Speech Diarization

Jul 24

Speech Diarization

Posten on X by Kwindla

The team at Speechmatics just shipped a really clean integration of realtime speaker diarization for voice agents. I’ve tinkered quite a bit with multi-speaker voice agent pipelines, and this is the best implementation I’ve seen so far. Voice AI in 2025 is at a really interesting point. A wide variety of companies are deploying voice agents at scale for a range of use cases. (Customer support. Outbound telephone calls for healthcare workflows. Answering the phone for restaurants and other small businesses. Teaching and tutoring. Phone screens. User research. And many more.) But there are still a *lot* of interesting problems to solve and new components to build. Speaker diarization means figuring out who is talking. If your voice agent knows that multiple people are talking, and who said what, you can do lots of useful things and build new kinds of interactions: – incorporate “side conversations” between people into the LLM’s understanding of context – ignore side conversations that the LLM shouldn’t respond to – (relatedly) do a better job with turn detection and selective response, overall For example, imagine a parent and child sitting together talking to an LLM tutor. The LLM can do a much better job guiding the lesson if the child and parent transcriptions are separate and properly marked.

@uberboffin

posted a great demo video, with source code, showing realtime speaker diarization in action. The code is running on the ESP32 embedded hardware that a lot of us are having fun hacking on these days!

This tech enables real-time speaker diarization on low-power devices like ESP32, solving: – Robotics: Multi-user interactions, e.g., home assistants distinguishing family members. – Drive-thrus: Accurate order-taking from car occupants. – Elevators: Voice commands with user identification for security. – Meetings: Portable transcription tools tagging speakers. – Healthcare: Wearables monitoring patient-doctor talks. – Automotive: In-car AI separating driver/passenger inputs. Edge AI for privacy-focused, offline apps.

·

Original Post…

We played a game of Guess Who? using @Speechmatics diarization that knows who’s talking, running on a tiny ESP32 using WebRTC via @pipecat_ai from @trydaily Yes. Really. Matt Barty and I went up against “Humphrey”, trying to guess a mystery Brit … With diarization,

10:00 AM · Jul 23, 2025

·

15.5K Views

Jake

@vokaysh

·

21h

what about using semantics to iterate on the speech to text result? who’s doing this

kwindla

@kwindla

·

21h

> what about using semantics to iterate on the speech to text result? who’s doing this One thing I see people doing in production is using the realtime (small chunks) transcription for the latency sensitive voice conversation flow, and then also asking an LLM to clean up and improve the transcription based on all the context so far, to update what’s displayed in text to the user. Also, it’s helpful to prompt the realtime LLM to tell it that it’s getting a transcribed audio stream and it should silently correct for and interpret the input accordingly.

Jake

@vokaysh

·

20h

>Also, it’s helpful to prompt the realtime LLM to tell it that it’s getting a transcribed audio stream and it should silently correct for and interpret the input accordingly. that’s right, didn’t think of that

Jesse Ezell

@jezell

·

21h

Support for more than one user on the entire agent stack is really sorely needed. WebRTC is doing it’s part, but the rest of the stack is lacking.

kwindla

@kwindla

·

21h

> WebRTC is doing it’s part, but the rest of the stack is lacking. This is a good point. When the Speechmatics team first showed me their diarization capability and Pipecat integration, my initial reaction was “I already do that because with WebRTC I always have separate audio tracks for each speaker.” But that was a blind spot on my part. Once I started thinking about it, it was obvious that there are a lot of contexts in which you only have “one microphone.” And, in fact, I’d worked on some of those in the past, for example in-office healthcare voice pipelines. I think this highlights that to build the best possible voice applications, you often need support for important capabilities at multiple layers of the stack. You want both separate tracks when you have that option, *and* excellent speaker diarization from STT models, *and* context engineering tooling that makes it easy to give the right multi-speaker tokens to the LLM.

Jesse Ezell

@jezell

·

21h

Yeah even in the basic example of one user on a call, when the wife walks by and asks a question, it’s lame when the agent can’t tell it was someone else and ignore it

Jesse Ezell

@jezell

·

20h

By:

the web

Posted in:

Speech Diarization

Share this: