Real-time AI sales coaching: engineering for sub-second latency
Live AI coaching during sales calls only works if guidance arrives in under a second. Here is the architecture that makes it possible: streaming transcription, prompt caching, and a deterministic fallback.
Post-call analysis tells you what went wrong yesterday. A sales team we work with wanted the harder thing: guidance during the call, on screen, while the customer is still talking. The entire engineering problem reduces to one number: how many milliseconds between the customer’s words and useful text in front of the rep.
Why latency is the whole product
Imagine the customer says “honestly, I’m worried this is too expensive for me.” A coaching hint that appears two seconds later is decoration; the rep already answered. The same hint inside one second changes what the rep says next. Everything in the architecture serves that budget.
The architecture, end to end
- Audio ingestion. The phone platform streams call audio over a WebSocket as the call happens. Critically, we subscribe only to the customer’s track. The rep knows what they themselves said; we do not need to transcribe it. That single decision removes speaker-separation work from the hot path.
- Streaming transcription. A streaming speech-to-text service returns partial transcripts as words are spoken, not after sentences complete. Waiting for “final” transcripts costs 500ms or more; the system works on partials.
- Two coaching engines in parallel. A trigger engine watches for high-stakes phrases (price worry, discount requests, trust concerns) and fires pre-written guidance instantly. These are deterministic and effectively free. A stage engine tracks where the conversation is (opening, discovery, presentation, close) and asks a small, fast language model what matters next. The model prompt is cached, so repeated calls reuse almost the entire context and respond in a few hundred milliseconds.
- Delivery. Guidance lands on a lightweight overlay the rep keeps beside their call window, pushed over a WebSocket. Each call gets a signed token so overlays cannot be hijacked across sessions.
The result in production: 95th-percentile latency under one second, on commodity infrastructure.
The decisions that mattered most
- Use the smallest model that works. The fastest tier of a frontier model family, with prompt caching, beats a smarter-but-slower model for this job. Depth belongs in post-call analysis; speed belongs live.
- Make the deterministic path primary, not a fallback. The highest-value moments (discount asked, trust questioned) are predictable enough to script. The AI fills the gaps between them.
- Test like it is production telephony. The system carries 300+ automated tests, including concurrency tests that simulate simultaneous calls, because a coaching tool that drops guidance under load trains reps to ignore it.
Where this fits in a sales organization
Live coaching is a force multiplier for the middle of the team. Top reps rarely glance at it; new reps treat it as training wheels that shorten ramp time; the middle 60% close measurably more because the right response shows up at the right moment. Paired with post-call scoring, the loop closes: analysis finds the patterns, live coaching enforces them.
Common questions
What latency does live AI sales coaching need to be useful?
Under one second from spoken word to on-screen guidance. Beyond that, the moment has passed and the rep has already responded. Sub-second latency requires streaming transcription and careful model selection, not just a fast prompt.
Does live AI coaching listen to both sides of the call?
It only needs the customer's audio track. Modern phone platforms deliver each side of the call as a separate stream, which removes the need for speaker diarization entirely and cuts both latency and cost.
What happens if the AI service goes down mid-call?
A well-built system degrades gracefully. Ours falls back to deterministic stage-based guidance (prepared prompts keyed to where the rep is in the call), so the overlay never goes blank even with no AI available.