Building a chatbot takes a weekend. Building a voice agent that handles legal intake in real time took us months. The difference is not complexity for its own sake. The difference is that voice has to work reliably in the moment, with little room for error, for people who are often at the worst point in their lives.
Here are the six engineering problems that made it hard.
1. Latency Is Brutal
In chat, a user expects a pause. In voice, even a short delay feels broken.
With Alira, that matters because the caller is often already stressed. If the system pauses too long after "Can you tell me what happened?" it stops feeling like a capable intake assistant and starts feeling unreliable. The caller doesn't think "the system is processing." They think "this isn't working."
So voice forces you to optimize the entire chain: speech-to-text, reasoning and prompt logic, tool calls, booking checks, response generation, and text-to-speech playback. Every one of those steps adds latency, and they all happen sequentially. In chat, you can hide complexity behind a spinner. In voice, every extra second hurts trust and conversion.
We spent more time on latency optimization than on any other single problem.
2. Interruptions and Turn-Taking
Chat is orderly. One person sends a message, the other responds. Voice is messy.
People interrupt themselves, change direction, talk over the assistant, pause mid-thought, restart, or answer a question you haven't fully finished asking yet. That is especially hard for legal intake because callers often speak like this:
"Well, it started last week, actually no, maybe earlier, and I already talked to insurance, and my son was in the car."
The system has to decide in real time: Is the caller done? Should it wait? Should it clarify? Should it cut in? Should it move to the next intake question? Should it attempt a warm transfer now because urgency is rising?
In chat, you wait for the send button. In voice, you're constantly reading silence and deciding when to respond.
3. Real-World Audio Is Messy
Everyone asks about accents first. Accents matter, but they're a fraction of the real problem.
The broader issue is real-world variability: cell phone connections cutting in and out, speakerphone echo, background noise from traffic or kids or a television, emotional speech where someone's voice is shaking, people spelling names quickly, code-switching mid-sentence when they get upset.
For most voice applications, getting a word slightly wrong is a minor inconvenience. For legal intake, it's serious. Getting a phone number wrong, misspelling an opposing party's name, recording the wrong incident date. Those mistakes follow a case for months.
The question was never "can the model transcribe speech?" It was "can this system reliably capture intake-grade information from imperfect audio, from a stressed caller, on a bad connection, at 9pm on a Tuesday?"
4. Bad Phrasing Is Instantly Obvious
Here's something I didn't fully appreciate until we were deep into building: what reads fine on a screen can sound terrible out loud. Bad phrasing in voice is exposed immediately.
A voice agent that sounds repetitive, too formal, too eager, or too scripted feels unnatural within seconds. For law firms, that's especially risky because callers may be anxious, embarrassed, angry, or in crisis. You cannot sound like a customer service bot when someone is calling about a restraining order.
So prompt design for voice is fundamentally different from chat. You're not just designing for correctness. You're designing for trust, calmness, pacing, empathy, concise phrasing, and smooth handoffs. What reads fine on a screen can sound terrible out loud. We learned that the hard way, multiple times.
5. Tool Use Mid-Conversation Is Risky
Alira does real work during the call: practice-area screening, conflict-aware intake, calendar lookup, consultation booking, urgency detection, live transfer, and follow-up workflows.
In chat, a failed tool call is annoying. The user waits a few seconds and maybe tries again. In voice, a failed tool call can derail the entire conversation in front of the caller.
If booking takes too long, the call loses momentum. If transfer logic is clunky, a warm handoff feels cold. If urgency detection is too aggressive, the experience becomes noisy and confusing. If the system asks too many structured questions before actually helping, it feels insensitive, like filling out a form instead of talking to a person.
The orchestration layer that coordinates all of these tools in real time, mid-conversation, matters far more in voice than it does in chat. One bad handoff and you've lost the caller's trust.
6. Recovery Is the Invisible Skill
Once something is misunderstood in a voice conversation, there's no scroll-up, no retype, no undo. You have to recover gracefully without making the interaction feel mechanical. The agent needs to confirm only the most important facts, not over-confirm every single detail. It needs to know when to apologize and restate. When to escalate to a human. When to stop sounding "smart" and just be useful.
Good recovery is invisible. Nobody notices when it works. Everyone notices when it doesn't.
The Real Challenge Is All Six at Once
Any one of these problems is solvable in isolation. Latency optimization is well-understood. Turn-taking algorithms exist. Audio processing keeps improving. Prompt engineering is a growing discipline.
The hard part is doing all six simultaneously, on every call, in real time, for callers who are stressed and need help now.
You need all six working together, on every call, under pressure. That's what makes legal voice AI genuinely difficult.
Alira has to think, listen, and act in real time. That's why we build it the way we do.
Alira AI is an AI-powered client intake and triage platform for law firms. Learn more at getalira.com.