Feature - Siri-like Voice Assistant for AI Chat

Feature - Siri-like Voice Assistant for AI Chat

Hey everyone! Today we are going to build something really cool — a Siri-like Voice Assistant for our AI chat!

Right now user types messages in the AI chat. But what if user can just SPEAK? Like Siri on iPhone — a floating circle (orb) appears, listens to your voice, sends the command to the same bot we already have, and speaks the response back to you!

Best part? NO backend changes needed! Everything is done with free browser APIs. No API keys, no external service, nothing.

What we will cover:

  • Big Picture - How the whole thing works
  • What are the Browser Speech APIs?
  • The State Machine - 5 states of the orb
  • Wake Word Detection - "Hai Strakly"
  • The Custom Hook - useVoiceAssistant.ts
  • The UI Component - Siri Orb (VoiceAssistant/index.tsx)
  • Connecting to Existing Bot (chatService)
  • How AI Chat triggers Voice Mode
  • Complete Flow - Step by Step
  • Files Changed

Big Picture - How the Whole Thing Works

Let's see the complete flow from start to finish:

User clicks mic icon on floating chat
        |
        v
Floating chat closes --> Siri orb appears (circular, floating, center-bottom)
        |
        v
Orb enters IDLE state (listening for wake word, subtle pulse animation)
        |
        v
User says "hai strakly"
        |
        v
Orb enters LISTENING state (bright glow, waveform animation)
        |
        v
User speaks command: "show me today's attendance"
        |
        v
User stops speaking (silence detected)
        |
        v
Orb enters PROCESSING state (spinning animation)
Transcript sent to existing chatService.sendMessageStream()
        |
        v
Bot responds with text via SSE stream
        |
        v
Orb enters SPEAKING state (pulsing animation)
SpeechSynthesis reads response aloud
        |
        v
Done --> back to IDLE state (listening for wake word again)

Key insight: The bot doesn't know (and doesn't care) if the message came from typing or voice. It receives text, it responds with text. The voice part is 100% frontend — we convert voice to text, send text to same bot, get text response, convert text back to voice.

What are the Browser Speech APIs?

We use TWO free browser APIs — no external service, no API key, nothing to install:

API What it does Direction
SpeechRecognition Listens to your microphone and converts voice to text Voice --> Text
SpeechSynthesis Takes text and reads it aloud through speakers Text --> Voice

Both are built into Chrome, Edge, Safari. They work on mobile too (Chrome Android). Firefox has limited support.

// Voice to Text - SpeechRecognition
const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)()
recognition.continuous = true        // Keep listening, don't stop after one phrase
recognition.interimResults = true    // Show partial results as user speaks
recognition.lang = "en-US"          // Language

recognition.onresult = (event) => {
    // event.results contains the recognized text
    const transcript = event.results[0][0].transcript
    console.log("User said:", transcript)
}

recognition.start()   // Start listening


// Text to Voice - SpeechSynthesis
const utterance = new SpeechSynthesisUtterance("Hello, how can I help you?")
speechSynthesis.speak(utterance)   // Browser speaks this text aloud!

Q: Why webkitSpeechRecognition?

A: Chrome uses webkitSpeechRecognition (with webkit prefix), Safari uses SpeechRecognition. We check both so it works everywhere:

window.SpeechRecognition || window.webkitSpeechRecognition
// This gives us whichever one the browser supports

The State Machine - 5 States of the Orb

The voice assistant has 5 states. Think of it like a traffic light — it's always in exactly ONE state at a time:

type VoiceState = "off" | "idle" | "listening" | "processing" | "speaking"
State What's Happening Orb Animation What User Sees
off Voice assistant not active, orb is hidden - Nothing (normal floating chat visible)
idle Mic is ON, listening continuously, waiting for wake word "hai strakly" Slow gentle pulse "Say 'Hai Strakly'..."
listening Wake word detected! Now capturing the actual command Bright glow + scale pulse Live transcript (what user is saying)
processing Command captured, sent to bot, waiting for response Spinning ring animation "Thinking..."
speaking Bot responded, reading the response aloud Rhythmic pulse Bot response text

The flow goes like this:

off --[user clicks mic]--> idle --[says "hai strakly"]--> listening
                            ^                                  |
                            |                          [stops speaking]
                            |                                  |
                            |                                  v
                         speaking <--[bot responds]-- processing
                            |
                     [done speaking]
                            |
                            v
                          idle  (loop! ready for next command)

It's a loop! After speaking the response, it goes back to idle and waits for the wake word again. User can keep giving commands without touching anything.

Wake Word Detection - "Hai Strakly"

This is the most interesting part. The orb is always listening in idle state, but it only activates when you say the magic words: "hai strakly" (or "hey strakly").

Q: How does it know the difference between random speech and the wake word?

A: Simple string matching! SpeechRecognition gives us text for everything the user says. We just check if it contains "hai strakly":

// In idle state, recognition fires for every phrase user says:

recognition.onresult = (event) => {
    const transcript = getTranscript(event)
    const lower = transcript.toLowerCase()

    if (state === "idle") {
        // Check for wake word
        if (lower.includes("hi strakly") ||
            lower.includes("hey strakly") ||
            lower.includes("hai strakly")) {

            // WAKE WORD DETECTED! Switch to listening mode
            setState("listening")
            clearTranscript()
            // Now everything user says will be captured as a command
        } else {
            // Random speech, ignore it
            // Maybe show the transcript briefly, then clear it
        }
    }

    if (state === "listening") {
        // User is giving a command, accumulate the text
        setTranscript(transcript)
    }
}

Q: But how do we know when the user STOPPED talking?

A: SpeechRecognition fires speechend event when it detects silence. We also use a silence timer — if no new words come for 2 seconds, we assume the command is done:

// When user stops speaking in "listening" state:
// 1. speechend event fires OR
// 2. Silence timer (2 seconds of no new results)

// Then: take the accumulated transcript and send it to the bot
if (state === "listening" && transcript.length > 0) {
    setState("processing")
    sendToBot(transcript)   // Same chatService.sendMessageStream()!
}

Edge Case - Chrome Stops Listening!

Important thing: Chrome's SpeechRecognition automatically stops after about 60 seconds of silence. It fires the onend event.

But we want to keep listening forever (for the wake word)! So we just restart it:

recognition.onend = () => {
    // Chrome stopped listening on its own
    if (state === "idle" || state === "listening") {
        // We still want to listen! Restart immediately
        recognition.start()
    }
    // If state is "processing" or "speaking", don't restart
    // (we don't need mic while bot is thinking/speaking)
}

The Custom Hook - useVoiceAssistant.ts

All the speech logic lives in ONE custom hook. This keeps the UI component clean — it just calls the hook and renders based on the state.

// useVoiceAssistant.ts

function useVoiceAssistant() {
    const [state, setState] = useState<VoiceState>("off")
    const [transcript, setTranscript] = useState("")
    const [botResponse, setBotResponse] = useState("")

    const recognitionRef = useRef<SpeechRecognition | null>(null)

    // Check if browser supports Speech APIs
    const isSupported = typeof window !== "undefined" &&
        !!(window.SpeechRecognition || window.webkitSpeechRecognition)

    // START - activate voice assistant
    const start = () => {
        if (!isSupported) return

        const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)()
        recognition.continuous = true
        recognition.interimResults = true
        recognition.lang = "en-US"

        recognition.onresult = handleResult   // Voice detected
        recognition.onend = handleEnd         // Chrome stopped listening
        recognition.onerror = handleError     // Mic permission denied, etc.

        recognitionRef.current = recognition
        recognition.start()
        setState("idle")
    }

    // STOP - deactivate completely
    const stop = () => {
        recognitionRef.current?.stop()
        speechSynthesis.cancel()   // Stop speaking if it was
        setState("off")
    }

    return {
        state,          // "off" | "idle" | "listening" | "processing" | "speaking"
        transcript,     // What user is saying right now
        botResponse,    // What bot said
        start,          // Call this to activate
        stop,           // Call this to deactivate
        isSupported,    // false if browser doesn't support
    }
}

Q: Why useRef for recognition?

A: Because we need the SAME recognition instance across re-renders. If we used useState, React would re-create it every render. useRef keeps the same object alive.

The onresult Handler - Heart of the Hook

This is where all the magic happens. Every time the user says something, this fires:

const handleResult = (event: SpeechRecognitionEvent) => {
    // Get the latest transcript from speech recognition
    let finalTranscript = ""
    let interimTranscript = ""

    for (let i = event.resultIndex; i < event.results.length; i++) {
        const result = event.results[i]
        if (result.isFinal) {
            finalTranscript += result[0].transcript
        } else {
            interimTranscript += result[0].transcript
        }
    }

    const text = finalTranscript || interimTranscript

    if (stateRef.current === "idle") {
        // IDLE: Looking for wake word only
        const lower = text.toLowerCase()
        if (lower.includes("hi strakly") ||
            lower.includes("hey strakly") ||
            lower.includes("hai strakly")) {
            // Found it! Switch to listening mode
            setState("listening")
            setTranscript("")
            resetSilenceTimer()
        }
    }
    else if (stateRef.current === "listening") {
        // LISTENING: Capturing the actual command
        setTranscript(text)
        resetSilenceTimer()   // User is still talking, reset timer
    }
}

Q: What's isFinal vs interim?

User says: "show me today's attendance"

As user speaks, SpeechRecognition fires MULTIPLE times:

  interim: "show"
  interim: "show me"
  interim: "show me today's"
  interim: "show me today's attend"
  FINAL:   "show me today's attendance"

interim = partial (still speaking, might change)
isFinal = done (this phrase is finalized, won't change)

We show interim for live feedback, but use final for the actual command.

Silence Timer - Detecting When User Stops

// When user is in "listening" state, we start a silence timer
// Every time new words come in, we RESET the timer
// If 2 seconds pass with NO new words → user is done speaking

const silenceTimerRef = useRef<NodeJS.Timeout>()

const resetSilenceTimer = () => {
    clearTimeout(silenceTimerRef.current)
    silenceTimerRef.current = setTimeout(() => {
        // 2 seconds of silence!
        if (stateRef.current === "listening" && transcriptRef.current) {
            // User is done. Send the command to bot.
            setState("processing")
            sendToBot(transcriptRef.current)
        }
    }, 2000)  // 2 seconds
}

Why 2 seconds? It's a good balance — long enough that a short pause between words won't trigger it, short enough that it feels responsive. Think about Siri — you stop talking, and after a brief pause it starts processing.

Connecting to the Existing Bot

This is the beautiful part — ZERO backend changes. We use the exact same chatService.sendMessageStream() that the typing chat uses:

// In the voice assistant hook:

const sendToBot = async (message: string) => {
    setState("processing")
    let fullResponse = ""

    await chatService.sendMessageStream(
        {
            message: message,           // The voice transcript as text
            conversation_id: convId,
            branch_id: branchId,
        },
        // onChunk - accumulate response text
        (chunk: string) => {
            fullResponse += chunk
            setBotResponse(fullResponse)   // Update UI as text streams in
        },
        // onDone - response complete, now SPEAK it!
        () => {
            speakResponse(fullResponse)   // Text -> Voice!
        },
        // onError - something went wrong
        (error: string) => {
            setState("idle")   // Go back to listening
        },
        abortController.signal,
        // onAction - theme/navigate actions still work!
        handleAction,
    )
}

Think about it:

TYPING FLOW:
  User types "go to dashboard"  -->  chatService.sendMessageStream()  -->  Bot responds

VOICE FLOW:
  User says "go to dashboard"
  SpeechRecognition converts to text: "go to dashboard"
  Send text to chatService.sendMessageStream()  -->  Bot responds
  SpeechSynthesis reads response aloud

SAME API CALL! Bot receives the same text either way.

Actions still work! If user says "change theme to dark", the bot returns a change_theme action. If user says "go to dashboard", the bot returns a navigate action. The handleAction callback processes them exactly the same way.

Speaking the Response - SpeechSynthesis

When the bot finishes responding, we convert the text to voice:

const speakResponse = (text: string) => {
    setState("speaking")

    // First, clean the text - remove markdown formatting
    // Bot might return: "**Today's attendance** is 45 clients. [See details](/attendance)"
    // We clean it to: "Today's attendance is 45 clients. See details"
    const cleanText = stripMarkdown(text)

    const utterance = new SpeechSynthesisUtterance(cleanText)
    utterance.lang = "en-US"
    utterance.rate = 1.0       // Normal speed
    utterance.pitch = 1.0      // Normal pitch

    utterance.onend = () => {
        // Done speaking! Go back to idle, listen for wake word again
        setState("idle")
        setTranscript("")
        setBotResponse("")
        // Recognition restarts automatically in idle
    }

    speechSynthesis.speak(utterance)
}

Q: Why strip markdown?

A: The bot responds with markdown formatting. If we speak it as-is, the browser would say "asterisk asterisk Today's attendance asterisk asterisk". That's terrible! So we clean it first:

const stripMarkdown = (text: string): string => {
    return text
        .replace(/\*\*(.*?)\*\*/g, "$1")          // **bold** -> bold
        .replace(/\*(.*?)\*/g, "$1")               // *italic* -> italic
        .replace(/\[([^\]]+)\]\([^)]+\)/g, "$1")   // [link text](url) -> link text
        .replace(/#{1,6}\s/g, "")                   // ### heading -> heading
        .replace(/`([^`]+)`/g, "$1")               // `code` -> code
        .replace(/\|[^\n]+\|/g, "")                // Tables -> remove
        .replace(/[-*]\s/g, "")                    // - bullet -> remove dash
        .replace(/\n{2,}/g, ". ")                  // Multiple newlines -> period
        .replace(/\n/g, " ")                       // Single newline -> space
        .trim()
}

// Example:
// Input:  "**Today's attendance** is 45 clients.\n\n| Name | Status |\n|---|---|\n| Ali | Present |"
// Output: "Today's attendance is 45 clients."

The UI Component - Siri Orb

The orb is a fixed-position circle at the bottom-center of the screen. It changes animation based on the state:

Layout:
=========

                 [transcript / response text]

                        ┌──────┐
                        │      │
                        │  ◉   │  <-- 80px glowing circle
                        │      │
                        └──────┘

                       [X] Close

         ═══════════════════════════════
         fixed position, bottom of screen

The component just reads from the hook and renders accordingly:

function VoiceAssistant() {
    const { state, transcript, botResponse, start, stop, isSupported } = useVoiceAssistant()

    // Listen for activation event from AI Chat
    useEffect(() => {
        const activate = () => start()
        window.addEventListener("activate-voice-assistant", activate)
        return () => window.removeEventListener("activate-voice-assistant", activate)
    }, [])

    if (state === "off") return null   // Hidden when not active

    return (
        <div className="fixed inset-0 z-50 flex flex-col items-center justify-end pb-20">

            {/* Backdrop */}
            <div className="absolute inset-0 bg-black/40 backdrop-blur-sm" />

            {/* Transcript / Response text */}
            <div className="relative z-10 text-center mb-8 text-white">
                {state === "idle" && "Say 'Hai Strakly'..."}
                {state === "listening" && transcript}
                {state === "processing" && "Thinking..."}
                {state === "speaking" && botResponse}
            </div>

            {/* The Orb */}
            <div className={`
                relative z-10 w-20 h-20 rounded-full
                ${state === "idle" ? "animate-pulse bg-accent/50" : ""}
                ${state === "listening" ? "animate-voice-glow bg-accent" : ""}
                ${state === "processing" ? "bg-accent" : ""}
                ${state === "speaking" ? "animate-voice-speak bg-accent" : ""}
            `}>
                {/* Spinning ring for processing */}
                {state === "processing" && (
                    <div className="absolute inset-[-4px] border-2 border-t-accent
                         border-transparent rounded-full animate-spin" />
                )}

                {/* Mic icon in center */}
                <MicIcon className="absolute inset-0 m-auto w-8 h-8 text-white" />
            </div>

            {/* Close button */}
            <button onClick={handleClose} className="relative z-10 mt-6">
                <XIcon className="w-6 h-6 text-white/70" />
            </button>
        </div>
    )
}

Animations - Making It Look Like Siri

Each state has a different animation to give visual feedback:

State Animation How
idle Slow gentle pulse (breathing effect) Tailwind animate-pulse (built-in)
listening Bright glow with expanding box-shadow Custom @keyframes voice-glow
processing Spinning ring around the orb Tailwind animate-spin on border ring
speaking Rhythmic scale pulse (like music beats) Custom @keyframes voice-speak
/* Custom animations in Tailwind config or CSS */

@keyframes voice-glow {
    0%, 100% {
        box-shadow: 0 0 20px rgba(var(--accent), 0.4);
        transform: scale(1);
    }
    50% {
        box-shadow: 0 0 40px rgba(var(--accent), 0.8);
        transform: scale(1.05);
    }
}

@keyframes voice-speak {
    0%, 100% { transform: scale(1); }
    25% { transform: scale(1.08); }
    75% { transform: scale(0.95); }
}

How AI Chat Triggers Voice Mode

The user activates the voice assistant from the floating chat (ai-chat.tsx). We add a mic icon button:

// ai-chat.tsx changes:

1. Add mic icon button (next to the chat bubble)
2. When clicked:
   - Close the floating chat panel
   - Fire custom DOM event to activate voice assistant

const handleMicClick = () => {
    setIsOpen(false)
    window.dispatchEvent(new CustomEvent("activate-voice-assistant"))
}

Q: Why custom DOM events?

A: Because AIChat and VoiceAssistant are sibling components — they don't share state directly. We use the same pattern as theme change: fire a CustomEvent on window, the other component listens for it.

Component Tree:
================

<Layout>
  <Sidebar />
  <MainContent />
  <AIChat />              <-- fires "activate-voice-assistant" event
  <VoiceAssistant />      <-- listens for "activate-voice-assistant" event
</Layout>

They're siblings. No parent-child relationship.
Custom DOM events = easy communication between siblings!

When user clicks the X close button on the orb, VoiceAssistant fires "deactivate-voice-assistant" event, and AIChat shows the floating bubble again.

Complete Flow - One Real Example

Let's trace "hai strakly, show me today's attendance" from start to finish:

STEP 1: User clicks mic icon on floating chat
=========================================
ai-chat.tsx:
  setIsOpen(false)                                    // Close chat panel
  window.dispatchEvent("activate-voice-assistant")    // Tell VoiceAssistant to start

VoiceAssistant hears the event:
  start()                                             // Initialize SpeechRecognition
  setState("idle")                                    // Orb appears, slow pulse


STEP 2: User says random words (before wake word)
=========================================
User says: "hey what time is it"

SpeechRecognition fires onresult:
  transcript = "hey what time is it"
  state === "idle", check for wake word...
  Does "hey what time is it" contain "hai strakly"? NO
  --> Ignore. Stay in idle. Orb keeps pulsing.


STEP 3: User says "hai strakly, show me today's attendance"
=========================================
SpeechRecognition fires onresult:
  transcript = "hai strakly"
  state === "idle", check for wake word...
  Does "hai strakly" contain "hai strakly"? YES!
  --> setState("listening")     // Orb starts glowing bright!
  --> clearTranscript()
  --> startSilenceTimer()

SpeechRecognition fires again:
  transcript = "show me today's attendance"
  state === "listening"
  --> setTranscript("show me today's attendance")  // Show live on orb
  --> resetSilenceTimer()                           // User still talking

...2 seconds of silence...

Silence timer fires:
  --> setState("processing")    // Orb starts spinning
  --> sendToBot("show me today's attendance")


STEP 4: Bot receives the text message
=========================================
chatService.sendMessageStream({
    message: "show me today's attendance",
    conversation_id, branch_id
})

Bot AI processes it just like a typed message.
Bot might call get_attendance_summary() tool.
Bot responds: "Today's attendance: 45 clients checked in, 12 absent."


STEP 5: Response streams back via SSE
=========================================
onChunk fires multiple times:
  "Today's"
  "Today's attendance:"
  "Today's attendance: 45 clients"
  "Today's attendance: 45 clients checked in, 12 absent."

  --> setBotResponse(fullText)   // Show on orb as it streams

onDone fires:
  --> speakResponse("Today's attendance: 45 clients checked in, 12 absent.")


STEP 6: Browser speaks the response
=========================================
const utterance = new SpeechSynthesisUtterance(
    "Today's attendance: 45 clients checked in, 12 absent."
)
speechSynthesis.speak(utterance)

setState("speaking")    // Orb does rhythmic pulse

User HEARS: "Today's attendance: 45 clients checked in, 12 absent."

utterance.onend fires:
  --> setState("idle")          // Back to listening for wake word!
  --> Orb goes back to gentle pulse
  --> "Say 'Hai Strakly'..." text appears

DONE! Ready for next command. Loop continues.

Actions Work Too!

Remember our theme change and navigation actions? They work through voice too!

User says: "hai strakly, change theme to dark"

Voice --> Text: "change theme to dark"
Send to bot --> Bot calls change_theme tool
Bot returns: { action: { type: "change_theme", value: "dark" } }

handleAction receives: { type: "change_theme", value: "dark" }
--> Theme changes to dark mode!
--> Bot also responds with text: "Done! Switched to dark theme."
--> SpeechSynthesis speaks: "Done! Switched to dark theme."


User says: "hai strakly, go to dashboard"

Voice --> Text: "go to dashboard"
Send to bot --> Bot calls navigate_to_page tool
Bot returns: { action: { type: "navigate", value: "/dashboard" } }

handleAction receives: { type: "navigate", value: "/dashboard" }
--> App navigates to dashboard!
--> Bot also responds: "Taking you to the dashboard."
--> SpeechSynthesis speaks: "Taking you to the dashboard."

Everything we built before (theme, navigation, accent color) works through voice automatically! Because the bot receives the same text — it doesn't matter if it was typed or spoken.

Edge Cases We Handle

Edge Case How We Handle It
Browser doesn't support SpeechRecognition isSupported flag = false, mic button hidden
User denies microphone permission recognition.onerror fires with "not-allowed", show error message
Chrome stops listening after 60s silence recognition.onend fires, we restart it if still in idle/listening
User speaks too softly SpeechRecognition handles sensitivity internally, we show whatever it detects
Bot response has markdown formatting stripMarkdown() cleans it before speaking
User closes orb while bot is speaking speechSynthesis.cancel() stops speaking immediately
User says "hai strakly" during processing/speaking Ignored — we only check wake word in "idle" state

Files Changed - Summary

File New / Modified What it Does
VoiceAssistant/useVoiceAssistant.ts NEW Custom hook: SpeechRecognition, wake word detection, silence timer, SpeechSynthesis, bot integration, state machine
VoiceAssistant/index.tsx NEW Siri orb UI: fixed-position circle, animations per state, transcript display, close button
Layouts/ai-chat.tsx MODIFIED Added mic icon button, fires "activate-voice-assistant" event, hides when voice is active
Layouts/index.tsx MODIFIED Added <VoiceAssistant /> alongside <AIChat />

Key Architecture Decisions

1. SEPARATION OF CONCERNS
   useVoiceAssistant.ts = ALL speech logic (hook)
   VoiceAssistant/index.tsx = ONLY UI (component)
   The hook doesn't know what the orb looks like.
   The component doesn't know how speech works.

2. CUSTOM DOM EVENTS for sibling communication
   Same pattern as theme change:
   "activate-voice-assistant"   (AIChat --> VoiceAssistant)
   "deactivate-voice-assistant" (VoiceAssistant --> AIChat)
   No prop drilling, no context, no Redux for this.

3. REUSE EXISTING BOT
   Zero backend changes. Same chatService.sendMessageStream().
   Voice is just a different INPUT method (mic instead of keyboard)
   and a different OUTPUT method (speaker instead of screen text).

4. BROWSER-NATIVE APIs
   No external service. No API key. No cost.
   SpeechRecognition + SpeechSynthesis = free, built into Chrome/Edge/Safari.

Summary - What We Built

BEFORE:
  User types message --> Bot responds with text on screen

AFTER:
  User SPEAKS message --> Bot responds with text on screen AND voice!

  With a cool Siri-like orb interface:
  - Always listening for "hai strakly" wake word
  - Shows live transcript
  - Spins while processing
  - Speaks the response aloud
  - Loops back to listening
  - All existing features (theme, navigate, etc.) work through voice

  Zero backend changes. Two browser APIs. Two new files.