Architecture - Voicebot

Design Philosophy

"A chatbot answers questions. A butler anticipates needs."

Voicebot is intentionally narrow in scope: it owns the audio pipeline and conversational experience. Complex tasks are delegated to an external agent via stdin/stdout protocol.

This separation exists because response latency matters. A voice bot that only handles conversation responds in under 1 second. Adding shell commands, file access, and calendar operations slows it down significantly.

The Pipeline

Voicebot runs as a single binary using a streaming pipeline where every stage is connected by tokio channels:

Input

Microphone

→

Audio Capture (CPAL)

→

VAD (Silero)

Process

Whisper STT

→

LLM (mlx-lm/oMLX)

→

Sentence Splitter

Output

TTS (AVSpeech/Kokoro)

→

Audio Output (CPAL)

→

Speaker

Latency Optimizations

STT→LLM Trick

Accumulate partial Whisper transcripts; send full text when VAD signals end-of-speech. The LLM server maintains KV-cache implicitly across requests.

LLM→TTS Streaming

Buffer tokens until punctuation (. ! ? ; :), synthesize immediately. While sentence N plays, sentence N+1 is being generated.

GPU/CPU Isolation

All heavy work runs in tokio::task::spawn_blocking threads so the async event loop stays unblocked.

System Architecture

Voicebot (Fast Layer)

STT → LLM (7B) → TTS

Barge-in, conversation awareness

Proactive suggestions (inference daemon)

Voice-local tools + MCP tool proxy

EYES: periodic screen capture + vision analysis

Complex tasks → delegate to AGENT

↓ delegates complex tasks via stdin/stdout

External Agent (Power Layer)

Full tool suite

File system, calendar, web, email

Long-running tasks

Project Structure

src/
├── main.rs              # Pipeline orchestration, VAD loop
├── lib.rs               # Library exports
├── config.rs            # Environment-based configuration
├── daemon.rs            # InferenceDaemon — proactive suggestions
├── eyes.rs              # EYES visual awareness daemon
│
├── audio/               # CPAL capture, Silero VAD, playback
│   ├── audio_capture.rs
│   ├── output.rs
│   ├── speaker.rs       # Speaker verification (ONNX)
│   └── ambient_buffer.rs
│
├── stt/                 # Whisper-cpp-plus wrapper
│   └── mod.rs           # Integrated STT+VAD
│
├── llm/                 # OpenAI-compatible SSE client
│   ├── client.rs
│   ├── session.rs       # Message history management
│   └── manager.rs
│
├── tts/                 # Text-to-speech engines
│   ├── avspeech.rs      # macOS AVSpeechSynthesizer
│   ├── kokoro.rs        # ONNX Kokoro (Linux)
│   └── sentence.rs      # Sentence boundary splitting
│
├── pipeline/            # Pipeline state machine
│   ├── fsm.rs           # State FSM (Idle/Listening/Thinking/Speaking)
│   └── frames.rs        # PipelineFrame types
│
├── tools/               # Built-in tool implementations
│   ├── mod.rs
│   └── [tool_name].rs
│
├── db/                  # SQLite persistence
│   └── migrations/
│
├── memory/              # Context consolidation
├── profile/             # User profile extraction
├── agents/              # Agent delegation (ACP protocol)
└── remote/              # WebSocket server for remote devices

Key Modules

Audio Pipeline

CPAL handles cross-platform audio I/O. The pipeline uses a pre-roll buffer so when VAD detects speech, the preceding audio is included for complete transcription.

audio_capture.rs — Microphone input with device selection
audio_transform.rs — Rubato resampling to 16kHz
buffer.rs — Circular VecDeque for audio buffering
output.rs — Speaker playback

Speech-to-Text

whisper-cpp-plus provides true streaming Whisper.cpp bindings with integrated VAD support. Uses Metal GPU on macOS, with optional CoreML Neural Engine fallback.

State cached across utterances for faster repeated recognition
Language hinting based on VOICEBOT_LANGUAGE setting
Accumulates partial transcripts until VAD silence threshold

LLM Integration

OpenAI-compatible streaming client with session management and automatic context consolidation.

client.rs — SSE streaming to /v1/chat/completions
session.rs — Message history with summarization
Auto-consolidation at configurable context threshold (default 90%)
Idle consolidation after inactivity (default 30 min)

Text-to-Speech

Sentence-by-sentence synthesis with overlapping playback for natural conversation flow.

AVSpeech — macOS native, zero setup
Kokoro — High-quality ONNX, requires --features kokoro
sentence.rs — Buffers tokens, emits on punctuation boundaries

Memory & Context

Persistent storage via SQLite with intelligent consolidation.

Active consolidation: Announces memory reorganization, extracts profile facts and memories, summarizes old turns
Silent consolidation: Runs transparently during idle periods
Memories and profile persist across sessions

Tools & Integrations

Extensible tool system with built-in and dynamic MCP support.

Built-in: time, files, clipboard, screenshots, apps, shell, web search
MCP: Dynamically registers tools from any MCP stdio server
Agent delegation: Complex tasks via stdin/stdout subprocess

Conversation Modes

Active Mode

Default

Responds to all detected speech. Best for dedicated interaction sessions.

Ambient Mode

Wake-word triggered

Responds only after wake word detection (default: "jarvis"). Auto-switches when non-enrolled speaker is detected.

Control API

Enable with CONTROL_PORT=9001 cargo run --features control

GET /control/events SSE stream of live pipeline events

GET /control/state Current pipeline state (listening/thinking/speaking/idle)

GET /control/history Full conversation message history

POST /control/mute Mute/unmute TTS (body: {"muted": true|false})

POST /control/barge_in Interrupt current TTS playback

POST /control/input Inject text as user input (body: {"text": "..."})

← Back to Home