Design Philosophy
"A chatbot answers questions. A butler anticipates needs."
Voicebot is intentionally narrow in scope: it owns the audio pipeline and conversational experience. Complex tasks are delegated to an external agent via stdin/stdout protocol.
This separation exists because response latency matters. A voice bot that only handles conversation responds in under 1 second. Adding shell commands, file access, and calendar operations slows it down significantly.
The Pipeline
Voicebot runs as a single binary using a streaming pipeline where every stage is connected by tokio channels:
Latency Optimizations
STT→LLM Trick
Accumulate partial Whisper transcripts; send full text when VAD signals end-of-speech. The LLM server maintains KV-cache implicitly across requests.
LLM→TTS Streaming
Buffer tokens until punctuation (. ! ? ; :), synthesize immediately. While sentence N plays, sentence N+1 is being generated.
GPU/CPU Isolation
All heavy work runs in tokio::task::spawn_blocking threads so the async event loop stays unblocked.
System Architecture
Voicebot (Fast Layer)
External Agent (Power Layer)
Project Structure
src/
├── main.rs # Pipeline orchestration, VAD loop
├── lib.rs # Library exports
├── config.rs # Environment-based configuration
├── daemon.rs # InferenceDaemon — proactive suggestions
├── eyes.rs # EYES visual awareness daemon
│
├── audio/ # CPAL capture, Silero VAD, playback
│ ├── audio_capture.rs
│ ├── output.rs
│ ├── speaker.rs # Speaker verification (ONNX)
│ └── ambient_buffer.rs
│
├── stt/ # Whisper-cpp-plus wrapper
│ └── mod.rs # Integrated STT+VAD
│
├── llm/ # OpenAI-compatible SSE client
│ ├── client.rs
│ ├── session.rs # Message history management
│ └── manager.rs
│
├── tts/ # Text-to-speech engines
│ ├── avspeech.rs # macOS AVSpeechSynthesizer
│ ├── kokoro.rs # ONNX Kokoro (Linux)
│ └── sentence.rs # Sentence boundary splitting
│
├── pipeline/ # Pipeline state machine
│ ├── fsm.rs # State FSM (Idle/Listening/Thinking/Speaking)
│ └── frames.rs # PipelineFrame types
│
├── tools/ # Built-in tool implementations
│ ├── mod.rs
│ └── [tool_name].rs
│
├── db/ # SQLite persistence
│ └── migrations/
│
├── memory/ # Context consolidation
├── profile/ # User profile extraction
├── agents/ # Agent delegation (ACP protocol)
└── remote/ # WebSocket server for remote devices
Key Modules
Audio Pipeline
CPAL handles cross-platform audio I/O. The pipeline uses a pre-roll buffer so when VAD detects speech, the preceding audio is included for complete transcription.
audio_capture.rs— Microphone input with device selectionaudio_transform.rs— Rubato resampling to 16kHzbuffer.rs— Circular VecDeque for audio bufferingoutput.rs— Speaker playback
Speech-to-Text
whisper-cpp-plus provides true streaming Whisper.cpp bindings with integrated VAD support. Uses Metal GPU on macOS, with optional CoreML Neural Engine fallback.
- State cached across utterances for faster repeated recognition
- Language hinting based on
VOICEBOT_LANGUAGEsetting - Accumulates partial transcripts until VAD silence threshold
LLM Integration
OpenAI-compatible streaming client with session management and automatic context consolidation.
client.rs— SSE streaming to /v1/chat/completionssession.rs— Message history with summarization- Auto-consolidation at configurable context threshold (default 90%)
- Idle consolidation after inactivity (default 30 min)
Text-to-Speech
Sentence-by-sentence synthesis with overlapping playback for natural conversation flow.
- AVSpeech — macOS native, zero setup
- Kokoro — High-quality ONNX, requires
--features kokoro sentence.rs— Buffers tokens, emits on punctuation boundaries
Memory & Context
Persistent storage via SQLite with intelligent consolidation.
- Active consolidation: Announces memory reorganization, extracts profile facts and memories, summarizes old turns
- Silent consolidation: Runs transparently during idle periods
- Memories and profile persist across sessions
Tools & Integrations
Extensible tool system with built-in and dynamic MCP support.
- Built-in: time, files, clipboard, screenshots, apps, shell, web search
- MCP: Dynamically registers tools from any MCP stdio server
- Agent delegation: Complex tasks via stdin/stdout subprocess
Conversation Modes
Active Mode
Default
Responds to all detected speech. Best for dedicated interaction sessions.
Ambient Mode
Wake-word triggered
Responds only after wake word detection (default: "jarvis"). Auto-switches when non-enrolled speaker is detected.
Control API
Enable with CONTROL_PORT=9001 cargo run --features control
/control/events
SSE stream of live pipeline events
/control/state
Current pipeline state (listening/thinking/speaking/idle)
/control/history
Full conversation message history
/control/mute
Mute/unmute TTS (body: {"muted": true|false})
/control/barge_in
Interrupt current TTS playback
/control/input
Inject text as user input (body: {"text": "..."})