Design Philosophy

"A chatbot answers questions. A butler anticipates needs."

Voicebot is intentionally narrow in scope: it owns the audio pipeline and conversational experience. Complex tasks are delegated to an external agent via stdin/stdout protocol.

This separation exists because response latency matters. A voice bot that only handles conversation responds in under 1 second. Adding shell commands, file access, and calendar operations slows it down significantly.

The Pipeline

Voicebot runs as a single binary using a streaming pipeline where every stage is connected by tokio channels:

Input
Microphone
Audio Capture (CPAL)
VAD (Silero)
Process
Whisper STT
LLM (mlx-lm/oMLX)
Sentence Splitter
Output
TTS (AVSpeech/Kokoro)
Audio Output (CPAL)
Speaker

Latency Optimizations

STT→LLM Trick

Accumulate partial Whisper transcripts; send full text when VAD signals end-of-speech. The LLM server maintains KV-cache implicitly across requests.

LLM→TTS Streaming

Buffer tokens until punctuation (. ! ? ; :), synthesize immediately. While sentence N plays, sentence N+1 is being generated.

GPU/CPU Isolation

All heavy work runs in tokio::task::spawn_blocking threads so the async event loop stays unblocked.

System Architecture

Voicebot (Fast Layer)

STT → LLM (7B) → TTS
Barge-in, conversation awareness
Proactive suggestions (inference daemon)
Voice-local tools + MCP tool proxy
EYES: periodic screen capture + vision analysis
Complex tasks → delegate to AGENT
↓ delegates complex tasks via stdin/stdout

External Agent (Power Layer)

Full tool suite
File system, calendar, web, email
Long-running tasks

Project Structure

src/
├── main.rs              # Pipeline orchestration, VAD loop
├── lib.rs               # Library exports
├── config.rs            # Environment-based configuration
├── daemon.rs            # InferenceDaemon — proactive suggestions
├── eyes.rs              # EYES visual awareness daemon
│
├── audio/               # CPAL capture, Silero VAD, playback
│   ├── audio_capture.rs
│   ├── output.rs
│   ├── speaker.rs       # Speaker verification (ONNX)
│   └── ambient_buffer.rs
│
├── stt/                 # Whisper-cpp-plus wrapper
│   └── mod.rs           # Integrated STT+VAD
│
├── llm/                 # OpenAI-compatible SSE client
│   ├── client.rs
│   ├── session.rs       # Message history management
│   └── manager.rs
│
├── tts/                 # Text-to-speech engines
│   ├── avspeech.rs      # macOS AVSpeechSynthesizer
│   ├── kokoro.rs        # ONNX Kokoro (Linux)
│   └── sentence.rs      # Sentence boundary splitting
│
├── pipeline/            # Pipeline state machine
│   ├── fsm.rs           # State FSM (Idle/Listening/Thinking/Speaking)
│   └── frames.rs        # PipelineFrame types
│
├── tools/               # Built-in tool implementations
│   ├── mod.rs
│   └── [tool_name].rs
│
├── db/                  # SQLite persistence
│   └── migrations/
│
├── memory/              # Context consolidation
├── profile/             # User profile extraction
├── agents/              # Agent delegation (ACP protocol)
└── remote/              # WebSocket server for remote devices

Key Modules

Audio Pipeline

CPAL handles cross-platform audio I/O. The pipeline uses a pre-roll buffer so when VAD detects speech, the preceding audio is included for complete transcription.

  • audio_capture.rs — Microphone input with device selection
  • audio_transform.rs — Rubato resampling to 16kHz
  • buffer.rs — Circular VecDeque for audio buffering
  • output.rs — Speaker playback

Speech-to-Text

whisper-cpp-plus provides true streaming Whisper.cpp bindings with integrated VAD support. Uses Metal GPU on macOS, with optional CoreML Neural Engine fallback.

  • State cached across utterances for faster repeated recognition
  • Language hinting based on VOICEBOT_LANGUAGE setting
  • Accumulates partial transcripts until VAD silence threshold

LLM Integration

OpenAI-compatible streaming client with session management and automatic context consolidation.

  • client.rs — SSE streaming to /v1/chat/completions
  • session.rs — Message history with summarization
  • Auto-consolidation at configurable context threshold (default 90%)
  • Idle consolidation after inactivity (default 30 min)

Text-to-Speech

Sentence-by-sentence synthesis with overlapping playback for natural conversation flow.

  • AVSpeech — macOS native, zero setup
  • Kokoro — High-quality ONNX, requires --features kokoro
  • sentence.rs — Buffers tokens, emits on punctuation boundaries

Memory & Context

Persistent storage via SQLite with intelligent consolidation.

  • Active consolidation: Announces memory reorganization, extracts profile facts and memories, summarizes old turns
  • Silent consolidation: Runs transparently during idle periods
  • Memories and profile persist across sessions

Tools & Integrations

Extensible tool system with built-in and dynamic MCP support.

  • Built-in: time, files, clipboard, screenshots, apps, shell, web search
  • MCP: Dynamically registers tools from any MCP stdio server
  • Agent delegation: Complex tasks via stdin/stdout subprocess

Conversation Modes

Active Mode

Default

Responds to all detected speech. Best for dedicated interaction sessions.

Ambient Mode

Wake-word triggered

Responds only after wake word detection (default: "jarvis"). Auto-switches when non-enrolled speaker is detected.

Control API

Enable with CONTROL_PORT=9001 cargo run --features control

GET /control/events SSE stream of live pipeline events
GET /control/state Current pipeline state (listening/thinking/speaking/idle)
GET /control/history Full conversation message history
POST /control/mute Mute/unmute TTS (body: {"muted": true|false})
POST /control/barge_in Interrupt current TTS playback
POST /control/input Inject text as user input (body: {"text": "..."})