Building a Local Voice Agent with Whisper, Ollama, and Edge-TTS

The goal was simple: say “Hey Jarvis, run morning summary” and have the server run the script and read back the results. No cloud APIs, no latency from a phone app.

Stack

Wake word: openWakeWord (hey_jarvis ONNX model)
STT: faster-whisper base model, CPU, int8
LLM: kimi-k2.5:cloud via Ollama (OpenAI-compatible API)
TTS: edge-tts (en-US-GuyNeural)
Transport: WebSocket, binary WAV from Python client

The Audio Problem

The biggest challenge was audio quality. The laptop’s built-in mic through Windows MME gave RMS values of 40–70 — whisper transcribed nothing. Switching to a Bluetooth headset (oraimo BoomPop) via WASAPI at its native 16kHz fixed it. RMS went to 5000+ when speaking.

I also added a normalization step server-side before sending to Whisper:

samples = samples * (30000.0 / peak)

This means even quieter mics get amplified to a workable level.

LLM Tool Calling

Kimi handles imperfect transcriptions well. “Can’t compete or watch” gets routed to competitor_watch correctly because the system prompt lists all available scripts and the LLM matches intent rather than exact words.

The tool loop runs up to 5 rounds — enough for any script call plus a summary reply.

What Works

Wake word detection in a noisy room, script triggers for 9 SEO scripts, voice replies under 3 seconds end-to-end. The whole thing runs on a ThinkCentre M83 with no GPU.