Building a Local Voice Agent with Whisper, Ollama, and Edge-TTS
How I wired together faster-whisper, kimi-k2.5 via Ollama, and edge-tts into a hands-free SEO script runner.
The goal was simple: say “Hey Jarvis, run morning summary” and have the server run the script and read back the results. No cloud APIs, no latency from a phone app.
Stack
- Wake word: openWakeWord (
hey_jarvisONNX model) - STT: faster-whisper base model, CPU, int8
- LLM: kimi-k2.5:cloud via Ollama (OpenAI-compatible API)
- TTS: edge-tts (
en-US-GuyNeural) - Transport: WebSocket, binary WAV from Python client
The Audio Problem
The biggest challenge was audio quality. The laptop’s built-in mic through Windows MME gave RMS values of 40–70 — whisper transcribed nothing. Switching to a Bluetooth headset (oraimo BoomPop) via WASAPI at its native 16kHz fixed it. RMS went to 5000+ when speaking.
I also added a normalization step server-side before sending to Whisper:
samples = samples * (30000.0 / peak)
This means even quieter mics get amplified to a workable level.
LLM Tool Calling
Kimi handles imperfect transcriptions well. “Can’t compete or watch” gets routed to competitor_watch correctly because the system prompt lists all available scripts and the LLM matches intent rather than exact words.
The tool loop runs up to 5 rounds — enough for any script call plus a summary reply.
What Works
Wake word detection in a noisy room, script triggers for 9 SEO scripts, voice replies under 3 seconds end-to-end. The whole thing runs on a ThinkCentre M83 with no GPU.