Voice Agent · Part 1

Building a Local Voice Agent with Whisper, Ollama, and Edge-TTS

How I wired together faster-whisper, kimi-k2.5 via Ollama, and edge-tts into a hands-free SEO script runner.

· 1 min read
Contents
  1. Stack
  2. The Audio Problem
  3. LLM Tool Calling
  4. What Works

The goal was simple: say “Hey Jarvis, run morning summary” and have the server run the script and read back the results. No cloud APIs, no latency from a phone app.

Stack

  • Wake word: openWakeWord (hey_jarvis ONNX model)
  • STT: faster-whisper base model, CPU, int8
  • LLM: kimi-k2.5:cloud via Ollama (OpenAI-compatible API)
  • TTS: edge-tts (en-US-GuyNeural)
  • Transport: WebSocket, binary WAV from Python client

The Audio Problem

The biggest challenge was audio quality. The laptop’s built-in mic through Windows MME gave RMS values of 40–70 — whisper transcribed nothing. Switching to a Bluetooth headset (oraimo BoomPop) via WASAPI at its native 16kHz fixed it. RMS went to 5000+ when speaking.

I also added a normalization step server-side before sending to Whisper:

samples = samples * (30000.0 / peak)

This means even quieter mics get amplified to a workable level.

LLM Tool Calling

Kimi handles imperfect transcriptions well. “Can’t compete or watch” gets routed to competitor_watch correctly because the system prompt lists all available scripts and the LLM matches intent rather than exact words.

The tool loop runs up to 5 rounds — enough for any script call plus a summary reply.

What Works

Wake word detection in a noisy room, script triggers for 9 SEO scripts, voice replies under 3 seconds end-to-end. The whole thing runs on a ThinkCentre M83 with no GPU.

Ted

Web Developer · SEO Operator · Security

I build web applications, run SEO campaigns, and dig into offensive security on the side. Most of what ends up here started as something I was already doing — agentic workflows, local LLMs, automation systems. This blog is the documentation layer.