Voice-to-text is everywhere, but making it work seamlessly on Linux—so that your spoken words appear as keystrokes in any app—turns out to be a surprisingly deep rabbit hole. This post documents my journey building a robust, real-time voice-to-text “virtual keyboard” script, the pitfalls I hit, and the lessons learned along the way. If you want to skip the story and grab the final script, jump to the end.
The Goal
I wanted a Python script that would:
- Listen to my microphone.
- Transcribe my speech to text using OpenAI’s Whisper (via
faster-whisper). - “Type” the transcribed text into whatever window is focused, as if I were using the keyboard.
- Optionally, copy the result to the clipboard instead.
Sounds simple, right? Not quite.
The First Steps: Naive Recording
I started with the basics: record audio until I hit Enter, transcribe it, and copy the result to the clipboard. This worked, but it wasn’t “real-time” and didn’t type into the active window. I wanted a hands-free, always-on experience.
Enter the Virtual Keyboard
I explored several ways to create a virtual keyboard on Linux:
uinput: The most “real” way, but requires kernel modules, udev rules, and system Python packages.evdev: More flexible, pip-installable, but still needs kernel/udev setup.pynput: Cross-platform, pip-installable, and types into the focused window—perfect for prototyping.
I settled on pynput for its simplicity and broad compatibility.
Real-Time Transcription: The VAD Challenge
To avoid transcribing silence or background noise, I integrated voice activity detection (VAD) using webrtcvad. The idea: only buffer and transcribe audio when actual speech is detected, and type out each segment as soon as you finish speaking.
But this introduced new problems:
- Missed beginnings: The first word or syllable was often lost, because VAD only starts buffering after it detects speech.
- Complex state: Managing buffers, ring buffers, and state transitions made the code harder to reason about and debug.
- Reliability: Sometimes, nothing was captured at all, or the script would miss entire segments.
The Hardware Hurdle: Why My Laptop Mic Wasn’t Enough
One of the most surprising challenges wasn’t in the code—it was in the hardware. My laptop’s built-in parabolic mic picked up everything: keyboard clacks, fan noise, even the faintest background hum. This made it nearly impossible for the voice activity detection (VAD) to reliably distinguish between actual speech and ambient noise. The result? Either it would never trigger, or it would constantly “hear” phantom voices.
After a lot of frustration, I realized I needed a mic I could physically control. So I made a quick run to Best Buy and picked up a $25 USB headset with a hardware mute switch. This simple upgrade made a world of difference:
- Physical mute: I could instantly cut the mic when not dictating, ensuring VAD only “heard” me when I wanted it to.
- Directional pickup: The headset’s mic focused on my voice, not the room.
- Plug-and-play: No drivers, no fuss—just worked.
Lesson: Sometimes, the right hardware is the missing piece for seamless software.
Iteration and Debugging
I tried several fixes:
- Adding a pre-speech ring buffer to capture the start of speech.
- Pre-filling the buffer with silence to avoid missing the first utterance.
- Adding debug prints to trace VAD detection and buffer states.
Each change brought new insights, but also new edge cases. Sometimes, the “fix” made things worse!
The Pragmatic Solution
After much experimentation, I realized that simplicity wins:
- For the “clipboard” mode, I reverted to the original, naive approach: record until Enter, then transcribe and copy. It’s reliable and easy to use.
- For the “typing” mode, I kept the VAD and buffer logic, so the script listens in real time and types out each segment as you speak.
This hybrid approach gives the best of both worlds: reliability when you need it, and hands-free typing when you want it.
The End Result
You can download the final script here: Download mic.py
-
Clipboard mode:
python3 mic.py --clipboard(Press Enter to stop recording, then your text is copied.)
-
Typing mode (default):
python3 mic.py(Speak, and your words are typed into the focused window in real time.)
-
Send Enter after each segment:
python3 mic.py --send
Lessons Learned
- Start simple. The naive approach is often the most reliable, especially for “batch” use cases.
- Iterate and debug. Don’t be afraid to add debug prints and try different strategies.
- Hybrid solutions work. Sometimes, the best answer is to combine approaches and let the user choose.
Conclusion
Building a robust voice-to-text tool for Linux is a journey through audio APIs, system permissions, and the quirks of real-time processing. But with the right tools and a willingness to iterate, you can create something that feels like magic.
Try it out, and let me know how it works for you!