Teams Voice Bridge
Microsoft’s real-time Teams media stack is built for Windows: native DLLs, a .NET SDK, and no Linux runtime. Instead of adding a Windows sidecar to a Linux system, I reverse-engineered the media protocol and rebuilt the bridge in Python. The bot can join a live meeting, receive audio, and speak back from Linux.
- Role
- Engineer
- Period
- 2025 to 2026
- Status
- Production
How the system fits together.
The calls that shaped it.
- 01
The core constraint was staying Linux-native. Official paths meant the .NET media SDK, a Windows sidecar, or a third-party meeting bot, so I chose the harder route: understand the Teams wire protocol well enough to implement it directly.
- 02
To learn the protocol I took Microsoft’s own binaries apart — decompiling the native and .NET DLLs in Ghidra and ILSpy, live-debugging the Teams client with x64dbg and Frida, and reading the decrypted traffic in Wireshark. I wired the decompiler into an LLM over MCP so I could interrogate thousands of lines of decompiled code and get through the dead ends faster.
- 03
Then I rebuilt it on Linux: from-scratch Python implementations of Microsoft’s old binary framing and SOAP-over-the-wire protocols, an ICE agent to negotiate the UDP media path, SRTP key derivation, and a fix for two hidden four-second timers that were silently killing every connection attempt.
- 04
The bridge now joins a real meeting, plays audio people can hear, and captures the returning speech byte-for-byte after decryption. For the codec path, I drove Microsoft’s native decoder from Python and validated it against captured frames.
- 05
I kept the process disciplined for something this fiddly: I versioned every experiment, kept real captured calls as test fixtures, and refused to ship guesses for parts I couldn’t yet name. And so nothing was blocked on the hard path, I shipped a simpler working voice agent alongside it that any company agent can call to join a meeting and answer by voice.
The interesting work isn't the stack. It's the boundaries.
What it runs on.
- 01 Pure-Linux Python media bridge — no Windows, no .NET SDK, no third-party meeting bot
- 02 RE toolchain: Ghidra + ILSpy to decompile, x64dbg + Frida to live-debug the Teams client, Wireshark to capture — with Ghidra wired to an LLM over MCP
- 03 From-scratch Python implementations of Microsoft’s MS-NMF framing and NBFS binary-SOAP protocols
- 04 ICE / STUN media negotiation and SDES-SRTP decryption (RFC 3711 key derivation)
- 05 Microsoft’s native audio codec (SATIN / SilkWide) driven directly through Python ctypes, validated against golden captured frames
- 06 A parallel production agent on Azure ACS with Deepgram (STT) and ElevenLabs (TTS), registered as an OpenClaw skill