Back to all work
— Project 05
Voice + RE

Teams Voice Bridge

Reverse-engineering Teams’ Windows-only voice onto Linux

Microsoft’s real-time Teams media stack is built for Windows: native DLLs, a .NET SDK, and no Linux runtime. Instead of adding a Windows sidecar to a Linux system, I reverse-engineered the media protocol and rebuilt the bridge in Python. The bot can join a live meeting, receive audio, and speak back from Linux.

Role
Engineer
Period
2025 to 2026
Status
Production
PythonGhidrax64dbgFridactypesICE / SRTPMS-NMF / NBFSMCP
— Chapter 01
System shape

How the system fits together.

Click a block to zoom in
Teams only supports live meeting audio on Windows — so I rebuilt the media stack on Linux. Click any block to see how.
Fig. 01 — Teams Voice Bridge architecture
— Chapter 02
Decisions and outcomes

The calls that shaped it.

  1. 01

    The core constraint was staying Linux-native. Official paths meant the .NET media SDK, a Windows sidecar, or a third-party meeting bot, so I chose the harder route: understand the Teams wire protocol well enough to implement it directly.

  2. 02

    To learn the protocol I took Microsoft’s own binaries apart — decompiling the native and .NET DLLs in Ghidra and ILSpy, live-debugging the Teams client with x64dbg and Frida, and reading the decrypted traffic in Wireshark. I wired the decompiler into an LLM over MCP so I could interrogate thousands of lines of decompiled code and get through the dead ends faster.

  3. 03

    Then I rebuilt it on Linux: from-scratch Python implementations of Microsoft’s old binary framing and SOAP-over-the-wire protocols, an ICE agent to negotiate the UDP media path, SRTP key derivation, and a fix for two hidden four-second timers that were silently killing every connection attempt.

  4. 04

    The bridge now joins a real meeting, plays audio people can hear, and captures the returning speech byte-for-byte after decryption. For the codec path, I drove Microsoft’s native decoder from Python and validated it against captured frames.

  5. 05

    I kept the process disciplined for something this fiddly: I versioned every experiment, kept real captured calls as test fixtures, and refused to ship guesses for parts I couldn’t yet name. And so nothing was blocked on the hard path, I shipped a simpler working voice agent alongside it that any company agent can call to join a meeting and answer by voice.

— Aside
The interesting work isn't the stack. It's the boundaries.
— Chapter 03
How it runs

What it runs on.

  • 01
    Pure-Linux Python media bridge — no Windows, no .NET SDK, no third-party meeting bot
  • 02
    RE toolchain: Ghidra + ILSpy to decompile, x64dbg + Frida to live-debug the Teams client, Wireshark to capture — with Ghidra wired to an LLM over MCP
  • 03
    From-scratch Python implementations of Microsoft’s MS-NMF framing and NBFS binary-SOAP protocols
  • 04
    ICE / STUN media negotiation and SDES-SRTP decryption (RFC 3711 key derivation)
  • 05
    Microsoft’s native audio codec (SATIN / SilkWide) driven directly through Python ctypes, validated against golden captured frames
  • 06
    A parallel production agent on Azure ACS with Deepgram (STT) and ElevenLabs (TTS), registered as an OpenClaw skill