Teams Voice Bridge

Reverse-engineering Teams’ Windows-only voice onto Linux

Microsoft’s real-time Teams media stack is built for Windows: native DLLs, a .NET SDK, and no Linux runtime. Instead of adding a Windows sidecar to a Linux system, I reverse-engineered the media protocol and rebuilt the bridge in Python. The bot can join a live meeting, receive audio, and speak back from Linux.

Role: Engineer
Period: 2025 to 2026
Status: Production

PythonGhidrax64dbgFridactypesICE / SRTPMS-NMF / NBFSMCP

— Chapter 01

System shape

How the system fits together.

Click a block to zoom in

Teams only supports live meeting audio on Windows — so I rebuilt the media stack on Linux. Click any block to see how.

Fig. 01 — Teams Voice Bridge architecture

— Chapter 02

Decisions and outcomes

The calls that shaped it.

01

The core constraint was staying Linux-native. Official paths meant the .NET media SDK, a Windows sidecar, or a third-party meeting bot, so I chose the harder route: understand the Teams wire protocol well enough to implement it directly.
02

To learn the protocol I took Microsoft’s own binaries apart — decompiling the native and .NET DLLs in Ghidra and ILSpy, live-debugging the Teams client with x64dbg and Frida, and reading the decrypted traffic in Wireshark. I wired the decompiler into an LLM over MCP so I could interrogate thousands of lines of decompiled code and get through the dead ends faster.
03

Then I rebuilt it on Linux: from-scratch Python implementations of Microsoft’s old binary framing and SOAP-over-the-wire protocols, an ICE agent to negotiate the UDP media path, SRTP key derivation, and a fix for two hidden four-second timers that were silently killing every connection attempt.
04

The bridge now joins a real meeting, plays audio people can hear, and captures the returning speech byte-for-byte after decryption. For the codec path, I drove Microsoft’s native decoder from Python and validated it against captured frames.
05

I kept the process disciplined for something this fiddly: I versioned every experiment, kept real captured calls as test fixtures, and refused to ship guesses for parts I couldn’t yet name. And so nothing was blocked on the hard path, I shipped a simpler working voice agent alongside it that any company agent can call to join a meeting and answer by voice.

— Aside

The interesting work isn't the stack. It's the boundaries.

— Chapter 03

How it runs

What it runs on.

01
Pure-Linux Python media bridge — no Windows, no .NET SDK, no third-party meeting bot
02
RE toolchain: Ghidra + ILSpy to decompile, x64dbg + Frida to live-debug the Teams client, Wireshark to capture — with Ghidra wired to an LLM over MCP
03
From-scratch Python implementations of Microsoft’s MS-NMF framing and NBFS binary-SOAP protocols
04
ICE / STUN media negotiation and SDES-SRTP decryption (RFC 3711 key derivation)
05
Microsoft’s native audio codec (SATIN / SilkWide) driven directly through Python ctypes, validated against golden captured frames
06
A parallel production agent on Azure ACS with Deepgram (STT) and ElevenLabs (TTS), registered as an OpenClaw skill

— Keep exploring

More from the workshop.

← Previous

Forecasting and Failure Prediction

In-house ML platform

AspenServicesAPI

Internal backend platform