1 Architectural Foundations: Mesh vs. MCU vs. SFU
Building a real-time video platform that supports 1,000-person meetings or tens of millions of daily participants forces you to confront architectural limits very early. The biggest decision is how media flows between users. WebRTC gives you encryption, congestion control, NAT traversal, and everything needed to move packets—but it does not tell you how your system should be structured. That’s your job. Pick the wrong topology and the whole platform collapses long before traffic reaches production scale.
This section explains, in practical terms, why Selective Forwarding Units (SFUs) are the only approach that makes sense for Zoom-level scale. And because this article targets a .NET implementation, the reasoning includes how these choices affect your network stack, memory footprint, and CPU usage.
1.1 The Topology Dilemma
1.1.1 Why P2P Mesh Collapses After 4–5 Users
Peer-to-peer mesh looks appealing because it feels simple: every participant connects directly to every other participant. All media stays encrypted end-to-end, and you don’t need a media server in the middle. Unfortunately, the math breaks almost immediately.
In a mesh network with N users, each person sends N–1 outgoing video streams and receives N–1 incoming streams. That’s quadratic growth.
The upstream requirement alone becomes impossible:
Upstream usage = (number of peers) × (bitrate per stream)
At just 2 Mbps per HD stream:
- 4 users → 6 Mbps (barely workable)
- 6 users → ~10 Mbps (many home networks fail here)
- 10 users → ~18 Mbps (unusable for most of the world)
The CPU hit is just as damaging. Each participant must run multiple video encoders simultaneously—one per peer—because each receiver might need a different resolution or bitrate. Browser encoders simply can’t keep up. Chrome can handle maybe two or three parallel encoders before frames start dropping.
Mesh also leaves you with no place to insert server-side recording, moderation, transcription, or analytics because the server never sees the media. Everything happens browser-to-browser.
Mesh is a fair choice for tiny group calls. But it breaks quickly and permanently once you try to scale beyond a handful of users.
1.1.2 Why MCU Transcoding Doesn’t Scale Economically
A Multipoint Control Unit (MCU) centralizes everything. All users send one high-quality stream to the MCU. The MCU decodes all of them, composites them into a single mixed layout, and re-encodes per participant.
This fixes the mesh bandwidth problem, but creates a new one: compute cost.
For 1,000 users, the MCU must:
- Decode 1,000 video streams
- Composite all frames into a layout (often multiple layouts)
- Re-encode 1,000 outbound streams
Even with dedicated GPU encoders, the math is painful. The video pipeline becomes hundreds of times more expensive than SFU routing. And the end-to-end latency through decode → composite → encode often exceeds acceptable limits for live interaction. Anything beyond 150–200 ms starts to feel laggy.
MCUs excel at specific scenarios—call centers, webinars, or fixed-layout broadcasts—but they cannot support one-to-many or many-to-many collaboration at global scale without an enormous hardware budget. That’s why platforms like Zoom, Meet, and Teams avoid full transcoding for general meetings.
1.1.3 Why SFU Is the Only Viable Choice for 1,000-User Interactive Meetings
A Selective Forwarding Unit (SFU) sits between participants but does almost no media processing. It receives encrypted SRTP packets, examines the headers, and forwards them to the right subscribers. It does not decode, mix, or re-encode video. The payload stays encrypted end-to-end.
An SFU:
- Looks at RTP headers to identify the stream (SSRC)
- Tracks which participants are subscribed to which streams
- Forwards packets exactly as they arrived
This makes the server extremely efficient. The cost is mostly network I/O and lightweight header parsing.
In practice:
- Each participant uploads a single video stream (or up to three if using simulcast)
- The SFU chooses the right quality layer for each recipient
- Routing scales linearly with the number of users
This is why an SFU can support thousands of participants per node while keeping CPU low and latency predictable. The SFU only routes packets and handles congestion signals—it never touches video pixels.
In one sentence:
- Mesh collapses, MCU overheats, and SFU scales.
1.2 High-Level System Design
1.2.1 The Control Plane: .NET Web API and SignalR
WebRTC needs signaling, but it doesn’t specify how. You must provide a signaling channel to:
- Create, join, and leave rooms
- Exchange SDP Offers and Answers
- Exchange ICE candidates
- Handle permissions (host, co-host, guest)
- Send meeting events (mute, unmute, hand raise, etc.)
SignalR is a strong fit on .NET because it provides:
- A persistent, real-time WebSocket connection
- Automatic fallback handling
- Tight integration with ASP.NET Core authentication
- Simple group broadcasting
- Horizontal scaling with Redis or Azure SignalR
This control channel carries only metadata—not audio or video. It orchestrates the WebRTC handshake and room logic.
1.2.2 The Data Plane: Custom .NET UDP Listeners
Media transport is completely separate from signaling. Audio and video flow through SRTP over UDP, so the SFU must run a dedicated high-performance UDP router.
A simplified flow:
Browser → ICE/DTLS/SRTP → SFU UDP Router → Other Participants
On the server, the SFU must:
- Receive SRTP packets
- Decrypt only the header (not the video payload)
- Map SSRC values to users and tracks
- Forward packets to subscribed transports
Because media forwarding is time-critical, the SFU must avoid overhead. .NET 8/9 provides the right tools: SocketAsyncEventArgs, System.IO.Pipelines, and Span<T> allow you to build a zero-allocation packet router.
1.2.3 The Storage Plane
The SFU itself stays stateless. All dynamic room and participant state lives in Redis.
Redis stores:
- Room membership
- Active streams (SSRC → user mapping)
- Per-user bandwidth estimates
- Track metadata (camera, screen share, simulcast layers)
For recordings, you stream output directly to Azure Blob Storage. Block blobs allow chunked uploads, automatic cleanup rules, and long-term durability without impacting the SFU’s performance.
This three-plane model—control, data, storage—keeps the SFU lean and resilient.
1.3 Technology Selection & The .NET Advantage
1.3.1 Why .NET 8/9 Works for Real-Time Media Routing
Modern .NET is well-suited for high-throughput networking. Several improvements make it competitive with low-level C/C++ routers:
Span<T>andReadOnlySpan<T>allow in-place packet parsingMemory<T>supports reusable buffersSystem.IO.Pipelineshandles continuous streaming workloads- Native AOT gives lightweight, fast-starting binaries for edge nodes
- Thread-pool scheduling improvements reduce context switching
A well-designed .NET SFU can push tens of gigabits per second of RTP traffic while maintaining stable latency—more than enough for large meetings.
1.3.2 Using SIPSorcery for DTLS and WebRTC Handshake
Implementing DTLS-SRTP from scratch is risky and unnecessary. SIPSorcery provides a mature, open-source WebRTC stack that handles:
- DTLS handshakes
- ICE connectivity checks
- STUN/TURN message parsing
- SRTP key derivation
You get a clean API to react to DTLS data and extract key material. For example:
var dtls = new DtlsHandshake(
certificate,
dtlsRole, // client or server
OnDtlsData, // callback to send outgoing packets
OnKeyMaterial // callback for SRTP keys
);
Once keys are available, you construct SRTP transforms and connect them to your incoming and outgoing transport pipelines. This frees you to focus on routing logic, simulcast, congestion control, and multi-node clustering—the real work of building a Zoom-scale platform.
2 The Signaling Layer: Orchestrating Connections with SignalR
Once the architecture is in place, the next challenge is coordinating how participants actually connect. WebRTC handles media transport, but it relies on a separate signaling channel for exchanging SDP, ICE candidates, permissions, and room events. This signaling path must be reliable, ordered, and low latency. SignalR fits well here because it gives you a persistent WebSocket connection with simple group broadcasting and clean integration with ASP.NET Core.
The signaling layer is not responsible for handling audio or video. Its job is to coordinate the WebRTC handshake and keep everyone in the room aware of what’s happening.
2.1 SignalR Hub Design for SDP Exchange
2.1.1 Designing a Clear and Predictable Hub API
A straightforward MeetingHub typically exposes a handful of methods that reflect the WebRTC negotiation steps. None of these methods manipulate SDP—they simply relay messages to the right users or SFU nodes.
public class MeetingHub : Hub
{
public async Task JoinRoom(string roomId) { ... }
public async Task Offer(string roomId, string sdp) { ... }
public async Task Answer(string roomId, string sdp) { ... }
public async Task IceCandidate(string roomId, object candidate) { ... }
}
The usual signaling sequence looks like this:
- A participant calls
JoinRoom. - The server adds their connection to the appropriate SignalR group.
- The participant generates an SDP Offer and sends it to the hub.
- The hub relays the offer to the target peer or to the SFU worker managing that participant.
- The responding endpoint sends an SDP Answer back.
- ICE candidates flow through the hub until the two endpoints establish a direct transport.
The key rule: the server never inspects or modifies SDP. It just moves messages between parties so the SFU and browsers can complete their negotiation.
2.1.2 Handling Glare with the Perfect Negotiation Pattern
Glare happens when both sides send an SDP Offer at the same time. WebRTC’s perfect negotiation pattern avoids deadlock by designating one side as “polite” (usually the browser) and the other as “impolite” (often the SFU).
When glare occurs:
- The polite side rolls back its local description.
- It accepts the incoming offer instead of sending its own.
- It generates a matching Answer once the rollback completes.
Your SignalR layer doesn’t resolve glare directly—it just needs to support rollback messages and allow peers to retry cleanly. Browsers already provide:
pc.setLocalDescription({ type: "rollback" });
With this approach, your signaling layer stays simple and predictable even when multiple participants join or leave at the same time.
2.2 Scaling SignalR to 50 Million Users
Large-scale deployments rarely run on a single SignalR server. When thousands of meetings start simultaneously, and millions of clients connect from around the world, horizontal scaling becomes essential. The challenge is keeping room membership consistent when connections can land on any server in your cluster.
2.2.1 Using Redis to Distribute Groups Without Sticky Sessions
SignalR groups are the backbone of your meeting logic. Every broadcast—SDP messages, room updates, notifications—uses groups under the hood. For a multi-node deployment to work correctly:
- Each SignalR node must know which users belong to which rooms.
- Messages must reach all relevant users, regardless of which server holds their connection.
Redis acts as a shared backplane:
builder.Services.AddSignalR()
.AddStackExchangeRedis("redis:6379", options =>
{
options.Configuration.ChannelPrefix = "signalr";
});
Once enabled, you can remove sticky sessions from your load balancer. Any request can land on any server. Redis keeps group membership synchronized across all nodes.
This model is essential for large-scale meetings where you may have hundreds of SignalR servers operating at once.
2.2.2 Choosing Between Azure SignalR Service and Self-Hosted Kubernetes
For deployments at Zoom-like scale, the signaling layer must be both reliable and cost-efficient. There are two common approaches.
Azure SignalR Service
- Pros: Fully managed, automatically scales, resilient
- Cons: Higher cost, less flexibility if you need custom routing or low-level control
Self-Hosted SignalR on Kubernetes
- Pros: Full control over scaling, logging, and configuration
- Cons: You must manage Redis, cluster stability, and connection surges yourself
Most real-time platforms adopt a hybrid model: Kubernetes SignalR for flexibility + Managed Redis for reliability + Regional edge nodes for WebRTC.
This balances cost, performance, and operational simplicity.
2.3 Room Management and State
Room state is the glue that holds everything together—who’s in the call, who’s publishing video, who’s screen-sharing, and what features are enabled. The challenge is making room state consistent across hundreds of servers without slowing the system down.
2.3.1 Local State for Small Deployments, Redis for Scale
If you’re running a small or single-node deployment, a simple in-memory structure like:
ConcurrentDictionary<string, RoomState>
works well. It’s fast and avoids any network latency. But this only works when all clients connect to the same process.
For a distributed system:
- Redis becomes the source of truth.
- Every node reads and writes room state through Redis.
- You avoid “split brain” scenarios where different servers have different views of the same room.
Room state typically includes fields such as:
{
"userId": "123",
"video": true,
"audio": true,
"handRaised": false,
"streams": [ "camera", "screen" ]
}
This metadata allows the SFU to know which SSRC belongs to which track and ensures that subscribers receive the right streams.
2.3.2 Keeping Late Joiners in Sync
One of the trickiest parts of large meetings is making sure late joiners see the room exactly as it is. Without proper synchronization, a client may receive RTP packets for tracks it doesn’t know about yet.
The signaling layer solves this by sending a room snapshot immediately after JoinRoom:
- Server loads room state from Redis.
- Server sends a complete participant list and stream metadata to the new user.
- Client creates or updates transceivers based on this state.
- Only after this setup does the user begin receiving media.
This avoids race conditions and ensures the user’s PeerConnection is ready before media starts arriving.
3 The Core: Building the Selective Forwarding Unit (SFU) in .NET
The SFU is the heart of the system. Everything depends on its ability to receive thousands of SRTP packets per second, determine who should receive each packet, and forward those packets with as little delay as possible. There’s no room for unnecessary allocations, blocking I/O, or complex transformations. The SFU must stay lean, predictable, and extremely fast.
This section focuses on how to build that core in .NET—how packets arrive, how they’re parsed, how they’re mapped to users, and how the router decides what goes where.
3.1 UDP Transport and Socket Management
3.1.1 Setting Up UDP for High-Throughput Media Traffic
Media traffic for large meetings flows through UDP, and you need full control over how those packets are handled. A typical listener looks simple:
var socket = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);
socket.Bind(new IPEndPoint(IPAddress.Any, listenPort));
socket.ReceiveBufferSize = 8 * 1024 * 1024;
socket.SendBufferSize = 8 * 1024 * 1024;
Increasing buffer sizes helps the socket absorb short bursts without dropping packets. But the real performance gain comes from avoiding allocations. That’s where SocketAsyncEventArgs fits in:
var args = new SocketAsyncEventArgs();
args.SetBuffer(buffer);
args.Completed += OnPacketReceived;
socket.ReceiveFromAsync(args);
Each callback delivers a raw UDP packet. Before the SFU can forward it, it must first figure out what type of packet it is. RFC 5764 defines how to demultiplex:
- STUN → starts with 0x00 or 0x01
- DTLS → starts with 0x14–0x16
- RTP/RTCP → usually starts with 0x80–0xBF
This lets the SFU route the packet into the right processing pipeline without expensive inspection.
3.1.2 Mapping SSRC Values to Users and Their Tracks
Every RTP packet contains a 32-bit SSRC (Synchronization Source) that identifies the source track. That’s how the SFU knows which user and which media track the packet belongs to.
The RTP header layout:
0–1: Version, padding flags
2–3: Sequence number
4–7: Timestamp
8–11: SSRC
Your SFU keeps a mapping like:
ConcurrentDictionary<uint, MediaStreamRoute> _routes;
A MediaStreamRoute entry typically contains:
- The publisher’s transport (where the packets came from)
- A list of subscribers who should receive the packets
- The active simulcast layer for this track
This SSRC map is essentially the SFU’s routing table. Every incoming packet uses it.
3.2 DTLS and SRTP
3.2.1 Performing the DTLS Handshake in .NET
Before the SFU can read RTP headers, it must complete the DTLS handshake to derive SRTP keys. With SIPSorcery, this stays straightforward:
var dtls = new DtlsSrtpTransport(isServer: true);
dtls.OnData += OnDtlsData;
dtls.OnKeyingMaterial += OnKeysReady;
Once the handshake completes, the callback provides key material. From there, you initialize SRTP:
var srtp = new SrtpTransform(dtls.LocalKey, dtls.RemoteKey);
Now the SFU can authenticate and decrypt only what it needs—the header extensions—without touching the video payload.
3.2.2 Decrypting Only What the Router Needs
SRTP fully encrypts and authenticates packets. But an SFU never needs access to the audio or video itself. It only needs:
- SSRC
- Sequence numbers
- Timestamps
- Codec-specific header extension fields (e.g., VP8 PictureID)
SIPSorcery exposes minimal decryption via:
if (_srtp.UnprotectRtp(packetBuffer, out int len))
{
ParseRtpHeader(packetBuffer.Span.Slice(0, len));
}
The payload remains encrypted and is forwarded as-is. This preserves end-to-end media confidentiality and keeps CPU cost low.
3.3 The Router Logic (The “Selective” Part)
3.3.1 A Simple Publisher–Subscriber Routing Model
Every incoming packet runs through the router. The router checks which subscribers are interested in this publisher’s track and forwards the packet to each one.
var route = _routes[ssrc];
foreach (var sub in route.Subscribers)
{
sub.Srtp.SendRtp(packet);
}
There is no transcoding, no frame interpolation, no decoding. The SFU acts like a post office—fast, efficient, and predictable.
3.3.2 Implementing Simulcast for Bandwidth Flexibility
Modern browsers often send three versions of the same video stream—high, medium, and low—using Restriction Identifiers (RIDs):
a=simulcast:send h;m;l
a=rid:h send
a=rid:m send
a=rid:l send
The SFU decides which layer each subscriber gets:
- High layer → strong connections
- Medium layer → moderate
- Low layer → limited bandwidth
A simple rule-of-thumb might look like:
if (subscriber.Bandwidth < 400_000)
layer = "l";
else if (subscriber.Bandwidth < 900_000)
layer = "m";
else
layer = "h";
The SFU typically maintains rolling bandwidth estimates using RTT, packet loss, and incoming traffic rate. When the estimate changes significantly, the SFU switches layers. This keeps video smooth while preventing congestion collapse.
3.3.3 Applying Temporal Scalability (SVC) When Congestion Occurs
Scalable Video Coding (SVC) is a more advanced form of adaptability. Instead of sending multiple encodings, the encoder sends multiple temporal layers in the same stream:
- T0 – Base layer (lowest frame rate)
- T1 – Additional frames
- T2 – Full frame rate
When congestion appears:
- The SFU drops T2 and T1 packets
- The subscriber receives only the minimal T0 layer
- Playback remains uninterrupted, just lower frame rate
To do this safely, the SFU must inspect codec-specific metadata in the RTP header extensions. It cannot simply drop random packets; it must drop whole layers to avoid breaking decode order.
This gives you fine-grained control: users with bad networks still see motion, while users on good connections get full quality.
4 Network Resilience: NAT Traversal and Adaptive Bitrate
A large portion of real-time video failures happen before a single frame of video is ever shown. Most issues come from the network path: users sitting behind strict firewalls, double NAT home routers, enterprise proxies, or mobile carrier networks that aggressively block UDP. Even when the path is established, bandwidth can jump up and down quickly depending on Wi-Fi interference, 4G/5G transitions, or congestion at the ISP level.
To build a Zoom-scale system, the SFU must handle two things reliably:
- Establish a working UDP route for every participant, regardless of how restrictive their network is.
- Adapt media rates continuously so video remains smooth even when the network fluctuates.
The following sections show how STUN, TURN, and bandwidth estimation fit into a .NET-based SFU architecture.
4.1 NAT Traversal Strategy
4.1.1 Using STUN to Discover Public Addresses
STUN is the lightweight, fast way for a browser to learn how its traffic is mapped by the user’s NAT. When the browser gathers ICE candidates, STUN produces “server reflexive” candidates that represent the client’s public-facing IP address.
The SFU doesn’t need to modify STUN messages—it just needs to route them correctly and associate them with the right DTLS transport once ICE checks start.
Typical client configuration:
const pc = new RTCPeerConnection({
iceServers: [
{ urls: "stun:stun.l.google.com:19302" },
{ urls: "stun:your-sfu-edge.example.com:3478" }
]
});
As ICE candidates come in, the client sends them to the SFU through SignalR. The SFU relays them to the correct peer or media node. No parsing is required; it’s just signaling traffic.
4.1.2 TURN with Coturn for Locked-Down Networks
STUN alone doesn’t help in environments where outbound UDP is blocked or where symmetric NATs prevent peer-to-peer routing. When the browser cannot establish a direct UDP path, it falls back to TURN.
TURN servers relay all media traffic between client and SFU, using UDP, TCP, or TLS depending on what’s allowed. Coturn is the standard open-source choice and integrates smoothly with .NET.
A minimal turnserver.conf might look like:
realm=example.com
use-auth-secret
static-auth-secret=YOUR_SHARED_SECRET
cert=/etc/ssl/certs/fullchain.pem
pkey=/etc/ssl/private/privkey.pem
no-tcp-relay
no-loopback-peers
A simple Docker run:
docker run -d --network=host \
-v /etc/turnserver.conf:/etc/turnserver.conf \
instrumentisto/coturn:latest
TURN ensures that even users behind the toughest network conditions can join meetings—at the cost of additional bandwidth on the server side. This becomes critical for enterprise and education environments.
4.1.3 Generating Per-Session TURN Credentials from .NET
For security, TURN credentials should not be static. Coturn supports short-lived credentials derived using HMAC-SHA1. This means the client receives credentials that automatically expire after a set time window.
A lightweight .NET generator:
public class TurnCredentialsService
{
private readonly string _secret = "YOUR_SHARED_SECRET";
public (string Username, string Password) Generate()
{
var timestamp = DateTimeOffset.UtcNow.ToUnixTimeSeconds() + 3600;
var username = $"{timestamp}";
var key = Encoding.UTF8.GetBytes(_secret);
var msg = Encoding.UTF8.GetBytes(username);
using var hmac = new HMACSHA1(key);
var passwordBytes = hmac.ComputeHash(msg);
var password = Convert.ToBase64String(passwordBytes);
return (username, password);
}
}
SignalR sends these credentials to authenticated clients during the join workflow:
iceServers: [
{
urls: "turn:turn.example.com:3478?transport=udp",
username: turn.username,
credential: turn.password
}
]
If UDP fails, the browser automatically retries with TCP or TLS. This is slower, but it keeps the session alive, which is the goal.
4.2 Bandwidth Estimation (BWE)
Even once the SFU has a working path, the network may change dramatically during a meeting. A user can move rooms, switch Wi-Fi networks, or lose connectivity for a moment. The SFU must react to bandwidth changes fast enough to avoid visible quality drops.
4.2.1 Using GCC Concepts for Live Bandwidth Tracking
WebRTC uses Google Congestion Control (GCC), which analyzes packet arrival times, loss, and delay. While the SFU doesn’t encode video, it still shapes traffic by:
- Switching simulcast layers
- Dropping temporal layers
- Sending RTCP feedback asking the sender to slow down
A simple estimator might look like:
public class BandwidthEstimator
{
private double _estimate = 1_200_000; // 1.2 Mbps initial
private const double Gain = 0.05;
public double Update(long packetSize, long arrivalDelta)
{
var current = (packetSize * 8.0) / (arrivalDelta / 1000.0);
_estimate = (_estimate * (1 - Gain)) + (current * Gain);
return _estimate;
}
public double GetEstimate() => _estimate;
}
Real implementations incorporate RTT, loss rate, TWCC timing, and more—but the principle remains the same: use recent packet behavior to adjust routing decisions.
4.2.2 Reading RTCP RR and TWCC for Accurate Feedback
Feedback from clients is just as important as traffic observations on the server. RTCP Receiver Reports tell the SFU:
- How many packets were lost
- Jitter as measured by the receiver
- Highest sequence number received
TWCC (Transport-Wide Congestion Control) gives even more detail: it includes timing information for each packet, allowing near real-time detection of congestion.
Reading TWCC via SIPSorcery:
if (RtcpPacket.GetRTCPPacketType(packet) == RtcpPacketTypesEnum.TransportLayerFeedback)
{
var feedback = TransportWideCCFeedbackPacket.Parse(packet);
OnTwccFeedback(feedback);
}
This data helps the SFU react quickly—often within a few RTTs—before the user notices video stutter.
4.2.3 Using REMB or TMMBR to Tell Clients to Slow Down
WebRTC allows the SFU to ask senders to reduce their bitrate. Even though the SFU doesn’t touch the media payload, it can influence encoding behavior.
Two main options:
- REMB (Receiver Estimated Maximum Bitrate): still widely supported
- TMMBR (Temporary Maximum Media Bitrate Request): the newer alternative
A simple REMB construction:
var remb = new RtcpRembPacket
{
SenderSSRC = ssrc,
Bitrate = targetBitrate,
SsrcFeedbacks = new[] { ssrc }
};
srtp.SendRtcp(remb.GetBytes());
The browser’s encoder immediately adjusts its output rate. When combined with simulcast switching, the result is smooth, stable video even on unstable networks.
5 Advanced Features: Breakout Rooms, Waiting Rooms, and Screen Sharing
As soon as you move beyond basic audio and video, a real meeting platform needs richer features—waiting rooms, promotions, breakout rooms, and screen sharing. These features don’t require changes to the core SFU architecture, but they do rely heavily on clean signaling and consistent state management. The goal is to handle these features through the control plane and Redis while letting the SFU continue doing what it does best: routing media.
5.1 JWT-Based Admission and Waiting Rooms
5.1.1 Validating Tokens Before the WebSocket Opens
Large meeting systems must control who can join a room. The easiest place to enforce this is during the SignalR handshake. If a user isn’t authenticated, the WebSocket should never open. ASP.NET Core middleware gives you this hook.
A simple JWT validation middleware:
public class JwtConnectionMiddleware : IMiddleware
{
private readonly TokenValidationParameters _parameters;
public JwtConnectionMiddleware(TokenValidationParameters parameters)
{
_parameters = parameters;
}
public async Task InvokeAsync(HttpContext context, RequestDelegate next)
{
var token = context.Request.Query["access_token"];
if (!string.IsNullOrEmpty(token))
{
var handler = new JwtSecurityTokenHandler();
context.User = handler.ValidateToken(token, _parameters, out _);
}
if (!context.User.Identity.IsAuthenticated)
{
context.Response.StatusCode = 401;
return;
}
await next(context);
}
}
You plug it into the pipeline:
app.UseMiddleware<JwtConnectionMiddleware>();
app.MapHub<MeetingHub>("/ws/meeting");
By the time the hub runs, you know the connection belongs to a valid user. This matches the pattern from previous sections—SignalR handles signaling, Redis holds state, and the SFU focuses on media.
5.1.2 Waiting Room and Promotion Logic
Not everyone joins the active meeting immediately. Many platforms route new users to a waiting room until a host approves them. Redis is a natural place to track this because all nodes in the cluster need a consistent view of room state.
Waiting and active sets:
waiting:{roomId}
active:{roomId}
When a user calls JoinRoom:
public async Task JoinRoom(string roomId)
{
var userId = Context.UserIdentifier;
await _redis.SetAddAsync($"waiting:{roomId}", userId);
await Clients.Caller.SendAsync("WaitingRoomStatus");
}
The host sees the waiting list, selects a user, and promotes them:
public async Task Promote(string roomId, string userId)
{
await _redis.SetMoveAsync($"waiting:{roomId}", $"active:{roomId}", userId);
var connectionId = await _redis.HashGetAsync("connections", userId);
await Groups.AddToGroupAsync(connectionId, roomId);
await Clients.Client(connectionId).SendAsync("Promoted");
}
Promotion does a few things:
- Moves the user into the active room set
- Adds them to the SignalR group
- Triggers a room state sync
- Allows WebRTC negotiation to start
This fits cleanly into the system you’ve built so far—SignalR stays the orchestration layer, Redis tracks membership, and the SFU starts routing media once the user publishes streams.
5.2 Screen Sharing and Dual-Stream Architecture
5.2.1 Treating Screen Share as a Second Video Track
Screen sharing is more than just sending a camera feed. It often requires higher resolution, lower frame rate, and different encoding hints. Browsers expose screen sharing as a completely separate track, which means the SFU sees a new SSRC.
Typical browser code:
const displayStream = await navigator.mediaDevices.getDisplayMedia({
video: { frameRate: 10 }
});
pc.addTrack(displayStream.getVideoTracks()[0], displayStream);
The SFU registers this just like any other published track. It becomes a second route in the SSRC map, with its own subscription list. Because SFU forwarding is track-based, screen sharing fits naturally into the model.
5.2.2 Renegotiation Without Complicating the Server
When a new track appears, the browser enters renegotiation mode. The SFU doesn’t need to interpret these changes; it only relays the fresh Offer/Answer.
Client-side:
pc.onnegotiationneeded = async () => {
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
hub.invoke("Offer", roomId, offer.sdp);
};
Receiving participants automatically fire ontrack once the renegotiation completes. The SFU’s job is only to update SSRC routing and forward packets as usual. No media pipeline changes are required.
5.2.3 Giving Screen Sharing Higher Priority in the Router
Screen content—slides, spreadsheets, diagrams—becomes unreadable quickly if resolution drops too low. To avoid this, the SFU can detect screen tracks and treat them differently when selecting simulcast layers.
For example:
if (stream.Type == StreamType.Screen)
{
subscriber.Layer = "high"; // always route full resolution
}
This simple rule ensures text remains sharp even under moderate network fluctuation. Camera video may shift between high, medium, or low layers, but screen share usually stays at the highest viable layer.
5.3 Breakout Rooms & Logical Partitioning
5.3.1 Splitting Participants Without Dropping Connections
Breakout rooms allow a large meeting to split into multiple smaller conversations. A naïve approach would disconnect users and force them to rejoin new rooms—but that triggers renegotiation and ICE restarts, slowing down the experience.
A cleaner approach is to treat breakout rooms as logical partitions. Everyone stays connected to the same SignalR hub and the same SFU instance, but routing rules change.
Redis groups:
active:{roomId}
breakout:{roomId}:{breakoutId}
This keeps the control plane stable while the data plane adjusts.
5.3.2 Updating Subscriptions at the SFU Level
When a user moves into a breakout room:
- They are removed from the main room subscriber lists.
- They are added to the breakout-specific subscriber lists.
- Their client receives an updated room state so it knows who to expect.
On the SFU:
public void MoveToBreakout(string userId, string breakId)
{
foreach (var stream in _streams)
{
// Remove main room subscription
stream.Value.Subscribers.Remove(userId);
// Add breakout subscription if relevant
if (_breakoutGroups[breakId].Contains(stream.Value.Publisher))
{
stream.Value.Subscribers.Add(userId);
}
}
}
This immediately changes which packets the user receives. No renegotiation is needed because the tracks themselves haven’t changed—only which users are subscribed.
5.3.3 Returning to the Main Room Smoothly
When breakout rooms close:
- Users are removed from the breakout group
- They rejoin the main group
- The SFU updates their subscriptions
- The client receives a new participant list and continues receiving packets
Because the SFU hasn’t modified transport or track definitions, the transition feels instant. No ICE restarts, no reconnects, no visible interruption.
6 Scaling to 1000 Users: Cascading and Clustering
At some point, even the most optimized SFU reaches the limits of a single machine. Zero-allocation parsing, efficient UDP loops, and simulcast logic get you far, but hardware constraints eventually take over. Once meetings grow into the hundreds and thousands, one node can no longer handle all traffic safely. Latency becomes unstable, CPU bursts create jitter, and NIC queues start dropping packets.
To support Zoom-scale meetings, the SFU must scale horizontally. That requires a design where multiple SFU nodes cooperate, share stream availability, and route packets across the cluster without forcing clients to reconnect or renegotiate. This section explains why a single-node SFU eventually breaks down and how a cascading SFU architecture solves the problem cleanly.
6.1 The Physical Limit of a Single Node
6.1.1 CPU Load and Forwarding Pressure
Even though SFUs avoid transcoding, they still perform significant work for every packet:
- DTLS header decryption
- SRTP authentication
- Parsing RTP headers
- Running congestion logic
- Forwarding packets to multiple subscribers
A 1,000-person meeting generates a huge number of packets per second. A typical mid-tier VM (8–16 vCPUs) can forward roughly 10–15 Gbps of SRTP traffic before CPU becomes saturated. Large meetings can exceed 25 Gbps when you include retransmissions and outgoing streams.
As CPU load approaches its limit:
- Packet forwarding delays increase
- Scheduling jitter grows
- SRTP authentication slows down
- Congestion logic becomes less responsive
Even with the optimizations described in earlier sections, a single node eventually struggles to stay ahead of the packet rate.
6.1.2 Bandwidth Saturation and NIC Bottlenecks
Most cloud VMs provide NICs rated between 10 Gbps and 25 Gbps. These limits are easy to hit in a large meeting. Remember:
- A participant publishes one simulcast stream.
- The SFU may forward that stream to hundreds of others.
Aggregate outbound bandwidth rises extremely quickly. Once the NIC queue fills, packets drop, triggering:
- NACK messages
- PLI storms
- Layer switching
- Bandwidth collapses
Even powerful bare-metal servers start struggling once aggregate outbound traffic crosses ~30 Gbps.
6.1.3 UDP Port Pressure and Thread Context Switching
Each WebRTC transport uses one RTP and one RTCP UDP socket pair. With hundreds of participants:
- The OS juggles thousands of UDP endpoints
- Ephemeral ports approach their limits
- The scheduler thrashes as transports compete for CPU
Even with SocketAsyncEventArgs, IO polling, and thread-pool tuning, context switching overhead grows significantly.
All three bottlenecks—CPU, NIC bandwidth, and port pressure—create a hard ceiling. A single-node SFU cannot safely support a thousand interactive participants. The solution is a cluster of SFUs working together.
6.2 Cascading SFU Architecture (Pipe-and-Filter)
6.2.1 Edge Nodes and Origin Nodes
A scalable SFU cluster separates responsibilities across two tiers:
Edge Nodes
- Handle WebRTC handshakes
- Terminate DTLS/SRTP
- Maintain publisher-to-subscriber mappings
- Route local streams to local users
Origin Nodes (Core Nodes)
- Aggregate traffic across edges
- Maintain global knowledge of active streams
- Relay cross-region or cross-edge traffic
Each participant connects to the closest Edge Node. Only traffic that needs to reach participants on other nodes goes through the Origin Node. This limits cross-node bandwidth and spreads the load evenly.
A typical flow looks like:
Client → Local Edge → Origin → Remote Edge → Remote Client
This resembles a pipe-and-filter architecture: each SFU stage handles a small portion of the routing pipeline.
6.2.2 Relaying Streams Between SFU Nodes
When a user on Edge A publishes a stream and someone on Edge B subscribes, the two nodes must exchange packets. They set up a long-lived relay channel. The simplest approach is to use a raw TCP socket between nodes because:
- It avoids head-of-line blocking per packet
- It reduces overhead compared to WebSockets or message queues
- It allows multiplexing multiple streams over one connection
A relay loop might look like this:
Sender:
public async Task RelayLoop(NetworkStream stream)
{
while (true)
{
var packet = await _relayQueue.Reader.ReadAsync();
await stream.WriteAsync(packet);
}
}
Receiver:
public async Task ReceiveRelay(NetworkStream stream)
{
while (true)
{
var lengthBytes = await stream.ReadExactAsync(2);
int length = BitConverter.ToUInt16(lengthBytes);
var buffer = ArrayPool<byte>.Shared.Rent(length);
await stream.ReadExactAsync(buffer.AsMemory(0, length));
ProcessRelayedPacket(buffer, length);
}
}
The relay path never decrypts payloads. It simply moves encrypted SRTP packets from one SFU to another. This keeps the end-to-end security and avoids unnecessary processing.
6.2.3 Publishing Router Capabilities Across the Cluster
To make correct routing decisions, each node must know:
- Which users are connected
- Which streams they publish (SSRC, RID, type)
- Available simulcast layers
- Which node holds each stream
- Health metrics (CPU, bandwidth, packet drop trends)
Nodes share this information using Redis, gRPC, or a message bus. A capabilities payload might look like:
{
"nodeId": "edge-1",
"streams": [
{ "ssrc": 1091234, "publisher": "u12", "layers": ["h","m","l"] }
]
}
Using this information, nodes decide whether they need a relay route for a specific subscriber. For example:
public void UpdateRouting(string publisher, string newNode)
{
foreach (var subscriber in _subscribers)
{
if (subscriber.EdgeNode != newNode)
{
_relays[newNode].AddSubscriber(subscriber);
}
}
}
No client renegotiation is required because streams are routed between SFUs, not re-created.
6.2.4 Handling Node Failures Without Breaking the Meeting
One of the biggest advantages of a multi-node SFU architecture is failure isolation. If an edge node goes down:
- Clients automatically reconnect to another edge via DNS or load balancer
- WebRTC renegotiation is minimal because they reconnect quickly
- Stream routing updates through Redis
- Relay paths are rebuilt internally
The rest of the meeting continues without interruption. There is no global “meeting reset.” This is essential for large events, where a node failure must not impact hundreds of participants.
7 The Egress: Distributed Cloud Recording (Replacing AMS)
Recording is a different problem than routing live media. Live streams pass through the SFU, but recordings need a stable, predictable visual representation of the meeting. Azure Media Services used to solve this, but with AMS retired, teams now need a portable, cloud-friendly alternative.
The most practical pattern is to use a “silent participant”—a headless client that joins the meeting like any other user but exists only to capture the rendered meeting view. This keeps the recording pipeline decoupled from the SFU and avoids the need for server-side compositing or video mixing.
This section explains how to build a distributed recorder using headless browsers, Xvfb, and FFmpeg, with output written directly to Azure Blob Storage.
7.1 The “Silent Participant” Pattern
7.1.1 Using a Headless Browser as a Hidden User
Instead of trying to mix media server-side, the system launches a lightweight browser instance inside a container. That browser:
- Authenticates like a regular participant
- Joins the meeting via WebRTC
- Renders whatever layout your frontend uses
- Never transmits audio or video
- Only receives incoming tracks
This approach matches how the SFU already works—routing streams to subscribers—and avoids adding layout logic to the backend.
A minimal Python + Pyppeteer example:
import asyncio
from pyppeteer import launch
async def start():
browser = await launch(args=['--no-sandbox', '--disable-gpu'])
page = await browser.newPage()
await page.goto("https://your-app/recording?meetingId=123")
# Automatically logs in using token in query string
asyncio.get_event_loop().run_until_complete(start())
The recorder is simply another subscriber from the SFU’s perspective. No special signaling is required.
7.1.2 Rendering With a Virtual Framebuffer (Xvfb)
Headless containers typically do not provide GPU-accelerated rendering. Xvfb solves this by creating a virtual display that Chromium or Firefox can draw into. FFmpeg then captures frames directly from that display.
Launching Xvfb:
Xvfb :99 -screen 0 1920x1080x24 &
export DISPLAY=:99
The headless browser binds to DISPLAY=:99. Everything else works the same way as in a normal user session. The SFU sends packets, the browser decodes them, and Xvfb renders the output.
This gives you a stable, deterministic video surface to record.
7.2 FFmpeg Composition Pipeline
7.2.1 Capturing the Virtual Display Output
FFmpeg is the workhorse for real-time recording. With X11 capture, it reads directly from the virtual framebuffer:
ffmpeg -f x11grab -r 30 -s 1920x1080 -i :99 \
-preset veryfast -vcodec libx264 -pix_fmt yuv420p \
-acodec aac -ar 48000 -ac 2 \
output.mp4
You can adjust frame rate, resolution, and bitrate based on meeting type:
- High frame rate for active speaker mode
- Lower frame rate for screen share–focused sessions
- Lower bitrate for long-running trainings or webinars
The key advantage is repeatability: whichever layout the browser displays is exactly what gets recorded.
7.2.2 Streaming FFmpeg Output Directly to Azure Blob Storage
Writing to local disk inside a container adds unnecessary overhead, especially for long meetings. Instead, run FFmpeg in pipe mode and upload the bytes as they’re produced.
A .NET example:
var process = new Process
{
StartInfo = new ProcessStartInfo
{
FileName = "ffmpeg",
Arguments = "-i ... -f mp4 pipe:1",
RedirectStandardOutput = true
}
};
process.Start();
var container = blobClient.GetBlobContainerClient("recordings");
var blockBlob = container.GetBlockBlobClient($"{meetingId}.mp4");
await blockBlob.UploadAsync(process.StandardOutput.BaseStream);
This ensures:
- No intermediate files
- Streamed writes for long sessions
- Reliable uploads even with large recordings
If the recorder node restarts, it simply replays from the beginning or picks up at the next segment (depending on format).
7.2.3 Using HLS for Long or Live-View Recordings
For multi-hour meetings, MP4 can become risky—one corrupted frame can break the whole file. HLS avoids this by splitting the recording into segments:
ffmpeg -i input -codec: copy \
-hls_time 4 -hls_list_size 0 \
/mnt/output/seg.m3u8
Each .ts segment is upload-friendly and resilient to interruptions. Users can even watch the recording while the meeting is still running.
The manifest (seg.m3u8) updates continuously, and Azure Blob Storage serves the files as soon as they land.
7.3 Storage and Lifecycle
7.3.1 Uploading Blocks Directly to Azure Blob Storage
Azure Block Blobs support high-throughput streaming with block staging. A recorder can upload blocks as they arrive from FFmpeg:
await blockBlob.StageBlockAsync(blockId, stream);
Once all blocks are uploaded:
await blockBlob.CommitBlockListAsync(blockList);
This approach handles:
- Large files
- Partial uploads
- Network interruptions
- Session checkpoints
It fits nicely into a distributed recording model where multiple recorders may run simultaneously.
7.3.2 Finalizing and Indexing With Azure Functions
When a recording finishes, your recording service signals Azure Functions with metadata:
- Meeting ID
- Participants present during each segment
- Duration
- File size and path
- Resolution and codecs
Azure Functions can then:
- Generate thumbnails
- Build a searchable index
- Save metadata to Cosmos DB or SQL
- Notify the frontend that the recording is ready
This creates a fully automated egress pipeline that mirrors what AMS offered, but with more flexibility and no dependency on proprietary services.
8 Security and Production Operations
Security and operations are never “addons” for a real-time video platform. When you’re moving audio, video, and chat data across a global cluster, a single security gap can compromise user privacy. And when you’re routing thousands of packets per second per meeting, operational visibility becomes just as important as correctness.
This section focuses on the areas that matter most for a distributed SFU system: strong encryption, clean key distribution, deep observability, and realistic load testing.
8.1 End-to-End Encryption (E2EE)
8.1.1 Understanding Hop-by-Hop Encryption vs. True E2EE
WebRTC gives you encryption by default, but it’s not the same as true end-to-end encryption. Standard WebRTC encrypts the payload as it travels between the client and the SFU. The SFU never decrypts the media payload, but it does decode certain header fields to route packets.
Hop-by-hop encryption protects the media from external attackers, but not from the server itself.
True E2EE adds another encryption layer above SRTP. With that layer:
- Browsers encrypt video/audio before it ever leaves the client
- The SFU only sees opaque ciphertext
- Only authorized participants can decrypt it
This matches the expectations of large enterprises and regulated industries.
8.1.2 Using Insertable Streams to Encrypt Frames in the Browser
The Insertable Streams API lets JavaScript access encoded media frames before they’re passed into the WebRTC pipeline. You can then wrap those frames using AES-GCM (or another cipher) and send them through the SFU without modification.
Example:
const key = await crypto.subtle.generateKey(
{ name: "AES-GCM", length: 256 },
true,
["encrypt", "decrypt"]
);
const sender = pc.getSenders()[0];
const senderTransform = new TransformStream({
async transform(encoded, controller) {
const iv = crypto.getRandomValues(new Uint8Array(12));
const ciphertext = await crypto.subtle.encrypt(
{ name: "AES-GCM", iv },
key,
encoded.data
);
encoded.data = new Uint8Array(ciphertext);
controller.enqueue(encoded);
}
});
sender.createEncodedStreams().readable
.pipeThrough(senderTransform)
.pipeTo(sender.createEncodedStreams().writable);
The SFU doesn’t need new logic for E2EE—its routing stays exactly the same. It forwards encrypted SRTP packets like any other payload.
8.1.3 Secure Key Rotation Through SignalR
Keys should rotate periodically to limit exposure if a key ever leaks. The server never sees plaintext keys. It only relays encrypted key blobs to authenticated members of the room.
C# hub example:
public async Task DistributeKey(string roomId, string encryptedKey)
{
await Clients.Group(roomId).SendAsync("E2EEKeyUpdate", encryptedKey);
}
Clients decrypt key updates using a pre-existing shared secret or a ratcheting protocol. Unauthorized users never receive the key message because SignalR group membership is controlled by your existing authentication and redis-backed room state.
This ties neatly into the architecture described earlier—SignalR handles control, Redis manages state, and the SFU continues routing without touching media.
8.2 Observability with OpenTelemetry
8.2.1 Tracing the Path of a Packet Through the SFU
When something goes wrong in a real-time system—packet drops, jitter spikes, audio glitches—you need to understand the path a packet took through the cluster. OpenTelemetry allows you to create spans for every important action inside the SFU.
Example:
using var span = _tracer.StartActiveSpan("RTP.Forward");
span.SetAttribute("ssrc", ssrc);
span.SetAttribute("subscriberCount", subscribers.Count);
These traces can be pushed to Jaeger, Zipkin, Grafana Tempo, or Azure Monitor. You can follow a packet from:
- Edge node ingress
- SRTP unprotect
- Router lookup
- Forward to subscribers
- Possible relay hop
- Egress
This level of visibility is essential for debugging large multi-node deployments.
8.2.2 The Metrics That Actually Matter
It’s easy to drown in metrics, so focus on the ones tied directly to user experience:
- Round Trip Time (RTT) from RTCP reports
- Packet loss (sender → SFU → receiver)
- Jitter measurements
- PLI frequency (how often clients ask for a keyframe)
- NACK rate (loss recovery requests)
Example Prometheus metric:
_rttGauge.WithLabels(roomId, userId).Set(rttMs);
These metrics show which nodes are saturated, which edges are congested, and where to scale out.
8.2.3 Error Annotation and Diagnostic Context
Whenever something fails—DTLS handshake, STUN timeout, SRTP authentication—you annotate the trace with structured error details. This gives operators a direct link between symptoms (pixelation, freezes) and root causes (network congestion, relay failures, CPU saturation).
The operations team should be able to inspect a single participant and see their full call history across the SFU cluster.
8.3 Load Testing with K6
8.3.1 Simulating Heavy Room Joins and WebRTC Handshakes
To ensure your signaling layer, SFU, and load balancer behave correctly under pressure, you need automated stress tests. K6’s browser module can drive dozens or hundreds of headless clients joining rooms at once.
A simple K6 script:
import { chromium } from 'k6/x/browser';
export default function () {
const browser = chromium.launch({ headless: true });
const page = browser.newPage();
page.goto("https://app/join?room=123&token=test");
page.waitForSelector("#video-ready");
}
This validates the full handshake:
- DNS routing
- SignalR connection
- WebRTC offer/answer
- ICE gathering and connectivity checks
8.3.2 Checking Media Flow Through WebRTC Stats
Once clients join, you can inspect WebRTC stats to confirm that packets actually flow from SFU to browser:
const stats = page.evaluate(() => {
return pc.getStats();
});
You can assert:
- Incoming bitrate
- Outgoing bitrate
- Frame decode rate
- Packet loss
- Jitter buffer fullness
Media-level assertions are critical. A system that “connects” is not necessarily a system that streams smoothly.
8.3.3 Scaling the Test to 1000 Parallel Clients
To simulate real-world load, you run multiple K6 containers in parallel. This gives you insight into:
- Sudden join spikes
- Distributed client locations (with synthetic latency/jitter)
- Packet forwarding across SFU clusters
- Node failover behavior
These tests help confirm that your cascading SFU architecture (Section 6) holds up under heavy pressure, and that routing remains stable even when nodes rotate or relays start up.