Concurrency Model

Threading architecture, lock-free patterns, and async producer/consumer design for high-throughput event processing

Threading Model Overview

Thread Pool Dedicated Consumer ───────────────────────── ───────────────────── [Producer 1: NASDAQ Feed] ──┐ [Producer 2: NYSE Feed] ──┼──▶ Channel<TradeSignal> ──▶ [Consumer: QueueProcessor] [Producer 3: Crypto Feed] ──┤ (lock-free write) (single reader loop) [Producer 4: Replay Engine]─┘ │ ▼ [SignalR Hub Broadcast] │ ┌─────┼─────┐ ▼ ▼ ▼ [WS] [WS] [WS] Client Connections

Key insight: Producers never block. The Channel's TryWrite returns immediately (true/false). If the channel is full, the oldest message is dropped — this is intentional for market data where freshness matters more than completeness.

The consumer is a single async loop using IAsyncEnumerable. This guarantees message ordering without locks, while still allowing efficient batching.

System.Threading.Channels<T>

Channels are the modern .NET primitive for high-performance producer/consumer scenarios. They replace BlockingCollection<T> with a fully async, allocation-efficient implementation.

Why not ConcurrentQueue? ConcurrentQueue has no backpressure, no async wait, and no notion of "full." Channels provide all three with better performance characteristics.

CHANNEL CONFIGURATION
// Bounded channel: prevents unbounded memory growth // DropOldest: stale market data is worthless in trading var channel = Channel.CreateBounded<TradeSignal>(new BoundedChannelOptions(10_000) { FullMode = BoundedChannelFullMode.DropOldest, SingleReader = true, // enables internal optimizations SingleWriter = false // multiple exchange feeds write concurrently });
PRODUCER SIDE (LOCK-FREE ENQUEUE)
// TryWrite is non-blocking and allocation-free // Returns false if channel is full (DropOldest already ejected oldest) public ValueTask EnqueueAsync(TradeSignal signal) { if (!_channel.Writer.TryWrite(signal)) { // Lock-free counter increment — no mutex needed Interlocked.Increment(ref _droppedCount); } return ValueTask.CompletedTask; }
CONSUMER SIDE (ASYNC ENUMERABLE + BATCH DRAIN)
// IAsyncEnumerable: cooperative cancellation, no busy-wait await foreach (var signal in _channel.Reader.ReadAllAsync(stoppingToken)) { batch.Add(signal); // Batch drain: amortizes overhead across multiple messages // TryRead is lock-free — drains whatever is immediately available while (batch.Count < 50 && _channel.Reader.TryRead(out var extra)) { batch.Add(extra); } // Process entire batch — reduces syscall count by up to 50x foreach (var item in batch) { await _hubContext.Clients.All.SendAsync("TradeSignal", item); Interlocked.Increment(ref _processedCount); } batch.Clear(); }

Lock-Free Patterns (Interlocked Operations)

Traditional locks (lock, Mutex, SemaphoreSlim) are unacceptable on the hot path in trading systems. They introduce:

  • Thread contention under high load
  • Priority inversion (low-priority thread holds lock needed by high-priority)
  • Unpredictable latency spikes
  • Potential deadlocks in complex systems

Instead, we use Interlocked operations — CPU-level atomic instructions (CMPXCHG, LOCK XADD) that complete in a single cycle without OS involvement.

LOCK-FREE PEAK TRACKING (CAS LOOP)
// Compare-And-Swap loop for tracking peak latency // No lock, no allocation, no contention — just retry on conflict public void RecordLatency(long elapsedTicks) { var idx = Interlocked.Increment(ref _sampleIndex) & (BufferSize - 1); _latencySamples[idx] = elapsedTicks; // single-writer per slot // CAS loop to update peak (only if our value is higher) long current; do { current = Interlocked.Read(ref _peakLatencyTicks); if (elapsedTicks <= current) break; } while (Interlocked.CompareExchange( ref _peakLatencyTicks, elapsedTicks, current) != current); }

Why power-of-2 buffer size (4096)? Bitwise AND (& 0xFFF) replaces integer modulo (% 4096). Modulo requires an expensive DIV instruction; AND completes in one cycle. This matters when recording millions of samples per second.

ArrayPool<T> — Zero-Allocation Buffers

Every new T[] allocates on the managed heap, eventually triggering GC. In latency-critical code, we rent pre-allocated buffers from the shared pool and return them after use.

ZERO-ALLOCATION PERCENTILE COMPUTATION
public PerformanceSnapshot GetSnapshot() { // Rent from the pool — no allocation, no GC pressure var buffer = ArrayPool<long>.Shared.Rent((int)sampleCount); try { Array.Copy(_latencySamples, buffer, (int)sampleCount); Array.Sort(buffer, 0, (int)sampleCount); var p50 = buffer[(int)(sampleCount * 0.50)]; var p99 = buffer[(int)(sampleCount * 0.99)]; // ... compute snapshot ... } finally { // Return to pool — buffer will be reused next time ArrayPool<long>.Shared.Return(buffer); } }

Async Producer/Consumer Architecture

The entire pipeline is async/await-based with no blocking calls. This is crucial for scalability — blocking calls consume thread pool threads, which are a limited resource.

Producer Threads (N) Channel Buffer Consumer (1) ━━━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━ ━━━━━━━━━━━━ TryWrite() ─────────┐ TryWrite() ─────────┤ ┌────────────────┐ TryWrite() ─────────┼────────▶│ [■■■■■■■□□□□□] │────────────▶ ReadAllAsync() TryWrite() ─────────┤ │ 10K capacity │ │ TryWrite() ─────────┘ └────────────────┘ ▼ Batch Drain Non-blocking Backpressure (up to 50) O(1) per write DropOldest policy │ ▼ SignalR Broadcast (per-client WebSocket)
PATTERNUSED HEREALTERNATIVEWHY THIS IS BETTER
Channel<T> Queue between producer/consumer BlockingCollection<T> Fully async; no thread blocking; bounded with backpressure
IAsyncEnumerable Consumer loop (ReadAllAsync) while(true) + WaitToReadAsync Cleaner code; cooperative cancellation built-in
ValueTask EnqueueAsync return type Task Avoids Task allocation when completing synchronously (99.9% of writes)
Interlocked All counters and metrics lock { counter++; } CPU atomic; no OS involvement; no contention
BackgroundService Long-running consumer + simulator Task.Run / Thread Graceful shutdown; health monitoring; DI integration

Memory & Allocation Strategy

The goal is near-zero allocation on the hot path. Every allocation eventually becomes GC work, and GC work means latency jitter.

COMPONENTALLOCATION STRATEGY
Latency sample buffer Fixed long[4096] — allocated once at startup
Batch processing list Pre-allocated List<T>(50) — reused via Clear()
Percentile computation ArrayPool<long>.Shared — rented and returned
Trade signal records record struct candidates for future optimization (currently record class for SignalR compat)
SignalR serialization System.Text.Json with source generators (future: MessagePack)