Concurrency Model — Trade Signal Terminal

Threading Model Overview

Thread Pool Dedicated Consumer ───────────────────────── ───────────────────── [Producer 1: NASDAQ Feed] ──┐ [Producer 2: NYSE Feed] ──┼──▶ Channel<TradeSignal> ──▶ [Consumer: QueueProcessor] [Producer 3: Crypto Feed] ──┤ (lock-free write) (single reader loop) [Producer 4: Replay Engine]─┘ │ ▼ [SignalR Hub Broadcast] │ ┌─────┼─────┐ ▼ ▼ ▼ [WS] [WS] [WS] Client Connections

Key insight: Producers never block. The Channel's TryWrite returns immediately (true/false). If the channel is full, the oldest message is dropped — this is intentional for market data where freshness matters more than completeness.

The consumer is a single async loop using IAsyncEnumerable. This guarantees message ordering without locks, while still allowing efficient batching.

System.Threading.Channels<T>

Channels are the modern .NET primitive for high-performance producer/consumer scenarios. They replace BlockingCollection<T> with a fully async, allocation-efficient implementation.

Why not ConcurrentQueue? ConcurrentQueue has no backpressure, no async wait, and no notion of "full." Channels provide all three with better performance characteristics.

CHANNEL CONFIGURATION

// Bounded channel: prevents unbounded memory growth
// DropOldest: stale market data is worthless in trading
var channel = Channel.CreateBounded<TradeSignal>(new BoundedChannelOptions(10_000)
{
    FullMode = BoundedChannelFullMode.DropOldest,
    SingleReader = true,    // enables internal optimizations
    SingleWriter = false   // multiple exchange feeds write concurrently
});

PRODUCER SIDE (LOCK-FREE ENQUEUE)

// TryWrite is non-blocking and allocation-free
// Returns false if channel is full (DropOldest already ejected oldest)
public ValueTask EnqueueAsync(TradeSignal signal)
{
    if (!_channel.Writer.TryWrite(signal))
    {
        // Lock-free counter increment — no mutex needed
        Interlocked.Increment(ref _droppedCount);
    }
    return ValueTask.CompletedTask;
}

CONSUMER SIDE (ASYNC ENUMERABLE + BATCH DRAIN)

// IAsyncEnumerable: cooperative cancellation, no busy-wait
await foreach (var signal in _channel.Reader.ReadAllAsync(stoppingToken))
{
    batch.Add(signal);
    
    // Batch drain: amortizes overhead across multiple messages
    // TryRead is lock-free — drains whatever is immediately available
    while (batch.Count < 50 && _channel.Reader.TryRead(out var extra))
    {
        batch.Add(extra);
    }
    
    // Process entire batch — reduces syscall count by up to 50x
    foreach (var item in batch)
    {
        await _hubContext.Clients.All.SendAsync("TradeSignal", item);
        Interlocked.Increment(ref _processedCount);
    }
    batch.Clear();
}

Lock-Free Patterns (Interlocked Operations)

Traditional locks (lock, Mutex, SemaphoreSlim) are unacceptable on the hot path in trading systems. They introduce:

Thread contention under high load
Priority inversion (low-priority thread holds lock needed by high-priority)
Unpredictable latency spikes
Potential deadlocks in complex systems

Instead, we use Interlocked operations — CPU-level atomic instructions (CMPXCHG, LOCK XADD) that complete in a single cycle without OS involvement.

LOCK-FREE PEAK TRACKING (CAS LOOP)

// Compare-And-Swap loop for tracking peak latency
// No lock, no allocation, no contention — just retry on conflict
public void RecordLatency(long elapsedTicks)
{
    var idx = Interlocked.Increment(ref _sampleIndex) & (BufferSize - 1);
    _latencySamples[idx] = elapsedTicks;  // single-writer per slot
    
    // CAS loop to update peak (only if our value is higher)
    long current;
    do
    {
        current = Interlocked.Read(ref _peakLatencyTicks);
        if (elapsedTicks <= current) break;
    } while (Interlocked.CompareExchange(
        ref _peakLatencyTicks, elapsedTicks, current) != current);
}

Why power-of-2 buffer size (4096)? Bitwise AND (& 0xFFF) replaces integer modulo (% 4096). Modulo requires an expensive DIV instruction; AND completes in one cycle. This matters when recording millions of samples per second.

ArrayPool<T> — Zero-Allocation Buffers

Every new T[] allocates on the managed heap, eventually triggering GC. In latency-critical code, we rent pre-allocated buffers from the shared pool and return them after use.

ZERO-ALLOCATION PERCENTILE COMPUTATION

public PerformanceSnapshot GetSnapshot()
{
    // Rent from the pool — no allocation, no GC pressure
    var buffer = ArrayPool<long>.Shared.Rent((int)sampleCount);
    try
    {
        Array.Copy(_latencySamples, buffer, (int)sampleCount);
        Array.Sort(buffer, 0, (int)sampleCount);
        
        var p50 = buffer[(int)(sampleCount * 0.50)];
        var p99 = buffer[(int)(sampleCount * 0.99)];
        // ... compute snapshot ...
    }
    finally
    {
        // Return to pool — buffer will be reused next time
        ArrayPool<long>.Shared.Return(buffer);
    }
}

Async Producer/Consumer Architecture

The entire pipeline is async/await-based with no blocking calls. This is crucial for scalability — blocking calls consume thread pool threads, which are a limited resource.

Producer Threads (N) Channel Buffer Consumer (1) ━━━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━ ━━━━━━━━━━━━ TryWrite() ─────────┐ TryWrite() ─────────┤ ┌────────────────┐ TryWrite() ─────────┼────────▶│ [■■■■■■■□□□□□] │────────────▶ ReadAllAsync() TryWrite() ─────────┤ │ 10K capacity │ │ TryWrite() ─────────┘ └────────────────┘ ▼ Batch Drain Non-blocking Backpressure (up to 50) O(1) per write DropOldest policy │ ▼ SignalR Broadcast (per-client WebSocket)

PATTERN	USED HERE	ALTERNATIVE	WHY THIS IS BETTER
Channel<T>	Queue between producer/consumer	BlockingCollection<T>	Fully async; no thread blocking; bounded with backpressure
IAsyncEnumerable	Consumer loop (ReadAllAsync)	while(true) + WaitToReadAsync	Cleaner code; cooperative cancellation built-in
ValueTask	EnqueueAsync return type	Task	Avoids Task allocation when completing synchronously (99.9% of writes)
Interlocked	All counters and metrics	lock { counter++; }	CPU atomic; no OS involvement; no contention
BackgroundService	Long-running consumer + simulator	Task.Run / Thread	Graceful shutdown; health monitoring; DI integration

Memory & Allocation Strategy

The goal is near-zero allocation on the hot path. Every allocation eventually becomes GC work, and GC work means latency jitter.

COMPONENT	ALLOCATION STRATEGY
Latency sample buffer	Fixed `long[4096]` — allocated once at startup
Batch processing list	Pre-allocated `List<T>(50)` — reused via Clear()
Percentile computation	`ArrayPool<long>.Shared` — rented and returned
Trade signal records	`record struct` candidates for future optimization (currently record class for SignalR compat)
SignalR serialization	System.Text.Json with source generators (future: MessagePack)