Skip to content
Real-Time Recommendation Engines in .NET: Combining Collaborative Filtering, Deep Learning, and Vector Databases

Real-Time Recommendation Engines in .NET: Combining Collaborative Filtering, Deep Learning, and Vector Databases

1 The Modern Recommendation Landscape: Beyond Simple Filtering

A recommendation engine must predict what a user is likely to want at the moment they need it. Early systems relied heavily on collaborative filtering—compare users, find similar behaviors, and suggest items based on interaction histories. That approach worked when catalogs were small, interaction logs were dense, and latency wasn’t as critical. Today none of those assumptions hold. Catalogs contain millions of items, user behavior changes quickly, and recommendations must be generated in real time.

Modern systems reach their limit when they rely solely on historical similarity. Sparse data, large item counts, and dynamic user intent demand architectures that combine multiple retrieval and ranking techniques. Real-time recommendation systems run under strict performance budgets, often inside user-facing APIs where latency directly affects engagement and revenue. This pushes the design toward multi-stage pipelines that blend fast retrieval methods with more precise machine learning models. And because these systems must run efficiently at scale, .NET becomes a practical platform thanks to its predictable performance and strong tooling for production workloads.

1.1 The “Sub-100ms” Challenge

Real-time recommendation engines operate under one of the tightest latency budgets in consumer applications. A single recommendation request involves several steps:

  1. Reading user or session features
  2. Selecting candidates from a large catalog
  3. Scoring them using a machine learning model
  4. Applying business logic or constraints
  5. Returning a high-quality ordered list

If this process takes more than 100ms, users perceive UI delays, and engagement drops. To keep latency under control, teams divide the budget across internal operations:

  • 5–10ms: Retrieve fast-moving features from cache
  • 10–30ms: Execute vector search or collaborative filtering retrieval
  • 2–4ms: Run ONNX-powered ranking inference
  • <2ms: Apply re-ranking logic
  • <1ms: Serialize and send response

Hitting these targets consistently requires a strict separation between recall and precision. Retrieval focuses on finding a wide set of plausible items quickly; ranking focuses on ordering them accurately. Running a precise ranking model directly over a full catalog of 1M–10M items is not feasible, so a multi-stage pipeline becomes the only practical architecture.

1.2 The Funnel Architecture

Modern recommendation engines follow a funnel structure to balance speed with accuracy. Each stage narrows the candidate list while enriching the signal available to the next step.

1.2.1 Candidate Generation (Retrieval)

Retrieval reduces the search space from millions of items to a few hundred or thousand. The emphasis is on coverage—ensuring that the right items make it into the next stage. Retrieval typically uses fast techniques such as:

  • Collaborative filtering with matrix factorization
  • Approximate nearest neighbor (ANN) vector search
  • Co-view or co-click heuristics
  • Trending or popular content signals

These methods are deliberately lightweight. They must respond in 10–20ms, even under heavy load, because every additional millisecond affects the ranking stage’s available budget.

1.2.2 Scoring & Ranking

After retrieval, the system has a manageable set of candidates that can be scored more accurately. Ranking models focus on precision—predicting which items the user is most likely to engage with. Deep learning architectures, gradient-boosted trees, or neural collaborative filtering models are common here.

Because ranking operates on maybe 500–1000 items, using a more computationally expensive model becomes practical. With ONNX Runtime, batch scoring takes only a few milliseconds, enabling real-time ranking without sacrificing quality.

1.2.3 Re-ranking

Re-ranking adds real-world constraints that pure ML models do not capture. Common adjustments include:

  • Ensuring diversity of item types
  • Avoiding near-duplicates
  • Filtering based on safety or policy rules
  • Boosting new or strategic content
  • Applying business-specific constraints

This final layer fine-tunes the output before sending it back to the client. When the retrieval and ranking stages are well designed, re-ranking becomes a lightweight, near-zero-cost step.

1.3 Why .NET for RecSys?

Many engineers assume recommendation systems must be written entirely in Python. Python is ideal for experimentation and model training because of its ML ecosystem, but it is not ideal for high-throughput, low-latency serving. Real-time recommendation systems often run under sustained load, and inference bottlenecks quickly surface in Python environments.

.NET addresses these challenges directly by offering:

1.3.1 Performance and Predictability

ASP.NET Core and the Kestrel server consistently rank among the fastest production web frameworks. Their performance characteristics remain stable under load, which is essential when serving vector-search results, Redis lookups, and ONNX inference in a tight loop.

The .NET runtime’s GC and threadpool behave predictably when tuned correctly, which helps maintain latency even during heavy concurrency.

1.3.2 Strong Typing and Tooling

Recommendation engines involve multiple moving parts—feature stores, caching layers, vector retrieval services, ranking models, and telemetry pipelines. Static typing, analyzers, and advanced tooling help maintain correctness and reduce regressions. This becomes increasingly important as the system grows.

1.3.3 The ONNX Runtime Ecosystem

Using Python for training and .NET for serving works smoothly when the model is exported to ONNX. ONNX Runtime provides:

  • Low-latency CPU/GPU acceleration
  • Thread-safe sessions for parallel inference
  • Highly optimized operators
  • Predictable performance under production load

This hybrid workflow—Python for training, .NET for real-time serving—is now common across large-scale recommendation systems that prioritize reliability and speed.


2 Designing the Real-Time Data Pipeline

Accurate recommendations depend on having up-to-date behavioral data. User intent changes quickly; relying only on daily batch jobs leads to stale suggestions that miss what users care about right now. A real-time pipeline captures every interaction—clicks, views, dwell time, add-to-cart actions—and turns them into usable features within seconds. These signals drive both retrieval and ranking, so the pipeline must be fast, fault-tolerant, and scalable.

A clean architecture separates the pipeline into three layers: ingestion, stream processing, and feature retrieval. Each layer solves a specific problem and can scale independently without impacting the others. This pattern mirrors the multi-stage approach described earlier: keep each step lightweight, predictable, and optimized for low latency.

2.1 Event Ingestion Architecture

The ingestion layer’s role is straightforward: capture user events as they happen and deliver them to downstream consumers reliably. A recommendation system may ingest tens of thousands of events per second, so the ingestion layer must be durable and horizontally scalable. Azure Event Hubs and Apache Kafka are the two most common choices when using .NET in production.

2.1.1 Azure Event Hubs

Event Hubs fits naturally when the system is hosted in Azure. It offers high throughput, partitioning, and built-in integration with Azure Stream Analytics. The .NET client is lightweight, and sending events requires only a few lines of code.

var client = new EventHubProducerClient(connectionString, hubName);

using var batch = await client.CreateBatchAsync();
batch.TryAdd(new EventData(Encoding.UTF8.GetBytes(jsonPayload)));

await client.SendAsync(batch);

Event Hubs handles distribution across partitions automatically, making it easy to scale without additional operational configuration.

2.1.2 Apache Kafka (Confluent .NET)

Kafka is often chosen when teams need multi-cloud support, custom retention policies, or when multiple microservices consume the same stream. The Confluent .NET client provides a performant producer implementation.

var config = new ProducerConfig { BootstrapServers = "kafka:9092" };
using var producer = new ProducerBuilder<string, string>(config).Build();

await producer.ProduceAsync(
    "events",
    new Message<string, string> { Key = userId, Value = jsonPayload }
);

Kafka excels in ecosystems where long-lived historical streams are needed for training data or advanced analytics.

2.1.3 Choosing Between Them

The choice usually depends on constraints:

  • Event Hubs works best for Azure-native systems with minimal operational overhead.
  • Kafka is ideal when retention, cross-cloud portability, or streaming flexibility matter.

For a .NET-based recommendation engine, both integrate cleanly with the rest of the data pipeline.

2.2 Stream Processing with Azure Stream Analytics

Once events enter the pipeline, the next step is to transform raw interactions into meaningful signals. Examples include: top trending items, category-level popularity, or real-time engagement scores. These aggregates must be updated continuously, not computed at request time, because the serving layer cannot afford heavy computations within its sub-100ms budget.

Azure Stream Analytics (ASA) provides a declarative, SQL-like way to process streams without managing infrastructure. Its windowing functions are particularly useful for recommendation features.

2.2.1 Windowing Strategies

Recommendation engines typically rely on two window types:

Sliding Windows

Continuously evaluate a time slice (e.g., last 5 minutes, updated every 10 seconds). Useful for fast-moving signals like “surging interest.”

Hopping Windows

Advance in fixed increments (e.g., 5-minute window hopping every minute). Useful for stable aggregates that don’t need continuous recalculation.

Both produce time-bounded signals that reflect fresh user behavior without requiring per-request computation.

2.2.2 Example SAQL Query

The query below maintains a rolling count of item views. The results can then be stored in Redis or a database and used by candidate generation layers.

SELECT
    ItemId,
    System.Timestamp AS WindowEnd,
    COUNT(*) AS ViewCount
INTO
    TrendingViews
FROM
    UserEvents TIMESTAMP BY EventTime
WHERE
    EventType = 'view'
GROUP BY
    ItemId,
    HoppingWindow(minute, 5, 1)

This pattern ensures “Trending Now” features always reflect recent activity while keeping the serving layer lightweight.

2.3 The Feature Store Concept

The serving API relies on a mix of long-term and short-term signals. Managing these directly inside the API leads to inconsistencies and duplicated logic. A feature store provides a single, consistent layer to read and update features used by retrieval and ranking.

2.3.1 Slow vs. Fast Features

Different types of features require different storage strategies:

  • Slow Features — update infrequently Examples: user demographics, region, subscription tier, long-term preferences. These can sit in databases or distributed caches because they change slowly.

  • Fast Features — update frequently Examples: recent clicks, last viewed item, session search terms, rolling engagement counts. These must be updated in near real time and quickly retrievable.

Separating them avoids unnecessary load on the serving layer and reduces cache churn.

2.3.2 Using Redis for Feature Retrieval

Redis works well for real-time recommendation engines because it offers sub-millisecond access and simple data structures. Inference paths inside the serving API benefit from Redis’ low latency.

var db = redis.GetDatabase();
var key = $"user:{userId}:features";
var json = await db.StringGetAsync(key);

var features = JsonSerializer.Deserialize<UserFeatures>(json);

To keep Redis stable when handling thousands of updates per second, teams often:

  • Apply TTLs to fast-changing session features
  • Use hashes when only partial updates are needed
  • Configure memory-optimized instance types
  • Shard large keyspaces across multiple instances

At inference time, Redis becomes the authoritative source for short-term behavior, enabling the ranking model to use the most relevant context.


3 Stage 1: Candidate Generation (The Retrieval Layer)

Candidate generation is the first major step in the recommendation pipeline. Its role is not to find the best items, but to narrow millions of possibilities into a focused set of candidates that the ranking model can process within the sub-100ms budget. Even the most advanced ranking model cannot fix poor retrieval, so this stage effectively sets the upper limit of recommendation quality.

Retrieval relies on two complementary strategies: collaborative filtering, which learns patterns from user behavior, and semantic search, which matches items and users based on embeddings. Using both ensures the system handles long-term preference patterns and short-term contextual signals. This hybrid approach also mitigates sparsity issues and improves recall coverage, especially in large catalogs.

3.1 Collaborative Filtering with ML.NET

Collaborative filtering identifies relationships between users and items by analyzing historical interactions. ML.NET implements matrix factorization with Alternating Least Squares (ALS), which decomposes the interaction matrix into latent user and item vectors. These vectors encode preference patterns that are not apparent from raw data.

When integrated into the retrieval stage, matrix factorization provides a fast, behavior-driven method to surface items the user is likely to engage with. This aligns with the overall funnel: retrieval must be computationally light but still meaningful.

3.1.1 Preparing Data

The model requires user–item interactions—clicks, purchases, or other implicit feedback. The training structure is straightforward:

public class RatingData
{
    public float UserId { get; set; }
    public float ItemId { get; set; }
    public float Label { get; set; }
}

In large systems, this data usually comes from the event pipeline described in Section 2. For training, you load it directly into ML.NET:

var data = ml.Data.LoadFromEnumerable(trainingData);

3.1.2 Training a Matrix Factorization Model

ALS learns latent factors by iteratively refining user and item vectors. ML.NET exposes this through the MatrixFactorizationTrainer.

var ml = new MLContext(seed: 1);

var options = new MatrixFactorizationTrainer.Options
{
    MatrixColumnIndexColumnName = nameof(RatingData.UserId),
    MatrixRowIndexColumnName = nameof(RatingData.ItemId),
    LabelColumnName = nameof(RatingData.Label),
    NumberOfIterations = 50,
    ApproximationRank = 128
};

var pipeline = ml.Recommendation().Trainers.MatrixFactorization(options);
var model = pipeline.Fit(data);

ALS models are fast to train and easy to deploy, making them a practical choice for the retrieval layer.

3.1.3 Making Predictions

Once trained, the model predicts affinity scores—higher scores indicate stronger expected relevance.

var predictionEngine =
    ml.Model.CreatePredictionEngine<RatingData, RatingPrediction>(model);

var result = predictionEngine.Predict(
    new RatingData { UserId = 42, ItemId = 12345 }
);

Console.WriteLine(result.Score);

In production, instead of scoring every item for every request, item vectors are precomputed and stored. During retrieval, users’ latent vectors are compared against precomputed item vectors to quickly return the top-N candidates.

3.1.4 Dealing with Sparsity

Behavior data in large catalogs is sparse—most users interact with only a tiny fraction of items. Sparse matrices degrade the quality of learned latent factors. To reduce this impact:

  • Implicit feedback gives weight to non-rating interactions (views, clicks).
  • User grouping clusters similar users to warm-start the model.
  • Rank tuning adjusts factor dimensionality to avoid overfitting.
  • Hybrid retrieval merges ALS results with vector-based or heuristic retrieval.

More complex systems almost always use semantic embeddings to fill the gaps left by sparse behavioral data.

3.2 Semantic Search with Vector Embeddings

Semantic search addresses limitations of collaborative filtering by analyzing the meaning of items instead of only behavior patterns. Embeddings convert text, metadata, or even images into dense numerical vectors. Similar items end up close in vector space, allowing the system to find relevant items even without explicit interaction history.

This is essential in retrieval, especially when catalogs are large or items are newly added.

3.2.1 Generating Embeddings in .NET

Embedding generation can run fully within .NET or rely on external services. Both fit well into real-time or batch preprocessing workflows.

Option 1: Bert.NET (On-Prem or Cloud-Agnostic)

Bert.NET allows local generation of sentence embeddings:

var bert = new BertSentenceEmbedder("modelPath");
float[] vector = bert.GetSentenceEmbedding("Sample item description");

This option is useful when external API calls are restricted.

Option 2: OpenAI Embedding API

For higher-quality embeddings or multilingual support, teams often use models like text-embedding-3-large.

var response = await client.Embeddings.CreateAsync(
    model: "text-embedding-3-large",
    input: description
);

float[] vector = response.Data[0].Embedding.ToArray();

These vectors are typically stored in a vector database alongside the item metadata.

3.2.2 Vector Database Integration

The retrieval layer needs fast approximate nearest neighbor (ANN) search. Both Azure AI Search and Qdrant support ANN through HNSW or IVF indexes.

Azure AI Search works well when you want hybrid search—mixing full-text filters with vector scoring.

Qdrant with C#

Qdrant is often used for self-hosted or cost-optimized deployments.

var client = new QdrantClient("localhost", 6333);

var searchResult = await client.SearchAsync(
    collectionName: "items",
    vector: vector,
    limit: 10
);

Both systems return top-N candidates efficiently, staying within the 10–20ms retrieval window.

3.2.3 Hybrid Search Example

Hybrid search combines metadata filtering, keyword search, and vector similarity—this usually gives the best recall coverage.

var searchClient = new SearchClient(
    new Uri(endpoint),
    indexName,
    new AzureKeyCredential(apiKey)
);

var options = new SearchOptions
{
    Size = 20,
    VectorSearchOptions = new()
    {
        Queries =
        {
            new VectorizedQuery(vector)
            {
                KNearestNeighborsCount = 20,
                Fields = { "vectorEmbedding" }
            }
        }
    }
};

var results = await searchClient.SearchAsync<SearchDocument>(
    "optional keyword filter",
    options
);

foreach (var hit in results.GetResults())
{
    Console.WriteLine($"{hit.Document["id"]}: {hit.Score}");
}

Hybrid retrieval ensures that both semantic similarity and textual relevance influence the result. This yields stronger recall across diverse item sets compared to using collaborative filtering or keyword search alone.


4 Stage 2: Deep Learning Based Scoring & Ranking

Candidate generation provides a broad list of potentially relevant items, but it cannot determine the final ordering on its own. Ranking is where the system evaluates each candidate more carefully, using richer features and more expressive models. The goal is simple: estimate which items the user is most likely to interact with right now and return them in the correct order—all within a few milliseconds.

To stay within the overall <100ms budget described earlier, ranking models must be fast, batch-friendly, and easy to optimize in production. Deep learning architectures like Two-Tower models are ideal for this stage because they capture relationships between users and items while keeping inference lightweight. Exporting these models to ONNX and running them inside .NET ensures consistent performance under real-time load.

4.1 The “Two-Tower” Neural Network Architecture

The Two-Tower model is widely used for large-scale ranking because it predicts user–item affinity efficiently. Instead of passing both user and item features into one large network, the architecture builds two separate neural networks—one for the user, one for the item. Each tower outputs an embedding vector. The similarity between the two vectors (typically a dot product) becomes the predicted relevance score.

This design fits nicely with the retrieval→ranking pipeline:

  • Retrieval (Section 3) can precompute or store item embeddings.
  • Ranking only needs to compute the user embedding and then compare it to a batch of item embeddings.
  • The dot product operation is extremely fast and predictable.

A typical Two-Tower implementation includes categorical features (region, device type), numerical aggregates (interaction counts), and embedding inputs (text or image embeddings). Both towers produce vectors in the same dimension so they can be compared directly.

Example training architecture in PyTorch:

import torch
import torch.nn as nn

class TwoTowerModel(nn.Module):
    def __init__(self, user_dim, item_dim, hidden_dim):
        super().__init__()
        self.user_tower = nn.Sequential(
            nn.Linear(user_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )
        self.item_tower = nn.Sequential(
            nn.Linear(item_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )

    def forward(self, user_features, item_features):
        u = self.user_tower(user_features)
        i = self.item_tower(item_features)
        return (u * i).sum(dim=1)

In production, item embeddings are often generated offline and stored in Redis, a vector store, or memory-mapped arrays. When a request arrives, the serving layer computes the user embedding once and scores all candidate items through efficient batch operations.

4.2 Leveraging ONNX Runtime

While Python is ideal for model training, real-time serving requires stability and predictable latency. This is why teams export trained models to ONNX and run inference inside .NET. ONNX Runtime provides highly optimized operators, multicore parallelism, and optional hardware acceleration.

4.2.1 Exporting a Model to ONNX

Exporting Two-Tower models from PyTorch is straightforward:

dummy_user = torch.randn(1, user_dim)
dummy_item = torch.randn(1, item_dim)

torch.onnx.export(
    model,
    (dummy_user, dummy_item),
    "two_tower.onnx",
    input_names=["user", "item"],
    output_names=["score"],
    dynamic_axes={
        "user": {0: "batch"},
        "item": {0: "batch"}
    },
    opset_version=15
)

Dynamic axes allow batching multiple items during inference—crucial for scoring dozens or hundreds of candidates at once.

4.2.2 Using ONNX Runtime from .NET

ONNX Runtime integrates cleanly with .NET, making inference fast and predictable. Loading and running the model:

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

var session = new InferenceSession("two_tower.onnx");

var userTensor = new DenseTensor<float>(userVector, new[] { 1, userVector.Length });
var itemTensor = new DenseTensor<float>(itemVectors, new[] { batchCount, itemDim });

var inputs = new List<NamedOnnxValue>
{
    NamedOnnxValue.CreateFromTensor("user", userTensor),
    NamedOnnxValue.CreateFromTensor("item", itemTensor)
};

using var results = session.Run(inputs);
var scores = results.First().AsEnumerable<float>().ToArray();

Because the ranking stage uses batches, ONNX Runtime’s optimized kernels keep the cost low—typically just a few milliseconds per request.

4.3 Inference Optimization in C#

Running real-time inference inside an API must account for high request volume and short time budgets. Even small inefficiencies accumulate quickly. The main considerations are memory allocation, pooling, and batching.

4.3.1 Memory Pooling

Array allocations inside the hot path create GC pressure, which leads to latency spikes. Using ArrayPool<T> avoids repeated allocations:

using System.Buffers;

var pool = ArrayPool<float>.Shared;
var buffer = pool.Rent(userVectorLength);

Buffer.BlockCopy(userVector, 0, buffer, 0, userVectorLength * sizeof(float));

try
{
    var userTensor = new DenseTensor<float>(buffer, new[] { 1, userVectorLength });
    // Use userTensor here
}
finally
{
    pool.Return(buffer);
}

This keeps the memory footprint stable and reduces GC churn.

4.3.2 Tensor Reuse

Rather than creating new tensor objects, teams reuse wrappers and update only the underlying buffer:

var tensor = new DenseTensor<float>(new Memory<float>(buffer), new[] { 1, dim });
// Overwrite buffer before each inference

This pattern cuts down object creation and stabilizes latency under load.

4.3.3 Batch Scoring

Scoring items one by one wastes compute. Batch scoring keeps ONNX Runtime busy and minimizes overhead:

var itemTensor = new DenseTensor<float>(itemBatchBuffer, new[] { batchSize, itemDim });

Most systems score batches of 32–256 items at a time depending on hardware and latency constraints.

4.4 Alternative: LightGBM for Ranking

Not every ranking workload requires deep learning. When features are mostly structured—price, category, popularity counts, inventory—gradient-boosted trees often work better than neural models. ML.NET includes a LightGBM ranking trainer that fits nicely into the same serving pipeline.

Example ranking pipeline:

var ml = new MLContext();

var data = ml.Data.LoadFromTextFile<RankingInput>("rankingData.csv", separatorChar: ',');

var pipeline = ml.Transforms
    .Concatenate("Features",
        nameof(RankingInput.Price),
        nameof(RankingInput.CategoryId),
        nameof(RankingInput.UserAffinity))
    .Append(ml.Ranking.Trainers.LightGbm(new LightGbmRankingTrainer.Options
    {
        LabelColumnName = "Label",
        FeatureColumnName = "Features",
        RowGroupColumnName = "GroupId"
    }));

var model = pipeline.Fit(data);

LightGBM models score very quickly and integrate well with the feature store described in Section 2. They often serve as either a baseline or as part of a hybrid ranking approach alongside Two-Tower models.


5 Handling the “Cold Start” Problem

Cold start appears whenever the system lacks historical interactions for either a user or an item. Because retrieval models like ALS rely on past behaviors, and ranking models rely on feature completeness, the system must handle these scenarios explicitly. The goal is not to simulate full personalization on day one. The goal is to produce reasonable, high-quality defaults that gradually adapt as signals accumulate.

A well-designed cold start strategy fits into the same architecture used throughout the pipeline: fast feature lookups, vector-based retrieval, and lightweight ranking. It relies on the components already described—Redis features, Stream Analytics aggregates, vector search, and ONNX scoring—but uses them differently depending on the situation.

5.1 New User Strategy

New users enter the system with no interaction history, so collaborative filtering and behavior-driven ranking cannot activate yet. Instead, the system depends on fast-moving global and regional trends, combined with early session signals. This ensures the user sees relevant items immediately while the system waits for the first meaningful interactions.

5.1.1 Regional Popularity Example

Trending signals computed by the streaming layer (Section 2) form the backbone of new-user recommendations. These pre-aggregated lists live in Redis, where they can be accessed in under a millisecond.

Common keys might include:

  • trending:global
  • trending:region:us
  • trending:region:uk

At request time, the serving API pulls from the appropriate list:

var db = redis.GetDatabase();
var regionKey = $"trending:region:{regionCode}";
var items = await db.ListRangeAsync(regionKey);

// Convert Redis entries to recommendation objects

Because the retrieval stage can produce these candidates instantly, the ranking stage still runs as normal. This ensures consistent behavior across the pipeline, even without personalization signals.

5.1.2 Session-Based Recommendations

Even within a few page views, the system can start building a short-term profile. Recent clicks serve as strong indicators of immediate intent. The fast feature store already captures these signals, so retrieval can use them without additional orchestration.

For example, if the last viewed item has an embedding stored in Redis, semantic retrieval can use it directly:

var lastItemVector = await db.StringGetAsync($"session:{sessionId}:lastVector");
var queryVector = JsonSerializer.Deserialize<float[]>(lastItemVector);

var similarItems = await qdrant.SearchAsync("items", queryVector, limit: 10);

This pattern mirrors the hybrid retrieval workflow in Section 3, but applies it to session data instead of long-term preferences. It often produces surprisingly accurate short-term recommendations long before collaborative filtering becomes effective.

5.2 New Item Strategy

New items present the opposite problem: they have no interactions, so models like ALS cannot assign them latent factors. Without a fallback strategy, these items would never appear in recommendations, hurting catalog coverage.

The solution is to initialize new items using content-based representations—typically text, image, or metadata embeddings—and insert them into the vector retrieval layer immediately.

5.2.1 Content-Based Mapping

The system generates embeddings for new items using the same embedding model used during retrieval. This guarantees all items occupy the same vector space, which maintains retrieval quality and consistency.

var vector = await embeddingService.GenerateAsync(itemDescription);
await qdrant.UpsertAsync("items", itemId, vector);

Once stored, the item becomes immediately discoverable through vector search. This removes any delay between item onboarding and first-time visibility.

5.2.2 Using Image and Text Embeddings Together

Many catalogs benefit from multimodal representations. For example, e-commerce products may include rich descriptions and multiple product images. Combining these signals produces more stable embeddings.

A typical approach:

text_vec = text_encoder.encode(description)
image_vec = image_encoder.encode(image_tensor)

combined = (text_vec + image_vec) / 2.0

These mixed embeddings help new items appear alongside semantically similar items—even before the first view or click occurs.

5.2.3 Hybrid Inclusion in Ranking

Ranking models often expect a complete set of behavioral features—view counts, click rates, purchase funnel metrics. New items have none of these, but they still need to enter the ranking stage smoothly.

The system handles this by assigning default or median values to missing features, allowing the ranking model to score the item without errors:

item.FeatureViews       = item.FeatureViews       ?? 0;
item.FeatureCarts       = item.FeatureCarts       ?? 0;
item.FeatureCategoryId  = item.FeatureCategoryId  ?? defaultCategory;

This approach preserves ranking stability and ensures new items can still compete fairly. As the item begins accumulating interactions, the feature store (Section 2) gradually replaces these defaults with real metrics.


6 Building the High-Performance Serving Layer

The serving layer is where every part of the recommendation system comes together: feature retrieval, candidate generation, vector search, ranking, re-ranking, and final assembly. By the time the request reaches this layer, the system must execute all remaining steps within the sub-100ms constraint outlined in Section 1. A serving API must therefore be predictable under heavy load, tolerant to slow dependencies, and flexible enough to scale each subsystem independently. .NET’s async model, runtime performance, and hosting capabilities make it a strong fit for these requirements.

A consistent serving design follows the same principle used throughout this article: break the work into focused components, keep each dependency isolated, and use optimized libraries (Redis, ONNX Runtime, Qdrant, Azure AI Search) where they make sense.

6.1 API Architecture

The API surface typically has two audiences:

  • Internal microservices that need to communicate with the retrieval engine, ranking service, feature store, or vector search components.
  • External clients that expect a simple REST-style endpoint for fetching recommendations.

Separating these concerns leads to predictable performance and clearer deployments. Internal calls benefit from strongly typed contracts, while external calls should remain lightweight and easy to consume.

6.1.1 Internal gRPC Services

Internal components—ranking, vector retrieval, or feature enrichment—often run on separate compute nodes. gRPC works well here because it provides low-latency RPC communication and clear, strongly typed contracts.

Example .proto definition used by the ranking service:

syntax = "proto3";

service RankingService {
  rpc ScoreItems (ScoreRequest) returns (ScoreResponse);
}

message ScoreRequest {
  repeated float userVector = 1;
  repeated Item items = 2;
}

message Item {
  string id = 1;
  repeated float vector = 2;
}

message ScoreResponse {
  repeated ScoredItem results = 1;
}

message ScoredItem {
  string id = 1;
  float score = 2;
}

The implementation in .NET stays small because the ranking logic (Section 4) is encapsulated inside a dedicated engine:

public class RankingServiceImpl : RankingService.RankingServiceBase
{
    private readonly RankingEngine _engine;

    public RankingServiceImpl(RankingEngine engine)
    {
        _engine = engine;
    }

    public override Task<ScoreResponse> ScoreItems(ScoreRequest request, ServerCallContext context)
    {
        var results = _engine.Score(request);
        return Task.FromResult(results);
    }
}

This mirrors the overall architecture: retrieval, vector search, and ranking each live behind focused services that can scale independently.

6.1.2 Minimal APIs for External Clients

External callers—mobile apps, web clients, or other services—benefit from a simple, clean endpoint. ASP.NET Core’s Minimal APIs fit this requirement and align with the low overhead needed for high-throughput traffic.

app.MapGet("/recommendations/{userId}", async (string userId, RecommendationService svc) =>
{
    var results = await svc.GetRecommendationsAsync(userId);
    return Results.Ok(results);
});

The RecommendationService orchestrates feature retrieval (Section 2), candidate generation (Section 3), and ranking (Section 4). Keeping the controller thin is essential for maintainability and performance.

6.1.3 Asynchronous Patterns and I/O-Bound Work

Most operations in the serving layer are I/O-bound: Redis reads, vector searches, ONNX inference calls, and internal RPC. Blocking threads here would cripple throughput under load. Using async methods prevents thread starvation and maximizes CPU availability for ranking computations.

Example feature retrieval:

public async Task<UserContext> GetUserContextAsync(string userId)
{
    var db = _redis.GetDatabase();
    var json = await db.StringGetAsync($"user:{userId}:context");

    return json.HasValue
        ? JsonSerializer.Deserialize<UserContext>(json)
        : new UserContext();
}

This matches the same pattern used throughout earlier stages: non-blocking, fast lookups, minimal allocations.

6.2 Caching Strategies

Caching reduces repeated work in the serving layer and prevents expensive calls to vector databases or ranking services. A multi-layer cache—following the same fast/slow separation described in Section 2—is essential for predictable performance.

6.2.1 Multi-Layer Caching

Two levels of caching are common:

  • In-memory (IMemoryCache) for global or semi-static features
  • Redis for per-user or per-session data

Global trending lists or precomputed embedding groups can live directly in memory:

if (!_memoryCache.TryGetValue("top-global", out List<Item> globalItems))
{
    globalItems = await _metadataService.GetTopGlobalAsync();
    _memoryCache.Set("top-global", globalItems, TimeSpan.FromMinutes(5));
}

User-specific features or short-term signals belong in Redis because they change frequently and must be shared across instances.

6.2.2 Cache-Aside Pattern

The cache-aside pattern keeps the logic consistent:

  1. Check the cache.
  2. If present, return the cached value.
  3. If missing, fetch from the source and cache the result.

This mirrors the structure used in the feature store (Section 2).

public async Task<UserFeatures> GetUserFeaturesAsync(string userId)
{
    var db = _redis.GetDatabase();
    var json = await db.StringGetAsync($"features:{userId}");

    if (!json.HasValue)
    {
        var features = await _slowStore.GetUserFeaturesAsync(userId);
        await db.StringSetAsync($"features:{userId}", JsonSerializer.Serialize(features), TimeSpan.FromMinutes(10));
        return features;
    }

    return JsonSerializer.Deserialize<UserFeatures>(json);
}

Using short TTLs for user-level data keeps recommendations responsive while avoiding heavy invalidation logic.

6.2.3 Invalidating Stale Recommendations

Because recommendations depend on rapidly evolving signals—session data, recent behavior, global trends—caches must expire aggressively:

  • User-level: 1–5 minutes
  • Global items: 5–15 minutes

This matches the freshness expectations established in the retrieval and ranking layers. Setting TTLs this way prevents outdated results without requiring complex back-channel invalidation processes.

6.3 Resilience

The serving layer must remain stable even when dependencies fail or slow down. Retrieval engines, vector databases, Redis, and ranking services may all experience intermittent issues. Without proper resilience patterns, these failures cascade and cause system-wide latency spikes or outages.

6.3.1 Circuit Breakers with Polly

Circuit breakers protect the system by detecting failing dependencies and temporarily halting traffic to them. Instead of repeatedly waiting for timeouts, the API quickly falls back to simpler logic such as trending lists or cached embeddings.

var circuitPolicy = Policy
    .Handle<Exception>()
    .CircuitBreakerAsync(
        exceptionsAllowedBeforeBreaking: 5,
        durationOfBreak: TimeSpan.FromSeconds(30)
    );

This mirrors the fallback strategies discussed in the cold-start section.

6.3.2 Bulkheads for Resource Isolation

Bulkheads ensure that one dependency cannot consume all available threads. Each subsystem—Redis, Qdrant, ONNX inference—can be wrapped in its own bulkhead.

var bulkhead = Policy.BulkheadAsync(
    maxParallelization: 50,
    maxQueuingActions: 200
);

This provides predictable behavior under peak load and protects core services.

6.3.3 Timeouts and Fallbacks

Timeouts prevent slow calls from blocking the pipeline. Paired with fallbacks, they create a graceful degradation path.

var timeout = Policy.TimeoutAsync(50, TimeoutStrategy.Pessimistic);

var fallback = Policy<UserRecommendations>
    .Handle<Exception>()
    .FallbackAsync(_fallbackService.GetTrendingAsync());

These techniques keep the API responsive even when parts of the system are degraded, preserving the user experience.


7 Continuous Improvement: A/B Testing and MLOps

A real-time recommendation engine is never “finished.” User behavior shifts, inventory changes, and new data arrives every second. Models that performed well last month gradually drift as patterns evolve. Continuous evaluation and controlled experimentation are therefore part of the core system architecture, not an optional add-on. The same principles used earlier—clear separation of responsibilities, fast feedback loops, and predictable infrastructure—apply here as well.

A/B testing ensures that new retrieval or ranking strategies do not harm engagement metrics in production. Meanwhile, MLOps pipelines keep the underlying models fresh by retraining them regularly and deploying updates safely. Together, these capabilities ensure the system adapts to changing behavior without introducing instability.

7.1 Infrastructure for Experimentation

Experimentation requires a controlled way to expose a subset of users to new retrieval models, ranking configurations, or feature store variations. Because the serving layer (Section 6) already routes most logic through the RecommendationService, inserting experimentation logic here keeps the design clean and testable.

Azure App Configuration and Feature Management integrate well with .NET, allowing traffic routing logic to remain simple and centralized.

7.1.1 Using App Configuration for Model Routing

Feature flags provide an easy way to toggle model behavior in real time. For example, a new deep-ranking model exported through ONNX might be enabled for only 5% of traffic initially.

bool useNewModel = await _featureManager.IsEnabledAsync("UseDeepRanking");

This toggle can determine whether the system calls:

  • The existing LightGBM-based ranking pipeline
  • The new Two-Tower ONNX model described in Section 4

This allows gradual rollout without redeploying the API.

7.1.2 User Bucketing with Hashing

Feature flags alone are not enough—users must receive consistent experiences. Bucketing solves this by assigning a user to a variant using a deterministic hash.

public int GetBucket(string userId)
{
    var hash = userId.GetHashCode();
    return Math.Abs(hash % 100);
}

Bucket rules might be:

  • Buckets 0–9 → Experimental ONNX ranking
  • Buckets 10–99 → Existing production ranking

Since the assignment is based on the user ID, the user remains in the same bucket across sessions and devices.

7.1.3 Telemetry-Based Monitoring

Each variant logs engagement metrics—impressions, clicks, dwell time—through the same ingestion pipeline described in Section 2. Because these metrics are streamed in real time, analytics dashboards can detect regressions early.

This provides a closed-loop measurement system:

  • Retrieval and ranking logic → Serving layer → User interaction
  • User interaction → Event ingestion → Stream processing → A/B analysis

This symmetry keeps experimentation reliable and consistent.

7.2 The Feedback Loop

A recommendation system must constantly compare what it predicted versus what actually happened. This feedback drives both feature updates and model retraining.

7.2.1 Logging Predictions

When the serving layer returns ranked items, it logs the model output alongside metadata like user ID, request context, and scores:

_logger.LogInformation("RankedItems: {Items}", JsonSerializer.Serialize(results));

These logs travel through Event Hubs or Kafka (from Section 2), eventually landing in storage where offline training jobs can read them.

7.2.2 Training Pipeline with Azure ML

Azure ML pipelines automate retraining using the newest interaction data. A typical nightly workflow reuses the same components described earlier:

  1. Pull interaction events captured through the ingestion layer
  2. Recompute fast-moving item statistics
  3. Generate updated embeddings (Sections 2 and 3)
  4. Retrain retrieval models like ALS
  5. Retrain ranking models such as Two-Tower networks
  6. Export updated models to ONNX
  7. Register them for deployment

Example Python entry point:

def train():
    data = load_interaction_data()
    model = train_two_tower(data)
    export_to_onnx(model, "model.onnx")

This keeps behavior-driven models synchronized with current user patterns.

7.2.3 Automated Deployment

Once a model passes offline evaluation (Section 7.3), CI/CD pipelines deploy it to a staging environment. During rollout:

  • gRPC ranking workers load the new ONNX model
  • Feature flags route a small percentage of traffic to the new model
  • Real-time metrics validate performance

If metrics degrade, the feature flag flips back, rolling the system to a stable version immediately.

This mirrors the resilience principles discussed in Section 6.

7.3 Offline Evaluation

Before exposing a model to real users, the team must understand its expected performance. Offline evaluation uses historical ground truth to test ranking quality. It’s not a substitute for online A/B testing, but it ensures only promising models move forward.

7.3.1 Computing nDCG

Normalized Discounted Cumulative Gain rewards models that rank relevant items near the top. It is one of the most widely used metrics for recommendation systems.

Simple Python implementation:

import numpy as np

def ndcg(pred, truth, k):
    gains = 1 / np.log2(np.arange(2, k + 2))
    idx = np.argsort(pred)[::-1][:k]
    dcg = (truth[idx] * gains).sum()
    ideal = (sorted(truth, reverse=True)[:k] * gains).sum()
    return dcg / ideal if ideal > 0 else 0

This directly reflects how well the ranking stage (Section 4) orders candidates.

7.3.2 Precision@K

Precision@K is easier to interpret and measures the fraction of top-K results that are relevant:

def precision_k(pred, truth, k):
    idx = np.argsort(pred)[::-1][:k]
    return sum(truth[i] for i in idx) / k

This gives a quick snapshot of how many of the top recommendations are correct.

In practice, teams compute nDCG, Precision@K, Recall@K, and coverage metrics before promoting a model to A/B testing.


A real-time recommendation engine in .NET brings together multiple specialized components—event ingestion, stream processing, feature stores, retrieval models, deep ranking, and resilient API design—into a single coordinated system. Each part addresses a specific step in the pipeline: capturing intent, transforming signals, generating candidates, ranking them efficiently, and serving results predictably. When combined, the architecture balances freshness, accuracy, and sub-100ms latency in a way that fits naturally with .NET’s strengths in performance, tooling, and operational stability.

The overall pattern mirrors the funnel introduced earlier: fast recall using collaborative filtering and semantic search, precise scoring using ONNX-based ranking models, and a resilient serving layer that hides complexity from the client. This layered design makes it possible to evolve individual components—retrieval engines, ranking models, feature stores—without disrupting the entire system.

8.1 Summary of the Stack

The full system integrates technologies that complement each other and align with the architecture illustrated throughout this article:

  • ML.NET for matrix factorization and classical ranking baselines
  • Azure AI Search or Qdrant for vector-based retrieval powered by embeddings
  • ONNX Runtime for low-latency scoring of Two-Tower and other deep ranking models
  • Redis as the real-time feature store for user and session signals
  • Azure Stream Analytics for windowed aggregation and trending metrics
  • ASP.NET Core Minimal APIs and gRPC for high-throughput serving and internal RPC
  • Polly for circuit breakers, bulkheads, and fallback logic

Each component directly supports the constraints discussed earlier: fast feature retrieval, efficient batch inference, reliable vector search, and predictable API latency.

8.2 The Cost of Real-Time

Real-time personalization provides measurable lift in engagement, but it comes with infrastructure trade-offs. Every low-latency dependency contributes to operational load:

  • Redis clusters must handle constant writes from streaming updates
  • Vector databases require memory-optimized nodes and ANN indexes
  • Ranking services must support high concurrency with ONNX inference
  • Frequent retraining increases storage and compute requirements

Teams must choose the right balance between personalization granularity and infrastructure footprint. Techniques described earlier—TTL-based caching, batch scoring, fallback retrieval, and selective feature updates—help keep costs under control without sacrificing user experience.

Cost is not only financial; it also includes operational simplicity. Systems that separate retrieval, ranking, and orchestration remain easier to scale and debug than monolithic ML services.

8.3 The Future

Recommendation systems are shifting toward richer, more expressive representations of user behavior. Several trends stand out:

8.3.1 Generative AI as a Reasoning Layer

LLMs can provide clearer explanations and support mixed retrieval strategies:

  • “We recommended this because you watched similar content yesterday.”
  • “Users with similar browsing patterns interacted with these items.”

These explanations enhance user trust and enable debugging of ranking decisions.

8.3.2 Sequential and Transformer-Based Models

Modern architectures treat a user’s behavior as a sequence rather than isolated events. Transformer-based models capture short-term intent more effectively than static embeddings and adapt quickly to fast-changing tastes. These models can also be exported to ONNX and integrated into the same .NET inference pipeline described in Section 4.

8.3.3 Deeper Integration of Vector and Graph Signals

As vector databases become faster and support hybrid filtering at scale, more systems will rely on semantic representations rather than sparse behavior alone. Combined with graph-based relationships—co-view graphs, item similarity graphs—retrieval will become richer and more dynamic.

8.3.4 Convergence of Retrieval, Ranking, and Large Models

Future pipelines will blend:

  • Classical retrieval (ALS, heuristics)
  • Vector retrieval (embeddings, multimodal signals)
  • Deep ranking (Two-Tower, transformer-based scoring)
  • Generative reasoning for explanations or fallback decisions

Because .NET can host ONNX Runtime, vector clients, and high-throughput APIs efficiently, it remains a strong platform for these emerging patterns.

Advertisement