1 Architectural Blueprint: Moving Beyond Simple Scraping
Most movie review aggregators begin as small utilities: fetch a page, scrape a number, store it somewhere. That works for a prototype, but it collapses as soon as you try to operate at real scale. Once you ingest data from multiple providers, deal with different scoring systems, and respect strict rate limits, a single scraping script becomes fragile and expensive to maintain.
At scale, movie review aggregation is no longer “scraping.” It is a distributed ingestion system that continuously pulls data from many sources, cleans and normalizes it, resolves identity conflicts, and makes the results available in near real time. This section describes the architectural foundation typically used by senior .NET teams building systems comparable to Rotten Tomatoes or Metacritic, using .NET 8/9, ASP.NET Core, MassTransit, Redis, PostgreSQL, and OpenTelemetry.
1.1 The Domain Model
A review aggregator lives or dies by the quality of its domain model. The moment you integrate IMDb, Metacritic, TMDb, and independent critic sites, you encounter conflicting identifiers, inconsistent naming, and partial metadata. If the model is unclear or unstable, every downstream process becomes harder.
The goal is to define a small set of core entities that remain stable even as ingestion sources, parsing logic, and scoring algorithms change over time.
1.1.1 Entities: Movie, Critic, Review, AggregateScore
A clean domain model allows ingestion, normalization, and scoring to evolve independently. Each entity has a clear responsibility and boundary.
Movie Represents the platform’s internal, canonical view of a movie. This is the identity everything else attaches to.
Typical fields:
MovieId(GUID or long)TitleReleaseYearDirectors(collection)CanonicalSlugCreatedAt,UpdatedAt
This entity does not try to mirror any external source exactly. Instead, it represents the platform’s best understanding of a movie.
Critic Represents a reviewer or publication, not a single review.
Fields:
CriticIdNameTier(Top Critic, Certified Reviewer, User)
The Tier field becomes important later when weighting scores and separating professional reviews from audience feedback.
Review Represents an individual opinion about a movie. Reviews are ingested in raw form and then normalized.
Fields:
ReviewIdMovieIdCriticIdSource(IMDb, Metacritic, Independent Blog)OriginalScore(as displayed by the source)NormalizedScore(converted to a 0–100 scale)ReviewTextUrl
Keeping both original and normalized scores allows auditing and recalculation when scoring rules change.
AggregateScore Represents the computed outcome users actually care about.
Fields:
MovieIdCriticScoreAudienceScoreNumCriticReviewsNumAudienceReviewsBayesianWeightedScore
This entity is derived data. It can be recalculated at any time from underlying reviews.
1.1.2 The Canonical Movie Problem
Every data source identifies movies differently:
- IMDb uses IDs like
tt0133093 - TMDb uses numeric IDs
- Metacritic uses human-readable slugs
- Many publications link movies only by title text
These differences lead to common conflicts:
- “The Matrix (1999)” vs. “Matrix, The”
- Director naming differences, such as “Wachowski” vs. “The Wachowskis”
- Regional title changes, like “Edge of Tomorrow” vs. “Live Die Repeat”
To handle this, the system needs a canonical mapping layer that connects all external identifiers to a single internal MovieId.
A simple PostgreSQL schema looks like this:
CREATE TABLE movie_canonical_map (
movie_id UUID NOT NULL,
source VARCHAR(50) NOT NULL,
external_id VARCHAR(255) NOT NULL,
PRIMARY KEY (source, external_id)
);
This table is central to the entire platform. Ingestion services write mappings as they discover them. Normalization and scoring services rely on it to ensure every review attaches to the correct movie. Without this layer, duplicate movies and fragmented scores are unavoidable.
1.2 High-Level System Design
A production-grade aggregator does not run as a single scraping process. It is composed of focused services that communicate through events and shared storage. This separation keeps the system resilient and easier to scale.
At a high level, the architecture consists of three core services.
1.2.1 Ingestion Service
The ingestion service is responsible for interacting with external systems.
Its responsibilities are narrow and deliberate:
- Fetch raw HTML, JSON, or API responses
- Respect rate limits and provider rules
- Apply retries and backoff
- Publish raw payloads for downstream processing
This service is stateless and optimized for network I/O. It does not parse or interpret content beyond what is required to route it correctly.
1.2.2 Normalization Service
The normalization service turns raw payloads into structured data.
Its responsibilities include:
- Parsing HTML or JSON into structured reviews
- Converting scores into a unified 0–100 scale
- Deduplicating reviews
- Resolving movie identity through canonical mapping
- Publishing “MovieUpdated” events
This service is CPU-heavy and usually runs as background workers. It is isolated from external dependencies so parsing failures or schema changes do not impact ingestion.
1.2.3 API Gateway (YARP)
The API gateway is the single entry point for clients.
Its responsibilities are:
- Routing requests to internal APIs
- Applying caching using HybridCache and Redis
- Enforcing rate limits and throttling
- Forwarding observability context for tracing
A minimal YARP configuration might look like this:
{
"ReverseProxy": {
"Routes": {
"movies": {
"ClusterId": "movie-api",
"Match": { "Path": "/api/movies/{**catch-all}" }
}
},
"Clusters": {
"movie-api": {
"Destinations": {
"d1": { "Address": "https://movie-api.internal/" }
}
}
}
}
}
The gateway keeps client-facing concerns separate from internal service logic.
1.3 Communication Patterns
Services must exchange data efficiently without becoming tightly coupled. The communication model determines how well the system scales and recovers from failures.
1.3.1 Synchronous vs. Asynchronous Ingestion
Synchronous ingestion—fetching data on demand during API requests—leads to predictable problems:
- Slow user responses
- Frequent rate-limit violations
- Cascading failures when external providers are slow or unavailable
Asynchronous ingestion avoids these issues by decoupling data collection from user traffic. Fetch operations are queued and processed independently.
Benefits include:
- Horizontal scalability
- Built-in retry and backoff behavior
- Consistent throughput even under load
1.3.2 Event-Driven Architecture with MassTransit
MassTransit provides a clean abstraction for building event-driven workflows using RabbitMQ, Azure Service Bus, or AWS SQS.
A simple message contract might look like this:
public record RawPayloadFetched(
string Source,
string ExternalId,
string Payload,
DateTime FetchedAt);
Publishing a message:
await _publishEndpoint.Publish(new RawPayloadFetched(
source, externalId, payload, DateTime.UtcNow));
Consuming the message:
public class RawPayloadConsumer : IConsumer<RawPayloadFetched>
{
public async Task Consume(ConsumeContext<RawPayloadFetched> context)
{
// Normalize, extract reviews, store, etc.
}
}
This approach cleanly separates concerns. Ingestion does not need to know how parsing works, and normalization does not care where the data came from. The result is higher throughput, better fault isolation, and a system that can evolve without constant rewrites.
2 The Ingestion Engine: Resilience and Rate Limiting
If a movie aggregation platform breaks, it usually breaks in the ingestion layer. This is the part of the system that talks directly to the outside world—APIs you don’t control, websites that change without warning, and providers that will block you if you behave poorly. Under light usage, almost any approach works. Under real traffic, weak ingestion design shows up immediately as timeouts, bans, missing data, or cascading failures.
A production ingestion engine has two core goals. First, it must be polite: respect rate limits, avoid unnecessary requests, and recover gracefully from failures. Second, it must be resilient: temporary outages or slow providers should not take down the rest of the platform. Everything in this section is about meeting those goals while still keeping throughput high.
2.1 Managing External Dependencies
Movie review data does not come from one clean, uniform source. Some providers expose structured APIs. Others publish static HTML pages. Many modern critic sites render content dynamically using JavaScript. A single ingestion service must handle all of these without becoming fragile or overly complex.
The key design principle is this: treat every external dependency as unreliable and potentially hostile, even when it’s well-documented.
2.1.1 Refit for API Consumption
When a provider offers an API, use it. APIs are usually faster, more stable, and less likely to trigger anti-bot defenses. Refit works well in .NET because it keeps HTTP concerns explicit while avoiding repetitive boilerplate.
A typical API client definition looks like this:
public interface IMetacriticApi
{
[Get("/movie/{slug}")]
Task<MetacriticMovieDto> GetMovie(string slug);
}
The client configuration stays simple and readable:
services.AddRefitClient<IMetacriticApi>()
.ConfigureHttpClient(c =>
{
c.BaseAddress = new Uri("https://api.metacritic.com");
});
Refit gives you strongly typed requests and responses, which reduces parsing errors and makes schema changes easier to detect. When combined with Polly, it also integrates cleanly with retries and circuit breakers without hiding HTTP behavior.
2.1.2 Playwright for Non-API and Dynamic HTML Sources
Not all review sources provide APIs. Many professional critics publish reviews on sites that rely heavily on client-side rendering. In those cases, traditional HTTP clients are insufficient because the HTML you receive does not contain the actual content.
Playwright for .NET fills this gap by running a real browser engine and executing JavaScript before extracting content.
using var playwright = await Playwright.CreateAsync();
await using var browser = await playwright.Chromium.LaunchAsync();
var page = await browser.NewPageAsync();
await page.GotoAsync(url);
var html = await page.ContentAsync();
In practice, Playwright should be treated as a scarce resource. Browser instances are expensive, and careless usage will overwhelm CPU and memory.
Operational lessons that matter:
- Run Playwright headless in isolated worker pods.
- Pre-warm browser contexts to reduce startup latency.
- Queue page visits instead of launching them concurrently.
This keeps dynamic scraping predictable and prevents ingestion workers from starving the rest of the system.
2.2 Advanced Rate Limiting Strategies
Rate limiting is not optional. Every provider enforces limits, even if they don’t publish them clearly. The challenge is enforcing limits consistently across multiple ingestion workers and multiple deployment instances.
A local rate limiter is not enough once the system scales horizontally.
2.2.1 Token Bucket with .NET RateLimiting
.NET’s built-in rate limiting primitives make it easier to express provider-specific limits in code. The token bucket algorithm works well because it allows short bursts while enforcing a steady long-term rate.
var limiter = PartitionedRateLimiter.Create<HttpRequestMessage, string>(req =>
{
var key = req.RequestUri.Host;
return RateLimitPartition.GetTokenBucketLimiter(key, _ => new TokenBucketRateLimiterOptions
{
TokenLimit = 10,
TokensPerPeriod = 10,
ReplenishmentPeriod = TimeSpan.FromSeconds(1),
AutoReplenishment = true
});
});
You attach this limiter directly to the HttpClient used for ingestion:
services.AddHttpClient("scraper")
.AddHttpMessageHandler(() => new RateLimitingHandler(limiter));
This ensures that even under high load, requests to a single provider stay within acceptable bounds.
2.2.2 Distributed Rate Limiting with Redis
Once you deploy multiple ingestion instances, local rate limiting is no longer sufficient. Ten pods, each making ten requests per second, still look like a denial-of-service attack to an upstream provider.
Redis solves this by acting as a shared coordination point.
A simple sliding-window approach:
var count = await db.StringIncrementAsync("imdb:rate:window");
if (count == 1)
await db.KeyExpireAsync("imdb:rate:window", TimeSpan.FromSeconds(1));
if (count > maxRequestsPerSecond)
throw new RateLimitExceededException();
Every ingestion worker checks the same counter. This keeps aggregate traffic within limits regardless of how many instances are running.
2.3 Retry Pattern with Polly V8
Failures are inevitable. Networks fail, providers throttle aggressively, and temporary outages happen. Retries are necessary, but uncontrolled retries make problems worse instead of better.
The goal is to retry only when it makes sense and to stop quickly when a provider is clearly unhealthy.
2.3.1 Jittered Backoff
If all workers retry at the same fixed interval, they create synchronized retry storms. Adding jitter spreads retries over time and reduces load spikes.
var retry = Policy
.Handle<HttpRequestException>()
.WaitAndRetryAsync(
retryCount: 5,
sleepDurationProvider: attempt =>
TimeSpan.FromMilliseconds(Random.Shared.Next(50, 250) * attempt));
This simple change significantly improves stability during partial outages.
2.3.2 Circuit Breaker Pattern
When a provider consistently fails, retries should stop altogether. Circuit breakers detect repeated failures and short-circuit requests for a cooling-off period.
var breaker = Policy
.Handle<HttpRequestException>()
.CircuitBreakerAsync(
handledEventsAllowedBeforeBreaking: 5,
durationOfBreak: TimeSpan.FromSeconds(30));
Retries and breakers are typically combined:
var policy = Policy.WrapAsync(retry, breaker);
This ensures the ingestion engine fails fast when necessary and recovers automatically once the provider stabilizes.
2.4 IP Rotation and Proxy Management
Some providers block traffic based on IP reputation, regardless of how carefully you rate-limit. This is common with HTML scraping and smaller sites.
The safest strategy is always to prefer official APIs. When scraping is unavoidable, IP management becomes part of ingestion design.
Practical guidelines:
- Separate proxy pools by provider to isolate risk.
- Use residential proxies only for high-risk sources.
- Cache aggressively to reduce repeated requests.
- Monitor ban rates and error patterns continuously.
A basic proxy configuration for HttpClient:
var handler = new HttpClientHandler
{
Proxy = new WebProxy(proxyUrl)
};
In production, proxies are selected dynamically using round-robin or health-based scoring. Failed proxies are removed automatically, preventing widespread ingestion failures.
3 Intelligent Data Hygiene: Entity Resolution and Deduplication
Once ingestion is reliable, the next major source of errors appears in data hygiene. At this stage, the system is pulling reviews successfully, but it still needs to answer a deceptively simple question: are all these reviews actually talking about the same movie? Even small mismatches here lead to duplicated entries, split scores, or incorrect rankings.
Entity resolution is not a one-time task. It is an ongoing process that runs continuously as new sources, reviews, and metadata arrive. A production aggregator treats this as a first-class concern, not a cleanup step.
3.1 The Challenge of Fuzzy Data
Movie titles are not stable identifiers. They change across regions, distributors, re-releases, and time. Consider just a single well-known example:
- “Star Wars Episode IV: A New Hope”
- “Star Wars: A New Hope”
- “Star Wars (1977)”
These are all the same movie, but string equality alone cannot tell you that. Differences come from many directions:
- punctuation and formatting
- subtitles being added or removed
- word ordering changes
- release year included or omitted
- localized or alternate titles
If the normalization pipeline compares titles too strictly, it creates duplicates. If it compares them too loosely, it merges unrelated movies.
Some common real-world pitfalls:
- “Alien” (1979) vs. “Aliens” (1986)
- Remakes that reuse the same title decades later
- Extended or director’s cuts listed as separate releases
The goal is not perfection. The goal is a deterministic, explainable process that produces consistent results and can be tuned when edge cases appear.
3.2 String Metrics and Algorithms
String similarity algorithms provide a foundation for comparing titles, but none of them work well in isolation. Each one captures a different aspect of similarity, and understanding their strengths and limits is critical.
3.2.1 Levenshtein Distance
Levenshtein distance measures how many single-character edits are needed to turn one string into another. It works well for catching typos and small formatting differences.
public int Levenshtein(string a, string b)
{
var dp = new int[a.Length + 1, b.Length + 1];
for (int i = 0; i <= a.Length; i++) dp[i, 0] = i;
for (int j = 0; j <= b.Length; j++) dp[0, j] = j;
for (int i = 1; i <= a.Length; i++)
for (int j = 1; j <= b.Length; j++)
{
var cost = a[i - 1] == b[j - 1] ? 0 : 1;
dp[i, j] = Math.Min(
Math.Min(dp[i - 1, j] + 1, dp[i, j - 1] + 1),
dp[i - 1, j - 1] + cost);
}
return dp[a.Length, b.Length];
}
This metric is useful, but it does not understand semantics. “Alien” and “Aliens” are very close by this measure, even though they refer to different movies. That makes Levenshtein a supporting signal, not a deciding one.
3.2.2 Jaro-Winkler
Jaro-Winkler favors matches that share common prefixes. This is useful when titles start the same but diverge later, which is common with subtitles.
double score = JaroWinkler.Similarity(
"Star Wars",
"Star Wars: A New Hope");
This metric helps distinguish between movies that start similarly but diverge early versus those that only differ at the end.
3.2.3 FuzzySharp Weighted Ratios
FuzzySharp combines multiple comparison strategies into a single weighted score. It considers partial matches, token ordering, and string normalization.
var ratio = Fuzz.WeightedRatio(
"Star Wars",
"Star Wars: A New Hope");
In practice, this tends to outperform raw distance metrics for movie titles. It is especially effective when combined with additional metadata like release year or director.
3.3 Multi-Factor Matching Logic
Relying on title similarity alone leads to unacceptable error rates. A robust resolver combines multiple weak signals into a single confidence score. Each signal contributes context that reduces ambiguity.
Common factors include:
- Title similarity using Jaro-Winkler or WeightedRatio
- Release year where an exact match is a strong indicator
- Director name matched exactly or fuzzily
- Runtime as a secondary check when available
- Genre overlap for additional confidence
A simple weighted scoring function might look like this:
double ComputeConfidence(MovieSourceA a, MovieSourceB b)
{
var titleScore = Fuzz.WeightedRatio(a.Title, b.Title) / 100.0;
var yearScore = a.ReleaseYear == b.ReleaseYear ? 1.0 : 0.0;
var directorScore = Fuzz.Ratio(a.Director, b.Director) / 100.0;
return 0.6 * titleScore +
0.25 * yearScore +
0.15 * directorScore;
}
The weights reflect how reliable each signal is in practice. A threshold—often around 0.85—determines whether two records are considered the same movie. Scores below that threshold are either flagged for manual review or treated as separate entries.
This approach keeps the system explainable. When a match is wrong, you can see why it happened and adjust the weights accordingly.
3.4 Canonical ID Mapping
Once two records are determined to refer to the same movie, the system must remember that decision. Canonical mapping ensures future ingestion is fast and consistent.
A typical PostgreSQL schema:
CREATE TABLE movies (
movie_id UUID PRIMARY KEY,
title TEXT,
release_year INT,
director TEXT
);
CREATE TABLE movie_source_links (
movie_id UUID NOT NULL REFERENCES movies(movie_id),
source VARCHAR(50) NOT NULL,
external_id VARCHAR(255) NOT NULL,
UNIQUE(source, external_id)
);
When a new association is discovered, it is persisted immediately:
INSERT INTO movie_source_links (movie_id, source, external_id)
VALUES ($1, 'imdb', $2)
ON CONFLICT DO NOTHING;
This design guarantees:
- Deterministic identity resolution
- Fast lookups for future ingestions
- Safe concurrent writes from multiple workers
In practice, normalization workers follow a consistent flow:
- Attempt a direct lookup by external ID
- If not found, run fuzzy matching against known movies
- If confidence is high, create a mapping
- If no match exists, create a new canonical movie
By enforcing this process, the platform maintains a single, stable identity for each movie. That stability is what makes accurate aggregation, scoring, caching, and real-time updates possible at scale.
4 The Scoring Engine: Normalization and Bayesian Averaging
Once reviews are correctly attached to the right movies, the next problem is turning those reviews into scores people can trust. This is harder than it looks. Review sources use different scales, different conventions, and very different sample sizes. A score based on three reviews should not carry the same weight as a score based on three hundred.
A production scoring engine does three things well. First, it converts all incoming scores into a single, consistent scale. Second, it avoids ranking movies too aggressively when review counts are low. Third, it balances professional critic opinions with audience feedback in a predictable way. Without all three, rankings become noisy and easy to game.
4.1 Scale Normalization
Before any averaging happens, every score must be converted into a common numeric range. Most aggregators settle on a 0–100 scale because it is easy to reason about and works well with weighted formulas.
There is no perfect conversion, especially for subjective formats like letter grades. The goal is consistency, not mathematical purity. As long as the rules are deterministic and applied everywhere, the system behaves predictably.
A typical normalization pipeline covers the common cases:
- 5-star scale →
(stars / 5) * 100 - 10-point scale →
(points / 10) * 100 - Letter grades → fixed lookup table
- Binary “Fresh/Rotten” or “Thumbs Up/Down” → 100 or 0
Letter grades require explicit mapping. A practical and commonly accepted mapping looks like this:
private static readonly Dictionary<string, double> LetterGradeMap = new()
{
["A+"] = 100,
["A"] = 95,
["A-"] = 90,
["B+"] = 85,
["B"] = 80,
["B-"] = 75,
["C+"] = 70,
["C"] = 65,
["C-"] = 60,
["D+"] = 55,
["D"] = 50,
["D-"] = 45,
["F"] = 20
};
Normalization logic then becomes straightforward and easy to audit:
public double NormalizeScore(ReviewRaw raw)
{
return raw.Type switch
{
ReviewType.FiveStar => (raw.Value / 5.0) * 100.0,
ReviewType.TenPoint => (raw.Value / 10.0) * 100.0,
ReviewType.LetterGrade => LetterGradeMap[raw.Grade],
ReviewType.Thumbs => raw.Value == 1 ? 100.0 : 0.0,
_ => throw new NotSupportedException()
};
}
At this point, every review—regardless of origin—can be treated the same way by the rest of the system.
4.2 The Flaw of Arithmetic Mean
Once scores are normalized, the next instinct is to average them. That works fine in small datasets, but it breaks badly at scale. A simple arithmetic mean ignores how many reviews contributed to the score.
This leads to obvious ranking problems:
- A movie with one 10/10 review outranks a movie with hundreds of 9/10 reviews
- New releases with minimal feedback spike to the top
- Rankings fluctuate wildly as new reviews trickle in
Arithmetic mean fails in this context because:
- It heavily overweights small sample sizes
- A single outlier dominates early scores
- It produces unstable rankings that feel arbitrary to users
When people look at a “Top Movies” list, they expect confidence and consistency. They expect movies with a long review history to be rewarded for that history.
4.3 Implementing Bayesian Approximation
To fix this, most large aggregators use a Bayesian-style weighted average. The idea is simple: blend a movie’s own average with a global baseline until enough reviews exist to trust it fully.
The standard formula looks like this:
WeightedScore = (v / (v + m)) * R + (m / (v + m)) * C
Where:
Ris the movie’s average normalized scorevis the number of reviewsmis the minimum review count required for full confidenceCis the global average score across all movies
Movies with many reviews lean heavily toward their own average. Movies with very few reviews stay closer to the global mean.
4.3.1 Dynamic calculation of C and m
Hardcoding C and m works for demos but not for real systems. As the catalog grows and user behavior changes, these values should adapt automatically.
A practical approach is to compute:
Cas the average score across all reviewsmas a percentile of review counts (for example, the 80th percentile)
public class BayesianCalculator
{
private readonly MovieDbContext _db;
public BayesianCalculator(MovieDbContext db)
{
_db = db;
}
public async Task<double> ComputeGlobalAverageAsync()
{
return await _db.Reviews
.Select(r => r.NormalizedScore)
.DefaultIfEmpty()
.AverageAsync();
}
public async Task<int> ComputeMinimumVotesAsync()
{
var counts = await _db.Reviews
.GroupBy(r => r.MovieId)
.Select(g => g.Count())
.ToListAsync();
return (int)MathNet.Numerics.Statistics.Statistics
.Percentile(counts.Select(c => (double)c), 80);
}
}
This approach automatically scales as the dataset grows, without manual tuning.
4.3.2 Computing the Bayesian score
Once C and m are known, computing the score for an individual movie is straightforward:
public async Task<double> ComputeMovieScoreAsync(Guid movieId)
{
var reviews = await _db.Reviews
.Where(r => r.MovieId == movieId)
.ToListAsync();
if (!reviews.Any())
return 0;
double R = reviews.Average(r => r.NormalizedScore);
int v = reviews.Count;
double C = await ComputeGlobalAverageAsync();
int m = await ComputeMinimumVotesAsync();
return (v / (double)(v + m)) * R +
(m / (double)(v + m)) * C;
}
In production systems, these calculations typically run in background workers and are cached aggressively. The API layer simply reads the precomputed results.
4.4 Weighted Critic Tiers
Not all reviews should influence the score equally. A professional critic writing for a major publication carries a different signal than an anonymous audience rating. Most aggregators reflect this by assigning weights to different reviewer types.
A simple and effective weighting model:
- Top Critics → weight
1.5 - Standard Critics → weight
1.0 - Audience Reviews → weight
0.5
Weighted averages are easy to compute:
public double ComputeWeightedAverage(IEnumerable<Review> reviews)
{
var weightedSum = reviews.Sum(r => r.NormalizedScore * r.Critic.Weight);
var totalWeight = reviews.Sum(r => r.Critic.Weight);
return weightedSum / totalWeight;
}
This weighted average replaces R in the Bayesian formula. The result is a score that reflects both review quality and quantity. It remains stable as audience sentiment fluctuates and avoids letting early hype or review bombing dominate rankings.
At this point, the scoring engine produces results that feel fair, explainable, and consistent—exactly what users expect from a Rotten Tomatoes–style platform.
5 Change Detection and Real-Time Updates
Movie review data never sits still. Critics publish new reviews throughout a film’s release cycle, audience scores drift as more people watch, and external platforms occasionally revise their numbers or fix errors. A large aggregator cannot afford to re-fetch everything on a fixed schedule. That approach wastes bandwidth, burns CPU, and increases the risk of being rate-limited or blocked.
Instead, the system needs to answer two questions efficiently. First: has anything meaningful changed? Second: if it has, how urgently do we need to act on it? Change detection and prioritization are what make near–real-time updates possible without constantly scraping the entire internet.
5.1 Change Detection Patterns
The ingestion layer should avoid work whenever possible. Parsing HTML, executing JavaScript, and running normalization logic are all expensive. Before doing any of that, the system should confirm that the underlying content has actually changed in a way that affects reviews or scores.
5.1.1 Content Hashing
The simplest and most reliable technique is hashing the raw payload. If the content is identical to what was processed last time, there is nothing new to do.
A SHA-256 hash works well for this purpose:
public static string ComputeSha256(string content)
{
using var sha = SHA256.Create();
var bytes = Encoding.UTF8.GetBytes(content);
var hash = sha.ComputeHash(bytes);
return Convert.ToHexString(hash);
}
When a worker fetches new content, it compares the hash to the previously stored value:
if (newHash == oldHash)
{
// No meaningful change; skip normalization
return;
}
This small check saves a surprising amount of compute. Most pages do not change between polling intervals, especially for older movies. Hashing allows the system to skip parsing, DOM traversal, and regex extraction entirely.
5.1.2 ETag and Last-Modified Support
When providers expose proper HTTP caching headers, the ingestion engine should always use them. Conditional requests dramatically reduce bandwidth and response times.
A typical request looks like this:
var request = new HttpRequestMessage(HttpMethod.Get, url);
request.Headers.IfNoneMatch.Add(new EntityTagHeaderValue(storedEtag));
request.Headers.IfModifiedSince = storedLastModified;
If the server responds with 304 Not Modified, the body is never downloaded. This is especially valuable for large JSON responses from APIs that update infrequently.
ETag and Last-Modified checks do not replace content hashing. They complement it. When available, they prevent unnecessary downloads; when not available or unreliable, hashing still protects the normalization pipeline.
5.1.3 Partial Document Comparison
For HTML-heavy sources, full-page hashing can be too sensitive. Minor layout or ad changes can alter the HTML without affecting review data.
A more precise approach is to hash only the relevant section of the page, such as the reviews container. Playwright makes this practical:
var reviewHtml = await page.QuerySelectorAsync("#reviews-section");
var segment = await reviewHtml.InnerHTMLAsync();
var hash = ComputeSha256(segment);
By focusing on the parts of the page that actually contain reviews and scores, the system avoids false positives and unnecessary reprocessing.
5.2 Priority Queues for Updates
Even when changes are detected, not all updates are equally important. A new review for a movie opening this weekend matters far more than a score change on a film released ten years ago. The system needs a way to process urgent updates first without starving less popular titles forever.
RabbitMQ priority queues provide this capability without complicating worker logic.
5.2.1 Declaring a Priority Queue
MassTransit exposes RabbitMQ priority support through endpoint configuration:
cfg.ReceiveEndpoint("movie-updates", e =>
{
e.SetQuorumQueue();
e.ConfigureConsumeTopology = false;
e.SetQueueArgument("x-max-priority", 10);
});
When publishing messages, producers assign a priority:
await _publishEndpoint.Publish(
new RefreshMovie(movieId),
ctx => ctx.SetPriority(9));
Lower-priority updates use smaller values, typically in the 0–2 range. Workers always consume higher-priority messages first, ensuring that popular or time-sensitive movies stay fresh.
5.2.2 Heat-Based Prioritization
To decide priority consistently, the system calculates a simple “heat score” for each movie. This score reflects how likely users are to care about changes right now.
Common inputs include:
- Days since release
- Recent review volume
- API request frequency
- Search or trending signals
A straightforward implementation might look like this:
int ComputeHeat(Movie m)
{
int recency = Math.Max(0, 30 - (DateTime.UtcNow - m.ReleaseDate).Days);
int popularity = m.ApiHitsLast24h / 100;
int recentReviews = m.NewReviewsLast48h;
return recency * 2 + popularity + recentReviews;
}
The resulting heat value maps directly to queue priority. Movies with high heat are refreshed aggressively; low-heat titles drift toward slower update cycles.
5.3 Webhooks vs. Polling
In an ideal world, every provider would send webhooks when review data changes. In reality, most movie data sources do not. Polling is unavoidable. The challenge is doing it intelligently.
5.3.1 Adaptive Polling
Fixed schedules lead to waste. Polling everything every hour is too slow for new releases and far too frequent for archive titles. Instead, polling intervals adapt based on movie heat.
A typical schedule:
- High-priority movies → every 5–15 minutes
- Medium priority → every 2–6 hours
- Low priority → daily or weekly
This logic can be expressed directly in code:
TimeSpan ComputeNextPoll(int heat)
{
return heat switch
{
> 80 => TimeSpan.FromMinutes(10),
> 40 => TimeSpan.FromHours(2),
_ => TimeSpan.FromDays(1)
};
}
Adaptive polling keeps ingestion responsive without overwhelming external providers.
5.3.2 Simulated Webhook Workflow
To approximate webhook behavior, the system relies on internal change detection. Content hashes act as triggers rather than timers.
The effective workflow looks like this:
- Poll the source
- Compute the content hash
- Compare it to the previous value
- If it changed, publish a
MovieContentChangedevent - Normalization and scoring services react to the event
From the rest of the system’s perspective, this behaves very much like a webhook-driven architecture.
5.3.3 Long Polling for Semi-Dynamic Sources
Some APIs expose lightweight metadata that indicates when content last changed, even if they don’t support webhooks. When available, this metadata should be checked before fetching full payloads.
var meta = await _api.GetMovieMeta(slug);
if (meta.LastUpdated <= cached.LastUpdated)
{
return; // No fetch needed
}
This pattern is common with TMDb and community-maintained APIs. It further reduces unnecessary traffic and keeps ingestion efficient.
6 Smart Caching Strategies: The “Theatrical Window” Approach
Caching is where a movie aggregation platform either feels fast and stable or constantly struggles under load. Movie data does not change at a steady pace. Activity spikes around release, settles during a theatrical run, and eventually flattens out. A good caching strategy follows this lifecycle instead of fighting it.
The “theatrical window” approach treats caching as a function of time and relevance. New releases need frequent refreshes. Older movies do not. By aligning cache behavior with how reviews actually arrive, the system stays responsive without constantly recalculating scores or hammering the database. .NET 9’s HybridCache makes this pattern practical without introducing complex cache coordination logic.
6.1 Dynamic TTL (Time-To-Live) Strategy
Using the same TTL for every movie is one of the most common caching mistakes. It either causes unnecessary churn for older titles or stale data for new releases. A release-aware TTL model avoids both problems.
During opening weekend, a popular movie may receive dozens of critic and audience reviews in a single hour. A decade-old film may not change at all for months. The cache should reflect that reality.
A practical TTL strategy looks like this:
- Pre-release / Opening weekend (high volatility) TTL: 5–10 minutes Scores change quickly as early reviews arrive and audience ratings fluctuate.
- In theaters (medium volatility, roughly 2–4 weeks) TTL: 1–6 hours Review velocity slows, but updates still matter.
- Archive titles (low volatility) TTL: 7–30 days Scores are stable and rarely change.
That logic can be captured cleanly in code:
public TimeSpan GetTtlFor(Movie movie)
{
var age = (DateTime.UtcNow - movie.ReleaseDate).Days;
return age switch
{
<= 3 => TimeSpan.FromMinutes(10), // Opening weekend
<= 30 => TimeSpan.FromHours(2), // In theaters
_ => TimeSpan.FromDays(14) // Archive
};
}
This simple rule eliminates a large percentage of unnecessary cache invalidations and recalculations. More advanced signals—such as traffic or review velocity—can be layered in later, but time-based TTL already delivers most of the benefit.
6.2 Implementing Tiered Caching
A single cache layer is rarely enough at scale. In-memory caches are fast but isolated to one instance. Distributed caches are consistent but slower. HybridCache combines both approaches in a way that fits naturally into ASP.NET applications.
With HybridCache:
- L1 cache lives in memory on each API node
- L2 cache lives in Redis and keeps nodes in sync
This setup avoids custom cache orchestration while still delivering low-latency reads.
6.2.1 Setting Up HybridCache
Configuration typically happens in the API layer. Payload size limits are important to prevent accidental caching of large objects.
services.AddHybridCache(options =>
{
options.MaximumPayloadBytes = 1024 * 32;
});
Redis provides the shared backing store:
services.AddStackExchangeRedisCache(o =>
{
o.Configuration = "redis.internal:6379";
});
With this in place, HybridCache handles L1 and L2 coordination automatically.
6.2.2 Using HybridCache in API Handlers
Caching logic works best when it stays close to the data it protects. A typical handler that serves aggregated movie scores might look like this:
public class MovieScoreHandler
{
private readonly HybridCache _cache;
private readonly MovieService _service;
public MovieScoreHandler(HybridCache cache, MovieService service)
{
_cache = cache;
_service = service;
}
public async Task<MovieAggregateDto> GetMovieAggregate(Guid movieId)
{
return await _cache.GetOrCreateAsync(
key: $"movie:agg:{movieId}",
fetch: async cancellationToken =>
{
var data = await _service.GetAggregateAsync(movieId, cancellationToken);
var ttl = GetTtlFor(data.Movie);
return new HybridCacheEntry<MovieAggregateDto>(data, ttl);
});
}
}
This keeps caching concerns localized. The handler does not care whether data comes from memory, Redis, or the database—it just receives a valid result.
6.2.3 Cache Coherence
When normalization workers detect a meaningful change—such as a new review or updated score—they publish an event. The API layer responds by invalidating the affected cache entry.
await _cache.RemoveAsync($"movie:agg:{movieId}");
This event-driven invalidation keeps data fresh without periodic cache sweeps or aggressive TTLs.
6.3 Cache Stampede Prevention
A cache stampede happens when many requests hit an expired entry at the same time. Without protection, every request triggers a recalculation, overwhelming the backend. This is especially common for popular movies.
HybridCache includes built-in locking to prevent this. Only one request recomputes the value; the rest either wait or receive stale data.
6.3.1 Using GetOrCreate with Locking
The locking behavior is automatic, but it can be tuned explicitly:
var result = await _cache.GetOrCreateAsync(
key: $"movie:{movieId}",
factory: async token =>
{
// Only executed by the lock owner
return await LoadMovieAsync(movieId, token);
},
options: new HybridCacheEntryOptions
{
Expiration = ttl,
AllowBackgroundRefill = true
});
With AllowBackgroundRefill, most callers continue to receive cached data while one request refreshes the value.
6.3.2 Background Refresh
Serving slightly stale data for a short period is usually better than blocking all callers. Movie aggregation is a read-heavy workload, and users rarely notice a few minutes of delay in score updates.
Background refresh smooths traffic spikes, reduces latency variance, and protects downstream services. Combined with dynamic TTLs, it ensures that popular titles stay fast without putting constant pressure on the scoring and normalization pipeline.
7 Data Persistence and Vector Search
By the time data reaches storage, a movie aggregator has already done a lot of work: ingestion fetched it, normalization cleaned it up, identity resolution linked it correctly, and scoring turned it into something meaningful. Storage now has two jobs. First, it must keep this data consistent and durable. Second, it must make the data easy to query—both in traditional ways and in more modern, discovery-oriented ways.
These goals pull the system in different directions. Relational databases excel at consistency and structured queries. Unstructured review text does not fit neatly into rigid schemas. And search increasingly needs to understand meaning, not just keywords. The result is a deliberate polyglot persistence model.
7.1 Polyglot Persistence
No single storage technology works well for every type of data in a movie aggregator. Trying to force everything into one model usually leads to slow queries, awkward schemas, or fragile migrations. Instead, each category of data is stored in the format that best matches how it is accessed.
7.1.1 PostgreSQL for Structured Data
Canonical entities—movies, critics, mappings, and computed scores—belong in a relational database. This data is highly structured, frequently queried, and must remain consistent even under concurrent updates.
PostgreSQL fits this role well because of its strong transactional guarantees and mature tooling. A simplified schema might look like this:
CREATE TABLE movies (
movie_id UUID PRIMARY KEY,
title TEXT NOT NULL,
release_year INT,
director TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE reviews (
review_id UUID PRIMARY KEY,
movie_id UUID REFERENCES movies(movie_id),
critic_id UUID,
normalized_score REAL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
This schema supports common queries such as “get all reviews for a movie” or “recompute aggregate scores” efficiently and safely.
PostgreSQL’s JSONB support is also useful here. It allows limited flexibility without abandoning relational structure, especially for metadata that evolves over time.
7.1.2 MongoDB or JSONB for Raw Review Documents
Raw review payloads are very different from canonical entities. They tend to be large, loosely structured, and rarely updated after ingestion. They are mostly read by background workers or analytics jobs, not by user-facing APIs.
Storing these documents separately avoids polluting the core relational schema. One common approach is to store them in PostgreSQL using JSONB:
CREATE TABLE review_raw (
review_id UUID PRIMARY KEY,
movie_id UUID,
source TEXT,
payload JSONB,
fetched_at TIMESTAMPTZ
);
This keeps raw data close to the rest of the system while preserving flexibility. Some teams prefer MongoDB for this layer, but JSONB works well when transactional consistency and simpler operations matter more than extreme scale.
7.1.3 Cross-Database Coordination
The key rule is that raw data is immutable. Once stored, it is never modified. Normalization workers read raw documents, extract structured data, and write the results into relational tables.
This append-only model has several advantages:
- It simplifies concurrency
- It preserves an audit trail
- It allows reprocessing when parsing or scoring rules change
Because raw documents are never updated, failures during normalization do not corrupt source data.
7.2 Semantic Search Implementation
Traditional keyword search works for known titles, but it breaks down when users search by theme or emotion. Queries like “sad movies about space” or “feel-good movies after a breakup” do not map cleanly to structured fields.
Semantic search solves this by embedding meaning into numeric vectors. Instead of matching words, the system compares concepts.
7.2.1 pgvector Setup
pgvector extends PostgreSQL with native vector support. It allows embeddings to live alongside structured data without introducing a separate search system.
Enable the extension:
CREATE EXTENSION IF NOT EXISTS vector;
Add a vector column to the movies table:
ALTER TABLE movies
ADD COLUMN embedding vector(1536);
The embedding represents the semantic meaning of the movie, derived from summaries, reviews, or descriptions.
7.2.2 Generating Embeddings
Embeddings are generated asynchronously by a background worker. The worker sends text to an embedding provider and stores the resulting vector.
public async Task<float[]> EmbedAsync(string text)
{
var response = await _client.GetEmbeddingAsync(new EmbeddingRequest
{
Input = text
});
return response.Vector;
}
Once generated, the embedding is persisted:
UPDATE movies
SET embedding = $2
WHERE movie_id = $1;
This work happens outside the request path, ensuring search enrichment does not affect API latency.
7.2.3 Vector Similarity Queries
When a user submits a natural-language query, the system generates an embedding for the query and finds nearby vectors.
SELECT movie_id, title
FROM movies
ORDER BY embedding <-> $1
LIMIT 20;
The <-> operator computes distance between vectors. To keep queries fast at scale, an HNSW index is added:
CREATE INDEX movie_embedding_idx
ON movies USING hnsw (embedding vector_l2_ops);
This allows semantic search to scale to large catalogs while remaining fully integrated with PostgreSQL.
7.3 Database Concurrency
Aggregation is inherently concurrent. Multiple normalization workers may process new reviews for the same movie at the same time. Without a clear concurrency strategy, updates collide and aggregates become inconsistent.
7.3.1 Optimistic Concurrency with Row Versioning
Optimistic concurrency works well because conflicts are relatively rare and short-lived. PostgreSQL exposes a system column (xmin) that Entity Framework can use as a concurrency token.
modelBuilder.Entity<Movie>()
.Property<uint>("xmin")
.IsRowVersion();
When saving changes:
try
{
await _db.SaveChangesAsync();
}
catch (DbUpdateConcurrencyException)
{
// Retry logic
}
If another worker updated the same row, the operation fails cleanly instead of silently overwriting data.
7.3.2 Conflict Resolution Strategy
The retry pattern is simple and predictable:
- Read the current state
- Recompute aggregates
- Attempt the update
- If a conflict occurs, repeat
Most systems retry two or three times before logging and moving on. This is usually enough to handle bursts of concurrent updates during peak review periods.
7.3.3 Bulk Updates
Some operations affect many rows at once, such as reprocessing all reviews for a movie after scoring logic changes. These cases bypass the ORM and use SQL directly.
PostgreSQL’s UPDATE ... FROM syntax allows efficient batch updates without row-by-row overhead. This keeps maintenance jobs fast and minimizes lock contention.
8 Observability and Maintenance
Once a movie aggregation platform is live, most problems do not appear as clean exceptions or obvious crashes. Instead, they show up indirectly: reviews stop updating for certain sources, scores lag behind reality, cache hit rates drop, or background workers silently stall. Without strong observability, these issues can go unnoticed for hours or days.
A production system needs visibility across every layer: ingestion, normalization, scoring, caching, search, and the public API. Observability is what turns a collection of distributed services into something you can reason about. It allows operators to answer simple but critical questions: Is data fresh? Where is time being spent? What broke, and when did it start?
8.1 Distributed Tracing with OpenTelemetry
The aggregation pipeline spans multiple services and communication styles. A single update might start as an HTTP fetch, flow through a message queue, trigger database writes, invalidate caches, and finally affect an API response. Distributed tracing is the only practical way to see this end-to-end.
OpenTelemetry provides a consistent tracing model regardless of whether traffic flows over HTTP, MassTransit, Redis, or PostgreSQL.
8.1.1 OpenTelemetry Setup
Each .NET service participates in tracing by registering instrumentation at startup. This captures request lifecycles automatically without requiring manual tracing everywhere.
services.AddOpenTelemetry()
.WithTracing(builder =>
{
builder
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddEntityFrameworkCoreInstrumentation()
.AddMassTransitInstrumentation()
.AddOtlpExporter(o =>
{
o.Endpoint = new Uri("http://collector:4317");
});
});
With this configuration, traces from API requests, background workers, database queries, and message handlers all share the same context.
8.1.2 Propagating Context Across Workers
Event-driven systems often lose visibility when execution jumps across queues. MassTransit propagates trace context automatically, so a message published by one service continues the same trace when consumed elsewhere.
public async Task Consume(ConsumeContext<MovieUpdated> context)
{
using var activity = MyActivitySource.StartActivity("NormalizeMovie");
// normalization and persistence work
}
In tracing tools like Jaeger or Tempo, this appears as a single timeline showing how long each step took. When updates feel slow or inconsistent, this view quickly reveals where the bottleneck is.
8.2 Health Checks
External dependencies are a constant source of instability. APIs go down, rate limits tighten, proxies fail, and databases occasionally become unreachable. Health checks provide an early warning system before these failures impact users.
8.2.1 Custom Health Checks
ASP.NET Core’s health check framework makes it easy to expose the status of both internal and external dependencies.
services.AddHealthChecks()
.AddCheck("metacritic-api", new UrlCheck("https://api.metacritic.com/ping"))
.AddCheck("redis", new RedisHealthCheck(redisConnection))
.AddNpgSql("postgres", connectionString);
A simple URL-based check might look like this:
public class UrlCheck : IHealthCheck
{
private readonly string _url;
public UrlCheck(string url) => _url = url;
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken token = default)
{
try
{
using var client = new HttpClient();
var response = await client.GetAsync(_url, token);
return response.IsSuccessStatusCode
? HealthCheckResult.Healthy()
: HealthCheckResult.Unhealthy();
}
catch
{
return HealthCheckResult.Unhealthy();
}
}
}
These checks feed load balancers, orchestration platforms, and alerting systems. When a dependency becomes unhealthy, traffic can be rerouted or throttled before data quality degrades.
8.2.2 Worker-Level Failure Reporting
Not all failures are binary. Parsing logic may break when a provider changes HTML structure. Normalization may start dropping reviews silently. Workers should emit metrics when error rates increase or when expected fields disappear.
This kind of signal is often the first indication that a scraping rule needs to be updated.
8.3 Metrics Dashboard
Tracing explains why something is slow. Metrics explain what is happening over time. Together, they provide a complete operational picture.
Prometheus and Grafana are commonly used to collect and visualize these metrics.
8.3.1 Key Ingestion Metrics
Ingestion throughput is a core indicator of platform health. A sudden drop usually points to upstream failures or rate-limiting issues.
public static readonly Counter ReviewsIngested =
Metrics.CreateCounter(
"reviews_ingested_total",
"Total number of reviews ingested");
Workers increment the counter as they process data:
ReviewsIngested.Inc();
Viewing this over time makes it easy to spot stalls or regressions.
8.3.2 Deduplication Quality
Entity resolution is probabilistic. Tracking the average confidence score provides insight into how well matching logic is performing.
var gauge = Metrics.CreateGauge(
"dedupe_confidence_avg",
"Average resolution confidence");
gauge.Set(avgConfidence);
A sudden drop often means a source changed naming conventions or metadata structure. Catching this early prevents widespread misclassification.
8.3.3 Cache Hit Ratios
Caching directly affects both performance and cost. Poor hit rates usually indicate overly aggressive invalidation or TTLs that are too short.
HybridCacheMetrics.RecordHit();
HybridCacheMetrics.RecordMiss();
Monitoring hit ratios alongside API latency helps teams tune cache behavior based on real usage patterns instead of guesswork.