1 Introduction: The “Sidecar” Strategy for Modern BI
Most enterprise BI platforms were designed for repeatable reporting, not for open-ended questions asked in the middle of a business conversation. Teams invest heavily in .rdlc reports, PowerBI datasets, and curated dashboards. These assets work well when the questions are known in advance. They break down when someone asks a question that cuts across time ranges, regions, or metrics in a way no report anticipated. Executives expect answers during the meeting, not after a backlog item is created. A Text-to-SQL layer fills this gap. Implementing that layer as a Python “sidecar” microservice lets .NET teams add Generative AI without destabilizing existing systems.
1.1 The Problem
Enterprise BI environments are built on the assumption that analysts define the questions ahead of time. Dashboards show totals, trends, and comparisons that were agreed on during design. Real business discussions don’t follow that structure. A conversation often moves from “How did we perform last quarter?” to “How does that compare to the same period last year?” to “Which regions drove the change?” These follow-up questions usually fall outside the original report scope.
Questions like “Show me revenue by region for last quarter compared to 2023” or “Which customers increased spending by more than 20% in the last two months?” are common. Traditional BI tooling struggles with these scenarios for several reasons:
- Dashboards are rigid. Adding a new view often requires changing data models, recalculating measures, and republishing datasets.
- Report definitions move slowly.
.rdlcand similar report formats require development, testing, and deployment even for small changes. - Self-service BI assumes modeling expertise. Most business users are not comfortable writing DAX, adjusting joins, or reasoning about row-level security.
- Direct SQL access is tightly restricted. Security teams rarely allow executives or managers to run ad-hoc queries against production databases.
The outcome is predictable. BI requests pile up. Analysts write custom SQL for one-off questions. The data team becomes a bottleneck, and decision-making slows down. In many cases, users export data to Excel and work around the system entirely, creating governance and compliance risks. The inability to answer ad-hoc questions quickly becomes a real business constraint.
1.2 The Solution: A Generative AI Text-to-SQL Layer
A Text-to-SQL system lets users ask questions in plain language and receive answers backed by real data. The system translates the question into SQL, validates it, executes it using a read-only database connection, and returns structured results. Those results can be rendered in a UI, displayed in a dashboard, or consumed by another service.
This changes how BI works in practice:
- Business users state intent instead of navigating reports.
- The LLM generates SQL that aligns with the existing schema.
- Validation and execution guardrails enforce safety.
- Results flow through the same APIs and applications already in use.
Instead of building new dashboards for every variation of a question, organizations expose a single analytics endpoint. That endpoint can answer thousands of questions without UI changes or schema rewrites.
The impact goes beyond convenience. Analysts spend less time answering repetitive questions. Engineering teams stop building custom endpoints for every reporting request. Conversations move faster because answers are available when the question is asked, not days later. BI shifts from static reporting to conversational access.
1.3 The Architectural Decision: Why Python?
A common question from .NET architects is why this logic should live in Python instead of C# with Semantic Kernel. The answer is practical rather than ideological. Today, the Python ecosystem is significantly more mature for Text-to-SQL workloads. The tooling for schema retrieval, SQL validation, vector indexing, and multi-step LLM pipelines is deeper and more battle-tested.
1.3.1 Python Libraries Fit the Problem Domain
Several Python libraries map directly to the requirements of enterprise Text-to-SQL:
- Haystack 2.x provides a pipeline-based model designed for retrieval, prompting, validation, and execution.
- LangChain offers reusable SQL generation and correction patterns.
- sqlglot parses and validates SQL at a structural level before execution.
- sqlalchemy manages SQL Server connections, pooling, and timeouts reliably.
- Chroma, FAISS, and InMemoryDocumentStore support fast retrieval of schema metadata.
Using these libraries avoids building and maintaining large amounts of custom infrastructure. They solve problems that are already well understood in the Python ecosystem.
1.3.2 Faster Iteration Cycle
Most LLM tooling evolves first in Python. New model features, SDK updates, and reference implementations usually appear there before other platforms. Teams can experiment quickly using notebooks and scripts, then harden the logic once it stabilizes. Converting that logic into a FastAPI service is straightforward and predictable.
1.3.3 The Sidecar Pattern Works Better Than Rewriting .NET Apps
Enterprise .NET systems are often large, stable, and business-critical. Rewriting them to embed LLM logic introduces unnecessary risk. A sidecar approach keeps responsibilities clear:
- .NET remains the system of record for authentication, authorization, and business workflows.
- Python handles the AI-specific concerns such as prompt construction, schema retrieval, and SQL generation.
The boundary between the two is explicit. Each system does what it is best at.
1.3.4 REST or gRPC Makes Integration Predictable
From the .NET side, integration is simple. The application sends a request like:
{
"natural_language_query": "sales in Europe last quarter",
"user_context_id": 12345
}
The service responds with validated SQL and structured results. There is no shared runtime, no embedded scripting, and no hidden coupling. If the AI layer needs to evolve or be replaced, the contract stays the same.
1.4 Scope of this Article
This article focuses on building a production-grade Text-to-SQL pipeline, not a demo. The stack includes:
- Haystack 2.x for pipeline orchestration
- OpenAI or Azure OpenAI for SQL generation
- SQL Server as the authoritative data source
- sqlalchemy and pyodbc for secure database access
- Chroma or InMemoryDocumentStore for schema metadata retrieval
- FastAPI for the microservice layer
The goal is to give .NET architects a concrete, realistic pattern for adding conversational analytics while respecting enterprise requirements around permissions, row-level security, observability, and zero-trust data access.
2 The Foundation: Infrastructure and Data Safety
Before any prompt engineering or pipeline logic, the foundation has to be right. In an enterprise environment, you cannot let an AI system interact freely with production data. Even a well-designed Text-to-SQL pipeline will fail if the database layer itself is not locked down. The goal of this section is simple: make it impossible for the AI to do harm, even when it makes mistakes.
2.1 The “Read-Only” Principle
The first non-negotiable rule in production is that the LLM must never be able to change data. Prompts alone are not a sufficient safety mechanism. Even when explicitly instructed to generate only SELECT statements, LLMs can occasionally produce invalid or dangerous SQL. In an enterprise system, those mistakes must be harmless by default.
The correct approach is to enforce read-only behavior at the database level.
2.1.1 Create a Dedicated SQL Login
Create a separate SQL Server login used only by the Text-to-SQL service. This login should exist solely for query execution and nothing else:
CREATE LOGIN ai_reader WITH PASSWORD = 'StrongPassword#123';
USE YourDatabase;
CREATE USER ai_reader FOR LOGIN ai_reader;
EXEC sp_addrolemember 'db_datareader', 'ai_reader';
This account has no ability to modify data, alter schema, or execute administrative commands. Even if a generated query contains a DELETE or DROP, SQL Server will reject it before execution. This is intentional. Safety is enforced by permissions, not by trust in the model.
2.1.2 Limit Schema Visibility
In most enterprise databases, not all schemas are intended for analytics. You can explicitly deny access to sensitive schemas:
DENY SELECT ON SCHEMA::SensitiveData TO ai_reader;
This turns schema access into an allow-list. The SQL validator later in the pipeline can enforce the same rules, but the database remains the final authority. Defense in depth matters.
2.1.3 Disable Risky Capabilities
The AI login should not have access to features like ad-hoc distributed queries, xp_cmdshell, or elevated stored procedures. Even if these are disabled globally, double-check that the service account cannot use them. The database connection must be physically incapable of performing destructive actions.
2.2 Row-Level Security (RLS)
Most enterprises already rely on Row-Level Security to enforce tenant boundaries or user-specific visibility. Introducing an AI layer should not weaken these controls. A common concern is whether a service account will bypass RLS rules.
The answer is no, if you wire it correctly.
2.2.1 Pass User Context Explicitly
The .NET application already knows who the user is. That context should be passed to the Python service and forwarded into SQL Server using session context:
EXEC sp_set_session_context @key = 'tenant_id', @value = @UserContextId;
Any existing RLS predicates that reference SESSION_CONTEXT values continue to apply automatically.
2.2.2 Keep RLS Out of Prompts
The LLM does not need to understand tenant boundaries or security logic. Those rules live in SQL Server, where they belong. The model simply generates SQL, and RLS ensures that only authorized rows are returned. This keeps prompts simpler and avoids subtle data-leak risks.
2.2.3 Avoid User Filters in Generated SQL
Do not instruct the model to add WHERE TenantId = ... clauses. Hard-coding security logic into generated SQL makes reuse unsafe and error-prone. Centralizing access control in the database keeps the architecture predictable and auditable.
2.3 Python Environment Setup
Once the database layer is secured, the runtime environment needs to be equally predictable. The Text-to-SQL service should behave the same in development, testing, and production. That consistency comes from a minimal, well-defined dependency set and containerized deployment.
2.3.1 Essential Libraries
Install only the libraries required for the pipeline:
pip install farm-haystack==2.* \
sqlalchemy \
pyodbc \
fastapi \
uvicorn \
chromadb \
sqlglot
Each dependency has a clear role, from pipeline orchestration to SQL validation. Avoid pulling in unnecessary packages. Smaller images start faster and are easier to secure.
2.3.2 ODBC Driver
For SQL Server connectivity on Linux, install:
ODBC Driver 18 for SQL Server
This driver handles modern TLS requirements and avoids connection issues common with older versions.
2.3.3 Docker Containerization
A minimal Docker image keeps deployment consistent across environments:
FROM python:3.11-slim
RUN apt-get update && \
apt-get install -y unixodbc unixodbc-dev curl && \
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - && \
curl https://packages.microsoft.com/config/debian/12/prod.list \
> /etc/apt/sources.list.d/mssql-release.list && \
apt-get update && \
ACCEPT_EULA=Y apt-get install -y msodbcsql18
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
This container runs cleanly in Kubernetes, either as a sidecar next to a .NET service or as a shared internal API.
2.3.4 Deployment Considerations
A few operational practices make a real difference:
- Store secrets in Kubernetes secrets or Azure Key Vault.
- Place the service in the same virtual network as SQL Server to reduce latency.
- Configure liveness and readiness probes for FastAPI.
- Set CPU and memory limits to prevent runaway LLM usage.
With these foundations in place, the system is ready for the next challenge: supplying the model with just enough schema context to generate correct, reliable SQL.
3 Context Engineering: The Schema Is Your Context
LLMs do not understand your database by default. Enterprise schemas often contain hundreds of tables and thousands of columns, many of which were named years ago for reasons no one remembers. If you paste that entire schema into a prompt, you overwhelm the model and still don’t get reliable SQL. The model needs context, but only the right context. The practical solution is retrieval-based schema grounding: store metadata as searchable documents and supply only what matters for the current question.
3.1 The “Semantic Layer” (DDL Store)
This layer is not a semantic model in the PowerBI or Analysis Services sense. It does not calculate measures or enforce relationships at runtime. Instead, it acts as a searchable catalog of schema knowledge. It gives the LLM just enough information to generate correct SQL without guessing.
Think of it as a lightweight documentation system optimized for retrieval rather than human reading.
3.1.1 What to Store
Each document in the DDL store represents a table or closely related set of tables. At a minimum, capture:
- Table name
- Column names and data types
- Column descriptions (from extended properties where available)
- Primary and foreign keys
- Business synonyms (for example, “Revenue” →
TotalAmount) - Two or three “Golden Queries” that demonstrate correct joins and common filters
An example document stored in the vector index might look like this:
{
"id": "table_sales",
"type": "table_ddl",
"content": "Table: Sales\nColumns:\n - SaleId (int, PK)\n - CustomerId (int, FK)\n - RegionId (int, FK)\n - TotalAmount (decimal)\n - SaleDate (datetime)\nDescription: Stores transactional sales data.\nSynonyms: revenue, order amount, sale total\nGoldenQueries:\n SELECT r.Name, SUM(s.TotalAmount)\n FROM Sales s JOIN Region r ON s.RegionId = r.Id\n WHERE YEAR(s.SaleDate) = 2023\n GROUP BY r.Name;"
}
This document is not meant to be perfect documentation. It exists to anchor the model in reality and prevent it from inventing schema details.
3.1.2 Creating the Index
Haystack provides multiple document store options. For most enterprise Text-to-SQL workloads, Chroma is sufficient:
from haystack.document_stores import ChromaDocumentStore
store = ChromaDocumentStore(collection_name="ddl_store")
Once created, load the metadata documents:
store.write_documents(ddl_documents)
This step is typically done during deployment or as part of a scheduled refresh. The store becomes the authoritative source of schema context for the pipeline.
3.1.3 Why the DDL Store Matters
This layer solves several common failure modes:
- Context window pressure – Only relevant schema fragments are injected into the prompt.
- Hallucinated columns – The model sees real column names and uses them consistently.
- Incorrect joins – Golden Queries show the correct join paths and key relationships.
Without a DDL store, even strong models fail frequently once schemas grow beyond a few tables.
3.2 Dynamic Schema Retrieval
When a user asks “sales in Europe last quarter,” the system has to decide which tables matter. Hardcoding table lists or relying on the model alone does not scale. Retrieval provides a deterministic way to narrow the schema.
For this example, the retrieval step typically identifies:
Salesfor transactional dataRegionfor geographic filteringCustomersonly if customer attributes are referenced
3.2.1 Retrieval Flow
The retrieval process follows a consistent sequence:
- Convert the user question into an embedding.
- Search the DDL store for the closest matching documents.
- Select the top results (usually three to five).
- Pass their contents into the prompt builder.
A simplified retrieval call looks like this:
from haystack.components.retrievers import EmbeddingRetriever
retriever = EmbeddingRetriever(document_store=store)
results = retriever.run(query="sales in Europe last quarter")
This step replaces guesswork with repeatable behavior.
3.2.2 Injecting Retrieved Metadata into Prompts
Only the retrieved documents are injected into the prompt:
schema_context = "\n\n".join([doc.content for doc in results["documents"]])
The prompt builder then uses that context:
prompt = f"""
You are a T-SQL expert. Use only the schema below.
Schema:
{schema_context}
User question:
{user_question}
Return a single SELECT query.
"""
This approach keeps the model focused on a small, accurate slice of the schema.
3.2.3 Selecting a Minimal Context Set
More context is not always better. Too many tables increase ambiguity and reduce confidence. Too few tables lead to missing joins or incorrect filters. In practice, three to five tables is the sweet spot for most enterprise analytical queries.
3.3 Handling Ambiguity
Business language rarely matches database naming conventions. Users say “revenue,” “customers,” or “regions,” while schemas contain TotalAmount, CustomerId, or RegionCode. Bridging that gap is essential for reliable SQL generation.
3.3.1 Synonym Definitions
Synonyms belong in the DDL store, not in prompts or code. Examples include:
- “Region” →
RegionName,RegionId - “Revenue” →
TotalAmount,Amount - “Date” →
SaleDate,CreatedOn
These mappings improve retrieval accuracy and reduce the chance of incorrect column selection.
3.3.2 Multi-Word Business Concepts
Some concepts map to logic rather than a single column. For example:
- “Active customers” implies
IsActive = 1 - “High value customers” implies
TotalLifetimeValue >= 100000
Capture these as short notes or examples in the DDL documents. The model uses them as guidance when constructing WHERE clauses.
3.3.3 Keep Business Meaning Out of Code
Do not hardcode business terminology in Python. Store it alongside schema metadata so BI teams can update definitions without redeploying the service. This separation keeps the system flexible and aligns with how enterprise data teams already manage metadata.
With schema context handled through retrieval instead of prompts alone, the pipeline is ready for structured SQL generation, validation, and execution in the next section.
4 The Core Pipeline: Building with Haystack 2.x
A Text-to-SQL system only becomes dependable when each step is clearly defined and isolated. In early prototypes, it’s common to see a single function that builds a prompt, calls the model, runs SQL, and handles errors all at once. That approach works briefly and then collapses under real usage. Haystack 2.x encourages a different model: a directed acyclic graph where each component has one job and a clear contract. This structure makes the system easier to reason about, test, and extend as requirements grow.
4.1 Pipeline Design
A Haystack pipeline is a sequence of connected nodes. Each node receives well-defined inputs, performs a specific task, and emits structured outputs. For enterprise Text-to-SQL, the pipeline usually follows a predictable path:
- Retrieve relevant DDL and Golden Queries
- Build a structured prompt
- Generate SQL using the LLM
- Validate the SQL
- Execute the query
- Shape the response
This explicit ordering matters. It prevents hidden dependencies and makes it clear where failures occur. When you later add features like auto-correction, caching, or auditing, those become new nodes rather than conditional logic scattered across the codebase.
A minimal pipeline definition looks like this:
from haystack import Pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=prompt_builder, name="PromptBuilder", inputs=["Retriever"])
pipeline.add_node(component=generator, name="Generator", inputs=["PromptBuilder"])
pipeline.add_node(component=validator, name="Validator", inputs=["Generator"])
pipeline.add_node(component=sql_executor, name="SQLExecutor", inputs=["Validator"])
pipeline.add_node(component=response_builder, name="ResponseBuilder", inputs=["SQLExecutor"])
This layout mirrors how engineers already think about data flow. Each node can be logged, tested, and replaced independently, which is essential once the service is used by multiple applications.
4.2 Component 1: The PromptBuilder
The PromptBuilder is where intent becomes instruction. It takes the user’s question and the retrieved schema context and turns them into a prompt the model can reliably follow. The goal is not creativity. The goal is precision.
4.2.1 Core System Prompt
The system prompt should be short, stable, and unambiguous. It defines the model’s role and constraints:
You are a T-SQL expert. You generate safe, correct, read-only SQL Server queries.
Return only a single SELECT statement without comments or explanation.
This prompt does not describe business rules or data meaning. Those come from the schema context. Keeping the system prompt minimal reduces unexpected behavior when the schema evolves.
4.2.2 Injecting Schema Context
The PromptBuilder receives the DDL documents selected during retrieval and formats them consistently:
schema_text = "\n\n".join([doc.content for doc in retrieved_docs])
They are then injected into the prompt as a clearly marked section:
Schema:
{schema_text}
This step anchors the model in the actual database structure. Without it, the model will often guess column names or infer relationships that do not exist.
4.2.3 Few-Shot Prompting with Similar Queries
Golden Queries serve as concrete examples of how the schema is meant to be used. When available, include two or three that resemble the user’s intent:
Here are examples of valid SQL queries similar to the task:
1.
SELECT r.Name, SUM(s.TotalAmount)
FROM Sales s JOIN Region r ON s.RegionId = r.Id
GROUP BY r.Name;
2.
SELECT c.SegmentCode, COUNT(*)
FROM Customers c
WHERE c.IsActive = 1
GROUP BY c.SegmentCode;
3.
SELECT p.Category, SUM(o.Quantity)
FROM Orders o JOIN Products p ON o.ProductId = p.Id
GROUP BY p.Category;
These examples gently guide the model toward correct join paths and grouping patterns without hardcoding logic. Because they come from the DDL store, they evolve alongside the schema.
4.2.4 PromptBuilder Component Example
A simplified implementation looks like this:
from haystack import component
@component
class PromptBuilder:
def run(self, user_question: str, documents: list):
schema_section = "\n\n".join([d.content for d in documents])
golden_examples = [d.meta.get("golden") for d in documents if "golden" in d.meta]
examples_text = "\n\n".join(golden_examples[:3])
prompt = f"""
You are a T-SQL expert. Generate a single read-only SQL SELECT query.
Schema:
{schema_section}
Examples:
{examples_text}
User question:
{user_question}
Return only the SQL query.
"""
return {"prompt": prompt}
The component remains simple by design. Changes to schema or examples do not require changes to this code.
4.3 Component 2: The Generator
The Generator sends the constructed prompt to the LLM and returns raw SQL. Haystack abstracts away provider-specific details, which makes switching between OpenAI and Azure OpenAI straightforward.
4.3.1 Generator Configuration
A typical OpenAI configuration:
from haystack.components.generators import OpenAIGenerator
generator = OpenAIGenerator(
model="gpt-4o",
api_key=os.getenv("OPENAI_API_KEY"),
temperature=0,
max_tokens=400
)
For Azure-hosted deployments:
from haystack.components.generators import AzureOpenAIGenerator
generator = AzureOpenAIGenerator(
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
api_key=os.getenv("AZURE_OPENAI_KEY"),
model="gpt-4o",
deployment_name="gpt4o-sql",
temperature=0
)
4.3.2 Why Temperature Must Be Zero
Generating SQL is not a creative task. Randomness increases the risk of subtle errors, such as invented columns or reordered joins. Setting temperature to zero ensures that the same input produces the same SQL, which is critical for debugging and auditing.
4.3.3 Generator Output
The generator returns a string that represents the SQL query:
result = generator.run(prompt=prompt_text)
sql_query = result["replies"][0].strip()
At this point, the SQL has not been trusted or executed. It moves directly to validation.
4.4 Component 3: The SQLExecutor
The SQLExecutor is the boundary between AI output and the database. Its responsibility is to run validated queries safely and return structured results.
4.4.1 Executor Responsibilities
The executor must:
- Execute SQL using the read-only connection
- Apply session context for RLS
- Enforce execution timeouts
- Capture database errors
- Return results in a predictable format
Separating this logic from generation and validation keeps the blast radius small if something goes wrong.
4.4.2 SQLExecutor Example
from haystack import component
from sqlalchemy import create_engine, text
@component
class SQLExecutor:
def __init__(self, conn_str):
self.engine = create_engine(conn_str, pool_pre_ping=True)
def run(self, sql: str, user_context_id: int):
try:
with self.engine.connect() as conn:
conn.execute(
text("EXEC sp_set_session_context @key='tenant_id', @value=:tid"),
{"tid": user_context_id}
)
result = conn.execute(text(sql))
rows = [dict(r._mapping) for r in result]
return {"rows": rows}
except Exception as ex:
return {"error": str(ex), "rows": []}
4.4.3 Handling Empty Results
An empty result set is not an error. Returning an empty array is clearer for BI consumers and prevents unnecessary retries from client applications.
With the core pipeline in place, the system is ready for the most important step: validating and constraining AI-generated SQL before execution. That is the focus of the next section.
5 Enterprise Guardrails: “Trust, but Verify”
Text-to-SQL only works in an enterprise setting when every generated query is treated as untrusted input. Even with a read-only database login, a poorly formed query can still cause problems. A Cartesian join can scan millions of rows. A query against system tables can expose metadata that was never meant to be visible. Guardrails exist to catch these issues before they reach the database and before results are returned to the caller.
This section focuses on three layers of protection: static SQL analysis, execution limits, and output sanitization.
5.1 Static Analysis with sqlglot
Before a generated query is ever executed, it should be parsed and inspected. sqlglot turns raw SQL text into a structured representation that can be reasoned about programmatically. This allows the system to enforce rules that prompts alone cannot guarantee.
5.1.1 Parsing the SQL
The first step is validating that the output is syntactically valid T-SQL:
import sqlglot
from haystack import component
@component
class SQLValidator:
def __init__(self, allowed_schemas):
self.allowed_schemas = allowed_schemas
def run(self, sql: str):
try:
parsed = sqlglot.parse_one(sql, read="tsql")
except Exception:
return {"error": "SQL parse failed", "sql": sql}
return self._validate(parsed, sql)
If parsing fails, the query never reaches SQL Server. At this point, the pipeline can either return a clear error to the caller or attempt a controlled auto-correction.
5.1.2 Validation Check 1: Must Be a SELECT
The most important rule is that only SELECT statements are allowed:
if parsed.type not in ("select",):
return {"error": "Only SELECT statements are allowed.", "sql": sql}
This check blocks destructive or state-changing operations, such as:
DELETE FROM UsersDROP TABLE SalesUPDATE Orders SET TotalAmount = 0
Even though the database login is read-only, rejecting these statements early simplifies auditing and avoids unnecessary database round trips.
5.1.3 Validation Check 2: Table Allow-Listing
The next step is verifying which tables the query touches. Using sqlglot, you can inspect all referenced tables and ensure they belong to approved schemas:
for table in parsed.find_all(sqlglot.expressions.Table):
schema = table.args.get("db")
if schema and schema.lower() not in self.allowed_schemas:
return {"error": f"Schema '{schema}' is not allowed.", "sql": sql}
This prevents queries against areas such as:
sys.usersINFORMATION_SCHEMA.COLUMNS- Internal or operational schemas
The model does not understand organizational boundaries. This check enforces them consistently.
5.1.4 Example Response on Violation
When a query violates a rule, the validator returns a structured response:
{
"error": "Schema 'sys' is not allowed.",
"sql": "SELECT * FROM sys.objects"
}
The .NET application can log this, display a user-friendly message, or trigger an automated retry with a corrected prompt.
5.2 Execution Timeout
Even valid SELECT queries can cause trouble if they run too long. A missing join condition or an unfiltered table scan can overwhelm the database. Timeouts ensure the system fails fast instead of degrading overall performance.
5.2.1 Client-Side Timeout Configuration
At the SQLAlchemy level, set a conservative timeout:
engine = create_engine(
conn_str,
connect_args={
"timeout": 5 # seconds
}
)
This prevents individual requests from blocking connections indefinitely.
5.2.2 Server-Side Safeguards
Client-side timeouts are not enough on their own. SQL Server should also enforce limits. One option is the query governor:
EXEC sp_configure 'query governor cost limit', 1500;
RECONFIGURE;
In more advanced setups, Resource Governor can be used to constrain CPU and memory usage for the AI service login.
5.2.3 Failing Fast and Clearly
When a timeout occurs, return a clear, structured response:
{
"error": "Execution timeout exceeded.",
"sql": "...",
"rows": []
}
This makes it obvious to downstream systems what happened and avoids repeated retries that only make the problem worse.
5.3 PII Masking
Even read-only analytics queries can expose sensitive data. Customer emails, phone numbers, or identifiers may appear in results even when the query itself is valid. Before returning data to the caller, the system should scan and sanitize the output.
5.3.1 Detecting PII with Regular Expressions
For common cases, lightweight pattern matching works well:
import re
EMAIL = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")
SSN = re.compile(r"\b\d{3}-\d{2}-\d{4}\b")
def mask_value(value):
if isinstance(value, str):
if EMAIL.search(value):
return "[email masked]"
if SSN.search(value):
return "[ssn masked]"
return value
This approach is fast and easy to reason about.
5.3.2 Applying Masking to Result Sets
Apply masking row by row before serialization:
def mask_rows(rows):
masked = []
for row in rows:
masked.append({k: mask_value(v) for k, v in row.items()})
return masked
5.3.3 Integration with Execution
The executor applies masking just before returning results:
rows = [dict(r._mapping) for r in result]
rows = mask_rows(rows)
5.3.4 Example Masked Response
{
"rows": [
{ "CustomerName": "Alice", "Email": "[email masked]" }
]
}
This keeps the analytics experience useful while respecting privacy and compliance requirements.
Together, static validation, execution limits, and output masking turn AI-generated SQL from a risk into a controlled capability. With these guardrails in place, the system is ready to expose results safely to client applications through a well-defined API layer.
6 The API Layer: Integration with .NET Apps
The Text-to-SQL service only becomes valuable once it is easy for .NET applications to call and trust. At this point in the architecture, all of the complexity—retrieval, prompting, validation, execution, and masking—already lives behind the scenes. The API layer’s job is to expose that capability through a simple, stable contract that feels familiar to .NET developers. If the API is predictable, teams can integrate it quickly without understanding how the AI works internally.
6.1 Designing the Contract (DTOs)
The API contract defines the boundary between .NET applications and the Text-to-SQL sidecar. The request should contain only what the service truly needs: the user’s question and the user context identifier used for RLS. Everything else stays inside the service. The response should provide enough information for observability and trust without leaking internal implementation details.
6.1.1 Request DTO
The request payload is intentionally small:
{
"natural_language_query": "sales in Europe last quarter",
"user_context_id": 1032
}
From the .NET side, this maps cleanly to a simple DTO:
public sealed class TextToSqlRequest
{
public string NaturalLanguageQuery { get; set; } = default!;
public int UserContextId { get; set; }
}
This model works equally well with HttpClient, Refit, minimal APIs, or generated clients. There is no coupling to prompts, schema details, or pipeline behavior.
6.1.2 Response DTO
The response includes four things: the generated SQL, the resulting data, execution timing, and any safety flags raised during processing. A typical response looks like this:
{
"sql_generated": "SELECT r.Name, SUM(s.TotalAmount) ...",
"data": [
{ "Region": "Europe", "Total": 1293300.22 },
{ "Region": "North America", "Total": 1582200.13 }
],
"execution_time_ms": 124.32,
"safety_flags": []
}
In C#, the corresponding model might look like:
public sealed class TextToSqlResponse
{
public string SqlGenerated { get; set; } = string.Empty;
public List<Dictionary<string, object?>> Data { get; set; } = new();
public double ExecutionTimeMs { get; set; }
public List<string> SafetyFlags { get; set; } = new();
}
Using a dictionary for result rows avoids forcing predefined DTOs for every possible query shape. This aligns well with how BI tools and dynamic grids consume data.
6.1.3 Extensibility Considerations
Over time, additional flags may be added, such as timeout, masking_applied, or auto_correction_used. The response model should allow this without breaking existing clients. In practice, this means only adding fields, never removing or renaming them. Generated .NET clients handle additive changes cleanly.
6.2 FastAPI Implementation
FastAPI fits well as the boundary layer for this service. It provides strong typing, async support, and automatic OpenAPI generation with very little code. The endpoint itself should stay thin and delegate all business logic to the Haystack pipeline.
6.2.1 Defining Pydantic Models
The Python models mirror the DTOs exactly:
from pydantic import BaseModel
from typing import List, Dict, Any
class TextToSqlRequest(BaseModel):
natural_language_query: str
user_context_id: int
class TextToSqlResponse(BaseModel):
sql_generated: str
data: List[Dict[str, Any]]
execution_time_ms: float
safety_flags: List[str]
Keeping these models aligned with the .NET DTOs avoids confusion and simplifies debugging.
6.2.2 Creating the Endpoint
The endpoint accepts the request, invokes the pipeline, and returns the result. Because LLM calls and database access are I/O-bound, the endpoint should remain asynchronous:
from fastapi import FastAPI
import time
app = FastAPI()
@app.post("/text-to-sql", response_model=TextToSqlResponse)
async def text_to_sql(req: TextToSqlRequest):
start = time.perf_counter()
result = await pipeline.run_async(
query=req.natural_language_query,
user_context_id=req.user_context_id
)
execution_time = (time.perf_counter() - start) * 1000
return TextToSqlResponse(
sql_generated=result.get("sql", ""),
data=result.get("rows", []),
execution_time_ms=execution_time,
safety_flags=result.get("flags", [])
)
The endpoint does not change when new pipeline steps are added. That stability is important for client teams.
6.2.3 Async and Timeout Behavior
LLM calls can take seconds, especially under load. Using async endpoints prevents thread starvation and improves throughput. When deployed behind Kubernetes or an API gateway, configure request timeouts that comfortably exceed expected LLM latency but still fail predictably.
6.2.4 Logging and Traceability
Every request should log:
- the incoming question
- the generated SQL
- any validation or safety flags
- total execution time
These logs are essential for audits, debugging, and performance tuning. Most enterprises forward them via OpenTelemetry into centralized logging systems.
6.3 Swagger/OpenAPI
FastAPI automatically publishes an OpenAPI specification. This removes friction for .NET teams by enabling client generation instead of manual HTTP wiring.
6.3.1 Accessing the Specification
Once the service is running, the OpenAPI document is available at:
/openapi.json
This file describes all endpoints, request models, and response shapes.
6.3.2 Generating a C# Client
Using NSwag, a client can be generated in seconds:
nswag openapi2csclient \
/input:https://ai-service.example.com/openapi.json \
/output:AiClient.cs
The generated client follows standard .NET conventions and integrates cleanly into existing solutions.
6.3.3 Calling the Service from .NET
From the application’s perspective, usage is straightforward:
var client = new TextToSqlClient("https://ai-service.example.com", httpClient);
var request = new TextToSqlRequest
{
NaturalLanguageQuery = "sales in Europe last quarter",
UserContextId = 1032
};
var response = await client.TextToSqlAsync(request);
Console.WriteLine(response.SqlGenerated);
At this point, the .NET team does not manage prompts, models, or SQL safety. They interact with a typed API that behaves like any other internal service. This clean separation is what makes the sidecar pattern viable in real enterprise systems.
7 Optimization: Self-Correction and Caching
Once the core pipeline is stable, the focus shifts from correctness to efficiency and resilience. In real usage, two things happen quickly. First, the model occasionally produces SQL that is almost right but fails on execution. Second, users ask the same questions repeatedly. Left unaddressed, these issues lead to poor user experience and unnecessary cost. A self-correction loop and a caching layer solve most of this without adding complexity to the client.
7.1 The “Self-Correction” Loop
Even with strong schema grounding, small mistakes still happen. A column name might be slightly off. A join alias might be wrong. In many cases, SQL Server’s error message already contains everything needed to fix the issue. Instead of failing immediately, the system can attempt a controlled retry.
7.1.1 Triggering Auto-Correction
When validation or execution fails, the pipeline captures the error and decides whether it is recoverable. If it is, the original SQL and the SQL Server error are fed back into the generator with a focused repair prompt:
repair_prompt = f"""
The following SQL failed:
{sql_generated}
Error from SQL Server:
{error_message}
Produce a corrected SELECT statement that avoids the error.
"""
The generator runs again with temperature still set to zero. This keeps the correction deterministic and prevents creative rewrites.
7.1.2 Limiting Correction Attempts
Auto-correction must be bounded. Unlimited retries hide problems and complicate debugging. In practice, one or two attempts are enough:
MAX_RETRIES = 2
If the query still fails after the final attempt, the system returns a clear error. Enterprise users generally prefer a transparent failure over repeated silent retries.
7.1.3 Example Correction Flow
A typical correction looks like this:
- A user asks, “list customers with subscription start date last month.”
- The model generates SQL referencing
SubStartinstead ofSubscriptionStartDate. - SQL Server returns “Invalid column name ‘SubStart’.”
- The self-correction step re-prompts the model with the error.
- The model produces corrected SQL using
SubscriptionStartDate. - The executor runs successfully and returns results.
These small fixes significantly improve perceived reliability, especially for first-time users.
7.2 Caching Strategy
Text-to-SQL queries repeat far more than teams expect. Weekly revenue summaries, monthly comparisons, and regional breakdowns are asked again and again. Re-generating SQL every time wastes tokens and adds latency. Caching addresses this without changing the external API.
7.2.1 What to Cache
Cache the generated SQL, not the query results. Data changes frequently, and serving cached results risks returning stale information. Generated SQL is cheap to reuse and safe to execute repeatedly.
7.2.2 Redis Integration
A common approach is to hash the natural-language query and use it as a Redis key:
import redis
import hashlib
redis_client = redis.Redis(host="redis", port=6379)
def cache_key(question: str):
return "sqlcache:" + hashlib.sha256(question.encode()).hexdigest()
Before calling the generator:
key = cache_key(user_question)
cached_sql = redis_client.get(key)
if cached_sql:
sql = cached_sql.decode()
else:
sql = generated_sql
redis_client.setex(key, 3600, sql)
This keeps caching logic simple and avoids coupling to user identity or schema details.
7.2.3 Cache TTL Guidance
Choose a TTL based on how often your data changes. Highly volatile domains may use 30 minutes. Financial or operational summaries often tolerate several hours. The goal is to reduce repeated generation, not to cache indefinitely.
7.2.4 Impact on .NET Clients
The .NET application does not need to know caching exists. From its perspective, responses are simply faster. This keeps the client code clean and avoids conditional logic around cache behavior.
7.2.5 Measuring Cache Effectiveness
Log cache hits and misses. Over time, these metrics show which questions repeat and where further optimization might help.
7.3 Evaluation Framework
Optimization only matters if it can be measured. Enterprise teams need objective signals that the system is improving and safe to expand.
7.3.1 Execution Accuracy
Execution accuracy measures how often generated SQL runs successfully on the first attempt:
execution_accuracy = successful_first_attempts / total_queries
After tuning retrieval and validation, a target above 85% is reasonable. With self-correction enabled, mature systems often exceed 95%.
7.3.2 Logical Accuracy
Logical accuracy asks whether the query actually answers the user’s question. Measuring this requires sampling:
- Select representative queries
- Review generated SQL
- Compare results with expected business logic
Initial systems often land around 70–80%. Adding better Golden Queries and synonyms steadily improves this.
7.3.3 Latency Metrics
Track average durations for:
- Schema retrieval
- SQL generation
- Validation
- Execution
- End-to-end request time
These metrics highlight slow SQL patterns and guide caching and indexing decisions.
7.3.4 Observability Integration
Export metrics through standard tooling such as Prometheus, Application Insights, or OpenTelemetry. This allows teams to detect schema drift, performance regressions, and unusual usage patterns early.
7.3.5 Continuous Improvement Loop
Every failure is feedback. Incorrect joins, missing filters, and ambiguous terms should be captured and added back into the DDL store as Golden Queries or synonyms. Over time, reliability improves without retraining models. In practice, most gains come from better context, not more complex prompts or models.
With optimization in place, the system is no longer just functional—it becomes efficient, predictable, and ready for sustained enterprise use.
8 Conclusion and Future Roadmap
The Text-to-SQL sidecar pattern gives .NET teams a practical way to add Generative AI without destabilizing existing systems. By isolating AI-specific concerns inside a Python microservice, teams gain flexibility while keeping data access, security, and ownership where they already belong. The earlier sections walked through the core pipeline, schema grounding, guardrails, and integration details. What remains is turning that architecture into something teams can operate confidently in production and extend over time.
8.1 Production Checklist
A production deployment succeeds or fails on discipline, not model quality. Secrets must be handled correctly from day one. API keys, database credentials, and Redis connection strings should never live in configuration files or container images. Use Azure Key Vault or AWS Secrets Manager and load values at runtime using managed identities where possible.
Observability is equally important. Structured logs and traces should flow from FastAPI, Haystack components, and the SQL execution layer into a centralized system via OpenTelemetry. Each request should carry a correlation ID so .NET application logs and sidecar logs can be viewed together. Logging execution duration, validation failures, and safety flags makes it possible to spot slow queries, schema drift, or unusual usage patterns before they become incidents.
Rate limiting protects both the database and the LLM budget. A simple Redis-backed token bucket is often enough to prevent accidental overload. For example:
async def rate_limit(user_id: int):
key = f"ratelimit:{user_id}"
count = redis.incr(key)
if count == 1:
redis.expire(key, 60)
if count > 30:
raise HTTPException(status_code=429, detail="Rate limit exceeded")
This ensures that a misconfigured client or runaway loop cannot exhaust resources or generate unexpected costs.
8.2 The “Human in the Loop”
Early in adoption, it is reasonable to slow things down intentionally. Some questions touch financial reporting, regulatory data, or multi-year trends where correctness matters more than speed. In these cases, an approval step adds confidence without blocking progress.
An “Approval Mode” can return the generated SQL without executing it:
{
"requires_approval": true,
"sql_generated": "SELECT ...",
"data": []
}
The .NET application can present that SQL to an analyst or reviewer, who approves or rejects it. Approval triggers a second call that executes the query. Over time, as execution accuracy and logical accuracy improve, fewer queries need review. The system earns trust gradually instead of demanding it upfront.
8.3 Next Steps
Once the read-only Text-to-SQL flow is stable, the same foundation supports more advanced workflows. Query results can become inputs to downstream actions rather than just rows on a screen. Reports can be scheduled, summaries sent by email, or insights posted to Teams or Slack.
For example, a .NET application might invoke a workflow agent like this:
await reportAgent.SendSummaryAsync(
userId: 1032,
query: "monthly revenue trends",
destination: "finance@company.com"
);
At that point, BI stops being passive. It becomes part of how work gets done. The same safety principles still apply: read-only data access, strict validation, bounded execution, and full audit trails. By following this roadmap, enterprises move from static dashboards toward conversational, self-service analytics that fit naturally into daily decision-making.