SQL Query Engine — Natural Language to PostgreSQL via a Self-Healing LLM Pipeline

SQL Query Engine
Point it at a database. Ask questions in plain English. Get back SQL results.
The Problem
You have a PostgreSQL database full of data, and stakeholders who need answers — fast. They don’t write SQL. They shouldn’t have to. But every text-to-SQL tool you’ve tried either chokes on schema complexity, generates broken queries with no recovery, or needs you to adopt someone else’s entire platform. SQL Query Engine closes that gap. It’s a self-hosted FastAPI service that translates natural language into executed PostgreSQL queries through a two-stage LLM pipeline — powered by any OpenAI-compatible model (Ollama, vLLM, OpenAI, LiteLLM). It also ships with full OpenWebUI integration out of the box.
How It Works
At its core, SQL Query Engine is a two-stage inference pipeline sitting between your users and PostgreSQL. Here’s the full picture:
flowchart LR
A([🧑 User<br>Natural Language]) -->|HTTP / OpenAI API| B[🖥️ SQL Query Engine<br>FastAPI :5181]
B -->|Stage 1 · NL → SQL| C[🤖 LLM<br>OpenAI-compatible]
C --> B
B -->|Stage 2 · Execute + Repair| D[(🐘 PostgreSQL)]
D --> B
B -->|Cache + Pub/Sub| E[(📦 Redis)]
B -->|Results| A
The user asks a question. The engine asks the LLM to write a SQL query. The query is executed against PostgreSQL. If it fails, the engine captures the error and asks the LLM to fix it — automatically, up to N times. That’s it.
Two-Stage Pipeline
The engine splits every request into two stages, each handled by a dedicated module:
sequenceDiagram
participant Client
participant API as FastAPI :5181
participant GEN as QueryGenerator<br>Stage 1 · NL → SQL
participant PG as PostgreSQL
participant Redis
participant LLM as LLM<br>OpenAI-compatible
participant EVAL as QueryEvaluator<br>Stage 2 · Execute + Repair
Client->>API: POST /inference/sqlQueryEngine/{chatID}
API->>GEN: forward(chatID, prompt)
GEN->>PG: introspectSchema() [first request only]
PG-->>GEN: DDL + sample rows
GEN->>Redis: cacheContext(chatID)
GEN->>LLM: generateSQL(schema, prompt)
LLM-->>GEN: sqlQuery
GEN->>EVAL: evaluate(sqlQuery)
EVAL->>PG: execute(sqlQuery)
PG-->>EVAL: rows | SQLException
alt Query fails
EVAL->>LLM: repairSQL(query, error)
LLM-->>EVAL: repairedQuery
EVAL->>PG: execute(repairedQuery)
PG-->>EVAL: rows
end
EVAL->>Redis: publish progressEvent(chatID)
EVAL-->>API: QueryResult
API-->>Client: 200 JSON
- Stage 1 — SQL Generation — On the first request for a session, the engine introspects the database schema from PostgreSQL, samples rows from every table, and asks the LLM to produce a detailed schema description. This context is cached in Redis per
chatID, so subsequent questions skip introspection entirely. The LLM then generates a SQL query, which is extracted via a multi-strategy response parser — a 5-strategy cascade (JSON → embedded JSON → code blocks → regex → raw text) that works with any model, no structured output or function calling required. - Stage 2 — SQL Evaluation & Repair — The generated query is executed against PostgreSQL. If it succeeds and returns rows, the result is accepted immediately — no LLM re-evaluation, no risk of regression. If it fails — syntax errors, schema mismatches, empty results — the engine captures the full error output (SQLSTATE codes, PostgreSQL diagnostics, tracebacks) and feeds it back to the LLM along with the schema context. The LLM returns a fix, extracted via the same multi-strategy parser. This loop repeats until the query succeeds or the retry limit is reached. The engine tracks the best result seen across all attempts, so if retries exhaust without a perfect fix, it returns the best partial result rather than nothing. Every step publishes real-time progress to a Redis Pub/Sub channel.
The Self-Healing Loop
This is the key differentiator. When a query goes wrong, the engine doesn’t give up — it enters an iterative repair loop:
flowchart TD
A[Execute SQL Query] --> B{Success?}
B -->|Yes + Rows| C[✅ Early-Accept · Return Results]
B -->|No or Empty| D[Track Best Result So Far]
D --> E[Capture Error Details<br>SQLSTATE · Diagnostics · Traceback]
E --> F[Feed Error + Schema to LLM]
F --> G[LLM Returns Fix<br>observation · fixedQuery]
G --> H{Retry Limit?}
H -->|No| A
H -->|Yes| I[⚠️ Return Best Result Seen]
Two design choices make this loop safe. First, early-accept: if a query executes successfully and returns rows, it’s accepted immediately — the LLM is never asked to second-guess a working result, which prevents regressions where the model “fixes” a correct query into a broken one. Second, best-result tracking: the engine records the best result seen across all retry iterations, so even if the loop exhausts without a perfect fix, the user gets the closest successful result rather than an empty failure.
The evaluator captures PostgreSQL errors at the psycopg level — including SQLSTATE codes, diagnostic messages, and hints — and feeds the full context back to the LLM. The LLM’s isValid self-assessment is ignored by design; the engine only trusts execution outcomes. If the query returns empty results, the loop continues — because an empty result on a reasonable question usually means the query is wrong, not the data. The database connection is set to read-only mode, so the engine can never accidentally modify data.
OpenAI-Compatible API
The engine exposes /v1/chat/completions, /v1/completions, and /v1/models endpoints — making it a drop-in replacement for any OpenAI-compatible client.
flowchart TB
subgraph Clients
OW[OpenWebUI]
CU[curl / http]
PY[Python SDK]
ANY[Any OpenAI Client]
end
subgraph Engine["SQL Query Engine (Docker)"]
direction LR
V1["/v1/chat/completions"]
NATIVE["/inference/sqlQueryEngine"]
AUTH[Bearer Token Auth]
V1 --> AUTH
NATIVE --> AUTH
end
OW -->|OpenAI Protocol| V1
CU -->|REST| NATIVE
PY -->|OpenAI Protocol| V1
ANY -->|OpenAI Protocol| V1
The streaming implementation subscribes to Redis Pub/Sub before launching the engine in a thread pool, then forwards progress messages as SSE chunks wrapped in <think> tags — so reasoning-capable clients like OpenWebUI display them as chain-of-thought while the query is being built and validated. The final result (SQL + markdown table) renders cleanly below.
Bearer token authentication is supported via OPENAI_API_KEY. Comma-separate multiple keys for multi-user setups. Leave empty to disable auth entirely.
Quick Start
Clone, configure, deploy:
git clone https://github.com/codeadeel/sqlqueryengine.git
cd sqlqueryengineEdit docker-compose.yml with your LLM and PostgreSQL details:
# LLM endpoint — Ollama, vLLM, OpenAI, LiteLLM, etc.
- LLM_BASE_URL=http://host.docker.internal:11434/v1
- LLM_MODEL=qwen2.5-coder:7b
- LLM_API_KEY=ollama
# Your PostgreSQL instance
- POSTGRES_HOST=host.docker.internal
- POSTGRES_PORT=5432
- POSTGRES_DB=mydb
- POSTGRES_USER=myuser
- POSTGRES_PASSWORD=mypasswordStart the stack:
docker compose up --buildDone. The engine, Redis, and OpenWebUI are all running.
| Service | URL |
|---|---|
| SQL Query Engine API | http://localhost:5181 |
| Swagger UI | http://localhost:5181/docs |
| OpenWebUI | http://localhost:5182 |
Architecture at a Glance
flowchart TB
subgraph Clients
OW[OpenWebUI]
SW[Swagger UI]
CU[curl / SDK]
end
subgraph Engine["SQL Query Engine (Docker)"]
direction LR
MAIN[FastAPI · main.py]
COMPAT[OpenAI Compat · openaiCompat.py]
ENG[Engine · engine.py]
GEN[QueryGenerator · Stage 1]
EVAL[QueryEvaluator · Stage 2]
DB[DBHandler · dbHandler.py]
SM[SessionManager · sessionManager.py]
MAIN --> ENG
COMPAT --> ENG
ENG --> GEN
ENG --> EVAL
GEN --> DB
EVAL --> DB
GEN --> SM
EVAL --> SM
end
subgraph External
PG[(PostgreSQL)]
RD[(Redis<br>Cache + Pub/Sub)]
LLM[LLM<br>OpenAI-compatible]
end
OW -->|OpenAI Protocol| COMPAT
SW -->|REST| MAIN
CU -->|REST / OpenAI| MAIN
DB --> PG
SM --> RD
GEN --> LLM
EVAL --> LLM
Each module has a single responsibility. dbHandler.py handles PostgreSQL introspection and safe query execution (read-only connections). sessionManager.py manages per-user context in Redis hashes. promptTemplates.py holds the LLM prompt templates. connConfig.py reads environment variables and provides a shared FastAPI dependency for connection parameters.
Features Summary
| Feature | Details |
|---|---|
| Pipeline | Two-stage: NL → SQL generation + execution with self-healing repair loop |
| LLM Support | Any OpenAI-compatible endpoint (Ollama, vLLM, OpenAI, LiteLLM) |
| Database | PostgreSQL (read-only connections enforced) |
| Caching | Redis — schema context cached per session, skips re-introspection |
| Streaming | Redis Pub/Sub → SSE with <think> tag support for chain-of-thought |
| API | REST + OpenAI-compatible /v1/chat/completions |
| Auth | Bearer token (comma-separate multiple keys) |
| Chat UI | OpenWebUI pre-configured in Docker Compose |
| Deployment | Docker Compose, single command |
| Module Mode | Import SQLQueryEngine directly in Python, no HTTP layer needed |
| Response Parsing | Multi-strategy cascade (JSON → embedded JSON → code blocks → regex → raw text) — works with any model |
| Reasoning Models | <think> tag stripping for models like Qwen3 and DeepSeek-R1 |
Evaluation & Benchmarks
The repository includes two independent evaluation pipelines that quantify the impact of the self-healing loop. Both use a 3-configuration ablation study tested across 5 LLM backends.
Three configurations isolate each pipeline stage:
| Config | retryCount | What It Tests |
|---|---|---|
| Config C | 0 | Generation only — no evaluation, no repair |
| Config B | 1 | Single evaluation + one repair attempt |
| Config A | 5 | Full pipeline — up to 5 repair iterations |
Synthetic Evaluation (Controlled Environment)
The synthetic pipeline seeds 3 PostgreSQL databases (e-commerce, hospital, university) with reproducible Faker-generated data and runs 75 gold-standard questions (25 per database) across 4 difficulty tiers.
| Model | Config C (gen only) | Config A (full pipeline) | Self-Healing Delta | Regressions |
|---|---|---|---|---|
| llama-4-scout-17b | 48.0% | 57.3% | +9.3pp | 0 |
| gpt-oss-20b | 45.3% | 53.3% | +8.0pp | 0 |
| llama-3.3-70b | 53.3% | 54.7% | +1.4pp | 2 |
| gpt-oss-120b | 50.7% | 48.0% | −2.7pp | 4 |
| qwen3-32b | 48.0% | 46.7% | −1.3pp | 6 |
Key finding: Llama 4 Scout (17B MoE) achieves the highest accuracy (57.3%), the largest self-healing delta (+9.3 percentage points), and zero regressions — meaning the repair loop never degraded a previously correct query. The early-accept and best-result-tracking design choices directly enable the zero-regression property.
Where does self-healing help most? On hard-difficulty queries. Scout 17B jumps from 5.6% to 44.4% on hard questions (+38.8pp), while easy questions stay at 100% across all configs. The loop adds the most value exactly where raw generation struggles.
BIRD Benchmark (Real-World Databases)
To validate these findings beyond synthetic data, we also evaluated against BIRD — one of the most widely cited NL-to-SQL benchmarks in the research community. BIRD’s mini-dev set contains 500 questions across 11 real-world databases spanning finance, healthcare, sports, education, and more.
Since SQL Query Engine runs against PostgreSQL and BIRD ships its databases as SQLite files, the evaluation pipeline includes an automatic SQLite-to-PostgreSQL migration layer. It converts schemas, transforms gold SQL queries across 14 dialect rules (backticks → double quotes, IIF → CASE WHEN, GROUP_CONCAT → STRING_AGG, LIKE → ILIKE, and more), and bulk-inserts data at runtime. About 12.6% of questions (63 out of 500) are excluded due to gold SQL that doesn’t convert cleanly between dialects — these are reported separately. Results below are on the remaining 437 evaluated questions.
| Model | Config C (gen only) | Config A (full pipeline) | Self-Healing Delta | Regressions |
|---|---|---|---|---|
| gpt-oss-120b | 44.4% | 49.0% | +4.6pp | 19 |
| llama-3.3-70b | 43.7% | 46.5% | +2.8pp | 23 |
| llama-4-scout-17b | 37.1% | 40.5% | +3.4pp | 20 |
| gpt-oss-20b | 43.5% | 43.2% | −0.3pp | 24 |
| qwen3-32b | 40.7% | 39.4% | −1.3pp | 17 |
For context, published zero-shot baselines on BIRD include ChatGPT at 40.08%, GPT-4 at 46.35%, and GPT-4 with evidence at 54.89%. Our best result (49.0% with GPT-OSS 120B) is competitive with these baselines while using a lightweight, single-LLM pipeline.
What changes between synthetic and BIRD? The self-healing loop still provides gains (up to +4.6pp), but regressions increase substantially. On synthetic databases with clean, controlled schemas, the repair loop operates safely with zero regressions for the best models. On BIRD’s messier real-world schemas — ambiguous column names, inconsistent data types, complex joins — the LLM occasionally over-corrects working queries. This highlights an important practical insight: the self-healing loop’s value scales with schema clarity, and the early-accept mechanism is critical for limiting regression damage.
Another pattern: model rankings shift between benchmarks. Scout 17B dominates synthetic data (57.3%) but drops to 40.5% on BIRD, while GPT-OSS 120B does the reverse (48.0% synthetic → 49.0% BIRD). Larger models handle the ambiguity of real-world schemas better, while smaller models excel on clean, well-defined structures.
Running the Evaluations
# Synthetic evaluation
docker compose -f docker-compose-synthetic-evaluation.yml build
docker compose -f docker-compose-synthetic-evaluation.yml up -d
docker logs eval-runner --tail 10 -f
# Results in evaluation/synthetic/results/
# BIRD benchmark (download dataset first — see wiki)
docker compose -f docker-compose-bird-evaluation.yml build
docker compose -f docker-compose-bird-evaluation.yml up -d
docker logs bird-runner --tail 10 -f
# Results in evaluation/bird/bird_results/Links
- Paper: arXiv:2604.16511 — SQL Query Engine: A Self-Healing LLM Pipeline for Natural Language to PostgreSQL Translation
- Dataset: codeadeel/sql-query-engine-synthetic on Hugging Face — 75 gold-standard questions, evaluation results across 5 models
- GitHub: codeadeel/sqlqueryengine
- Wiki: Full documentation
- Wiki — REST API & Python Module: usage examples & endpoints
- Wiki — OpenWebUI Setup: connection & configuration
- Wiki — Environment Variables: full reference
- Wiki — Synthetic Evaluation: benchmark methodology & results
- Wiki — BIRD Benchmark: real-world evaluation results
- License: MIT