Mini SQL Query Engine — Natural Language Analytics with Local LLM
Overview
This project is a locally-run natural language analytics system that lets non-technical users query business data in plain language. No SQL knowledge required, no cloud dependencies — everything runs on a quantized Llama 3.1 8B model.
Key Design Principle
The LLM’s job is intentionally limited: it only extracts intent, it does not execute anything.
Given a natural language question, the LLM returns a small JSON object describing what the user wants — which filters to apply, which columns to group by, which metric to aggregate. The program then executes this plan deterministically using plain Python and pandas.
Question
↓
LLM → {"filters": {"invoice_year": 2025, "related_webshop": "Clearvis.io"},
"metric_col": "revenues_wdisc_in_base_currency",
"group_cols": ["invoice_month", "product_group"],
"agg": "sum"}
↓
run_plan() — deterministic Python
↓
Result table
This separation is the core of the design. The LLM is good at understanding language; it is not reliable as a code generator. By constraining it to JSON extraction, every failure is explicit and catchable before any data is touched.
Full Implementation
import json
import pandas as pd
from llama_cpp import Llama
CSV_PATH = r"C:\path\to\data.csv"
MODEL_PATH = r"C:\path\to\llama-3.1-8b-instruct-q4_k_m.gguf"
df = pd.read_csv(
CSV_PATH,
encoding="cp1250",
encoding_errors="ignore",
on_bad_lines="skip",
low_memory=False,
)
llm = Llama(model_path=MODEL_PATH, n_ctx=4096, n_threads=8, verbose=False)
SYSTEM_PROMPT = """You are a query planner. Extract query parameters from the user's question.
Return ONLY a JSON object with these fields:
{
"filters": {column: value, ...},
"metric_col": string or null,
"group_cols": [string, ...] or null,
"agg": "sum" | "mean" | "count" | "nunique"
}
Column mappings (USE THESE EXACT COLUMN NAMES):
- webshop / shop -> related_webshop (values: "eOptika.hu", "LentileContact.ro", "Clearvis.io")
- net revenue -> revenues_wdisc_in_base_currency
- gross revenue -> gross_revenues_wdisc_in_base_currency
- year -> invoice_year (integer)
- month / havi bontás -> invoice_month (integer, 1-12)
- product group -> product_group
- user count -> agg="nunique", metric_col="user_id"
Rules:
- filters: equality conditions only (column -> value)
- group_cols: always a list; include invoice_month if monthly breakdown is asked
- agg: "sum" for revenue, "count" for rows, "nunique" for unique users
- Return ONLY the JSON, no explanation
Example:
Q: What was the total net revenue in 2025 by month in the Clearvis.io webshop by product group?
A: {"filters": {"invoice_year": 2025, "related_webshop": "Clearvis.io"}, "metric_col": "revenues_wdisc_in_base_currency", "group_cols": ["invoice_month", "product_group"], "agg": "sum"}
"""
def question_to_plan(question: str) -> dict:
resp = llm.create_chat_completion(
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
],
temperature=0,
max_tokens=256,
)
raw = resp["choices"][0]["message"]["content"]
start = raw.find("{")
end = raw.rfind("}") + 1
return json.loads(raw[start:end])
def run_plan(df: pd.DataFrame, plan: dict) -> pd.DataFrame:
# filters — case-insensitive
mask = pd.Series(True, index=df.index)
for col, val in (plan.get("filters") or {}).items():
if val is None or col not in df.columns:
continue
s_norm = df[col].astype(str).str.strip().str.casefold()
v_norm = str(val).strip().casefold()
mask &= (s_norm == v_norm)
df_f = df.loc[mask]
metric = plan.get("metric_col")
group_cols = plan.get("group_cols") or []
agg = plan.get("agg", "sum")
if not group_cols:
if agg == "count":
return pd.DataFrame({"result": [len(df_f)]})
s = pd.to_numeric(df_f[metric], errors="coerce")
return pd.DataFrame({"result": [s.sum() if agg == "sum" else s.mean()]})
if agg == "count":
result = df_f.groupby(group_cols).size().reset_index(name="count")
elif agg == "nunique":
result = df_f.groupby(group_cols)[metric].nunique().reset_index()
else:
df_f = df_f.copy()
df_f[metric] = pd.to_numeric(df_f[metric], errors="coerce")
result = df_f.groupby(group_cols)[metric].sum().reset_index() if agg == "sum" \
else df_f.groupby(group_cols)[metric].mean().reset_index()
# pivot: invoice_month + 1 other dimension → wide table
if "invoice_month" in group_cols and len(group_cols) == 2:
other = [c for c in group_cols if c != "invoice_month"][0]
result = result.pivot_table(
index=other,
columns="invoice_month",
values=metric,
aggfunc="sum",
fill_value=0,
).reset_index()
month_cols = sorted(
[c for c in result.columns if c != other], key=lambda x: int(x)
)
result = result[[other] + month_cols]
return result
def ask(question: str):
print("\nQUESTION:", question)
plan = question_to_plan(question)
print("\nPLAN:")
print(json.dumps(plan, ensure_ascii=False, indent=2))
result = run_plan(df, plan)
print("\nRESULT:")
print(result.to_string(index=False))
return result
ask("What was the total net revenue in 2025 by month in the Clearvis.io webshop by product group?")
Example Output
PLAN:
{
"filters": {"invoice_year": 2025, "related_webshop": "Clearvis.io"},
"metric_col": "revenues_wdisc_in_base_currency",
"group_cols": ["invoice_month", "product_group"],
"agg": "sum"
}
Rows after filters: 10772
RESULT:
product_group 10 11 12
Accessories 170,293 125,353 125,872
Contact lens cleaners 1,332,999 1,016,438 1,390,611
Contact lenses 7,702,316 5,530,241 7,131,140
Eye drops 792,053 735,024 901,451
Eye tests 1,408,159 1,670,070 1,563,534
Frames 8,515,660 7,761,395 6,036,395
Lenses for spectacles 15,972,972 16,536,509 11,811,043
Sunglasses 1,984,686 1,018,771 2,161,891
Why It Works
The system prompt does the heavy lifting: it maps business language to exact column names, provides example values, and gives the LLM a single concrete example to follow. At temperature=0 the output is deterministic.
Filters use case-insensitive matching, so slight casing differences between user input and data values are handled automatically. The pivot logic kicks in automatically when a monthly breakdown is combined with a second grouping dimension.
The entire system runs locally — no API calls, no data leaving the machine.