ConnectSphere

Resources / technical

Why your LLM needs a glossary

The schema in your database has names like LIFNR, MATNR, KUNDE_NR, ACCT_TYPE_CD, PROC_CD. These names made sense to whoever wrote the source system in 1998, 2003, or 2011. They make no sense to anyone else, and they especially make no sense to an LLM trying to answer a natural-language question against the data.

The model reads the schema literally. There is no cultural memory that translates LIFNR to supplier number; there is no engineer the model can ask. Whatever the column is called is what the model has to work with.

Most enterprises know this gap exists and route around it: they build dashboards with hand-written labels, train new analysts to memorize the field codes, document the mapping in a wiki that goes stale within a year. The semantic dictionary is the structural alternative.

What an LLM actually sees when it reads your schema

Modern LLMs are reasonably good at translating natural language to SQL when they have a reasonable schema description. They are not magic. When the schema looks like this:

CREATE TABLE T_001 (
  LIFNR VARCHAR(10),
  MATNR VARCHAR(18),
  EBELN VARCHAR(10),
  WAERS VARCHAR(3),
  NETPR DECIMAL(11,2),
  ...
);

the model can guess that NETPR might mean net price because the prefix is loosely meaningful and the type is decimal. It has no way to guess what LIFNR means — the abbreviation comes from German (Lieferantennummer, supplier number), the table name is a sequential identifier, and the only signal the model has is the column type, which doesn't disambiguate it from the dozen other 10-character VARCHARs in the schema.

The model will produce SQL anyway. It will pick a column and join on it and return rows. The query will look correct. Whether it is correct depends on whether the model guessed right.

The same problem appears in every legacy enterprise system. Banking has ACCT_TYPE_CD with values S, C, M, T that mean savings, checking, money market, time deposit — or, in another bank, mean something different. Healthcare has procedure codes that look like opaque numerics until someone explains the AMA evaluation-and-management hierarchy. Insurance has policy-status codes whose meanings have drifted three times since the system was deployed. Every legacy schema looks like this from the outside.

This is the gap a semantic dictionary fills.

The artifact

A semantic dictionary is a versioned, code-reviewed catalog of every column an analytical model exposes, with each entry carrying:

  • Canonical name. What the column is called in the analytical model. supplier_id instead of LIFNR.
  • Definition. What the column actually means, in business terms. The unique identifier for a supplier in the procurement system, assigned at supplier onboarding and never reused.
  • Type and constraints. Datatype, nullability, allowed values, units. VARCHAR(10), not null, leading zeros preserved, sourced from the procurement master record.
  • Provenance. Where the column came from in the source schema. Originally T_001.LIFNR in the legacy ERP, joined via the supplier-master extract.
  • Stewardship. Who owns the definition and approves changes to it.

The dictionary is committed to the repository alongside the code that builds the analytical model. When the model is rebuilt, the dictionary metadata flows into the model's documentation — so the LLM, the BI tool, and the human analyst all read the same definitions.

This is nothing exotic. The data-modeling community has been advocating for some form of this for decades; data-catalog products have offered partial versions for years. What changed is that LLMs now make it the difference between reliable grounding and confident hallucination.

Why this is policy, not metadata

The entries in a semantic dictionary look like documentation. They are not. They are policy decisions about what the data means.

Consider a column called customer_status. The source system stores values like A, I, H, T, D. What do they mean? Active, Inactive, Held, Terminated, Deceased? Or Active, Inactive, Hot Lead, Trial, Disabled? The source code might not say. The original developer might have left. The values might have drifted — maybe A used to mean active but now also includes records that should have been held. A semantic dictionary entry has to commit to one meaning, with its own list of allowed values, and that commitment is a policy.

Once that policy is in the dictionary:

  • Code that produces the column has to conform. If a transformation produces Q (which isn't in the allowed values), the schema test fails the build.
  • Code that consumes the column can rely on it. If a downstream query treats A as active without further checks, it can do so safely because the upstream policy guarantees what A means.
  • The LLM has the meaning available when it generates SQL. Asked how many active customers do we have, the model translates active into the dictionary's canonical value and runs a deterministic count.

The dictionary's policy decisions propagate through the entire stack. A change to a definition isn't documentation work; it's a versioned change to a shared contract, with downstream consequences that have to be reviewed and tested before the change ships.

What changes when the dictionary exists

Four things become structural when the dictionary is in place. Without it, all four are continuous low-grade pain.

Grounding becomes deterministic. The LLM stops guessing what LIFNR means and translates against supplier_id. RAG over schema documentation retrieves coherent definitions instead of contradictory comments scraped from old wiki pages. Skills procedures point at named, defined columns. None of these require special tooling — they require a dictionary.

Governance becomes structural. Every column has a steward, a definition, and a versioned history of changes. When an auditor asks who decided this customer is active, there is an answer in the repository. Compliance reviews stop being archaeology and become reading.

Accessibility becomes real. Business users can browse the dictionary instead of asking IT what each column means. New analysts ramp faster because the dictionary is the onboarding documentation. The dependency on tribal knowledge — ask Helga, she's been here 25 years — is reduced to the genuinely subtle decisions, not to what KUNDE_NR means.

Drift becomes visible. When an upstream source system changes the meaning of a column, the dictionary entry has to change too. The change shows up in version control, gets reviewed, and propagates to every consumer. Without a dictionary, the same drift happens silently and surfaces as inexplicable hallucinations months later.

How ConnectSphere produces it

The semantic dictionary isn't a separate product or a tool ConnectSphere sells. It is the deliverable of Phase 2 of the methodology — drafted during cardinality analysis, refined during the relationship-architecture phase, and committed as code alongside the dbt models that build the 3NF substrate.

The dictionary then drives:

  • The column names in the dbt models (legacy LIFNR becomes supplier_id).
  • The schema tests that enforce allowed values, types, and uniqueness.
  • The lineage documentation generated by the build.
  • The vocabulary the LLM reads when grounding queries against the model.

This is why the dictionary is the smallest unit of governance in the platform: it is where policy gets recorded, and every other layer derives from it. A platform without a dictionary has the data but not the meaning. A platform with one has the substrate that grounding strategies can actually rely on.

The schema is for the database. The dictionary is for everyone else.

The schema tells your database how to store the data. The dictionary tells everyone else — humans, LLMs, BI tools, downstream applications, auditors — what the data means.

You can run a database without a dictionary. You can't run an enterprise on it.

The schema is for the database. The dictionary is for everyone else.

Related

Ready to Map Your Fragmented Landscape — and See the Path to One Logical Truth?

In a 30-minute diagnostic call, we:

  • Review your current data landscape for redundancy hotspots and contradictions
  • Show a high-level redundancy map tailored to your systems
  • Outline your exact 6-month POC timeline and expected outcomes

No slides. No sales pitch. Just honest architecture insight to decide if this keystone makes sense for your environment.

Prefer email first? hello@connect-sphere.ai

Or message us on LinkedIn

We typically respond within 24 hours and work with enterprises ready for architectural change.