Resources / technical
How a normalized data foundation actually gets built
Most data-platform projects fail in the gap between "we bought a tool" and "we have a usable analytical foundation." The methodology below describes what fills that gap when the goal is a normalized 3NF substrate that AI grounding, BI tooling, and downstream applications can all share. There are six phases, each with a deliverable. Each phase is engineering — not configuration, not workshops, not domain modeling sessions.
The worked example uses dbt as the transformation engine, since that's the most common shape this project takes today. The methodology itself is technology-agnostic — other tools (SQLMesh, Dataform, native PostgreSQL pipelines) implement the same six phases.
The six phases at a glance
- Ingestion — source files become typed staging tables.
- Cardinality and the Semantic Dictionary — the structural shape of the data is discovered, and legacy field names are mapped to business terms.
- Relationship architecture — primary keys, foreign keys, and the referential-integrity violations that have to be resolved.
- Build the model — the 3NF schema is implemented and tested in dbt.
- Validation through NLQ — business questions are translated to SQL against the new model; queries that feel heavy reveal design flaws.
- Incremental processing — the one-off transformation becomes a living ledger that absorbs new source data without manual rework.
Each phase compounds on the last. Each one also has a checkpoint: you can stop after any phase and have a usable artifact, even if you don't go on to the next. Most failed projects stop somewhere between Phase 1 and Phase 2 — they have data loaded but never do the structural analysis that makes it queryable.
Phase 1: Ingestion
The first phase turns raw source-system exports — flat files, ABAP/CSV dumps from ERP systems, mainframe extracts, fixed-width logs from operational systems — into typed staging tables in the working database. This sounds like setup work, but the decisions made here propagate forward: the column types you assign here are the column types Phase 4's models will inherit, and the parsing rules you apply here determine which Phase 5 questions are even askable.
Three pieces of work matter:
Format detection. Source exports rarely arrive in a uniform format. CSVs come with quoting variations; fixed-width files come with column boundaries that are documented for legacy programs but not for current consumers; mainframe extracts use packed numeric formats and accounting conventions (100.50- for negative numbers, EBCDIC-derived text encodings). Each source needs its parsing rules documented as code, not in a how to load this file runbook that lives in someone's head.
DDL generation. Once the format is understood, the staging table's CREATE TABLE statement is generated programmatically — increasingly with LLM assistance, since a frontier model reading the file header and a sample of rows can produce a reasonable initial DDL faster than a human can type it. The DDL is then code-reviewed and committed alongside the rest of the project.
Load transformations. The final piece is the COPY (or equivalent bulk-load) command that moves data from file into staging table, applying the parsing rules: trim whitespace, convert accounting-format negatives to standard form, normalize date formats, handle nulls consistently. The rule of thumb is that nothing about the data should be ambiguous after Phase 1; if a downstream phase has to interpret a staging table, Phase 1 wasn't finished.
The deliverable is a raw staging schema: the original data, but typed, parsed, and loaded into PostgreSQL (or whatever the project's working database is). Same rows, different shape — now queryable in SQL.
Phase 2: Cardinality and the Semantic Dictionary
This is the phase where the methodology's intellectual core lives. It's also the phase that most projects either skip or rush, and skipping it is the most common failure mode in the field.
The work has three intertwined threads:
Cardinality analysis. For every staging table, profile the data: how many rows are there, how many distinct values per column, what are the value ranges, what is the grain of a row. The grain question — what does one row represent? — sounds trivial but isn't. A "customer events" table might have one row per event, one row per customer per day, or one row per customer-product pair, and the joins that work for one grain don't work for the others. Cardinality profiling is what answers this question deterministically. The deeper case for cardinality as the structural driver — versus domain modeling by interview — is in the cardinality-driven normalization deep-dive.
Fact vs. dimension assignment. Once the grain of each table is clear, classify each one as either a fact (a record of an event or measurement, with a timestamp and observable quantities) or a dimension (a description of a stable entity). Fact tables grow; dimension tables churn. They need different handling in subsequent phases, so getting this assignment right early saves rework later.
The Semantic Dictionary. This is the named artifact this phase produces — a versioned mapping from cryptic legacy field names to business terms, with each entry carrying a definition, type, allowed-values constraints, and the provenance of where the field came from. A legacy ERP field called LIFNR becomes the entry supplier_id with a definition and a source-system reference. A sensor tag like Sensor_T7_Mean becomes temperature_zone_7_mean with units and physical-range bounds. A free-text status column gets explicit allowed values pinned down.
The Semantic Dictionary is what makes the rest of the methodology readable to humans. It is also what an LLM reads when translating a natural-language question into SQL: not the cryptic source schema, but the dictionary's business-term vocabulary. The dictionary is generated as documentation alongside the dbt models in Phase 4 and consumed by Phase 5's NLQ work, but Phase 2 is where the entries get drafted.
The deliverable is the Semantic Data Dictionary — a complete, human-readable mapping between every legacy field name and its canonical business term, with cardinality and grain notes attached.
Phase 3: Relationship architecture
Once each table's grain is known and each field has a business-term name, the next question is how the tables relate. This is where primary keys and foreign keys are formally identified, where composite keys are recognized, and where the referential-integrity violations that always exist in legacy data finally surface.
The work:
Primary-key identification. For each table, find the field or combination of fields that uniquely identifies a row. Sometimes this is a single column (a customer table's customer_id); sometimes it's a composite (an order-line table's (order_id, line_number)). Confirm uniqueness empirically: count rows, count distinct values of the proposed key, ensure they match. If they don't, the proposed key is wrong, or the data has duplicates that need investigation.
Foreign-key identification. For each non-key column, ask whether it is or should be a foreign key into another table. The customer table's region_code should reference the region dimension; the order-line table's product_id should reference the product dimension. Confirm referential validity: every foreign-key value should exist in the referenced table. This is rarely true in legacy data on the first pass.
Conflict resolution. When referential integrity is violated — orders pointing at products that don't exist, customers in regions the region table doesn't list — the question is what to do. Options: backfill the dimension, remap the offending rows, treat them as quarantined data. This is where the legacy system's accumulated drift gets surfaced and decided. Documenting why each violation was resolved a particular way is part of the deliverable; the decisions become Phase 4 transformation logic, and the rationale becomes audit material.
The deliverable is the Entity-Relationship Blueprint — a documented graph of every table, its primary key, its foreign keys, and a register of resolved referential issues with the rationale for each resolution.
Phase 4: Build the model
This is where prior-phase analysis becomes executable code. The Semantic Dictionary becomes column names; the Entity-Relationship Blueprint becomes a graph of dbt models; the cardinality and primary-key decisions become tested invariants.
Concretely:
Project initialization. A dbt project (or equivalent) is initialized with a directory structure that mirrors the analytical layers: staging/ for the Phase 1 artifacts, intermediate/ for the structural transforms, marts/ for the consumer-facing models. Source-system tables are declared in source YAML files with the descriptions drafted in the Semantic Dictionary.
Model scaffolding. Each entity from Phase 3's blueprint becomes a dbt model: a SQL file that selects from upstream models or sources, joins where necessary, and produces the entity's canonical form. The transformations encode the resolution decisions made in Phase 3 — backfills, remappings, quarantines.
Schema tests. Every primary key gets a unique and not_null test. Every foreign key gets a relationships test. Every column with allowed values gets an accepted_values test. These are the executable form of the invariants that came out of Phase 2's cardinality analysis and Phase 3's referential work. When a test fails on a future build, it surfaces drift in the source data — exactly the silent failures that humans were previously catching by hand.
Documentation. dbt's documentation framework consumes the Semantic Dictionary entries and produces a navigable lineage graph. Every column's definition, type, source, and downstream consumers are visible; the graph is the audit artifact for compliance reviews.
The deliverable is a functioning dbt project — runnable, tested, documented, with every model producing a 3NF entity that satisfies the schema tests. At this point the analytical foundation exists; Phase 5 validates that it's actually usable.
Phase 5: Validation through NLQ
The hardest validation question for any data model is: can the model answer the questions the business actually has?
Schema tests catch invariant violations. Lineage docs catch missing fields. Neither catches the failure mode where the model is technically correct but operationally awkward — where every business question requires four joins, a pivot, and a window function, and an LLM trying to translate the question into SQL gets lost partway through.
The validation method is to take 5–10 representative business questions — drawn from the dashboards, reports, and ad-hoc queries the organization is currently running by hand — and run them through an LLM in NLQ (natural-language-querying) mode. The LLM reads the Semantic Dictionary and the dbt model docs, then translates the natural-language question into SQL against the new 3NF model.
The diagnostic signal: if a question takes the LLM more than a couple of joins to answer, or if the LLM produces SQL that's correct but obviously belabored, the model has structural problems that schema tests can't detect. The friction is the symptom; the cause is usually a missed entity, a wrong grain, or a denormalization that should have been broken up in Phase 3 but was carried into Phase 4.
When the friction shows up, the response is to go back to Phase 2 or Phase 3, not to add another join layer. The 3NF foundation gets re-architected; the dbt models get rebuilt; the validation loop runs again. This iteration is the point — Phase 5 is a feedback mechanism that catches what the earlier phases missed.
NLQ is one of several grounding strategies the validated foundation will eventually serve, but it's particularly well-suited to validation because the friction surface is where structural problems show themselves. A model that NLQ can query cleanly is a model that downstream Skills, RAG-on-schema, and tool-using agents will all consume reliably.
The deliverable is a verified analytical model: a 3NF foundation that has been demonstrated to answer real business questions with reasonable, deterministic queries — not a design that schema tests pass but humans can't actually use.
Phase 6: Incremental processing
Up to Phase 5, the model has been built and validated against a snapshot. Phase 6 turns the snapshot into a living ledger that absorbs new source data continuously without manual rework.
The work:
Watermark definition. For each fact source, identify a high-water-mark field — typically a timestamp or a monotonically-increasing sequence ID — that lets the pipeline know which rows have already been processed. New rows are detected by comparing the source's max watermark to the destination's last-processed watermark.
Incremental materialization. dbt's incremental model type lets a model update only the rows that have changed since the last run, rather than rebuilding the entire entity from scratch. For large fact tables this is the difference between a 30-second update and a 30-minute one. The unique-key configuration determines what counts as the same row between runs.
Idempotency via hashes. Composite keys that come from multiple source columns are typically materialized as hashes (SHA-256 of the concatenated source columns). The hash is the row's identity for the purpose of incremental updates, and is stable across runs even if the source values are reordered. Using hashes guarantees idempotency: re-running a load is safe and produces the same result.
Freshness monitoring. dbt's source freshness tests fire when a source hasn't been updated within an expected window. The architect gets alerted when an upstream extract job fails before the symptoms show up downstream. Freshness is its own kind of invariant — it doesn't catch bad data, but it catches missing data, which the schema tests can't.
The deliverable is a living 3NF ledger: the foundation now updates itself, with monitoring that flags when the source side breaks. The methodology's output stops being an artifact and starts being a service.
What you have at the end
After six phases, the organization has a self-documenting, fully tested, incrementally maintained 3NF foundation. The Semantic Dictionary makes every field readable to humans; the schema tests guarantee the invariants; the lineage graph shows where every column came from; the incremental pipeline keeps it current; the freshness monitoring catches upstream breakage.
This foundation is what every grounding strategy described elsewhere in the resources library — RAG over schema docs, Skills procedures, structured queries, fine-tuning corpora, agent tool calls — actually consumes when it works reliably. The strategies don't fail because the LLM is wrong; they fail because the foundation is. With this foundation in place, the strategies start working as advertised, and the choice between them becomes an engineering decision about latency, cost, and update frequency rather than a Hail Mary against unreliability.
The same foundation also serves the downstream work that has nothing to do with AI: BI dashboards run against unambiguous joins, integration consumers get a stable schema, audit reviews get lineage they can follow without archaeology. The 3NF substrate is the substrate, not just the AI substrate.
Methodology, not software
The reason this is presented as a methodology rather than as a tool is that the work is the engineering. There is no version of this where you license a SaaS product and skip the phases. The phases are the work; tools (dbt, SQLMesh, Dataform) make the work tractable but don't shortcut it.
What ConnectSphere provides is the methodology, the tooling around it, and the elite team that runs the methodology in a defined window — the 6-month production POC compresses what most organizations stretch over years of false starts into a single delivery cycle. The output, after the POC, is the same artifact described in Phase 6: a living 3NF ledger that downstream strategies can finally rely on.
Architecture is engineering. The phases are how the engineering gets done.