We are still confusing the semantic layer for context. It's not.

At the Amsterdam dbt meetup last week we did three things – we talked through how to use AI in migrating a legacy warehouse, 2,400 undocumented stored procedures, from Data Vault to a Kimball model in weeks rather than months. Then, Artem from Jetbrains (who heads up the team that builds text-to-SQL) explained why scoring 90% on a benchmark told them almost nothing about production. Dumky de Wilde talked through how MotherDuck sees context layers as critical. All come down to how to govern Agentic Analytics properly.

Meanwhile, Anthropic published a detailed account of how its own data team runs analytics on Claude. All of which is to say that the model is the easy part – making the answers (and resulting insights) accurate, novel, and robust is much harder.

This is a governance problem which I think breaks into two layers, and the industry keeps confusing them. The semantic layer is the dictionary: a governed definition of what each metric is, so “active users” means one thing across the company. The context layer is the reasoning: why the metric is defined that way, when it stops being valid, and what the business means when it asks. Most teams build the first, call it done, and skip the second. The second is the one AI most needs.

Update: Yali Sassoon, who’s building Snowplow’s customer context layer, made a similar argument shortly after this piece went up: a semantic layer tells you what data means, but a context layer is the different, harder job of giving an agent what it needs to act well in the moment — and most of that has to be created, not just organised. Same conclusion we’re drawing here, from a different angle: mistake one for the other and the agent ends up confidently wrong.

How most teams are doing it

The standard approach looks sensible. You point an agent at the warehouse and let it write SQL. You feed it your documentation, your dbt repo, your query history, on the assumption that more context means better answers. Perhaps you buy a tool that promises to chat with your data and expect it to work against your schema. And before any of that, someone says the responsible thing: fix the data first. So a data-quality project begins, with no clear definition of done.

These are reasonable instincts. They are also where a lot of money goes to die. S&P Global’s 2025 enterprise survey found the share of companies abandoning most of their AI initiatives before production rose from 17% to 42% in a year. When a pilot stalls, the natural move is to blame the model. The evidence points elsewhere.

Why that fails

Anthropic ran the experiment most teams assume they do not need. They gave their analytics agent direct access to every SQL query the company had ever written, then checked the transcripts to confirm it read them before answering. Accuracy moved by less than a point. They looked at the misses: the correct answer was already sitting in that pile of queries about 80% of the time, and the agent had seen it. It still did not use it. The information was there; what the agent lacked was a way to connect a question to the right definition. That mapping is the real work.

The same post gives the cleanest figure I have seen on this. Strip out the structured context and Claude answered analytics questions correctly 21% of the time. Add it back and the same model passed 95%, reaching 99% in some domains. Those two numbers came from the same model. The difference was the structured context around it.

They also tried the shortcut everyone reaches for: have the model generate the metric definitions itself, from the tables and the logs. It produced definitions that looked plausible and encoded the exact ambiguities they were trying to remove, and it scored worse than a smaller layer a human had curated by hand. The lesson they drew is a good one. Let the model write the documentation, but keep a human owning the definition.

dbt meetup 16th edition - Tasman Analytics

The benchmark numbers tell the same story once you read them properly. Models top 90% on the original Spider benchmark, which uses small, clean, well-named databases. Point them at Spider 2.0, built from real enterprise databases with thousands of columns and several SQL dialects, and the best agents solve roughly a third of the tasks. The team at MotherDuck found the other half of the problem when they ran 500 benchmark questions through frontier models with only the schema. Strict accuracy sat at 58% to 64%. Add a review step, human or model, and it climbed to 94% to 95%. A benchmark hands the agent a clean schema and a well-formed question. Your business gives it neither.

There is a worse problem than a query that breaks. It is the answer that looks right, gets used, and turns out wrong. The text-to-SQL team put it as a question to the room: if your agent scored 90%, would you let your CFO run the board numbers through it unsupervised? Nobody raised a hand. An agent cannot be accountable for a number, and the analyst who used to eyeball a result before it reached the business has quietly left the loop. Anthropic is candid that they have no reliable fix for this yet.

The same gap runs through all of it. Every one of these failures comes back to missing context: the governed meaning that tells the agent which definition to use and when it holds.

What works instead

Three ways to put an analytics agent on your data. The naive stack runs the agent straight on the warehouse, the typical startup adds a semantic model and stops there, and only the third builds the context layer that holds why a metric exists and when it still applies. Most teams build the typical stack and call it done, which is exactly where a trustworthy answer goes missing.

Naive, typical startup, and what to build analytics stack - Tasman Analytics

Reading the stack bottom to top also tells you how to build it. The order matters more than the tooling. Start with the conceptual layer, before any model. Work out what the entities are, how they relate, and what the business actually means by each one. The migration talk at the dbt meetup made this its method: ask the model for a map of reality and its gaps before you ask it for a schema, and make that map reach into where the business is heading, not only the system it is leaving. This is where we tend to start engagements at Tasman, and it shapes how we work on every build (see our post about Domain Modelling as well).

Underneath sits the unglamorous foundation, one trusted and centralised source of data. On top of it, the definitions are owned by a person rather than generated and forgotten, and grain decisions in particular stay human. The agent is built to refuse, or to hand off to a person, when it is unsure, instead of producing a confident guess. That rebuilds the checkpoint the analyst used to provide.

Anthropic watched its own accuracy drift from 95% to 65% in a month once the definitions were left untended. The layer is not a one-off build. The business moves, so it has to move with it, which is why they now ship a definition update alongside most of their model changes.

AI earns its place on top of that foundation. It read those 2,400 stored procedures in minutes, work that used to mean weeks of archaeology. It can draft the documentation, generate the edge cases for a stakeholder session, and review SQL against the logic you have already agreed. It speeds the work up. The judgement underneath stays yours.

The semantic layer, at least, has good tooling now. dbt’s Semantic Layer and Cube let you define metrics as version-controlled code that compiles to warehouse SQL, and Snowflake and Databricks now build those definitions into the warehouse itself. Worth doing. But this is the what. The why is the part no tool gives you: the reasoning behind a definition, the caveats, the memory of when it stops applying. You still have to write that down and keep it current.

One honest caveat. For a small, clean, well-named schema, a formal semantic layer earns its keep slowly, since good data modelling already does most of the job. The context layer pays off as the business grows complex and the definitions get contested, which is where most scaling companies already are.

If you are trying to think about the best architecture for your stack, then the implication is quite practical. It changes what you automate and what you keep human: traversal, drafting and review can go to the model, while definitions and grain stay with people. It changes how you scope, since you need trusted definitions for the questions that matter rather than a cleanup of every table you own. And it changes how you judge a tool, because one that ships a semantic layer and stops there will still hand you confident wrong answers, only faster!

The work was always the same

None of this is new, which is the part worth ending on. Context layer, semantic layer, conceptual layer, knowledge graph: the field keeps minting names for the same old discipline, knowing what your numbers mean and why, and writing it down so it can be trusted and owned. The need was always there – AI just removed the slack that let teams ignore it. The analyst quietly sanity-checking a figure before it reached the board was friction you really should be trying to avboid. Now a machine acts on those definitions at scale, so they have to be explicit, and they have to be maintained.

The model is now a commodity, and the semantic layer is becoming one. The judgement that decides what your numbers mean, and why, is the part you still have to build and own. So the real question for any team wiring AI into its data is whether it has done that work: the meaning, and the reasoning behind it, written down well enough to trust a machine with. Mistake the semantic layer for the context layer and the machine will be confidently, fluently wrong.

News & Views

We are still confusing the semantic layer for context. It’s not.

How most teams are doing it

Why that fails

What works instead

The work was always the same

More news & views

Contact form