Summary
What this post covers: A practical, end-to-end tour of dbt (data build tool) as the transformation layer of the modern ELT stack, including project structure, materializations, testing, macros, CI/CD, and a complete e-commerce pipeline blueprint you can adapt.
Key insights:
- dbt is a compile-time SQL templating and orchestration tool, not a runtime engine, so all execution and scaling happens inside your warehouse (Snowflake, BigQuery, Redshift, Databricks) and dbt itself never moves or stores data.
- Cheap decoupled storage, columnar MPP compute, and commodity EL tools (Fivetran, Airbyte, Debezium) killed the middle-tier transformation server and made the ELT pattern that dbt formalizes the default.
- The staging → intermediate → marts layering, combined with generic and singular tests on every model, is what turns ad-hoc SQL scripts into a maintainable codebase the business can trust.
- Incremental materializations, sources with freshness checks, snapshots for slowly changing dimensions, and macros with Jinja are the features that pay back the learning curve at scale.
- dbt Core covers most teams; dbt Cloud is justified when you need hosted scheduling, a managed IDE, and SOC 2 compliance without running your own orchestrator.
Main topics: The 3,000-Line SQL Script from Hell, Why Transformation Belongs in the Warehouse, What dbt Actually Is (and What It Isn’t), Core Concepts: Models, Sources, Seeds, Snapshots, Writing Your First Model, Materializations: View, Table, Incremental, Ephemeral, Incremental Models in Depth, Sources and Freshness Checks, Testing: The Feature That Wins Skeptics, Macros and Jinja Templating, Auto-Generated Documentation, Project Structure: Staging, Intermediate, Marts, Full Example: E-Commerce Data Pipeline, dbt Cloud vs dbt Core, CI/CD with dbt and Slim CI, Integrating with Airflow, Dagster, and Prefect, Common Pitfalls and How to Avoid Them, FAQ, Wrapping Up, References.
The 3,000-Line SQL Script from Hell
Most data practitioners will recognise the artefact. A single reporting.sql file lives on a shared drive rather than in Git, because the BI team “does not use Git.” It runs to 3,247 lines, opens with sixteen CTEs, pivots through three temporary tables, joins seven source systems, and at approximately line 1,900 contains a hardcoded filter for customer_id = 47382 accompanied by a comment that reads only “– ask Brian why.” Brian left the company in 2022.
The script runs nightly. When it breaks, no one knows whose metric is incorrect. When a column is renamed upstream, the script silently produces zeros. There are no tests. The only documentation is a Confluence page last updated in 2020 that describes a schema no longer in use. When finance asks why net revenue disagrees with the general ledger by $184,000, the answer requires a week of detective work.
This is the problem that dbt was built to solve. It does not introduce a new language (the result is still SQL) and does not replace the warehouse (it runs inside the warehouse). Instead, it applies two decades of software engineering discipline—version control, modularity, testing, documentation, and CI/CD—to the analytical SQL layer that sits between raw data and business decisions.
This guide examines dbt from the ground up: what it is, how it came to dominate the modern data stack, how to structure a real project, how to write models, tests, and macros, how to deploy to production with CI/CD, and how to integrate with orchestrators such as Apache Airflow. The guide concludes with a complete e-commerce pipeline blueprint suitable for adaptation in production.
Why Transformation Belongs in the Warehouse
For most of the 2000s and early 2010s, the canonical data pipeline was ETL: extract data from a source, transform it on a middle-tier server (Informatica, Talend, SSIS, or bespoke Python), then load the cleaned result into a data warehouse that was too expensive and too slow to perform heavy computation itself. Storage cost hundreds of dollars per gigabyte-month. Compute was fixed. Raw clickstream was not loaded into Teradata directly; it was first aggregated into daily rollups.
Three developments disrupted that model.
First, cloud warehouses decoupled storage from compute. Snowflake introduced the architecture in 2014, and BigQuery, Redshift, and Databricks followed. Storage cost dropped to roughly $23/TB/month. Compute became elastic: a warehouse can be started, a query run, and the warehouse stopped. Idle capacity is no longer billed.
Second, columnar storage combined with massively parallel processing made aggregation over billions of rows feasible. A query that would require four hours on a row-oriented OLTP database completes in eleven seconds on a suitably sized Snowflake warehouse.
Third, managed EL tools such as Fivetran, Airbyte, Stitch, and Debezium commoditised the data ingestion problem. A few clicks suffice to connect a Postgres replica or a Stripe account, after which raw tables appear automatically. No engineering effort is required.
The consequence is that the middle-tier transformation server became unnecessary. There is no reason to move gigabytes of data out of the warehouse, transform them on a smaller machine, and load them back; transformation can occur in the warehouse where the data already resides. The resulting pattern is ELT, that is, extract, load, transform, and dbt owns the final T.
What dbt Actually Is (and What It Is Not)
The single most important point in this guide is the following: dbt is a compile-time SQL tool, not a runtime engine. It does not execute queries. It does not store data. It does not move data between systems. dbt is a templating and orchestration layer that reads .sql files, resolves Jinja references, compiles plain SQL, and submits the result to the warehouse through that warehouse’s native adapter.
When dbt run is executed, dbt walks the dependency graph and, for each model, executes a statement of the following form:
CREATE OR REPLACE TABLE analytics.fct_orders AS (
-- your compiled model SQL
);
That is the entire mechanism. Every capability—testing, incremental logic, documentation, and snapshots—ultimately reduces to SQL statements that dbt generates and the warehouse executes. The implications are as follows:
- All compute occurs where the data resides, so no network egress is incurred.
- Scaling is achieved by scaling the warehouse, not by scaling dbt.
- Every query dbt runs can be inspected in
target/compiled/. - dbt has no opinion about data volume; if the warehouse can handle a workload, dbt can orchestrate it.
The capabilities that dbt adds on top of SQL include:
- The
ref()function: model-to-model references that build a DAG automatically. - Materialisations: a
SELECTis written, and dbt wraps it in the appropriate DDL (view, table, or incremental merge). - Tests: declarative data quality assertions that compile to
SELECTstatements expected to return zero rows. - Macros: reusable SQL via Jinja, eliminating repeated patterns such as a 40-line date spine.
- Documentation: a generated static site describing every model and column, with lineage graphs.
- Version control: the entire analytics logic is stored as files in Git.
Core Concepts: Models, Sources, Seeds, Snapshots
Before any code is written, the five primitives should be understood:
| Primitive | File Location | What It Represents |
|---|---|---|
| Model | models/*.sql |
A SELECT that becomes a view or table. |
| Source | models/*.yml |
Raw tables loaded by your EL tool; declared, not created. |
| Seed | seeds/*.csv |
Small static CSV loaded as a table (country codes, tax rates). |
| Snapshot | snapshots/*.sql |
Slowly-changing dimension (SCD Type 2) tracking. |
| Test | models/*.yml or tests/*.sql |
A SQL assertion that should return zero rows on pass. |
| Macro | macros/*.sql |
Reusable Jinja function producing SQL. |
Writing the First Model
A minimal model can be written immediately. A dbt model is no more than a file ending in .sql that contains a single SELECT. The file models/staging/stg_customers.sql is created as follows:
{{ config(
materialized='view',
schema='staging'
) }}
with source as (
select * from {{ source('raw_app', 'customers') }}
),
renamed as (
select
id as customer_id,
email as customer_email,
lower(trim(first_name)) as first_name,
lower(trim(last_name)) as last_name,
created_at as signup_at,
updated_at as updated_at
from source
where deleted_at is null
)
select * from renamed
Three points merit attention:
{{ config(...) }}is a Jinja expression that informs dbt how to materialise this model—in this case as a view in thestagingschema.{{ source('raw_app', 'customers') }}is a reference to a raw source table declared in a YAML file; dbt replaces it at compile time with the fully qualifiedraw.app.customers.- There is no
CREATE TABLEorDROP IF EXISTSstatement. dbt wraps theSELECTin the appropriate DDL automatically.
When a sibling model that references this one is added:
-- models/marts/dim_customers.sql
{{ config(materialized='table') }}
select
customer_id,
customer_email,
first_name || ' ' || last_name as full_name,
signup_at
from {{ ref('stg_customers') }}
The expression {{ ref('stg_customers') }} conveys two pieces of information: first, that this model depends on stg_customers and must be built after it; and second, that the reference should be replaced at compile time with the correct fully qualified table name, regardless of the schema in which it resides. This single feature is responsible for much of dbt’s apparent simplicity.
Materialisations: View, Table, Incremental, Ephemeral
A materialisation is dbt’s strategy for persisting a model. One is selected per model on the basis of size, latency, and cost trade-offs.
| Materialization | How It Builds | When to Use |
|---|---|---|
| view | CREATE OR REPLACE VIEW |
Default for staging. Fresh, cheap to build, slower to query. |
| table | CREATE OR REPLACE TABLE ... AS SELECT |
Marts queried frequently by BI. Faster reads, full rebuild each run. |
| incremental | MERGE or INSERT only new rows |
Large event/fact tables (>100M rows) where a full rebuild is too slow. |
| ephemeral | Inlined as a CTE | Shared logic that doesn’t need its own table. Rare; use sparingly. |
table on the assumption that tables are faster. A model queried twice a day from BI imposes negligible cost as a view, whereas one queried 40 times per minute by a dashboard merits a table. The default should be a view, with promotion to a table only when reads dominate.
Incremental Models in Depth
Incremental models are where dbt pays for itself in warehouse credits. Consider an fct_orders table containing 900 million rows. A full refresh takes 45 minutes and costs $40 in Snowflake credits. An incremental run that processes only yesterday’s 400,000 new rows takes 90 seconds and costs a few cents.
The pattern uses the is_incremental() Jinja macro:
{{ config(
materialized='incremental',
unique_key='order_id',
on_schema_change='append_new_columns',
incremental_strategy='merge'
) }}
with source as (
select * from {{ ref('stg_orders') }}
{% if is_incremental() %}
-- On incremental runs, only pull rows newer than what we already have.
-- The subquery reads from {{ this }} — the model's own materialized table.
where updated_at > (select coalesce(max(updated_at), '1900-01-01') from {{ this }})
{% endif %}
)
select
order_id,
customer_id,
order_status,
order_total_usd,
placed_at,
updated_at
from source
Three configuration options merit explanation:
unique_key: the column or columns that dbt uses to identify a row forMERGE. If an incomingorder_idalready exists, it is updated; otherwise it is inserted.incremental_strategy: on Snowflake and BigQuery,mergeis standard. On Redshift,delete+insertis used. On Databricks,mergeis also standard.on_schema_change: the behaviour when a column is added.append_new_columnsis the safe and sensible default.
The first run uses dbt run --full-refresh --select fct_orders to build the entire table; subsequent runs collect the delta automatically.
Sources and Freshness Checks
Sources are the mechanism by which dbt is informed about raw tables that it did not create. They are declared in YAML and are never written to by dbt. The benefits include lineage (any mart column can be traced back to a source), source() references that fail builds if the raw table disappears, and freshness checks that fail the pipeline if the EL tool falls behind.
# models/staging/sources.yml
version: 2
sources:
- name: raw_app
database: raw
schema: app_public
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: customers
description: "One row per registered customer."
columns:
- name: id
description: "Primary key."
tests:
- unique
- not_null
- name: email
tests:
- not_null
- name: orders
loaded_at_field: updated_at
freshness:
warn_after: {count: 1, period: hour}
error_after: {count: 6, period: hour}
- name: order_items
- name: raw_stripe
database: raw
schema: stripe
tables:
- name: charges
- name: refunds
Running dbt source freshness causes dbt to query each source’s loaded_at_field to determine whether the latest row is sufficiently recent. The mechanism converts “the Fivetran Salesforce connector broke three days ago and no one noticed” into a CI failure.
Testing: The Feature That Wins Sceptics
If a single feature converts SQL analysts into dbt advocates, it is testing. Data quality defects are the worst kind of defects: silent, slow to surface, and frequently identified by the executive before the data team. dbt tests allow invariants to be asserted declaratively and violations to be caught in CI rather than in a Tuesday morning finance meeting.
dbt ships with four generic tests: unique, not_null, accepted_values, and relationships. They are declared in YAML alongside the models:
# models/marts/_marts.yml
version: 2
models:
- name: fct_orders
description: "Order fact table, grain: one row per order."
columns:
- name: order_id
description: "Primary key."
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: order_status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'refunded', 'cancelled']
- name: order_total_usd
tests:
- dbt_utils.expression_is_true:
expression: ">= 0"
Every test compiles to a SELECT statement that should return zero rows. The unique test for order_id compiles to approximately the following:
select order_id
from analytics.fct_orders
where order_id is not null
group by order_id
having count(*) > 1
If that statement returns any rows, the test fails. All tests can be executed with dbt test, or a single model can be tested with dbt test --select fct_orders. In CI, a failing test blocks the merge. Data quality thereby becomes a pre-deployment check rather than a customer-reported defect.
For assertions that do not fit a generic test, a singular test may be written: a one-off .sql file placed in tests/:
-- tests/assert_refunds_never_exceed_charges.sql
select
c.charge_id,
c.amount_usd as charge_amount,
sum(r.amount_usd) as total_refunded
from {{ ref('stg_stripe_charges') }} c
left join {{ ref('stg_stripe_refunds') }} r
on c.charge_id = r.charge_id
group by 1, 2
having sum(r.amount_usd) > c.amount_usd
If a refund ever exceeds its original charge, this test fails and identifies the offending charge. For broader coverage, the dbt-utils and dbt-expectations packages should be installed; they provide dozens of tests, including expect_column_values_to_match_regex, expect_row_values_to_have_recent_data, and mutually_exclusive_ranges.
unique and not_null on the primary key, and a relationships test on each foreign key. This combination catches approximately 80% of the duplicate-row defects that plague raw SQL.
Macros and Jinja Templating
A macro is a reusable piece of SQL powered by Jinja. When the same CASE expression appears in ten models, it should be converted to a macro. The file macros/cents_to_dollars.sql is created as follows:
{% macro cents_to_dollars(column_name, scale=2) %}
round(({{ column_name }} / 100.0)::numeric, {{ scale }})
{% endmacro %}
It can then be used in any model:
select
charge_id,
{{ cents_to_dollars('amount_cents') }} as amount_usd,
{{ cents_to_dollars('fee_cents', 4) }} as fee_usd
from {{ ref('stg_stripe_charges') }}
Macros are especially valuable for database-specific SQL dialects. The following macro generates a date spine compatible with Snowflake, BigQuery, and Postgres:
{% macro date_spine(start_date, end_date) %}
{%- if target.type == 'snowflake' -%}
select dateadd('day', seq4(), '{{ start_date }}')::date as date_day
from table(generator(rowcount => datediff('day', '{{ start_date }}', '{{ end_date }}') + 1))
{%- elif target.type == 'bigquery' -%}
select day as date_day
from unnest(generate_date_array('{{ start_date }}', '{{ end_date }}')) as day
{%- else -%}
select generate_series('{{ start_date }}'::date, '{{ end_date }}'::date, '1 day'::interval)::date as date_day
{%- endif -%}
{% endmacro %}
The same model can now run across three warehouses without any manual modification.
Auto-Generated Documentation
Running dbt docs generate && dbt docs serve starts a local web server with a complete catalogue: every model, every column, every test, every source, and an interactive DAG visualisation that shows how data flows from sources to marts. Descriptions are read from the YAML files. A doc() block can also be used for longer Markdown documentation:
# models/marts/_marts.yml
version: 2
models:
- name: fct_orders
description: "{{ doc('fct_orders_overview') }}"
columns:
- name: order_total_usd
description: "Gross merchandise value in USD, excluding tax and shipping. Computed as sum(line_item.quantity * line_item.unit_price_usd)."
The block resides in models/marts/docs.md:
{% docs fct_orders_overview %}
# Orders Fact Table
Grain: one row per customer order.
## Business Rules
- Orders with status = 'cancelled' are retained for analytics but excluded from the GMV metric.
- Refunds are tracked in `fct_refunds`, not here.
- This table is incrementally built on `updated_at`.
## Known Limitations
- Historical order status changes prior to 2023-01-01 were not captured; use `dim_order_snapshots` for SCD history.
{% enddocs %}
When the documentation site is deployed to S3 or dbt Cloud, the analytics catalogue becomes self-service. Finance no longer has to ask “what does net_revenue actually mean?” because the definition is available for inspection.
Project Structure: Staging, Intermediate, Marts
dbt does not enforce a directory structure, although the community has converged on a three-layer model. The convention should be followed, since departures from it without good reason create maintenance difficulty.
Staging (models/staging/): one staging model per source table. Columns are renamed to a consistent convention (snake_case, _id suffixes, _at for timestamps). Types are cast. Soft-deleted rows are dropped. No other operations occur. Staging models are materialised as views and are the only models permitted to call source().
Intermediate (models/intermediate/): composition logic that is not itself a final mart. For example, stg_orders may be joined with stg_order_items to compute line-item-aware order totals. Intermediate models reference only staging or other intermediate models.
Marts (models/marts/): the final deliverables—fact and dimension tables that BI queries. They are organised by business domain (marts/finance/, marts/marketing/) and materialised as tables (or as incremental for large fact tables).
Full Example: E-Commerce Data Pipeline
A complete pipeline can be wired up end-to-end as follows. Assume that Fivetran is loading the Postgres tables customers, orders, order_items, and products into a raw.app_public schema. The project layout is shown below:
jaffle_shop_dbt/
├── dbt_project.yml
├── packages.yml
├── profiles.yml # (usually in ~/.dbt/)
├── models/
│ ├── staging/
│ │ ├── _sources.yml
│ │ ├── _stg_models.yml
│ │ ├── stg_customers.sql
│ │ ├── stg_orders.sql
│ │ ├── stg_order_items.sql
│ │ └── stg_products.sql
│ ├── intermediate/
│ │ └── int_order_items_priced.sql
│ └── marts/
│ ├── _marts.yml
│ ├── dim_customers.sql
│ ├── dim_products.sql
│ ├── fct_orders.sql
│ └── fct_order_items.sql
├── macros/
│ └── cents_to_dollars.sql
├── tests/
│ └── assert_fct_orders_positive_totals.sql
└── seeds/
└── country_codes.csv
dbt_project.yml
name: 'jaffle_shop_dbt'
version: '1.0.0'
config-version: 2
profile: 'jaffle_shop'
model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets:
- "target"
- "dbt_packages"
models:
jaffle_shop_dbt:
staging:
+materialized: view
+schema: staging
intermediate:
+materialized: ephemeral
+schema: intermediate
marts:
+materialized: table
+schema: analytics
seeds:
jaffle_shop_dbt:
+schema: seeds
vars:
active_order_statuses: ['placed', 'shipped', 'completed']
packages.yml
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: calogica/dbt_expectations
version: 0.10.3
- package: dbt-labs/codegen
version: 0.12.1
Packages are installed with dbt deps.
Sources
# models/staging/_sources.yml
version: 2
sources:
- name: raw_app
database: raw
schema: app_public
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 2, period: hour}
error_after: {count: 12, period: hour}
tables:
- name: customers
columns:
- name: id
tests: [unique, not_null]
- name: orders
columns:
- name: id
tests: [unique, not_null]
- name: customer_id
tests:
- not_null
- relationships:
to: source('raw_app', 'customers')
field: id
- name: order_items
columns:
- name: id
tests: [unique, not_null]
- name: products
columns:
- name: id
tests: [unique, not_null]
Staging Models
-- models/staging/stg_customers.sql
with source as (
select * from {{ source('raw_app', 'customers') }}
)
select
id as customer_id,
lower(trim(email)) as email,
lower(trim(first_name)) as first_name,
lower(trim(last_name)) as last_name,
country_code,
created_at as signup_at,
updated_at
from source
where deleted_at is null
-- models/staging/stg_orders.sql
with source as (
select * from {{ source('raw_app', 'orders') }}
)
select
id as order_id,
customer_id,
status as order_status,
placed_at,
shipped_at,
updated_at
from source
-- models/staging/stg_order_items.sql
with source as (
select * from {{ source('raw_app', 'order_items') }}
)
select
id as order_item_id,
order_id,
product_id,
quantity,
{{ cents_to_dollars('unit_price_cents') }} as unit_price_usd,
{{ cents_to_dollars('discount_cents') }} as discount_usd
from source
-- models/staging/stg_products.sql
with source as (
select * from {{ source('raw_app', 'products') }}
)
select
id as product_id,
sku,
name as product_name,
category,
{{ cents_to_dollars('price_cents') }} as list_price_usd,
is_active
from source
Intermediate Model
-- models/intermediate/int_order_items_priced.sql
with items as (
select * from {{ ref('stg_order_items') }}
),
products as (
select * from {{ ref('stg_products') }}
)
select
i.order_item_id,
i.order_id,
i.product_id,
p.product_name,
p.category,
i.quantity,
i.unit_price_usd,
i.discount_usd,
(i.quantity * i.unit_price_usd) - i.discount_usd as line_total_usd
from items i
left join products p using (product_id)
Marts Models
-- models/marts/dim_customers.sql
{{ config(materialized='table') }}
with customers as (
select * from {{ ref('stg_customers') }}
),
orders as (
select
customer_id,
min(placed_at) as first_order_at,
max(placed_at) as most_recent_order_at,
count(*) as lifetime_orders
from {{ ref('stg_orders') }}
where order_status in ('placed', 'shipped', 'completed')
group by customer_id
)
select
c.customer_id,
c.email,
c.first_name || ' ' || c.last_name as full_name,
c.country_code,
c.signup_at,
o.first_order_at,
o.most_recent_order_at,
coalesce(o.lifetime_orders, 0) as lifetime_orders,
case when o.lifetime_orders is null then 'prospect'
when o.lifetime_orders = 1 then 'one_time'
when o.lifetime_orders < 5 then 'returning'
else 'loyal'
end as customer_segment
from customers c
left join orders o using (customer_id)
-- models/marts/dim_products.sql
{{ config(materialized='table') }}
select
product_id,
sku,
product_name,
category,
list_price_usd,
is_active
from {{ ref('stg_products') }}
-- models/marts/fct_orders.sql
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
on_schema_change='append_new_columns'
) }}
with orders as (
select * from {{ ref('stg_orders') }}
{% if is_incremental() %}
where updated_at > (select coalesce(max(updated_at), '1900-01-01') from {{ this }})
{% endif %}
),
items as (
select
order_id,
sum(line_total_usd) as order_total_usd,
count(*) as item_count
from {{ ref('int_order_items_priced') }}
group by order_id
)
select
o.order_id,
o.customer_id,
o.order_status,
o.placed_at,
o.shipped_at,
o.updated_at,
coalesce(i.order_total_usd, 0) as order_total_usd,
coalesce(i.item_count, 0) as item_count,
case when o.order_status in {{ "('" ~ var('active_order_statuses') | join("','") ~ "')" }}
then true else false end as is_active_order
from orders o
left join items i using (order_id)
-- models/marts/fct_order_items.sql
{{ config(materialized='table') }}
select
line.order_item_id,
line.order_id,
line.product_id,
o.customer_id,
line.quantity,
line.unit_price_usd,
line.discount_usd,
line.line_total_usd,
o.placed_at
from {{ ref('int_order_items_priced') }} line
left join {{ ref('stg_orders') }} o using (order_id)
Tests and Descriptions
# models/marts/_marts.yml
version: 2
models:
- name: dim_customers
description: "One row per customer with lifetime metrics."
columns:
- name: customer_id
tests: [unique, not_null]
- name: email
tests: [not_null]
- name: customer_segment
tests:
- accepted_values:
values: ['prospect', 'one_time', 'returning', 'loyal']
- name: fct_orders
description: "Orders fact table, one row per order."
columns:
- name: order_id
tests: [unique, not_null]
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: order_total_usd
tests:
- dbt_utils.expression_is_true:
expression: ">= 0"
- name: order_status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'refunded', 'cancelled']
The complete pipeline can now be executed:
# Install packages, seeds, and run everything
dbt deps
dbt seed
dbt run
dbt test
# Or chain with dbt build (run + test + seed + snapshot in dependency order)
dbt build
# Run only staging models
dbt run --select staging
# Run fct_orders and everything it depends on
dbt run --select +fct_orders
# Run fct_orders and everything downstream of it
dbt run --select fct_orders+
# Full-refresh the incremental
dbt run --select fct_orders --full-refresh
The structure above is complete and production-ready. With discipline around staging rename conventions and a test on every primary key, the same layout scales from 10 models to 2,000.
dbt Cloud and dbt Core
dbt is offered in two forms. dbt Core is the free, open-source Python package (pip install dbt-snowflake or another adapter) that can be run from a laptop, a CI server, or an orchestrator. dbt Cloud is the hosted commercial product, providing a browser IDE, a managed scheduler, alerting, a Semantic Layer, a metadata API, and SSO. Both execute the same underlying project.
| Concern | dbt Core | dbt Cloud |
|---|---|---|
| Cost | Free | Paid per developer seat + job runs |
| IDE | Your editor (VS Code + dbt Power User) | Browser IDE with live compile |
| Scheduling | Bring your own (Airflow, cron, GitHub Actions) | Built-in with cron + event triggers |
| CI | GitHub Actions / CircleCI (manual setup) | First-class Slim CI via PR integration |
| Docs hosting | Deploy yourself (S3, Netlify) | Hosted |
| Alerting | DIY via logs + your monitoring | Slack / PagerDuty / Email built-in |
| Best for | Teams with strong DevOps; multi-orchestrator setups | Teams who want the fastest path to production |
Core is the appropriate choice for teams that already run Airflow or Dagster and want dbt to be one task among many. Cloud is the appropriate choice when analytics engineers, rather than data platform engineers, need to ship quickly and the shortest time to value is desired. Many teams begin on Cloud and migrate to Core as platform maturity increases.
CI/CD with dbt and Slim CI
Treating SQL as application code requires running CI on every pull request. A well-designed dbt CI pipeline performs three actions:
- Lint with
sqlfluffto enforce style. - Build only changed models together with their downstream dependencies (Slim CI).
- Test the built models.
Slim CI is the central optimisation. A naive CI job runs dbt build, which rebuilds every model and is slow and expensive on a large project. Slim CI compares the PR’s manifest against the production manifest and builds only what changed:
# .github/workflows/dbt_ci.yml
name: dbt CI
on:
pull_request:
branches: [main]
jobs:
dbt-build:
runs-on: ubuntu-latest
env:
DBT_PROFILES_DIR: ./.dbt
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_CI_USER }}
SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_CI_PASSWORD }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dbt
run: pip install dbt-snowflake==1.8.* sqlfluff-templater-dbt
- name: Install packages
run: dbt deps
- name: Lint SQL
run: sqlfluff lint models/
# Pull production manifest (stored in S3 or an artifact)
- name: Download prod manifest
run: |
aws s3 cp s3://dbt-artifacts/prod/manifest.json ./prod-manifest.json
- name: Build changed models (Slim CI)
run: |
dbt build \
--select state:modified+ \
--defer --state ./ \
--target ci
The flag --select state:modified+ instructs dbt to build modified models and everything downstream. --defer --state ./ instructs dbt that any unmodified upstream model should be read from production rather than rebuilt in the CI schema. A 400-model project whose PR changes three models runs CI in 90 seconds rather than 45 minutes.
For broader coverage of Git workflows that complement this approach, see the guide on Git and GitHub best practices. For SQL style, the principles outlined in clean code principles apply to SQL more than is commonly acknowledged.
Integrating with Airflow, Dagster, and Prefect
dbt is a transformation tool. It is unaware of upstream EL jobs, downstream ML pipelines, or Kafka consumers. Awareness of those is the orchestrator’s responsibility. Two standard patterns apply:
Pattern 1: dbt as one task. An Airflow DAG runs the Fivetran sync, then dbt build, then a reverse-ETL push. The arrangement is simple and reliable:
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.providers.fivetran.operators.fivetran import FivetranOperator
from datetime import datetime, timedelta
default_args = {'owner': 'data', 'retries': 1, 'retry_delay': timedelta(minutes=5)}
with DAG(
'analytics_pipeline',
default_args=default_args,
schedule_interval='0 6 * * *',
start_date=datetime(2026, 1, 1),
catchup=False,
) as dag:
sync_app = FivetranOperator(
task_id='sync_app_db',
connector_id='app_postgres_connector',
)
dbt_build = BashOperator(
task_id='dbt_build',
bash_command=(
'cd /opt/dbt/jaffle_shop_dbt && '
'dbt deps && '
'dbt build --target prod'
),
)
sync_app >> dbt_build
Pattern 2: Asset-level orchestration with Dagster or Cosmos. Rather than a single monolithic dbt build task, the dbt manifest is parsed and one Airflow/Dagster task per model is created. The arrangement provides per-model retries, per-model SLAs, and cross-pipeline dependencies (for example, an ML feature task can depend on fct_orders directly rather than on the whole dbt job). The astronomer-cosmos library performs this transformation automatically for Airflow.
For streaming sources that feed dbt, see the guides on Debezium CDC and the full data pipeline architecture article.
Common Pitfalls and How to Avoid Them
Pitfall 1: Circular references. dbt forbids circular references, but the warning is easy to overlook. If model_a references model_b and model_b references model_a, the DAG is invalid. The remedy is to factor shared logic into an intermediate model on which both depend.
Pitfall 2: Over-materialising as tables. Beginners often materialise everything as a table on the assumption that tables are faster. The nightly dbt run then takes three hours and costs $200 because 400 tables are being rebuilt that are queried twice a week. The default should be view. Promotion to table should occur only when read-heavy access is measured. Promotion to incremental should occur only when full refresh of the table is too slow.
Pitfall 3: Ignoring test failures. Teams add tests, the tests begin to fail, and the fix is deferred to the next sprint. Within three months the tests are ignored entirely. The remedy is to make tests blocking in CI and in production. An on-call engineer should be paged when a not-null test fails in production. If a test is known to be noisy, it should be either fixed or removed; “yellow” should not be normalised as “green.”
Pitfall 4: Excessively large models. A single 900-line model that joins eight sources, pivots three times, and computes forty aggregations is essentially the 3,000-line script of the opening section in a dbt costume. Such models should be broken into intermediate models. Models that fit on a single screen are preferable.
Pitfall 5: Skipping the staging layer. The rationale “we do not need a staging layer; we will join raw directly in the mart” leads to difficulty when the source system renames a column. The staging layer is a contract: it is the single location where column name changes must be addressed, and every downstream model uses the renamed version. Skipping the staging layer means that the blast radius of a raw column change is the entire project.
Pitfall 6: Not using dbt build. dbt run runs models. dbt test runs tests. dbt build performs both in topological order and, crucially, does not run fct_orders if stg_orders tests fail. build should be used in production because it prevents the propagation of bad data.
For related operational discipline on containerisation and deployment, see the guides on Docker containers and the broader database comparison for analytics workloads.
FAQ
Should I use dbt Core or dbt Cloud?
Start with dbt Core if your team already runs an orchestrator like Airflow or Dagster and has DevOps capacity—Core is free and integrates cleanly into existing CI/CD. Choose dbt Cloud if your team is primarily analysts or analytics engineers who need a browser IDE, managed scheduling, Slim CI, and alerting without standing up infrastructure. Cloud’s per-seat pricing is worth it when the alternative is hiring a platform engineer.
How is dbt different from stored procedures?
Stored procedures are imperative code living inside the database, typically without version control, testing frameworks, or dependency graphs. dbt models are declarative SELECT statements under Git, with automatic DAG resolution from ref(), built-in tests, auto-generated documentation, and materializations that adapt between view/table/incremental without rewriting logic. Stored procedures also tightly couple you to a specific database dialect; dbt abstracts dialect differences through adapters and macros.
When should I use incremental materialization vs a table?
Use table by default for marts. Switch to incremental when full-refresh becomes too slow or expensive—typically when the underlying table exceeds 100 million rows or when a rebuild takes more than a few minutes. Incremental models add complexity (unique_key logic, handling late-arriving data, full-refresh semantics), so don’t adopt them prematurely. A good heuristic: if dbt run --full-refresh --select my_model takes over 5 minutes and costs more than you’re willing to pay nightly, go incremental.
Does dbt work with any database?
dbt works with any warehouse that has an official or community adapter. First-class adapters exist for Snowflake, BigQuery, Redshift, Databricks, Postgres, DuckDB, SQL Server, Trino, and Spark. Adapters handle dialect differences (merge syntax, type casting, date functions). You can run dbt against a classic OLTP database like MySQL or Postgres, but the value is higher on analytical warehouses because that’s where columnar storage and MPP make transformation fast. If your database has a dbt-<name> pip package, you’re covered.
How does dbt integrate with Airflow?
Three common patterns: (1) Simple, run dbt build as a single BashOperator or DockerOperator task after your EL tasks finish; easy to set up, but all models are one task. (2) Asset-level via astronomer-cosmos—Cosmos parses the dbt manifest and automatically creates one Airflow task per dbt model, giving per-model retries, SLAs, and cross-DAG dependencies. (3) Custom—use Airflow’s KubernetesPodOperator to run dbt in an isolated pod per model group. Pattern 2 is the current best practice for production and is covered in more depth in our Airflow pipeline guide.
Wrapping Up
Fifteen years ago, a data warehouse team typically shipped reports by passing SQL scripts via email, occasionally running them by hand, and hoping the numbers matched. The work was skilled and the tools were inadequate. dbt did not invent a new kind of analytics; it applied the software engineering norms that application developers had enjoyed since the early 2000s—version control, modularity, testing, documentation, and CI/CD—to the analytical SQL layer that had been left behind.
The result is a new category of role, the analytics engineer, who owns the transformation layer end-to-end with tools that work. A project with a staging layer, tested primary keys, and CI on every PR is not glamorous, but it is the difference between a data team that ships metrics finance trusts and one that fights fires indefinitely.
The recommended next steps are as follows. Clone the dbt-labs/jaffle_shop example project. Run it against DuckDB locally; no cloud warehouse is required. Extend it with one incremental model and one generic test. Deploy it behind a GitHub Actions CI workflow. Then replicate the pattern against one real data source. Within a week, the foundation of a maintainable analytics codebase will be in place.
The official dbt documentation provides the reference material. The dbt Best Practices guide contains opinionated patterns. Ralph Kimball’s dimensional modelling techniques describe the underlying fact-and-dimension theory that marts layers codify. For the broader ecosystem, the Analytics Engineering Guide from dbt Labs serves as the canonical field manual.
The 3,000-line SQL script is not a fact of nature. It is technical debt that has been accepted by default. dbt is the mechanism by which acceptance can end.
References
- dbt official documentation
- dbt-labs/jaffle_shop example project
- dbt Best Practices Guide
- Analytics Engineering Guide
- Kimball Dimensional Modeling Techniques
Disclaimer: This article is for informational and educational purposes only and does not constitute professional consulting advice. Validate all architecture decisions against your own data volumes, security requirements, and cost constraints before putting them into production.
Leave a Reply