Home Programming dbt for Data Transformation Pipelines: From Raw to Analytics-Ready

dbt for Data Transformation Pipelines: From Raw to Analytics-Ready

The 3,000-Line SQL Script from Hell

You’ve seen it. Maybe you wrote it. A single reporting.sql file checked into a shared drive (not Git, because the BI team “doesn’t use Git”), weighing in at 3,247 lines. It starts with sixteen CTEs, pivots through three temporary tables, joins seven source systems, and somewhere around line 1,900 there’s a hardcoded filter for customer_id = 47382 with a comment that just says “– ask Brian why.” Brian left the company in 2022.

The script runs nightly. When it breaks, nobody knows whose metric is wrong. When a column is renamed upstream, the whole thing silently produces zeros. There are no tests. There is no documentation outside a Confluence page last updated in 2020 that describes a schema that no longer exists. When finance asks “why does net_revenue disagree with the GL by $184,000?” the answer is a week of detective work.

This is the problem dbt was built to solve. Not by inventing a new language (it’s still SQL), not by replacing your warehouse (it runs inside your warehouse), but by applying twenty years of software engineering discipline—version control, modularity, testing, documentation, CI/CD—to the analytical SQL layer that sits between raw data and business decisions.

In this guide, we’ll unpack dbt from the ground up: what it is, why it took over the modern data stack, how to structure a real project, how to write models, tests, and macros, how to deploy to production with CI/CD, and how to integrate with orchestrators like Apache Airflow. By the end you’ll have a full e-commerce pipeline blueprint you can lift into your own warehouse.

dbt in the Modern Data Stack Postgres app database Stripe payments API Salesforce CRM Event Logs Kafka / Kinesis Sources EL Tool Fivetran Debezium / Airbyte Warehouse Snowflake BigQuery Redshift / Databricks raw schema dbt staging -> intermediate -> marts compile-time SQL Tableau BI dashboards Looker semantic layer Consumers EL pushes raw data in; dbt transforms inside the warehouse; BI reads from marts.

Why Transformation Belongs in the Warehouse

For most of the 2000s and early 2010s, the canonical data pipeline was ETL: Extract data from a source, Transform it on a middle-tier server (Informatica, Talend, SSIS, bespoke Python), then Load the cleaned result into a data warehouse that was too expensive and too slow to do the heavy lifting itself. Storage cost hundreds of dollars per gigabyte-month. Compute was fixed. You did not load raw clickstream into Teradata; you aggregated it to daily rollups first.

Three things broke that model.

First, cloud warehouses decoupled storage from compute. Snowflake introduced the architecture in 2014, and BigQuery, Redshift, and Databricks followed. Storage dropped to roughly $23/TB/month. Compute became elastic—spin up a warehouse, run a query, spin it down. You no longer pay for idle capacity.

Second, columnar storage plus massively parallel processing made aggregation over billions of rows feasible. A query that would take four hours on a row-oriented OLTP database finishes in eleven seconds on a properly sized Snowflake warehouse.

Third, managed EL tools (Fivetran, Airbyte, Stitch, Debezium) commoditized the “get data in” problem. You click a button, point at your Postgres replica or Stripe account, and raw tables start showing up. There’s nothing for your engineering team to write.

The consequence: the middle-tier transformation server became unnecessary. Why move gigabytes of data out of the warehouse, transform it on a smaller machine, and load it back? Just transform it where it already lives. This is ELT—Extract, Load, Transform—and dbt is the tool that owns the final T.

Key Takeaway: dbt exists because modern warehouses are fast and cheap enough to do all transformation work themselves. Your pipeline becomes: EL tool loads raw -> dbt transforms -> BI consumes. No middle-tier server required.

What dbt Actually Is (and What It Isn’t)

Here is the single most important sentence in this entire post: dbt is a compile-time SQL tool, not a runtime engine. It does not execute queries. It does not store data. It does not move data between systems. dbt is a templating and orchestration layer that reads your .sql files, resolves Jinja references, compiles plain SQL, and submits it to your warehouse via that warehouse’s native adapter.

When you run dbt run, dbt walks your dependency graph and for each model executes something like:

CREATE OR REPLACE TABLE analytics.fct_orders AS (
  -- your compiled model SQL
);

That’s it. Every capability—testing, incremental logic, documentation, snapshots—ultimately reduces to SQL statements that dbt generates and the warehouse executes. This matters because:

  • All your compute happens where your data lives (no network egress).
  • You scale by scaling your warehouse, not by scaling dbt.
  • You can inspect every query dbt runs in target/compiled/.
  • dbt has no opinion about your data volume; if your warehouse can handle it, dbt can orchestrate it.

What dbt adds on top of SQL:

  • The ref() function — model-to-model references that build a DAG automatically.
  • Materializations — you write a SELECT and dbt wraps it in the right DDL (view, table, incremental merge).
  • Tests — declarative data quality assertions that compile to SELECT statements expected to return zero rows.
  • Macros — reusable SQL via Jinja, so you stop copy-pasting that 40-line date spine.
  • Documentation — a generated static site describing every model and column, with lineage graphs.
  • Version control — your whole analytics logic is just files in Git.

Core Concepts: Models, Sources, Seeds, Snapshots

Before we write code, internalize the five primitives:

Primitive File Location What It Represents
Model models/*.sql A SELECT that becomes a view or table.
Source models/*.yml Raw tables loaded by your EL tool; declared, not created.
Seed seeds/*.csv Small static CSV loaded as a table (country codes, tax rates).
Snapshot snapshots/*.sql Slowly-changing dimension (SCD Type 2) tracking.
Test models/*.yml or tests/*.sql A SQL assertion that should return zero rows on pass.
Macro macros/*.sql Reusable Jinja function producing SQL.

 

Writing Your First Model

Let’s write a minimal model. A dbt model is nothing more than a file ending in .sql that contains a single SELECT. Create models/staging/stg_customers.sql:

{{ config(
    materialized='view',
    schema='staging'
) }}

with source as (
    select * from {{ source('raw_app', 'customers') }}
),

renamed as (
    select
        id                      as customer_id,
        email                   as customer_email,
        lower(trim(first_name)) as first_name,
        lower(trim(last_name))  as last_name,
        created_at              as signup_at,
        updated_at              as updated_at
    from source
    where deleted_at is null
)

select * from renamed

Three things to notice:

  1. {{ config(...) }} is a Jinja expression that tells dbt how to materialize this model (here, as a view in the staging schema).
  2. {{ source('raw_app', 'customers') }} is a reference to a raw source table declared in a YAML file—dbt will replace it at compile time with the fully-qualified raw.app.customers.
  3. No CREATE TABLE, no DROP IF EXISTS. dbt wraps your SELECT in the appropriate DDL.

Once you add a sibling model that references this one:

-- models/marts/dim_customers.sql
{{ config(materialized='table') }}

select
    customer_id,
    customer_email,
    first_name || ' ' || last_name as full_name,
    signup_at
from {{ ref('stg_customers') }}

…the {{ ref('stg_customers') }} tells dbt two things: (1) this model depends on stg_customers, so build that first; (2) replace this at compile time with the correct fully-qualified table name, whatever schema it ended up in. This is the single feature that makes dbt feel magical.

Materializations: View, Table, Incremental, Ephemeral

A materialization is dbt’s strategy for persisting your model. You pick one per model based on size, latency, and cost tradeoffs.

Materialization How It Builds When to Use
view CREATE OR REPLACE VIEW Default for staging. Fresh, cheap to build, slower to query.
table CREATE OR REPLACE TABLE ... AS SELECT Marts queried frequently by BI. Faster reads, full rebuild each run.
incremental MERGE or INSERT only new rows Large event/fact tables (>100M rows) where a full rebuild is too slow.
ephemeral Inlined as a CTE Shared logic that doesn’t need its own table. Rare; use sparingly.

 

Caution: A common mistake is materializing every model as table because “tables are faster.” If your model is queried twice a day from BI, a view costs you nothing. If it’s queried 40 times a minute by a dashboard, materialize as a table. Default to view; promote to table only when reads dominate.

Incremental Models in Depth

Incremental models are where dbt pays for itself in warehouse credits. Imagine fct_orders with 900 million rows. Full-refresh takes 45 minutes and costs $40 in Snowflake credits. Incremental—only processing yesterday’s 400k new rows—takes 90 seconds and costs pennies.

The pattern uses the is_incremental() Jinja macro:

{{ config(
    materialized='incremental',
    unique_key='order_id',
    on_schema_change='append_new_columns',
    incremental_strategy='merge'
) }}

with source as (
    select * from {{ ref('stg_orders') }}

    {% if is_incremental() %}
      -- On incremental runs, only pull rows newer than what we already have.
      -- The subquery reads from {{ this }} — the model's own materialized table.
      where updated_at > (select coalesce(max(updated_at), '1900-01-01') from {{ this }})
    {% endif %}
)

select
    order_id,
    customer_id,
    order_status,
    order_total_usd,
    placed_at,
    updated_at
from source

Three configuration options to understand:

  • unique_key — the column(s) dbt uses to identify a row for MERGE. If an incoming order_id already exists, it’s updated; otherwise inserted.
  • incremental_strategy — on Snowflake/BigQuery, merge is standard. On Redshift, delete+insert. On Databricks, merge also.
  • on_schema_change — what to do when you add a column. append_new_columns is safe and sensible.

Run it the first time with dbt run --full-refresh --select fct_orders to build the whole thing; subsequent runs pick up the delta automatically.

Incremental Model Execution Flow dbt run fct_orders is_incremental() ? false true Full Refresh CREATE OR REPLACE TABLE SELECT all 900M rows no WHERE filter 45 minutes $40 in credits Incremental MERGE INTO fct_orders WHERE updated_at > max(this) only new/changed rows 90 seconds ~$0.30 in credits fct_orders (ready)

Sources and Freshness Checks

Sources are how you tell dbt about raw tables that it did not create. You declare them in YAML; dbt never writes to them. The payoff: lineage (you can trace any mart column back to a source), source() references that break builds if the raw table disappears, and freshness checks that fail your pipeline if the EL tool is stale.

# models/staging/sources.yml
version: 2

sources:
  - name: raw_app
    database: raw
    schema: app_public
    loaded_at_field: _fivetran_synced
    freshness:
      warn_after: {count: 6, period: hour}
      error_after: {count: 24, period: hour}
    tables:
      - name: customers
        description: "One row per registered customer."
        columns:
          - name: id
            description: "Primary key."
            tests:
              - unique
              - not_null
          - name: email
            tests:
              - not_null
      - name: orders
        loaded_at_field: updated_at
        freshness:
          warn_after: {count: 1, period: hour}
          error_after: {count: 6, period: hour}
      - name: order_items

  - name: raw_stripe
    database: raw
    schema: stripe
    tables:
      - name: charges
      - name: refunds

Run dbt source freshness and dbt queries each source’s loaded_at_field to see if the latest row is recent enough. This turns “the Fivetran Salesforce connector broke three days ago and nobody noticed” into a CI failure.

Testing: The Feature That Wins Skeptics

If there’s one feature that converts SQL analysts to dbt true believers, it’s testing. Data quality bugs are the worst kind of bugs—silent, slow to surface, and the executive sees the bad number before you do. dbt tests let you assert invariants declaratively and catch violations in CI, not in a Tuesday morning finance meeting.

There are four generic tests shipped with dbt: unique, not_null, accepted_values, and relationships. You declare them in YAML next to your models:

# models/marts/_marts.yml
version: 2

models:
  - name: fct_orders
    description: "Order fact table, grain: one row per order."
    columns:
      - name: order_id
        description: "Primary key."
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: order_status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'refunded', 'cancelled']
      - name: order_total_usd
        tests:
          - dbt_utils.expression_is_true:
              expression: ">= 0"

Every test compiles to a SELECT that should return zero rows. The unique test for order_id becomes roughly:

select order_id
from analytics.fct_orders
where order_id is not null
group by order_id
having count(*) > 1

If that returns any rows, the test fails. Run all tests with dbt test, or test one model with dbt test --select fct_orders. In CI, a failing test blocks the merge. Data quality becomes a pre-deployment check, not a customer-reported bug.

For assertions that don’t fit a generic test, write a singular test—a one-off .sql file in tests/:

-- tests/assert_refunds_never_exceed_charges.sql
select
    c.charge_id,
    c.amount_usd as charge_amount,
    sum(r.amount_usd) as total_refunded
from {{ ref('stg_stripe_charges') }} c
left join {{ ref('stg_stripe_refunds') }} r
  on c.charge_id = r.charge_id
group by 1, 2
having sum(r.amount_usd) > c.amount_usd

If a refund ever exceeds its original charge, this test fails and tells you which charge. For even more leverage, install the dbt-utils and dbt-expectations packages—they ship dozens of tests like expect_column_values_to_match_regex, expect_row_values_to_have_recent_data, and mutually_exclusive_ranges.

Tip: Start every new model with at least three tests: unique and not_null on the primary key, and a relationships test on each foreign key. This catches 80% of the gnarly joins-producing-duplicates bugs that plague raw SQL.

Macros and Jinja Templating

A macro is a reusable piece of SQL powered by Jinja. If you find yourself writing the same CASE expression in ten models, turn it into a macro. Create macros/cents_to_dollars.sql:

{% macro cents_to_dollars(column_name, scale=2) %}
    round(({{ column_name }} / 100.0)::numeric, {{ scale }})
{% endmacro %}

Use it in any model:

select
    charge_id,
    {{ cents_to_dollars('amount_cents') }} as amount_usd,
    {{ cents_to_dollars('fee_cents', 4) }} as fee_usd
from {{ ref('stg_stripe_charges') }}

Macros shine for database-specific SQL dialects. Here’s one that generates a date spine compatible across Snowflake, BigQuery, and Postgres:

{% macro date_spine(start_date, end_date) %}
    {%- if target.type == 'snowflake' -%}
        select dateadd('day', seq4(), '{{ start_date }}')::date as date_day
        from table(generator(rowcount => datediff('day', '{{ start_date }}', '{{ end_date }}') + 1))
    {%- elif target.type == 'bigquery' -%}
        select day as date_day
        from unnest(generate_date_array('{{ start_date }}', '{{ end_date }}')) as day
    {%- else -%}
        select generate_series('{{ start_date }}'::date, '{{ end_date }}'::date, '1 day'::interval)::date as date_day
    {%- endif -%}
{% endmacro %}

Now the same model works across three warehouses without a single hand-edit.

Auto-Generated Documentation

Run dbt docs generate && dbt docs serve and dbt starts a local web server with a full catalog: every model, every column, every test, every source, and an interactive DAG visualization showing how data flows from sources to marts. It reads descriptions from your YAML files. You can also use a doc() block for longer markdown docs:

# models/marts/_marts.yml
version: 2

models:
  - name: fct_orders
    description: "{{ doc('fct_orders_overview') }}"
    columns:
      - name: order_total_usd
        description: "Gross merchandise value in USD, excluding tax and shipping. Computed as sum(line_item.quantity * line_item.unit_price_usd)."

Then in models/marts/docs.md:

{% docs fct_orders_overview %}

# Orders Fact Table

Grain: one row per customer order.

## Business Rules

- Orders with status = 'cancelled' are retained for analytics but excluded from the GMV metric.
- Refunds are tracked in `fct_refunds`, not here.
- This table is incrementally built on `updated_at`.

## Known Limitations

- Historical order status changes prior to 2023-01-01 were not captured; use `dim_order_snapshots` for SCD history.

{% enddocs %}

Deploy the docs site to S3 or dbt Cloud and your analytics catalog becomes self-serve. Finance stops asking “what is net_revenue actually?” because they can read it themselves.

Project Structure: Staging, Intermediate, Marts

dbt doesn’t enforce a directory structure, but the community has converged on a three-layer model. Use it. Deviating without reason causes pain.

dbt Model Layering Raw (sources) raw.app.customers raw.app.orders raw.app.order_items raw.stripe.charges raw.app.products Staging (view) stg_customers stg_orders stg_order_items stg_stripe_charges stg_products Intermediate int_orders_joined int_order_totals int_customer_ltv Marts (table) Facts fct_orders fct_order_items Dimensions dim_customers dim_products BI Tableau Looker Arrows are ref() calls. Each layer can only reference the layer(s) before it.

Staging (models/staging/): one staging model per source table. Rename columns to a consistent convention (snake_case, _id suffixes, _at for timestamps). Cast types. Drop soft-deleted rows. Do nothing else. Materialized as views. Staging models are the only ones that may call source().

Intermediate (models/intermediate/): composition logic that isn’t a final mart. Join stg_orders with stg_order_items to compute line-item-aware order totals. Intermediate models reference only staging or other intermediate.

Marts (models/marts/): the final deliverables—fact and dimension tables that BI queries. Organized by business domain (marts/finance/, marts/marketing/). Materialized as tables (or incremental for big facts).

Full Example: E-Commerce Data Pipeline

Let’s wire up a real pipeline end-to-end. Assume Fivetran is loading Postgres tables customers, orders, order_items, and products into a raw.app_public schema. Our project layout:

jaffle_shop_dbt/
├── dbt_project.yml
├── packages.yml
├── profiles.yml              # (usually in ~/.dbt/)
├── models/
│   ├── staging/
│   │   ├── _sources.yml
│   │   ├── _stg_models.yml
│   │   ├── stg_customers.sql
│   │   ├── stg_orders.sql
│   │   ├── stg_order_items.sql
│   │   └── stg_products.sql
│   ├── intermediate/
│   │   └── int_order_items_priced.sql
│   └── marts/
│       ├── _marts.yml
│       ├── dim_customers.sql
│       ├── dim_products.sql
│       ├── fct_orders.sql
│       └── fct_order_items.sql
├── macros/
│   └── cents_to_dollars.sql
├── tests/
│   └── assert_fct_orders_positive_totals.sql
└── seeds/
    └── country_codes.csv

dbt_project.yml

name: 'jaffle_shop_dbt'
version: '1.0.0'
config-version: 2

profile: 'jaffle_shop'

model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"
clean-targets:
  - "target"
  - "dbt_packages"

models:
  jaffle_shop_dbt:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: ephemeral
      +schema: intermediate
    marts:
      +materialized: table
      +schema: analytics

seeds:
  jaffle_shop_dbt:
    +schema: seeds

vars:
  active_order_statuses: ['placed', 'shipped', 'completed']

packages.yml

packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: calogica/dbt_expectations
    version: 0.10.3
  - package: dbt-labs/codegen
    version: 0.12.1

Install with dbt deps.

Sources

# models/staging/_sources.yml
version: 2

sources:
  - name: raw_app
    database: raw
    schema: app_public
    loaded_at_field: _fivetran_synced
    freshness:
      warn_after: {count: 2, period: hour}
      error_after: {count: 12, period: hour}
    tables:
      - name: customers
        columns:
          - name: id
            tests: [unique, not_null]
      - name: orders
        columns:
          - name: id
            tests: [unique, not_null]
          - name: customer_id
            tests:
              - not_null
              - relationships:
                  to: source('raw_app', 'customers')
                  field: id
      - name: order_items
        columns:
          - name: id
            tests: [unique, not_null]
      - name: products
        columns:
          - name: id
            tests: [unique, not_null]

Staging Models

-- models/staging/stg_customers.sql
with source as (
    select * from {{ source('raw_app', 'customers') }}
)

select
    id                          as customer_id,
    lower(trim(email))          as email,
    lower(trim(first_name))     as first_name,
    lower(trim(last_name))      as last_name,
    country_code,
    created_at                  as signup_at,
    updated_at
from source
where deleted_at is null
-- models/staging/stg_orders.sql
with source as (
    select * from {{ source('raw_app', 'orders') }}
)

select
    id              as order_id,
    customer_id,
    status          as order_status,
    placed_at,
    shipped_at,
    updated_at
from source
-- models/staging/stg_order_items.sql
with source as (
    select * from {{ source('raw_app', 'order_items') }}
)

select
    id                                          as order_item_id,
    order_id,
    product_id,
    quantity,
    {{ cents_to_dollars('unit_price_cents') }}  as unit_price_usd,
    {{ cents_to_dollars('discount_cents') }}    as discount_usd
from source
-- models/staging/stg_products.sql
with source as (
    select * from {{ source('raw_app', 'products') }}
)

select
    id                              as product_id,
    sku,
    name                            as product_name,
    category,
    {{ cents_to_dollars('price_cents') }} as list_price_usd,
    is_active
from source

Intermediate Model

-- models/intermediate/int_order_items_priced.sql
with items as (
    select * from {{ ref('stg_order_items') }}
),

products as (
    select * from {{ ref('stg_products') }}
)

select
    i.order_item_id,
    i.order_id,
    i.product_id,
    p.product_name,
    p.category,
    i.quantity,
    i.unit_price_usd,
    i.discount_usd,
    (i.quantity * i.unit_price_usd) - i.discount_usd as line_total_usd
from items i
left join products p using (product_id)

Marts Models

-- models/marts/dim_customers.sql
{{ config(materialized='table') }}

with customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select
        customer_id,
        min(placed_at) as first_order_at,
        max(placed_at) as most_recent_order_at,
        count(*)       as lifetime_orders
    from {{ ref('stg_orders') }}
    where order_status in ('placed', 'shipped', 'completed')
    group by customer_id
)

select
    c.customer_id,
    c.email,
    c.first_name || ' ' || c.last_name as full_name,
    c.country_code,
    c.signup_at,
    o.first_order_at,
    o.most_recent_order_at,
    coalesce(o.lifetime_orders, 0) as lifetime_orders,
    case when o.lifetime_orders is null then 'prospect'
         when o.lifetime_orders = 1    then 'one_time'
         when o.lifetime_orders < 5    then 'returning'
         else 'loyal'
    end as customer_segment
from customers c
left join orders o using (customer_id)
-- models/marts/dim_products.sql
{{ config(materialized='table') }}

select
    product_id,
    sku,
    product_name,
    category,
    list_price_usd,
    is_active
from {{ ref('stg_products') }}
-- models/marts/fct_orders.sql
{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',
    on_schema_change='append_new_columns'
) }}

with orders as (
    select * from {{ ref('stg_orders') }}

    {% if is_incremental() %}
      where updated_at > (select coalesce(max(updated_at), '1900-01-01') from {{ this }})
    {% endif %}
),

items as (
    select
        order_id,
        sum(line_total_usd) as order_total_usd,
        count(*)            as item_count
    from {{ ref('int_order_items_priced') }}
    group by order_id
)

select
    o.order_id,
    o.customer_id,
    o.order_status,
    o.placed_at,
    o.shipped_at,
    o.updated_at,
    coalesce(i.order_total_usd, 0) as order_total_usd,
    coalesce(i.item_count, 0)      as item_count,
    case when o.order_status in {{ "('" ~ var('active_order_statuses') | join("','") ~ "')" }}
         then true else false end  as is_active_order
from orders o
left join items i using (order_id)
-- models/marts/fct_order_items.sql
{{ config(materialized='table') }}

select
    line.order_item_id,
    line.order_id,
    line.product_id,
    o.customer_id,
    line.quantity,
    line.unit_price_usd,
    line.discount_usd,
    line.line_total_usd,
    o.placed_at
from {{ ref('int_order_items_priced') }} line
left join {{ ref('stg_orders') }} o using (order_id)

Tests and Descriptions

# models/marts/_marts.yml
version: 2

models:
  - name: dim_customers
    description: "One row per customer with lifetime metrics."
    columns:
      - name: customer_id
        tests: [unique, not_null]
      - name: email
        tests: [not_null]
      - name: customer_segment
        tests:
          - accepted_values:
              values: ['prospect', 'one_time', 'returning', 'loyal']

  - name: fct_orders
    description: "Orders fact table, one row per order."
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: order_total_usd
        tests:
          - dbt_utils.expression_is_true:
              expression: ">= 0"
      - name: order_status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'refunded', 'cancelled']

Now run the whole pipeline:

# Install packages, seeds, and run everything
dbt deps
dbt seed
dbt run
dbt test

# Or chain with dbt build (run + test + seed + snapshot in dependency order)
dbt build

# Run only staging models
dbt run --select staging

# Run fct_orders and everything it depends on
dbt run --select +fct_orders

# Run fct_orders and everything downstream of it
dbt run --select fct_orders+

# Full-refresh the incremental
dbt run --select fct_orders --full-refresh

This is a complete, production-ready structure. With discipline around staging rename conventions and a test on every primary key, you can scale this layout from 10 models to 2,000.

dbt Cloud vs dbt Core

dbt comes in two flavors. dbt Core is the free, open-source Python package (pip install dbt-snowflake or whichever adapter) you run from your laptop, CI server, or orchestrator. dbt Cloud is the hosted commercial product with a browser IDE, a managed scheduler, alerting, a Semantic Layer, a metadata API, and SSO. They execute the same underlying project.

Concern dbt Core dbt Cloud
Cost Free Paid per developer seat + job runs
IDE Your editor (VS Code + dbt Power User) Browser IDE with live compile
Scheduling Bring your own (Airflow, cron, GitHub Actions) Built-in with cron + event triggers
CI GitHub Actions / CircleCI (manual setup) First-class Slim CI via PR integration
Docs hosting Deploy yourself (S3, Netlify) Hosted
Alerting DIY via logs + your monitoring Slack / PagerDuty / Email built-in
Best for Teams with strong DevOps; multi-orchestrator setups Teams who want the fastest path to production

 

Choose Core if you already run Airflow or Dagster and want dbt to be one task among many. Choose Cloud if analytics engineers—not data platform engineers—need to ship and you want the shortest time-to-value. Many teams start on Cloud and migrate to Core as platform maturity grows.

CI/CD with dbt and Slim CI

Treating SQL like application code means running CI on every pull request. A proper dbt CI pipeline does three things:

  1. Lint with sqlfluff to enforce style.
  2. Build only changed models plus their downstream dependencies (Slim CI).
  3. Test the built models.

Slim CI is the magic piece. A naive CI job runs dbt build, which rebuilds every model—slow and expensive on a big project. Slim CI instead compares your PR’s manifest against production’s manifest and only builds what changed:

# .github/workflows/dbt_ci.yml
name: dbt CI

on:
  pull_request:
    branches: [main]

jobs:
  dbt-build:
    runs-on: ubuntu-latest
    env:
      DBT_PROFILES_DIR: ./.dbt
      SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
      SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_CI_USER }}
      SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_CI_PASSWORD }}
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dbt
        run: pip install dbt-snowflake==1.8.* sqlfluff-templater-dbt

      - name: Install packages
        run: dbt deps

      - name: Lint SQL
        run: sqlfluff lint models/

      # Pull production manifest (stored in S3 or an artifact)
      - name: Download prod manifest
        run: |
          aws s3 cp s3://dbt-artifacts/prod/manifest.json ./prod-manifest.json

      - name: Build changed models (Slim CI)
        run: |
          dbt build \
            --select state:modified+ \
            --defer --state ./ \
            --target ci

The flags --select state:modified+ tells dbt to build modified models and everything downstream. --defer --state ./ tells dbt that any unmodified upstream model should be read from production rather than rebuilt in the CI schema. A 400-model project whose PR changes three models runs CI in 90 seconds instead of 45 minutes.

For deeper coverage of Git workflows that pair well with this, see our guide on Git and GitHub best practices, and for SQL style, clean code principles apply to SQL more than people admit.

Integrating with Airflow, Dagster, and Prefect

dbt is a transformation tool; it doesn’t know about your upstream EL jobs, your downstream ML pipelines, or your Kafka consumers. That’s the orchestrator’s job. The two standard patterns:

Pattern 1: dbt as one task. Your Airflow DAG runs Fivetran sync, then dbt build, then a reverse-ETL push. Simple and bulletproof:

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.providers.fivetran.operators.fivetran import FivetranOperator
from datetime import datetime, timedelta

default_args = {'owner': 'data', 'retries': 1, 'retry_delay': timedelta(minutes=5)}

with DAG(
    'analytics_pipeline',
    default_args=default_args,
    schedule_interval='0 6 * * *',
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:

    sync_app = FivetranOperator(
        task_id='sync_app_db',
        connector_id='app_postgres_connector',
    )

    dbt_build = BashOperator(
        task_id='dbt_build',
        bash_command=(
            'cd /opt/dbt/jaffle_shop_dbt && '
            'dbt deps && '
            'dbt build --target prod'
        ),
    )

    sync_app >> dbt_build

Pattern 2: Asset-level orchestration with Dagster or Cosmos. Instead of one monolithic dbt build task, parse the dbt manifest and create one Airflow/Dagster task per model. This gives per-model retries, per-model SLAs, and cross-pipeline dependencies (an ML feature task can depend on fct_orders directly, not on “the whole dbt job”). The astronomer-cosmos library does this automatically for Airflow.

For streaming sources that feed dbt, see our guides on Debezium CDC and the full data pipeline architecture article.

Common Pitfalls and How to Avoid Them

Caution: These five mistakes account for most failed dbt adoptions I’ve seen. Avoid them deliberately.

Pitfall 1: Circular references. dbt forbids them, but the warning is easy to miss. If model_a refs model_b and model_b refs model_a, your DAG is invalid. Fix: factor shared logic into an intermediate model both depend on.

Pitfall 2: Over-materializing as tables. Beginners materialize everything as table because “tables are faster.” Then the nightly dbt run takes 3 hours and costs $200 because you’re rebuilding 400 tables that are queried twice a week. Default to view. Promote to table only when you measure read-heavy access. Promote to incremental only when full-refresh of the table is too slow.

Pitfall 3: Ignoring test failures. Teams add tests, tests start failing, and the fix gets deferred to “next sprint.” Within three months tests are ignored entirely. Fix: make tests blocking in CI and in production. Page someone when a not-null test fails in prod. If a test is known-noisy, either fix the test or delete it—do not normalize “yellow is the new green.”

Pitfall 4: Fat models. A single 900-line model that joins eight sources, pivots three times, and computes forty aggregations. It’s the 3,000-line script from the opening, just wearing a dbt hat. Break it into intermediate models. Aim for models that fit on one screen.

Pitfall 5: Skipping staging. “We don’t need a staging layer, let’s just join raw directly in our mart.” You’ll regret this the first time the source system renames a column. The staging layer is a contract—it’s the one place where column name changes have to be addressed, and every downstream model uses the renamed version. Skip it and the blast radius of a raw column change is your entire project.

Pitfall 6: Not using dbt build. dbt run runs models. dbt test runs tests. dbt build does both in topological order and—crucially—won’t run fct_orders if stg_orders tests fail. Use build in production; it stops bad data from propagating.

For related ops discipline on containerization and deployment, see our guides on Docker containers and the broader database comparison for analytics workloads.

FAQ

Should I use dbt Core or dbt Cloud?

Start with dbt Core if your team already runs an orchestrator like Airflow or Dagster and has DevOps capacity—Core is free and integrates cleanly into existing CI/CD. Choose dbt Cloud if your team is primarily analysts or analytics engineers who need a browser IDE, managed scheduling, Slim CI, and alerting without standing up infrastructure. Cloud’s per-seat pricing is worth it when the alternative is hiring a platform engineer.

How is dbt different from stored procedures?

Stored procedures are imperative code living inside the database, typically without version control, testing frameworks, or dependency graphs. dbt models are declarative SELECT statements under Git, with automatic DAG resolution from ref(), built-in tests, auto-generated documentation, and materializations that adapt between view/table/incremental without rewriting logic. Stored procedures also tightly couple you to a specific database dialect; dbt abstracts dialect differences through adapters and macros.

When should I use incremental materialization vs a table?

Use table by default for marts. Switch to incremental when full-refresh becomes too slow or expensive—typically when the underlying table exceeds 100 million rows or when a rebuild takes more than a few minutes. Incremental models add complexity (unique_key logic, handling late-arriving data, full-refresh semantics), so don’t adopt them prematurely. A good heuristic: if dbt run --full-refresh --select my_model takes over 5 minutes and costs more than you’re willing to pay nightly, go incremental.

Does dbt work with any database?

dbt works with any warehouse that has an official or community adapter. First-class adapters exist for Snowflake, BigQuery, Redshift, Databricks, Postgres, DuckDB, SQL Server, Trino, and Spark. Adapters handle dialect differences (merge syntax, type casting, date functions). You can run dbt against a classic OLTP database like MySQL or Postgres, but the value is higher on analytical warehouses because that’s where columnar storage and MPP make transformation fast. If your database has a dbt-<name> pip package, you’re covered.

How does dbt integrate with Airflow?

Three common patterns: (1) Simple—run dbt build as a single BashOperator or DockerOperator task after your EL tasks finish; easy to set up, but all models are one task. (2) Asset-level via astronomer-cosmos—Cosmos parses the dbt manifest and automatically creates one Airflow task per dbt model, giving per-model retries, SLAs, and cross-DAG dependencies. (3) Custom—use Airflow’s KubernetesPodOperator to run dbt in an isolated pod per model group. Pattern 2 is the current best practice for production and is covered in more depth in our Airflow pipeline guide.

Related Reading:

Conclusion

Fifteen years ago, a data warehouse team shipped reports by passing SQL scripts over email, occasionally running them by hand, and hoping the numbers matched. The work was skilled and the tools were bad. dbt did not invent a new kind of analytics; it applied the software engineering norms that application developers had enjoyed since the early 2000s—version control, modularity, testing, documentation, CI/CD—to the analytical SQL layer that had somehow been left behind.

The result is a category of role, the analytics engineer, who owns the transformation layer end-to-end with tools that actually work. A project with a staging layer, tested primary keys, and CI on every PR is not glamorous, but it is the difference between a data team that ships metrics finance trusts and a data team that fights fires forever.

Here’s what to do next. Clone the dbt-labs/jaffle_shop example project. Run it against DuckDB locally—no cloud warehouse required. Extend it with one incremental model and one generic test. Deploy it behind a GitHub Actions CI workflow. Then replicate the pattern against one of your own real data sources. Within a week you will have the start of a real, maintainable analytics codebase.

Read the official dbt documentation for reference material, the dbt Best Practices guide for opinionated patterns, and Ralph Kimball’s dimensional modeling techniques for the underlying fact-and-dimension theory that marts layers codify. For the broader ecosystem, the Analytics Engineering Guide from dbt Labs is the canonical field manual.

Your 3,000-line SQL script is not a fact of nature. It’s a tech debt you’ve been taught to accept. dbt is how you stop accepting it.

References

Disclaimer: This article is for informational and educational purposes only and does not constitute professional consulting advice. Validate all architecture decisions against your own data volumes, security requirements, and cost constraints before putting them into production.

You Might Also Like

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *