Asset Granularity Demo

This demo project showcases three different approaches to modeling ETL operations as Dagster assets, each with different trade-offs for visibility, control, and simplicity.

Quick Start

# Activate virtual environment
source .venv/bin/activate

# Validate definitions
uv run dg check defs

# Start development server
uv run dg dev

Then open http://localhost:3000 to view the Dagster UI.

The Three Approaches

1. Single Asset (Coarse-Grained)

Asset: customer_etl_single Group: single_asset_approach

All ETL phases (Extract, Load, Transform) are combined into ONE asset.

┌─────────────────────────────────┐
│       customer_etl_single       │
│  ┌─────────┐ ┌─────┐ ┌───────┐  │
│  │ Extract │→│Load │→│Transform│ │
│  └─────────┘ └─────┘ └───────┘  │
└─────────────────────────────────┘

When to use:

Simple pipelines with tightly-coupled operations
Quick prototypes and proof-of-concepts
When you don't need to retry individual phases
When intermediate visibility isn't important

Pros:

✅ Simple to understand and reason about
✅ Single materialization = single execution
✅ Minimal orchestration overhead
✅ Good for operations that always run together

Cons:

❌ No visibility into individual phases in the UI
❌ Cannot independently retry extract vs transform
❌ Cannot selectively run only the transform step
❌ If extract succeeds but transform fails, must re-run entire pipeline

Code Example:

@dg.asset(key="customer_etl_single")
def etl_asset(context):
    # All phases in one function
    data = extract_from_source()
    loaded = load_to_staging(data)
    transformed = transform_data(loaded)
    return transformed

2. Three Separate Assets (Fine-Grained)

Assets: orders_extract → orders_load → orders_transform Group: three_asset_approach

Extract, Load, and Transform are THREE separate assets with explicit dependencies.

┌───────────────┐     ┌─────────────┐     ┌─────────────────┐
│orders_extract │ ──→ │ orders_load │ ──→ │orders_transform │
└───────────────┘     └─────────────┘     └─────────────────┘

When to use:

Complex pipelines requiring visibility into each phase
When you need to debug and monitor individual steps
When phases may fail independently and need separate retries
When downstream consumers need access to intermediate results
Production pipelines where observability is critical

Pros:

✅ Full visibility into each ETL phase in the UI
✅ Can independently retry failed phases
✅ Can selectively run only certain phases (e.g., just transform)
✅ Intermediate results visible in Dagster UI
✅ Better for debugging and monitoring
✅ Can set different retry policies per phase

Cons:

❌ More assets to manage
❌ Requires explicit dependency management
❌ May have I/O overhead between phases
❌ More complex job definitions

Code Example:

@dg.asset(key="orders_extract")
def extract_asset(context):
    return extract_from_source()

@dg.asset(key="orders_load", deps=["orders_extract"])
def load_asset(context):
    return load_to_staging()

@dg.asset(key="orders_transform", deps=["orders_load"])
def transform_asset(context):
    return transform_data()

3. Multi-Asset (Grouped)

Assets: inventory_raw, inventory_staged, inventory_final (produced atomically) Group: multi_asset_approach

A single function produces THREE assets atomically using @multi_asset.

┌──────────────────────────────────────────────────────────┐
│                  etl_multi_asset()                       │
│  ┌───────────────┐  ┌─────────────────┐  ┌────────────┐  │
│  │ inventory_raw │  │ inventory_staged│  │inventory_  │  │
│  │   (output)    │  │    (output)     │  │final (out) │  │
│  └───────────────┘  └─────────────────┘  └────────────┘  │
└──────────────────────────────────────────────────────────┘

When to use:

Operations are tightly coupled in execution
You need separate assets for downstream consumption
Want to pass data between phases without I/O overhead
Need "best of both worlds": simple execution + granular visibility

Pros:

✅ Atomic execution (all-or-nothing)
✅ Separate assets for downstream consumers to depend on
✅ Can pass data between outputs without intermediate I/O
✅ Reduced I/O overhead compared to separate assets
✅ All three assets visible separately in the Dagster UI
✅ Best balance of simplicity and observability

Cons:

❌ Cannot independently retry individual outputs
❌ All outputs must succeed or all fail
❌ Slightly more complex to understand than single asset
❌ Cannot selectively materialize just one output

Code Example:

@dg.multi_asset(
    outs={
        "raw": dg.AssetOut(key="inventory_raw"),
        "staged": dg.AssetOut(key="inventory_staged"),
        "final": dg.AssetOut(key="inventory_final"),
    },
    can_subset=False,  # All outputs produced together
)
def etl_multi_asset(context):
    raw_data = extract_from_source()
    staged_data = load_to_staging(raw_data)
    final_data = transform_data(staged_data)
    return raw_data, staged_data, final_data

Comparison Table

Feature	Single Asset	Three Assets	Multi-Asset
Assets Created	1	3	3
Execution Model	Single function	3 functions with deps	Single function, 3 outputs
Independent Retry	No	Yes	No
Selective Execution	No	Yes	No
UI Visibility	1 asset	3 assets with lineage	3 assets (grouped)
I/O Between Phases	In-memory	External storage	In-memory
Downstream Dependencies	1 target	3 targets	3 targets
Complexity	Low	Medium	Medium
Best For	Prototypes, simple ETL	Production, debugging	Coupled ops, multiple consumers

Schedules

Each approach has a corresponding schedule demonstrating different scheduling patterns:

Schedule	Approach	Cron	Description
`daily_single_asset_etl`	Single Asset	`0 6 * * *`	Daily at 6 AM UTC
`hourly_three_asset_etl`	Three Assets	`0 * * * *`	Every hour
`twice_daily_multi_asset_etl`	Multi-Asset	`0 6,18 * * *`	6 AM and 6 PM UTC

Schedules use asset selection by group:

asset_selection: "group:single_asset_approach"

Project Structure

asset-granularity-demo/
├── src/asset_granularity_demo/
│   ├── components/
│   │   ├── single_asset_etl.py      # Approach 1: Single Asset
│   │   ├── three_asset_etl.py       # Approach 2: Three Assets
│   │   ├── multi_asset_etl.py       # Approach 3: Multi-Asset
│   │   └── scheduled_job_component.py
│   ├── defs/
│   │   └── asset_granularity_pipeline/
│   │       └── defs.yaml            # All component instances
│   └── definitions.py
├── pyproject.toml
└── README.md

Decision Guide: Which Approach Should I Use?

START: Do phases need independent retry?
  │
  ├── YES → Use Three Separate Assets
  │
  └── NO → Do downstream consumers need
           access to intermediate results?
           │
           ├── YES → Use Multi-Asset
           │
           └── NO → Is this a simple/prototype pipeline?
                    │
                    ├── YES → Use Single Asset
                    │
                    └── NO → Consider Three Separate Assets
                             for better observability

Running the Demo

Materialize All Assets

# Run all assets
uv run dagster asset materialize --select "*"

# Run only the single-asset approach
uv run dagster asset materialize --select "group:single_asset_approach"

# Run only the three-asset approach
uv run dagster asset materialize --select "group:three_asset_approach"

# Run only the multi-asset approach
uv run dagster asset materialize --select "group:multi_asset_approach"

Check Asset Dependencies

# List all assets with dependencies
uv run dg list defs

Demo Mode

All components in this demo run in demo_mode: true, which:

Simulates data operations without real database/API connections
Returns sample data for demonstration purposes
Logs operations to show what would happen in production

To run with real connections, set demo_mode: false in the YAML and implement the actual extraction/loading logic.

Key Takeaways

Start simple: Use Single Asset for prototypes, then migrate to more granular approaches as needed.
Visibility matters: In production, the Three Assets approach gives you the best debugging and monitoring capabilities.
Multi-Asset is a sweet spot: When you need atomic execution but want downstream consumers to reference specific outputs.
Think about failure modes: Consider what happens when each phase fails and whether you need independent retry capability.
Consider downstream consumers: If other assets or systems need to depend on intermediate results, use Three Assets or Multi-Asset.

cnolanminich/asset_granularity.md

Select an option

No results found

Select an option

No results found

Asset Granularity Demo

Quick Start

The Three Approaches

1. Single Asset (Coarse-Grained)

2. Three Separate Assets (Fine-Grained)

3. Multi-Asset (Grouped)

Comparison Table

Schedules

Project Structure

Decision Guide: Which Approach Should I Use?

Running the Demo

Materialize All Assets

Check Asset Dependencies

Demo Mode

Key Takeaways

Learn More