Skip to content

Instantly share code, notes, and snippets.

@cnolanminich
Created February 4, 2026 19:35
Show Gist options
  • Select an option

  • Save cnolanminich/8cfc5a3da15a5f0cd64d119674e9ae93 to your computer and use it in GitHub Desktop.

Select an option

Save cnolanminich/8cfc5a3da15a5f0cd64d119674e9ae93 to your computer and use it in GitHub Desktop.
how granular should assets be?

Asset Granularity Demo

This demo project showcases three different approaches to modeling ETL operations as Dagster assets, each with different trade-offs for visibility, control, and simplicity.

Quick Start

# Activate virtual environment
source .venv/bin/activate

# Validate definitions
uv run dg check defs

# Start development server
uv run dg dev

Then open http://localhost:3000 to view the Dagster UI.

The Three Approaches

1. Single Asset (Coarse-Grained)

Asset: customer_etl_single Group: single_asset_approach

All ETL phases (Extract, Load, Transform) are combined into ONE asset.

┌─────────────────────────────────┐
│       customer_etl_single       │
│  ┌─────────┐ ┌─────┐ ┌───────┐  │
│  │ Extract │→│Load │→│Transform│ │
│  └─────────┘ └─────┘ └───────┘  │
└─────────────────────────────────┘

When to use:

  • Simple pipelines with tightly-coupled operations
  • Quick prototypes and proof-of-concepts
  • When you don't need to retry individual phases
  • When intermediate visibility isn't important

Pros:

  • ✅ Simple to understand and reason about
  • ✅ Single materialization = single execution
  • ✅ Minimal orchestration overhead
  • ✅ Good for operations that always run together

Cons:

  • ❌ No visibility into individual phases in the UI
  • ❌ Cannot independently retry extract vs transform
  • ❌ Cannot selectively run only the transform step
  • ❌ If extract succeeds but transform fails, must re-run entire pipeline

Code Example:

@dg.asset(key="customer_etl_single")
def etl_asset(context):
    # All phases in one function
    data = extract_from_source()
    loaded = load_to_staging(data)
    transformed = transform_data(loaded)
    return transformed

2. Three Separate Assets (Fine-Grained)

Assets: orders_extractorders_loadorders_transform Group: three_asset_approach

Extract, Load, and Transform are THREE separate assets with explicit dependencies.

┌───────────────┐     ┌─────────────┐     ┌─────────────────┐
│orders_extract │ ──→ │ orders_load │ ──→ │orders_transform │
└───────────────┘     └─────────────┘     └─────────────────┘

When to use:

  • Complex pipelines requiring visibility into each phase
  • When you need to debug and monitor individual steps
  • When phases may fail independently and need separate retries
  • When downstream consumers need access to intermediate results
  • Production pipelines where observability is critical

Pros:

  • ✅ Full visibility into each ETL phase in the UI
  • ✅ Can independently retry failed phases
  • ✅ Can selectively run only certain phases (e.g., just transform)
  • ✅ Intermediate results visible in Dagster UI
  • ✅ Better for debugging and monitoring
  • ✅ Can set different retry policies per phase

Cons:

  • ❌ More assets to manage
  • ❌ Requires explicit dependency management
  • ❌ May have I/O overhead between phases
  • ❌ More complex job definitions

Code Example:

@dg.asset(key="orders_extract")
def extract_asset(context):
    return extract_from_source()

@dg.asset(key="orders_load", deps=["orders_extract"])
def load_asset(context):
    return load_to_staging()

@dg.asset(key="orders_transform", deps=["orders_load"])
def transform_asset(context):
    return transform_data()

3. Multi-Asset (Grouped)

Assets: inventory_raw, inventory_staged, inventory_final (produced atomically) Group: multi_asset_approach

A single function produces THREE assets atomically using @multi_asset.

┌──────────────────────────────────────────────────────────┐
│                  etl_multi_asset()                       │
│  ┌───────────────┐  ┌─────────────────┐  ┌────────────┐  │
│  │ inventory_raw │  │ inventory_staged│  │inventory_  │  │
│  │   (output)    │  │    (output)     │  │final (out) │  │
│  └───────────────┘  └─────────────────┘  └────────────┘  │
└──────────────────────────────────────────────────────────┘

When to use:

  • Operations are tightly coupled in execution
  • You need separate assets for downstream consumption
  • Want to pass data between phases without I/O overhead
  • Need "best of both worlds": simple execution + granular visibility

Pros:

  • ✅ Atomic execution (all-or-nothing)
  • ✅ Separate assets for downstream consumers to depend on
  • ✅ Can pass data between outputs without intermediate I/O
  • ✅ Reduced I/O overhead compared to separate assets
  • ✅ All three assets visible separately in the Dagster UI
  • ✅ Best balance of simplicity and observability

Cons:

  • ❌ Cannot independently retry individual outputs
  • ❌ All outputs must succeed or all fail
  • ❌ Slightly more complex to understand than single asset
  • ❌ Cannot selectively materialize just one output

Code Example:

@dg.multi_asset(
    outs={
        "raw": dg.AssetOut(key="inventory_raw"),
        "staged": dg.AssetOut(key="inventory_staged"),
        "final": dg.AssetOut(key="inventory_final"),
    },
    can_subset=False,  # All outputs produced together
)
def etl_multi_asset(context):
    raw_data = extract_from_source()
    staged_data = load_to_staging(raw_data)
    final_data = transform_data(staged_data)
    return raw_data, staged_data, final_data

Comparison Table

Feature Single Asset Three Assets Multi-Asset
Assets Created 1 3 3
Execution Model Single function 3 functions with deps Single function, 3 outputs
Independent Retry No Yes No
Selective Execution No Yes No
UI Visibility 1 asset 3 assets with lineage 3 assets (grouped)
I/O Between Phases In-memory External storage In-memory
Downstream Dependencies 1 target 3 targets 3 targets
Complexity Low Medium Medium
Best For Prototypes, simple ETL Production, debugging Coupled ops, multiple consumers

Schedules

Each approach has a corresponding schedule demonstrating different scheduling patterns:

Schedule Approach Cron Description
daily_single_asset_etl Single Asset 0 6 * * * Daily at 6 AM UTC
hourly_three_asset_etl Three Assets 0 * * * * Every hour
twice_daily_multi_asset_etl Multi-Asset 0 6,18 * * * 6 AM and 6 PM UTC

Schedules use asset selection by group:

asset_selection: "group:single_asset_approach"

Project Structure

asset-granularity-demo/
├── src/asset_granularity_demo/
│   ├── components/
│   │   ├── single_asset_etl.py      # Approach 1: Single Asset
│   │   ├── three_asset_etl.py       # Approach 2: Three Assets
│   │   ├── multi_asset_etl.py       # Approach 3: Multi-Asset
│   │   └── scheduled_job_component.py
│   ├── defs/
│   │   └── asset_granularity_pipeline/
│   │       └── defs.yaml            # All component instances
│   └── definitions.py
├── pyproject.toml
└── README.md

Decision Guide: Which Approach Should I Use?

START: Do phases need independent retry?
  │
  ├── YES → Use Three Separate Assets
  │
  └── NO → Do downstream consumers need
           access to intermediate results?
           │
           ├── YES → Use Multi-Asset
           │
           └── NO → Is this a simple/prototype pipeline?
                    │
                    ├── YES → Use Single Asset
                    │
                    └── NO → Consider Three Separate Assets
                             for better observability

Running the Demo

Materialize All Assets

# Run all assets
uv run dagster asset materialize --select "*"

# Run only the single-asset approach
uv run dagster asset materialize --select "group:single_asset_approach"

# Run only the three-asset approach
uv run dagster asset materialize --select "group:three_asset_approach"

# Run only the multi-asset approach
uv run dagster asset materialize --select "group:multi_asset_approach"

Check Asset Dependencies

# List all assets with dependencies
uv run dg list defs

Demo Mode

All components in this demo run in demo_mode: true, which:

  • Simulates data operations without real database/API connections
  • Returns sample data for demonstration purposes
  • Logs operations to show what would happen in production

To run with real connections, set demo_mode: false in the YAML and implement the actual extraction/loading logic.


Key Takeaways

  1. Start simple: Use Single Asset for prototypes, then migrate to more granular approaches as needed.

  2. Visibility matters: In production, the Three Assets approach gives you the best debugging and monitoring capabilities.

  3. Multi-Asset is a sweet spot: When you need atomic execution but want downstream consumers to reference specific outputs.

  4. Think about failure modes: Consider what happens when each phase fails and whether you need independent retry capability.

  5. Consider downstream consumers: If other assets or systems need to depend on intermediate results, use Three Assets or Multi-Asset.


Learn More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment