This demo project showcases three different approaches to modeling ETL operations as Dagster assets, each with different trade-offs for visibility, control, and simplicity.
# Activate virtual environment
source .venv/bin/activate
# Validate definitions
uv run dg check defs
# Start development server
uv run dg devThen open http://localhost:3000 to view the Dagster UI.
Asset: customer_etl_single
Group: single_asset_approach
All ETL phases (Extract, Load, Transform) are combined into ONE asset.
┌─────────────────────────────────┐
│ customer_etl_single │
│ ┌─────────┐ ┌─────┐ ┌───────┐ │
│ │ Extract │→│Load │→│Transform│ │
│ └─────────┘ └─────┘ └───────┘ │
└─────────────────────────────────┘
When to use:
- Simple pipelines with tightly-coupled operations
- Quick prototypes and proof-of-concepts
- When you don't need to retry individual phases
- When intermediate visibility isn't important
Pros:
- ✅ Simple to understand and reason about
- ✅ Single materialization = single execution
- ✅ Minimal orchestration overhead
- ✅ Good for operations that always run together
Cons:
- ❌ No visibility into individual phases in the UI
- ❌ Cannot independently retry extract vs transform
- ❌ Cannot selectively run only the transform step
- ❌ If extract succeeds but transform fails, must re-run entire pipeline
Code Example:
@dg.asset(key="customer_etl_single")
def etl_asset(context):
# All phases in one function
data = extract_from_source()
loaded = load_to_staging(data)
transformed = transform_data(loaded)
return transformedAssets: orders_extract → orders_load → orders_transform
Group: three_asset_approach
Extract, Load, and Transform are THREE separate assets with explicit dependencies.
┌───────────────┐ ┌─────────────┐ ┌─────────────────┐
│orders_extract │ ──→ │ orders_load │ ──→ │orders_transform │
└───────────────┘ └─────────────┘ └─────────────────┘
When to use:
- Complex pipelines requiring visibility into each phase
- When you need to debug and monitor individual steps
- When phases may fail independently and need separate retries
- When downstream consumers need access to intermediate results
- Production pipelines where observability is critical
Pros:
- ✅ Full visibility into each ETL phase in the UI
- ✅ Can independently retry failed phases
- ✅ Can selectively run only certain phases (e.g., just transform)
- ✅ Intermediate results visible in Dagster UI
- ✅ Better for debugging and monitoring
- ✅ Can set different retry policies per phase
Cons:
- ❌ More assets to manage
- ❌ Requires explicit dependency management
- ❌ May have I/O overhead between phases
- ❌ More complex job definitions
Code Example:
@dg.asset(key="orders_extract")
def extract_asset(context):
return extract_from_source()
@dg.asset(key="orders_load", deps=["orders_extract"])
def load_asset(context):
return load_to_staging()
@dg.asset(key="orders_transform", deps=["orders_load"])
def transform_asset(context):
return transform_data()Assets: inventory_raw, inventory_staged, inventory_final (produced atomically)
Group: multi_asset_approach
A single function produces THREE assets atomically using @multi_asset.
┌──────────────────────────────────────────────────────────┐
│ etl_multi_asset() │
│ ┌───────────────┐ ┌─────────────────┐ ┌────────────┐ │
│ │ inventory_raw │ │ inventory_staged│ │inventory_ │ │
│ │ (output) │ │ (output) │ │final (out) │ │
│ └───────────────┘ └─────────────────┘ └────────────┘ │
└──────────────────────────────────────────────────────────┘
When to use:
- Operations are tightly coupled in execution
- You need separate assets for downstream consumption
- Want to pass data between phases without I/O overhead
- Need "best of both worlds": simple execution + granular visibility
Pros:
- ✅ Atomic execution (all-or-nothing)
- ✅ Separate assets for downstream consumers to depend on
- ✅ Can pass data between outputs without intermediate I/O
- ✅ Reduced I/O overhead compared to separate assets
- ✅ All three assets visible separately in the Dagster UI
- ✅ Best balance of simplicity and observability
Cons:
- ❌ Cannot independently retry individual outputs
- ❌ All outputs must succeed or all fail
- ❌ Slightly more complex to understand than single asset
- ❌ Cannot selectively materialize just one output
Code Example:
@dg.multi_asset(
outs={
"raw": dg.AssetOut(key="inventory_raw"),
"staged": dg.AssetOut(key="inventory_staged"),
"final": dg.AssetOut(key="inventory_final"),
},
can_subset=False, # All outputs produced together
)
def etl_multi_asset(context):
raw_data = extract_from_source()
staged_data = load_to_staging(raw_data)
final_data = transform_data(staged_data)
return raw_data, staged_data, final_data| Feature | Single Asset | Three Assets | Multi-Asset |
|---|---|---|---|
| Assets Created | 1 | 3 | 3 |
| Execution Model | Single function | 3 functions with deps | Single function, 3 outputs |
| Independent Retry | No | Yes | No |
| Selective Execution | No | Yes | No |
| UI Visibility | 1 asset | 3 assets with lineage | 3 assets (grouped) |
| I/O Between Phases | In-memory | External storage | In-memory |
| Downstream Dependencies | 1 target | 3 targets | 3 targets |
| Complexity | Low | Medium | Medium |
| Best For | Prototypes, simple ETL | Production, debugging | Coupled ops, multiple consumers |
Each approach has a corresponding schedule demonstrating different scheduling patterns:
| Schedule | Approach | Cron | Description |
|---|---|---|---|
daily_single_asset_etl |
Single Asset | 0 6 * * * |
Daily at 6 AM UTC |
hourly_three_asset_etl |
Three Assets | 0 * * * * |
Every hour |
twice_daily_multi_asset_etl |
Multi-Asset | 0 6,18 * * * |
6 AM and 6 PM UTC |
Schedules use asset selection by group:
asset_selection: "group:single_asset_approach"asset-granularity-demo/
├── src/asset_granularity_demo/
│ ├── components/
│ │ ├── single_asset_etl.py # Approach 1: Single Asset
│ │ ├── three_asset_etl.py # Approach 2: Three Assets
│ │ ├── multi_asset_etl.py # Approach 3: Multi-Asset
│ │ └── scheduled_job_component.py
│ ├── defs/
│ │ └── asset_granularity_pipeline/
│ │ └── defs.yaml # All component instances
│ └── definitions.py
├── pyproject.toml
└── README.md
START: Do phases need independent retry?
│
├── YES → Use Three Separate Assets
│
└── NO → Do downstream consumers need
access to intermediate results?
│
├── YES → Use Multi-Asset
│
└── NO → Is this a simple/prototype pipeline?
│
├── YES → Use Single Asset
│
└── NO → Consider Three Separate Assets
for better observability
# Run all assets
uv run dagster asset materialize --select "*"
# Run only the single-asset approach
uv run dagster asset materialize --select "group:single_asset_approach"
# Run only the three-asset approach
uv run dagster asset materialize --select "group:three_asset_approach"
# Run only the multi-asset approach
uv run dagster asset materialize --select "group:multi_asset_approach"# List all assets with dependencies
uv run dg list defsAll components in this demo run in demo_mode: true, which:
- Simulates data operations without real database/API connections
- Returns sample data for demonstration purposes
- Logs operations to show what would happen in production
To run with real connections, set demo_mode: false in the YAML and implement the actual extraction/loading logic.
-
Start simple: Use Single Asset for prototypes, then migrate to more granular approaches as needed.
-
Visibility matters: In production, the Three Assets approach gives you the best debugging and monitoring capabilities.
-
Multi-Asset is a sweet spot: When you need atomic execution but want downstream consumers to reference specific outputs.
-
Think about failure modes: Consider what happens when each phase fails and whether you need independent retry capability.
-
Consider downstream consumers: If other assets or systems need to depend on intermediate results, use Three Assets or Multi-Asset.