Author: Albert Bentov Date: 2026-02-11 Status: Design Proposal
This document proposes a two-tiered approach for intelligent model selection in contextual ad generation:
- Simple Approach (immediate): Rule-based routing for current production based on expected click value (eCPM) and domain reputation
- Advanced Approach (post-online learning): Learned routing integrated with quality predictor (Ĉ) and performance predictor (P̂)
Expected Impact:
- 50-70% cost reduction on low-value impressions
- Maintained quality on high-value impressions
- 2-5x faster response times for most requests
- Profitability threshold enforcement per impression
- Problem Statement
- API Cost Analysis (2026)
- Simple Approach: Rule-Based Routing
- Advanced Approach: Learned Routing
- Fast Wins: Input Token Optimization
- ROI Analysis & Break-Even Scenarios
- Implementation Roadmap
- Risk Mitigation
Our production system (ControlledAd.py) serves contextual ads with human-in-the-loop approval:
- Fetch article (title + body)
- Generate embeddings (256d)
- Find anchor ad via similarity search
- Exploration trigger: When predefined categories or approved candidates fail similarity threshold
- Brand safety check (LLM call on title + content)
- Generate candidate variants (LLM with mega-prompt: brand + styling + strategies + few-shot + safety instructions)
- Generate image
- Human approval → serve winning ad
Problem: We use expensive models uniformly regardless of:
- Expected click value (advertiser's willingness to pay)
- Domain quality/reputation (premium publishers vs low-traffic blogs)
- Content complexity (simple product ads vs nuanced brand campaigns)
Result: Unprofitable on low-eCPM impressions, over-engineered for simple contexts.
Dynamically select model tier (LLM size, image generation quality) based on:
Constraint: Maintain quality standards while maximizing profit margin per impression.
| Model | Provider | Input Cost ($/1M tokens) | Output Cost ($/1M tokens) | Context | Speed | Use Case |
|---|---|---|---|---|---|---|
| GPT-5.2 | OpenAI | $1.75 | $14.00 | 400K | Fast | Premium tier |
| Gemini 3 Pro | $2.00 | $12.00 | 200K | Fast | Premium tier | |
| Gemini 3 Flash | $0.50 | $3.00 | 1M | Very Fast | Balanced tier | |
| Gemini 2.0 Flash | $0.10 | $0.40 | 1M | Very Fast | Budget tier | |
| Llama 3.1 8B | Groq | $0.05 | $0.08 | 128K | Ultra Fast | Ultra-budget |
| Mixtral 8x7B | Groq | $0.27 | $0.27 | 32K | Very Fast | Budget alternative |
| Claude Haiku | Anthropic | $1.00 | $5.00 | 200K | Fast | Budget fallback |
Key Observations:
- Gemini 2.0 Flash is 20x cheaper than Gemini 3 Pro on input, 30x cheaper than GPT-5.2
- Groq inference is 35-40x cheaper than premium models with acceptable quality trade-offs
- Context caching available on Gemini (75% savings on repeated prompts, cache reads at 10% of input price)
- GPT-5.2 generates internal "thinking" tokens billed as output ($14/1M)
Sources:
| Model | Provider | Resolution | Cost per Image | Speed | Use Case |
|---|---|---|---|---|---|
| Imagen 3 | 1024×1024 | $0.030 | ~8s | Premium tier | |
| FLUX.1 [pro] | Replicate | 1024×1024 | $0.055 | ~10s | High quality |
| FLUX.1 [dev] | Replicate | 1024×1024 | $0.030 | ~6s | Balanced tier |
| FLUX.1 [schnell] | Replicate | 1024×1024 | $0.003 | ~2s | Budget tier |
Key Observations:
- Flux schnell is 10x cheaper than Imagen 3 with acceptable quality
- 2-3 second generation time enables real-time workflows
- Flux dev offers good balance (same price as Imagen 3, faster)
Sources:
Production mega-prompt structure:
- Brand description: ~200 tokens
- Styling instructions: ~300 tokens
- Strategy guidelines: ~200 tokens
- Few-shot examples (3-5 examples): ~500 tokens
- Safety instructions: ~150 tokens
- Article content (full): ~2500 tokens
- Total input: ~3,850 tokens
Brand safety call:
- System prompt: ~200 tokens
- Article (title + content): ~2500 tokens
- Total safety check: ~2,700 tokens
Scenario: Generate 1 contextual ad with exploration triggered
| Component | Tokens/Params | Model | Cost |
|---|---|---|---|
| Brand safety check | 2,700 input + 50 output | GPT-5.2 (current prod) | $0.00543 |
| Article embedding | 2,500 input | text-embedding-3-small | $0.00005 |
| Tagline generation | 3,850 input + 150 output | GPT-5.2 or Gemini 3 Pro | $0.00884 |
| Image generation | 1 image | Imagen 3 | $0.03000 |
| Total (Premium) | $0.0443 |
Alternative (Budget):
| Component | Tokens/Params | Model | Cost |
|---|---|---|---|
| Brand safety check | 2,700 input + 50 output | GPT-5.2 (unchanged) | $0.00543 |
| Article embedding | 800 input (title + para1) | text-embedding-3-small | $0.00002 |
| Tagline generation (compact) | 1,200 input + 150 output | Gemini 2.0 Flash | $0.00018 |
| Image generation | 1 image | Flux schnell | $0.00300 |
| Total (Budget) | $0.0086 |
Savings: 91% cost reduction per generation
Ultra-budget (Groq):
| Component | Tokens/Params | Model | Cost |
|---|---|---|---|
| Brand safety check | 2,700 input + 50 output | Llama 3.1 8B (Groq) | $0.00014 |
| Tagline generation (compact) | 1,200 input + 150 output | Llama 3.1 8B (Groq) | $0.00007 |
| Image generation | 1 image | Flux schnell | $0.00300 |
| Total (Ultra-budget) | $0.0032 |
Savings: 92% cost reduction, 5x faster
def select_model_tier(
ecpm: float, # Expected CPM ($/1000 impressions)
domain_quality: str, # 'premium' | 'standard' | 'low'
content_length: int, # Article word count
campaign_type: str # 'brand_awareness' | 'performance'
) -> dict:
"""
Simple rule-based model selection.
Returns:
{
'llm': str,
'llm_tier': 'premium' | 'balanced' | 'budget',
'image': str,
'image_tier': 'premium' | 'balanced' | 'budget',
'input_mode': 'full_article' | 'title_plus_para1',
'max_cost': float
}
"""
# Profitability threshold (must cover at least 2x generation cost)
MIN_ECPM_PREMIUM = 10.0 # $10 eCPM = $0.010 per impression
MIN_ECPM_BALANCED = 3.0 # $3 eCPM = $0.003 per impression
# Decision tree
if ecpm >= MIN_ECPM_PREMIUM and domain_quality == 'premium':
# High-value, premium publishers → best quality
return {
'llm': 'gpt-5.2', # or 'gemini-3-pro'
'llm_tier': 'premium',
'image': 'imagen-3',
'image_tier': 'premium',
'input_mode': 'full_article',
'safety_model': 'gpt-5.2', # Current production
'max_cost': 0.0443
}
elif ecpm >= MIN_ECPM_BALANCED and domain_quality in ['premium', 'standard']:
# Mid-value, good publishers → balanced
return {
'llm': 'gemini-3-flash',
'llm_tier': 'balanced',
'image': 'flux-dev',
'image_tier': 'balanced',
'input_mode': 'title_plus_para1',
'safety_model': 'gpt-5.2', # Current production
'max_cost': 0.0117
}
elif campaign_type == 'brand_awareness':
# Brand campaigns → prioritize quality over cost
return {
'llm': 'gpt-5.2', # or 'gemini-3-pro'
'llm_tier': 'premium',
'image': 'flux-dev', # Balanced image sufficient
'image_tier': 'balanced',
'input_mode': 'title_plus_para1',
'safety_model': 'gpt-5.2', # Current production
'max_cost': 0.0159
}
else:
# Low-value or unproven domains → budget
return {
'llm': 'gemini-2-flash',
'llm_tier': 'budget',
'image': 'flux-schnell',
'image_tier': 'budget',
'input_mode': 'title_plus_para1',
'safety_model': 'gpt-5.2', # Current production
'max_cost': 0.0086
}Data Sources (existing in production):
- Impression count (from
impressionstable) - CTR history (clicks / impressions per domain)
- Human approval rate (from
controlled_adstype=2 vs type=-1) - Publisher whitelist/blacklist
Simple Heuristic:
def classify_domain_quality(domain: str) -> str:
"""Classify domain based on historical stats."""
stats = get_domain_stats(domain)
if domain in PREMIUM_WHITELIST:
return 'premium'
if stats['impression_count'] > 10000 and stats['ctr'] > 0.02:
return 'premium'
if stats['impression_count'] > 1000 and stats['ctr'] > 0.01:
return 'standard'
return 'low'Modification point before exploration trigger:
def _trigger_exploration_async(self, selected_ad: Dict | None) -> None:
"""Trigger exploration with dynamic model selection."""
# NEW: Select model tier before generation
model_config = select_model_tier(
ecpm=self.calculate_ecpm(),
domain_quality=self.classify_domain(),
content_length=len(self.article_text.split()),
campaign_type=self.campaign_type
)
# Store config for exploration method to use
self.model_config = model_config
# Existing exploration logic...
if self.cache.get_from_cache(self.key_lock_exploration):
return
self.cache.update_cache(
self.key_lock_exploration,
{'exploration_in_progress': 1},
EXPIRATION_60_SEC
)
if self.exploration_method:
async_call(self._execute_exploration_on_copy, selected_ad)Traffic Distribution (estimated):
| Tier | % Traffic | Avg eCPM | Current Cost | New Cost | Savings |
|---|---|---|---|---|---|
| Premium | 15% | $12.00 | $0.0443 | $0.0443 | $0 |
| Balanced | 35% | $5.00 | $0.0443 | $0.0117 | $0.0326 |
| Budget | 50% | $1.50 | $0.0443 | $0.0086 | $0.0357 |
Total Savings: (0.35 × $0.0326) + (0.50 × $0.0357) = $0.0293 per impression (66% reduction)
Annual Impact (1M impressions/month):
- Current: $44,300/month
- New: $15,055/month
- Savings: $29,245/month ($350,940/year)
Once the self-learning framework (Ĉ, P̂, DSPy) is operational, upgrade routing to use learned signals:
def select_model_tier_learned(
context: dict, # Brand, article, domain
C_hat_threshold: float = 0.7, # Quality predictor threshold
P_hat_threshold: float = 0.02, # Performance predictor threshold
ecpm: float = None
) -> dict:
"""
Learned model selection using quality and performance predictors.
Key insight: If we predict high approval (Ĉ) and high CTR (P̂),
it's worth investing in premium models. Otherwise, use budget.
"""
# Quick quality pre-check using Ĉ on anchor ad
anchor_quality = C_hat(context['brand'], context['article'], context['anchor'])
# Predicted performance using P̂ on anchor
predicted_ctr = P_hat(context['article'], context['anchor'])
# Calculate expected value of premium vs budget generation
premium_value = (
predicted_ctr * 1.2 * # Assume 20% CTR lift from premium models
ecpm / 1000 - # Revenue per impression
0.0411 # Premium cost
)
budget_value = (
predicted_ctr * # No CTR lift assumption
ecpm / 1000 - # Revenue per impression
0.0035 # Budget cost
)
# Decision: use premium only if EV is higher
if premium_value > budget_value and anchor_quality > C_hat_threshold:
return {
'llm': 'gpt-5.2', # or 'gemini-3-pro'
'llm_tier': 'premium',
'image': 'imagen-3',
'image_tier': 'premium',
'input_mode': 'full_article',
'expected_value': premium_value,
'reason': f'High quality ({anchor_quality:.2f}) + high CTR ({predicted_ctr:.3f})'
}
else:
return {
'llm': 'gemini-2-flash',
'llm_tier': 'budget',
'image': 'flux-schnell',
'image_tier': 'budget',
'input_mode': 'title_plus_para1',
'expected_value': budget_value,
'reason': f'Budget sufficient (quality={anchor_quality:.2f}, CTR={predicted_ctr:.3f})'
}Treat model tier selection as a contextual bandit problem:
Context: (brand_id, domain_tier, content_category, article_length) Actions: (premium, balanced, budget) Reward: (revenue - cost) per impression
class ModelTierBandit:
"""Contextual bandit for model tier selection."""
def __init__(self):
self.policy = EpsilonGreedy(epsilon=0.1)
self.context_encoder = embed_context
self.Q_table = defaultdict(lambda: {'premium': 0.0, 'balanced': 0.0, 'budget': 0.0})
def select_tier(self, context: dict) -> str:
"""Select model tier using ε-greedy policy."""
context_key = self.context_encoder(context)
if random.random() < self.policy.epsilon:
return random.choice(['premium', 'balanced', 'budget'])
else:
return max(self.Q_table[context_key], key=self.Q_table[context_key].get)
def update(self, context: dict, tier: str, reward: float):
"""Update Q-value after observing reward."""
context_key = self.context_encoder(context)
alpha = 0.1 # Learning rate
old_Q = self.Q_table[context_key][tier]
self.Q_table[context_key][tier] = old_Q + alpha * (reward - old_Q)Typical production prompt:
- Mega-prompt components: ~1,350 tokens
- Brand description: 200
- Styling instructions: 300
- Strategy guidelines: 200
- Few-shot examples: 500
- Safety instructions: 150
- Article (full): ~2,500 tokens
- Total input: ~3,850 tokens
Brand safety call:
- Article (title + content): ~2,500 tokens
- Safety prompt: ~200 tokens
- Total: ~2,700 tokens
Reduced input:
- Mega-prompt components: ~1,350 tokens (same)
- Article (title + para1): ~400 tokens
- Total input: ~1,750 tokens
Brand safety call (unchanged):
- Still uses full article for safety: ~2,700 tokens
Savings: 54% input token reduction on generation (safety unchanged for quality)
| Model | Full Article Cost | Compact Cost | Savings |
|---|---|---|---|
| GPT-5.2 | $0.00884 | $0.00401 | $0.00483 (55%) |
| Gemini 3 Pro | $0.00950 | $0.00431 | $0.00519 (55%) |
| Gemini 2.0 Flash | $0.00039 | $0.00018 | $0.00021 (54%) |
def build_compact_prompt(self, context: dict) -> str:
"""Build prompt using only title + first paragraph."""
article_title = context['article']['title']
article_body = context['article']['body']
# Extract first paragraph (split by \n\n or first 150 words)
first_paragraph = self.extract_first_paragraph(article_body, max_words=150)
# Mega-prompt components (unchanged)
mega_prompt = self.build_mega_prompt_base(context['brand'])
prompt = f"""{mega_prompt}
Article title: {article_title}
Article excerpt: {first_paragraph}
Anchor tagline: {context['anchor']['tagline']}
Generate contextual tagline variant following brand guidelines above.
Tagline:
"""
return promptProfit per impression:
Or for CPC campaigns:
| Model Tier | Break-even eCPM (2× margin) | Break-even CPC (1% CTR) | |
|---|---|---|---|
| Premium (GPT-5.2 + Imagen) | $0.0443 | $88.60 | $4.43 |
| Balanced (Gemini 3 Flash + Flux) | $0.0117 | $23.40 | $1.17 |
| Budget (Gemini 2 Flash + Flux) | $0.0086 | $17.20 | $0.86 |
Interpretation:
- Premium tier requires $82+ eCPM to be profitable with 2× margin
- Budget tier profitable at $7 eCPM (achievable on most campaigns)
- Ultra-cheap models (Groq + Flux schnell) profitable at <$1 eCPM
Scenario 1: Mid-value campaign (eCPM = $5, CTR = 1.5%)
| Tier | Cost | Revenue | Profit | ROI |
|---|---|---|---|---|
| Premium | $0.0443 | $0.0050 | -$0.0393 | -88.7% |
| Balanced | $0.0117 | $0.0050 | -$0.0067 | -57.3% |
| Budget | $0.0086 | $0.0050 | -$0.0036 | -41.9% |
Conclusion: Only budget tier profitable for typical campaigns.
Scenario 2: Premium publisher (CPC = $8, CTR = 2.5%)
| Tier | Cost | Revenue (CTR × CPC) | Profit | ROI |
|---|---|---|---|---|
| Premium | $0.0443 | $0.20 | $0.1557 | +351.5% |
| Balanced | $0.0117 | $0.20 | $0.1883 | +1609.4% |
| Budget | $0.0086 | $0.20 × 0.95 | $0.1814 | +2109.3% |
Insight: Even with -5% quality penalty, budget tier delivers highest ROI. Premium justified only for brand-sensitive campaigns.
Deliverables:
select_model_tier()function with eCPM + domain quality routing- Domain quality classifier (premium/standard/low)
- Integration into
ControlledAd._trigger_exploration_async() - Logging: model_tier, generation_cost, decision_reason
Success criteria:
- 50% of traffic routed to budget tier
- No drop in approval rate
- Cost savings confirmed
Deliverables:
build_compact_prompt()using title + para1- A/B test framework (50/50 split)
- Quality monitoring dashboard
Success criteria:
- <5% approval rate drop
- <3% CTR drop
- 54% input token savings confirmed
Deliverables:
- Historical analysis: profit vs tier by campaign
- Per-campaign threshold learning
- Threshold update automation
Success criteria:
- 10% additional profit vs fixed thresholds
- Thresholds stable (not oscillating)
Deliverables:
select_model_tier_learned()using Ĉ and P̂- Expected value calculation framework
- Bandit policy for exploration
Prerequisites:
- Ĉ (quality predictor) trained and deployed
- P̂ (performance predictor) trained and deployed
- Propensity logging operational
Success criteria:
- 15% profit improvement vs rule-based
- Bandit policy converges
Risk: Budget models produce lower quality, reducing approval rate and CTR.
Mitigation:
- Start with conservative thresholds (only low-value traffic to budget)
- Monitor approval rate daily, alert if <80%
- Circuit breaker: auto-revert to premium if approval drops >10%
- Human review sample: 100 budget-generated ads for manual QA
Risk: eCPM thresholds miscalibrated, losing money on expensive generations.
Mitigation:
- Default to budget tier unless eCPM exceeds 2× generation cost
- Continuous profit analysis per tier
- Threshold adjustment automation
Risk: Primary model down or rate-limited, fallback needed.
Mitigation:
- Fallback chain: Gemini 2.0 Flash → Groq Llama → Claude Haiku
- Cache model availability status (Redis, 1min TTL)
- Alert if fallback rate >5%
| Model | Provider | Input | Output | Speed | Context |
|---|---|---|---|---|---|
| GPT-5.2 | OpenAI | $1.75 | $14.00 | Fast | 400K |
| Gemini 3 Pro | $2.00 | $12.00 | Fast | 200K | |
| Gemini 3 Flash | $0.50 | $3.00 | Very Fast | 1M | |
| Gemini 2.0 Flash | $0.10 | $0.40 | Very Fast | 1M | |
| Llama 3.1 8B | Groq | $0.05 | $0.08 | Ultra Fast | 128K |
| Mixtral 8x7B | Groq | $0.27 | $0.27 | Very Fast | 32K |
| Claude Haiku | Anthropic | $1.00 | $5.00 | Fast | 200K |
| Model | Provider | Resolution | Cost | Speed |
|---|---|---|---|---|
| Imagen 3 | 1024×1024 | $0.030 | ~8s | |
| FLUX.1 [pro] | Replicate | 1024×1024 | $0.055 | ~10s |
| FLUX.1 [dev] | Replicate | 1024×1024 | $0.030 | ~6s |
| FLUX.1 [schnell] | Replicate | 1024×1024 | $0.003 | ~2s |
- GPT-5.2 API Pricing
- GPT-5.2 Pricing Calculator
- Gemini API Pricing
- Gemini 3 Pricing Guide
- Groq Pricing
- Replicate Pricing
- Claude API Pricing
- AI Image Model Pricing
End of Document