Target Audience: Dev Engineers Goal: Understand the dataset creation pipeline and training process Context: Part of the LoudEcho online learning bandit system
- Big Picture: Where This Fits
- The Problem We're Solving
- Dataset Creation Pipeline
- Training Process
- Key Code Snippets
- Results & Performance
- Integration with Bandit System
The memorability scorer is one component in a multi-armed bandit system for ad creative optimization:
┌─────────────────────────────────────────────────────────────┐
│ ONLINE LEARNING BANDIT │
│ │
│ ┌────────────┐ ┌─────────────┐ ┌────────────────┐ │
│ │ DSPy │───▶│ Candidate │───▶│ Memorability │ │
│ │ Generator │ │ Ad Creatives│ │ Scorer │ │
│ └────────────┘ └─────────────┘ └────────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ Scoring Function: │ │
│ │ │ │
│ │ Score = w₁·Quality(C) │ │
│ │ + w₂·Performance(P) │ │
│ │ + w₃·Novelty │ │
│ │ - penalties │ │
│ └──────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ Ad Selection & │ │
│ │ Serving Decision │ │
│ └────────┬───────────────┘ │
│ │ │
└─────────────────────────────┼───────────────────────────────┘
│
▼
┌───────────────────┐
│ Real Engagement │
│ Data (CTR, etc.) │
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ Feedback Loop: │
│ Update Weights │
│ & Retrain Models │
└───────────────────┘
The Memorability Scorer predicts Quality(C) - how good an ad creative is in a given context.
Challenge: Given an ad creative and an article context, predict how memorable and effective the ad will be.
Why it matters:
- Manual review doesn't scale (thousands of ad variants)
- Post-hoc metrics (CTR) come too late for real-time decisions
- Need to rank candidates BEFORE serving to users
Solution approach:
- Train a "teacher model" (GPT-5.2 Vision) to score thousands of ad-article pairs
- Use those scores to train a fast, cheap XGBoost model for production
- Deploy the XGBoost model for real-time scoring
This was 90% of the effort and cost (several hundred USD in API calls). Here's the full pipeline:
┌──────────────────────────────────────────────────────────────────┐
│ DATASET CREATION PIPELINE │
│ │
│ Step 1: Image Deduplication │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Pinterest + Twitter ad images (10K+) │ │
│ │ ↓ │ │
│ │ Perceptual hashing (aHash: 32×32 grayscale) │ │
│ │ ↓ │ │
│ │ 7,258 unique ads │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Step 2: Synthetic Article Generation │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ For each ad image: │ │
│ │ - GPT-5.2 Vision analyzes the ad │ │
│ │ - Generates contextually relevant article │ │
│ │ - 1:1 mapping (ad_00001 → article_00001) │ │
│ │ ↓ │ │
│ │ 7,258 synthetic articles │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Step 3: Ad Feature Extraction │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Extract M features (15 dimensions): │ │
│ │ - Visual: faces, clarity, clutter, contrast │ │
│ │ - Text: OCR, copy quality, concreteness │ │
│ │ - Creative: twist present, resolves fast │ │
│ │ - CLIP embeddings (512-dim) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Step 4: Article Feature Extraction │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Extract article features (4 dimensions): │ │
│ │ - Topic category │ │
│ │ - Named entities │ │
│ │ - Sentiment valence │ │
│ │ - Emotional arousal │ │
│ │ - Text embeddings (512-dim) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Step 5: Generate Negative Samples │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ For each ad: │ │
│ │ ✓ 1 positive pair (original synthetic match) │ │
│ │ ✗ 2 random negatives (any mismatched article) │ │
│ │ ✗ 1 safe contrast (different topic, similar tone) │ │
│ │ ↓ │ │
│ │ 29,032 total pairs │ │
│ │ - 7,258 positive (25%) │ │
│ │ - 14,516 random negatives (50%) │ │
│ │ - 7,258 safe contrast (25%) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Step 6: Compute Pair Features │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Combine ad + article → F features (6 dimensions): │ │
│ │ - sim_adtext_article (cosine similarity) │ │
│ │ - sim_adimage_article (CLIP similarity) │ │
│ │ - entity_overlap_rate (Jaccard) │ │
│ │ - sentiment_alignment (distance) │ │
│ │ - topic_match (binary) │ │
│ │ - contrast (inverse similarity) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Step 7: Teacher Scoring │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ For each of 29,032 pairs: │ │
│ │ - GPT-5.2 Vision evaluates: │ │
│ │ • Ad memorability & originality │ │
│ │ • Message clarity (< 2 sec?) │ │
│ │ • Emotional engagement │ │
│ │ • Contextual relevance │ │
│ │ - Outputs: score (1-10) + reasoning │ │
│ │ ↓ │ │
│ │ 29,032 labeled training examples │ │
│ │ Cost: ~$0.02/pair × 29K = $580+ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Step 8: Consolidate Dataset │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Merge all data into single file: │ │
│ │ - Images (as bytes) │ │
│ │ - Ad text + article text │ │
│ │ - 21 features (15 M + 6 F) │ │
│ │ - Teacher scores + reasoning │ │
│ │ ↓ │ │
│ │ memorability_dataset_consolidated.parquet (8.1 GB) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
Why synthetic articles instead of real ones?
- Real ad-article pairs have selection bias (ads are already chosen to fit)
- Synthetic articles let us control the distribution of positive/negative pairs
- We can generate "safe contrast" pairs (same sentiment, different topic) to teach the model nuance
Why negative samples?
- A model trained only on positive pairs can't discriminate
- Random negatives teach "this ad doesn't fit this article"
- Safe contrast negatives teach subtler distinctions (not just topic matching)
Why teacher-student approach?
- GPT-5.2 Vision is expensive ($0.02/pair) and slow (120s timeout)
- XGBoost is cheap ($0.0001/pair) and fast (< 1ms)
- Trade-off: upfront labeling cost → cheap inference forever
Once we have the labeled dataset, training is straightforward:
┌────────────────────────────────────────────────────────┐
│ XGBOOST TRAINING PIPELINE │
│ │
│ 1. Load consolidated dataset │
│ ↓ │
│ 2. Split by ad_id (prevent leakage) │
│ - Train: 80% of unique ads │
│ - Test: 20% of unique ads │
│ ↓ │
│ 3. Encode categorical features │
│ - face_emotion: LabelEncoder │
│ ↓ │
│ 4. Train XGBoost regressor │
│ - Target: teacher_score_1_10 │
│ - Features: 21 (15 M + 6 F) │
│ - Hyperparameters: │
│ • n_estimators: 300 (early stopped at 215) │
│ • max_depth: 6 │
│ • learning_rate: 0.05 │
│ • L1 regularization: 1 │
│ • L2 regularization: 10 │
│ ↓ │
│ 5. Evaluate on test set │
│ - MAE: 0.605 (6% error on 10-point scale) │
│ - R²: 0.785 (explains 78.5% of variance) │
│ ↓ │
│ 6. Save model + encoders │
│ - baseline_xgboost_model.pkl │
│ │
└────────────────────────────────────────────────────────┘
Considered alternatives:
- Neural network: Requires more data, harder to interpret, overkill for 21 features
- Linear regression: Too simple, can't capture feature interactions
- Random Forest: Similar to XGBoost but slower and less accurate
Why XGBoost won:
- ✅ Handles tabular data with mixed types (categorical + numeric) out of the box
- ✅ Built-in regularization (L1, L2) prevents overfitting
- ✅ Feature importance scores for interpretability
- ✅ Fast training (< 5 min on CPU)
- ✅ Blazing fast inference (< 1ms per prediction)
- ✅ Industry standard for Kaggle/production regression tasks
Key design decisions:
- GroupShuffleSplit by ad_id: Prevents data leakage (same ad in train and test)
- Early stopping: Prevents overfitting by stopping at 215 trees when validation MAE plateaus
- Regression, not ranking: We want absolute scores, not just relative ordering (for now)
File: ImageDeduplicator.py:24-37
This was critical to avoid wasting money labeling duplicate ads.
@staticmethod
def compute_image_hash(image_path: str) -> str | None:
"""
Compute aHash for image
Algorithm: 32x32 grayscale → compare to mean → binary hash → MD5
"""
image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
if image is None:
return None
resized_image = cv2.resize(image, (32, 32))
avg_pixel_value = resized_image.mean()
hash_str = ''.join('1' if pixel > avg_pixel_value else '0'
for pixel in resized_image.flatten())
return hashlib.md5(hash_str.encode()).hexdigest()Why this works:
- Resizing to 32×32 normalizes for size variations
- Comparing to mean creates a perceptual fingerprint
- Near-identical images (different compression, minor edits) get same hash
- Result: Reduced 10K+ images → 7,258 unique ads
File: GenerateNegativeSamples.py:70-153
Teaching the model to discriminate requires careful negative sampling.
def generate_pairs(self):
"""Generate positive and negative pairs"""
pairs = []
for _, ad_row in tqdm(self.df_ads.iterrows(), total=len(self.df_ads)):
ad_id = ad_row['ad_id']
ad_topic = ad_row.get('ad_topic', 'other')
ad_sentiment = ad_row.get('copy_emotion_valence', 0.0)
# 1. Positive pair (original 1:1 synthetic pairing)
positive_article_id = ad_id # ad_00001 → article_00001
pairs.append({
'ad_id': ad_id,
'article_id': positive_article_id,
'pair_type': 'positive'
})
# 2. Random negatives (any mismatched article)
other_article_ids = [a_id for a_id in all_article_ids
if a_id != positive_article_id]
random_article_ids = random.sample(other_article_ids,
self.random_negatives_per_ad)
for article_id in random_article_ids:
pairs.append({
'ad_id': ad_id,
'article_id': article_id,
'pair_type': 'random_negative'
})
# 3. Safe contrast negatives (different topic, similar sentiment)
contrast_candidates = [
a_id for a_id in other_article_ids
if articles_by_id[a_id].get('topic_category') != ad_topic # Different topic
and abs(articles_by_id[a_id].get('sentiment_valence', 0.0)
- ad_sentiment) < 0.5 # Similar sentiment
]
contrast_article_ids = random.sample(
contrast_candidates,
min(self.contrast_negatives_per_ad, len(contrast_candidates))
)
for article_id in contrast_article_ids:
pairs.append({
'ad_id': ad_id,
'article_id': article_id,
'pair_type': 'safe_contrast'
})Why three types of negatives?
- Random negatives: Easy examples (car ad × cooking article = bad)
- Safe contrast: Hard examples (car ad × travel article = maybe okay?)
- Teaches nuance: Model learns contextual fit, not just topic matching
File: ConsolidateDataset.py:61-198
With 8.1 GB of data including images, we need batch processing to avoid OOM.
def consolidate():
"""Merge all data into single consolidated file (memory-efficient batching)"""
BATCH_SIZE = 1000 # Process 1000 rows at a time
TEMP_DIR = f"{DATA_DIR}/temp_batches"
# Load metadata (no images yet)
df_pairs = pd.read_csv(f"{DATA_DIR}/pairs.csv")
df_features = pd.read_csv(f"{DATA_DIR}/features_full.csv")
df_scores = pd.read_csv(f"{DATA_DIR}/teacher_scores.csv")
# Merge all metadata
df = df_pairs.merge(df_features, on=['ad_id', 'article_id'])
df = df.merge(df_scores, on=['ad_id', 'article_id'])
# Process in batches (load images only for current batch)
num_batches = (len(df) + BATCH_SIZE - 1) // BATCH_SIZE
for batch_idx in tqdm(range(num_batches), desc="Processing batches"):
start_idx = batch_idx * BATCH_SIZE
end_idx = min((batch_idx + 1) * BATCH_SIZE, len(df))
df_batch = df.iloc[start_idx:end_idx].copy()
# Load images for this batch only
image_bytes_list = []
for image_path in df_batch['image_path']:
image_bytes_list.append(load_image_bytes(image_path))
df_batch['image_bytes'] = image_bytes_list
# Save batch
df_batch.to_parquet(f"{TEMP_DIR}/batch_{batch_idx:04d}.parquet",
compression='snappy')
# Merge batches efficiently using PyArrow
import pyarrow.parquet as pq
tables = [pq.read_table(f) for f in batch_files]
combined_table = pa.concat_tables(tables)
pq.write_table(combined_table, OUTPUT_FILE, compression='snappy')Why batch processing?
- Loading 29K images (8.1 GB) into memory at once = OOM crash
- Batch size of 1000 = manageable memory (~ 280 MB per batch)
- PyArrow for efficient Parquet I/O (3-5× faster than pandas alone)
| Metric | Value | Interpretation |
|---|---|---|
| MAE | 0.605 | Predictions within ±0.6 points on 10-point scale (6% error) |
| RMSE | 0.842 | Slightly higher due to outliers, still strong |
| R² | 0.785 | Model explains 78.5% of score variance |
| Pair Type | MAE | Mean True Score | Mean Predicted |
|---|---|---|---|
| Positive | 0.682 | 4.93 | 4.67 |
| Random Negative | 0.587 | 3.09 | 3.14 |
| Safe Contrast | 0.562 | 3.05 | 3.10 |
Key insight: Model correctly separates positive pairs from negatives by ~1.6 points, which is what matters for ranking in production.
- contrast (32.9%) - Inverse text similarity; ads benefit from differentiation
- sim_adtext_article (24.0%) - Text-article semantic alignment
- is_ad (15.3%) - Ad classification confidence
- clarity (10.4%) - Message clarity at a glance
- twist_resolves_fast (3.2%) - Creative twist resolution speed
Insight: Contextual fit features (contrast + similarity + entities) dominate at 58.7% combined importance.
┌─────────────────────────────────────────────────────────────┐
│ BANDIT SYSTEM: AD SELECTION FLOW │
│ │
│ 1. Generate candidates │
│ ┌──────────────────────────────────────┐ │
│ │ DSPy + GPT-5.2 → 10-20 ad variants │ │
│ └────────────────┬─────────────────────┘ │
│ │ │
│ 2. Extract features for each candidate │
│ ┌──────────────────────────────────────┐ │
│ │ M features (15): Visual + Copy │ │
│ │ F features (6): Context fit │ │
│ └────────────────┬─────────────────────┘ │
│ │ │
│ 3. Score with memorability model │
│ ┌──────────────────────────────────────┐ │
│ │ XGBoost prediction: score_1_10 │ │
│ │ Latency: < 1ms per candidate │ │
│ └────────────────┬─────────────────────┘ │
│ │ │
│ 4. Compute overall score │
│ ┌──────────────────────────────────────┐ │
│ │ Final = w₁·memorability │ │
│ │ + w₂·predicted_CTR │ │
│ │ + w₃·novelty │ │
│ │ - duplicate_penalty │ │
│ │ - brand_safety_penalty │ │
│ └────────────────┬─────────────────────┘ │
│ │ │
│ 5. Select & serve best candidate │
│ ┌──────────────────────────────────────┐ │
│ │ Thompson sampling or UCB policy │ │
│ └────────────────┬─────────────────────┘ │
│ │ │
│ 6. Collect feedback │
│ ┌──────────────────────────────────────┐ │
│ │ Actual CTR, dwell time, conversions │ │
│ └────────────────┬─────────────────────┘ │
│ │ │
│ 7. Update weights (monthly) │
│ ┌──────────────────────────────────────┐ │
│ │ Optimize w₁, w₂, w₃ based on │ │
│ │ real performance data │ │
│ └──────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Production Inference:
┌──────────────────┐
│ Ad Candidate │
│ + Article Text │
└────────┬─────────┘
│
▼
┌─────────────────────────────┐
│ Feature Extraction (cached)│
│ - M features: GPT-5.2 cache│
│ - F features: compute live │
│ Latency: ~50ms if cached │
└────────┬────────────────────┘
│
▼
┌─────────────────────────────┐
│ XGBoost Model (.pkl) │
│ - 21 features → score │
│ Latency: < 1ms │
│ Cost: $0.0001 per call │
└────────┬────────────────────┘
│
▼
┌─────────────────────────────┐
│ Score: 1-10 │
│ (feeds into bandit scorer) │
└─────────────────────────────┘
Cost comparison:
- Before: GPT-5.2 scoring = $0.02/ad + 120s latency → not feasible for real-time
- After: XGBoost scoring = $0.0001/ad + < 1ms latency → production ready
- Perceptual hashing saved $200+ by deduplicating before API calls
- Synthetic article generation let us control the training distribution
- Teacher-student paradigm traded upfront cost for fast inference
- Batch processing + resume capability made the pipeline robust to failures
- XGBoost was the right choice for tabular data with 21 features
-
Dataset creation took 90% of time and money
- 29K API calls to GPT-5.2 Vision ($580+)
- Incremental saves + resume logic to handle timeouts
- Memory-efficient batching to avoid OOM
-
Preventing data leakage
- Must split by
ad_idnot row-level (same ad appears in multiple pairs) - Careful validation of positive pair alignment
- Must split by
-
Balancing negative sample types
- Too many random negatives = model learns trivial patterns
- Safe contrast negatives = harder to generate but critical for nuance
- Replace teacher scores with real CTR data once we have enough traffic
- Add brand safety dimension (currently missing from M×F framework)
- Experiment with learning-to-rank instead of regression
- Fine-tune embeddings instead of using off-the-shelf CLIP/OpenAI
| Metric | Value |
|---|---|
| Unique Ads | 7,258 |
| Synthetic Articles | 7,258 |
| Total Pairs | 29,032 |
| Features | 21 (15 M + 6 F) |
| Teacher Scores | 29,032 |
| Dataset Size | 8.1 GB |
| Training Time | < 5 minutes |
| Total Cost | ~$600 (mostly GPT-5.2 API) |
- Dataset:
albertbn/ad-memorability-scorer-v0(HuggingFace, private) - Model:
baseline_xgboost_model.pkl(8 MB) - Code:
/Labs/memorability/(8 Python modules + utilities) - Framework: Based on LoudEcho Creative Quality Framework (6 dimensions)
- Gist: Context Ad Learning Research
Questions? Ping the team or check the code in /Labs/memorability/
Generated for internal dev team walkthrough • February 2026