Skip to content

Instantly share code, notes, and snippets.

@albertbn
Created February 11, 2026 08:21
Show Gist options
  • Select an option

  • Save albertbn/c2a9ab767e2931d3a5c6f49aeccbd974 to your computer and use it in GitHub Desktop.

Select an option

Save albertbn/c2a9ab767e2931d3a5c6f49aeccbd974 to your computer and use it in GitHub Desktop.
Ad Memorability Scorer: Dataset Creation and Training Pipeline Walkthrough

Ad Memorability Scorer: Project Walkthrough

Target Audience: Dev Engineers Goal: Understand the dataset creation pipeline and training process Context: Part of the LoudEcho online learning bandit system


Table of Contents

  1. Big Picture: Where This Fits
  2. The Problem We're Solving
  3. Dataset Creation Pipeline
  4. Training Process
  5. Key Code Snippets
  6. Results & Performance
  7. Integration with Bandit System

Big Picture: Where This Fits

The memorability scorer is one component in a multi-armed bandit system for ad creative optimization:

┌─────────────────────────────────────────────────────────────┐
│                   ONLINE LEARNING BANDIT                    │
│                                                             │
│  ┌────────────┐    ┌─────────────┐    ┌────────────────┐    │
│  │   DSPy     │───▶│  Candidate  │───▶│   Memorability │    │
│  │ Generator  │    │  Ad Creatives│    │     Scorer    │    │
│  └────────────┘    └─────────────┘    └────────┬───────┘    │
│                                                  │          │
│                                                  ▼          │
│                    ┌──────────────────────────────────┐     │
│                    │    Scoring Function:             │     │
│                    │                                  │     │
│                    │  Score = w₁·Quality(C)           │     │
│                    │        + w₂·Performance(P)       │     │
│                    │        + w₃·Novelty              │     │
│                    │        - penalties               │     │
│                    └──────────────┬───────────────────┘     │
│                                   │                         │
│                                   ▼                         │
│                    ┌────────────────────────┐               │
│                    │   Ad Selection &       │               │
│                    │   Serving Decision     │               │
│                    └────────┬───────────────┘               │
│                             │                               │
└─────────────────────────────┼───────────────────────────────┘
                              │
                              ▼
                    ┌───────────────────┐
                    │  Real Engagement  │
                    │  Data (CTR, etc.) │
                    └─────────┬─────────┘
                              │
                              ▼
                    ┌───────────────────┐
                    │  Feedback Loop:   │
                    │  Update Weights   │
                    │  & Retrain Models │
                    └───────────────────┘

The Memorability Scorer predicts Quality(C) - how good an ad creative is in a given context.


The Problem We're Solving

Challenge: Given an ad creative and an article context, predict how memorable and effective the ad will be.

Why it matters:

  • Manual review doesn't scale (thousands of ad variants)
  • Post-hoc metrics (CTR) come too late for real-time decisions
  • Need to rank candidates BEFORE serving to users

Solution approach:

  • Train a "teacher model" (GPT-5.2 Vision) to score thousands of ad-article pairs
  • Use those scores to train a fast, cheap XGBoost model for production
  • Deploy the XGBoost model for real-time scoring

Dataset Creation Pipeline

This was 90% of the effort and cost (several hundred USD in API calls). Here's the full pipeline:

┌──────────────────────────────────────────────────────────────────┐
│                    DATASET CREATION PIPELINE                     │
│                                                                  │
│  Step 1: Image Deduplication                                     │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  Pinterest + Twitter ad images (10K+)                │        │
│  │              ↓                                       │        │
│  │  Perceptual hashing (aHash: 32×32 grayscale)         │        │
│  │              ↓                                       │        │
│  │  7,258 unique ads                                    │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 2: Synthetic Article Generation                            │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  For each ad image:                                  │        │
│  │    - GPT-5.2 Vision analyzes the ad                  │        │
│  │    - Generates contextually relevant article         │        │
│  │    - 1:1 mapping (ad_00001 → article_00001)          │        │
│  │              ↓                                       │        │
│  │  7,258 synthetic articles                            │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 3: Ad Feature Extraction                                   │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  Extract M features (15 dimensions):                 │        │
│  │    - Visual: faces, clarity, clutter, contrast       │        │
│  │    - Text: OCR, copy quality, concreteness           │        │
│  │    - Creative: twist present, resolves fast          │        │
│  │    - CLIP embeddings (512-dim)                       │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 4: Article Feature Extraction                              │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  Extract article features (4 dimensions):            │        │
│  │    - Topic category                                  │        │
│  │    - Named entities                                  │        │
│  │    - Sentiment valence                               │        │
│  │    - Emotional arousal                               │        │
│  │    - Text embeddings (512-dim)                       │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 5: Generate Negative Samples                               │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  For each ad:                                        │        │
│  │    ✓ 1 positive pair (original synthetic match)      │        │
│  │    ✗ 2 random negatives (any mismatched article)     │        │
│  │    ✗ 1 safe contrast (different topic, similar tone) │        │
│  │              ↓                                       │        │
│  │  29,032 total pairs                                  │        │
│  │    - 7,258 positive (25%)                            │        │
│  │    - 14,516 random negatives (50%)                   │        │
│  │    - 7,258 safe contrast (25%)                       │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 6: Compute Pair Features                                   │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  Combine ad + article → F features (6 dimensions):   │        │
│  │    - sim_adtext_article (cosine similarity)          │        │
│  │    - sim_adimage_article (CLIP similarity)           │        │
│  │    - entity_overlap_rate (Jaccard)                   │        │
│  │    - sentiment_alignment (distance)                  │        │
│  │    - topic_match (binary)                            │        │
│  │    - contrast (inverse similarity)                   │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 7: Teacher Scoring                                         │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  For each of 29,032 pairs:                           │        │
│  │    - GPT-5.2 Vision evaluates:                       │        │
│  │      • Ad memorability & originality                 │        │
│  │      • Message clarity (< 2 sec?)                    │        │
│  │      • Emotional engagement                          │        │
│  │      • Contextual relevance                          │        │
│  │    - Outputs: score (1-10) + reasoning               │        │
│  │              ↓                                       │        │
│  │  29,032 labeled training examples                    │        │
│  │  Cost: ~$0.02/pair × 29K = $580+                     │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Step 8: Consolidate Dataset                                     │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  Merge all data into single file:                    │        │
│  │    - Images (as bytes)                               │        │
│  │    - Ad text + article text                          │        │
│  │    - 21 features (15 M + 6 F)                        │        │
│  │    - Teacher scores + reasoning                      │        │
│  │              ↓                                       │        │
│  │  memorability_dataset_consolidated.parquet (8.1 GB)  │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Why This Approach?

Why synthetic articles instead of real ones?

  • Real ad-article pairs have selection bias (ads are already chosen to fit)
  • Synthetic articles let us control the distribution of positive/negative pairs
  • We can generate "safe contrast" pairs (same sentiment, different topic) to teach the model nuance

Why negative samples?

  • A model trained only on positive pairs can't discriminate
  • Random negatives teach "this ad doesn't fit this article"
  • Safe contrast negatives teach subtler distinctions (not just topic matching)

Why teacher-student approach?

  • GPT-5.2 Vision is expensive ($0.02/pair) and slow (120s timeout)
  • XGBoost is cheap ($0.0001/pair) and fast (< 1ms)
  • Trade-off: upfront labeling cost → cheap inference forever

Training Process

Once we have the labeled dataset, training is straightforward:

┌────────────────────────────────────────────────────────┐
│               XGBOOST TRAINING PIPELINE                │
│                                                        │
│  1. Load consolidated dataset                          │
│     ↓                                                  │
│  2. Split by ad_id (prevent leakage)                   │
│     - Train: 80% of unique ads                         │
│     - Test: 20% of unique ads                          │
│     ↓                                                  │
│  3. Encode categorical features                        │
│     - face_emotion: LabelEncoder                       │
│     ↓                                                  │
│  4. Train XGBoost regressor                            │
│     - Target: teacher_score_1_10                       │
│     - Features: 21 (15 M + 6 F)                        │
│     - Hyperparameters:                                 │
│       • n_estimators: 300 (early stopped at 215)       │
│       • max_depth: 6                                   │
│       • learning_rate: 0.05                            │
│       • L1 regularization: 1                           │
│       • L2 regularization: 10                          │
│     ↓                                                  │
│  5. Evaluate on test set                               │
│     - MAE: 0.605 (6% error on 10-point scale)          │
│     - R²: 0.785 (explains 78.5% of variance)           │
│     ↓                                                  │
│  6. Save model + encoders                              │
│     - baseline_xgboost_model.pkl                       │
│                                                        │
└────────────────────────────────────────────────────────┘

Why XGBoost?

Considered alternatives:

  • Neural network: Requires more data, harder to interpret, overkill for 21 features
  • Linear regression: Too simple, can't capture feature interactions
  • Random Forest: Similar to XGBoost but slower and less accurate

Why XGBoost won:

  • ✅ Handles tabular data with mixed types (categorical + numeric) out of the box
  • ✅ Built-in regularization (L1, L2) prevents overfitting
  • ✅ Feature importance scores for interpretability
  • ✅ Fast training (< 5 min on CPU)
  • ✅ Blazing fast inference (< 1ms per prediction)
  • ✅ Industry standard for Kaggle/production regression tasks

Key design decisions:

  • GroupShuffleSplit by ad_id: Prevents data leakage (same ad in train and test)
  • Early stopping: Prevents overfitting by stopping at 215 trees when validation MAE plateaus
  • Regression, not ranking: We want absolute scores, not just relative ordering (for now)

Key Code Snippets

1. Image Deduplication (Perceptual Hashing)

File: ImageDeduplicator.py:24-37

This was critical to avoid wasting money labeling duplicate ads.

@staticmethod
def compute_image_hash(image_path: str) -> str | None:
    """
    Compute aHash for image
    Algorithm: 32x32 grayscale → compare to mean → binary hash → MD5
    """
    image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    if image is None:
        return None

    resized_image = cv2.resize(image, (32, 32))
    avg_pixel_value = resized_image.mean()
    hash_str = ''.join('1' if pixel > avg_pixel_value else '0'
                      for pixel in resized_image.flatten())
    return hashlib.md5(hash_str.encode()).hexdigest()

Why this works:

  • Resizing to 32×32 normalizes for size variations
  • Comparing to mean creates a perceptual fingerprint
  • Near-identical images (different compression, minor edits) get same hash
  • Result: Reduced 10K+ images → 7,258 unique ads

2. Negative Sample Generation Strategy

File: GenerateNegativeSamples.py:70-153

Teaching the model to discriminate requires careful negative sampling.

def generate_pairs(self):
    """Generate positive and negative pairs"""
    pairs = []

    for _, ad_row in tqdm(self.df_ads.iterrows(), total=len(self.df_ads)):
        ad_id = ad_row['ad_id']
        ad_topic = ad_row.get('ad_topic', 'other')
        ad_sentiment = ad_row.get('copy_emotion_valence', 0.0)

        # 1. Positive pair (original 1:1 synthetic pairing)
        positive_article_id = ad_id  # ad_00001 → article_00001
        pairs.append({
            'ad_id': ad_id,
            'article_id': positive_article_id,
            'pair_type': 'positive'
        })

        # 2. Random negatives (any mismatched article)
        other_article_ids = [a_id for a_id in all_article_ids
                            if a_id != positive_article_id]
        random_article_ids = random.sample(other_article_ids,
                                          self.random_negatives_per_ad)
        for article_id in random_article_ids:
            pairs.append({
                'ad_id': ad_id,
                'article_id': article_id,
                'pair_type': 'random_negative'
            })

        # 3. Safe contrast negatives (different topic, similar sentiment)
        contrast_candidates = [
            a_id for a_id in other_article_ids
            if articles_by_id[a_id].get('topic_category') != ad_topic  # Different topic
            and abs(articles_by_id[a_id].get('sentiment_valence', 0.0)
                   - ad_sentiment) < 0.5  # Similar sentiment
        ]

        contrast_article_ids = random.sample(
            contrast_candidates,
            min(self.contrast_negatives_per_ad, len(contrast_candidates))
        )
        for article_id in contrast_article_ids:
            pairs.append({
                'ad_id': ad_id,
                'article_id': article_id,
                'pair_type': 'safe_contrast'
            })

Why three types of negatives?

  • Random negatives: Easy examples (car ad × cooking article = bad)
  • Safe contrast: Hard examples (car ad × travel article = maybe okay?)
  • Teaches nuance: Model learns contextual fit, not just topic matching

3. Dataset Consolidation (Memory-Efficient)

File: ConsolidateDataset.py:61-198

With 8.1 GB of data including images, we need batch processing to avoid OOM.

def consolidate():
    """Merge all data into single consolidated file (memory-efficient batching)"""

    BATCH_SIZE = 1000  # Process 1000 rows at a time
    TEMP_DIR = f"{DATA_DIR}/temp_batches"

    # Load metadata (no images yet)
    df_pairs = pd.read_csv(f"{DATA_DIR}/pairs.csv")
    df_features = pd.read_csv(f"{DATA_DIR}/features_full.csv")
    df_scores = pd.read_csv(f"{DATA_DIR}/teacher_scores.csv")

    # Merge all metadata
    df = df_pairs.merge(df_features, on=['ad_id', 'article_id'])
    df = df.merge(df_scores, on=['ad_id', 'article_id'])

    # Process in batches (load images only for current batch)
    num_batches = (len(df) + BATCH_SIZE - 1) // BATCH_SIZE

    for batch_idx in tqdm(range(num_batches), desc="Processing batches"):
        start_idx = batch_idx * BATCH_SIZE
        end_idx = min((batch_idx + 1) * BATCH_SIZE, len(df))

        df_batch = df.iloc[start_idx:end_idx].copy()

        # Load images for this batch only
        image_bytes_list = []
        for image_path in df_batch['image_path']:
            image_bytes_list.append(load_image_bytes(image_path))

        df_batch['image_bytes'] = image_bytes_list

        # Save batch
        df_batch.to_parquet(f"{TEMP_DIR}/batch_{batch_idx:04d}.parquet",
                          compression='snappy')

    # Merge batches efficiently using PyArrow
    import pyarrow.parquet as pq
    tables = [pq.read_table(f) for f in batch_files]
    combined_table = pa.concat_tables(tables)
    pq.write_table(combined_table, OUTPUT_FILE, compression='snappy')

Why batch processing?

  • Loading 29K images (8.1 GB) into memory at once = OOM crash
  • Batch size of 1000 = manageable memory (~ 280 MB per batch)
  • PyArrow for efficient Parquet I/O (3-5× faster than pandas alone)

Results & Performance

Model Performance

Metric Value Interpretation
MAE 0.605 Predictions within ±0.6 points on 10-point scale (6% error)
RMSE 0.842 Slightly higher due to outliers, still strong
0.785 Model explains 78.5% of score variance

Performance by Pair Type

Pair Type MAE Mean True Score Mean Predicted
Positive 0.682 4.93 4.67
Random Negative 0.587 3.09 3.14
Safe Contrast 0.562 3.05 3.10

Key insight: Model correctly separates positive pairs from negatives by ~1.6 points, which is what matters for ranking in production.

Top 5 Feature Importances

  1. contrast (32.9%) - Inverse text similarity; ads benefit from differentiation
  2. sim_adtext_article (24.0%) - Text-article semantic alignment
  3. is_ad (15.3%) - Ad classification confidence
  4. clarity (10.4%) - Message clarity at a glance
  5. twist_resolves_fast (3.2%) - Creative twist resolution speed

Insight: Contextual fit features (contrast + similarity + entities) dominate at 58.7% combined importance.


Integration with Bandit System

How the Memorability Scorer Fits

┌─────────────────────────────────────────────────────────────┐
│             BANDIT SYSTEM: AD SELECTION FLOW                │
│                                                             │
│  1. Generate candidates                                     │
│     ┌──────────────────────────────────────┐                │
│     │ DSPy + GPT-5.2 → 10-20 ad variants   │                │
│     └────────────────┬─────────────────────┘                │
│                      │                                      │
│  2. Extract features for each candidate                     │
│     ┌──────────────────────────────────────┐                │
│     │ M features (15): Visual + Copy       │                │
│     │ F features (6): Context fit          │                │
│     └────────────────┬─────────────────────┘                │
│                      │                                      │
│  3. Score with memorability model                           │
│     ┌──────────────────────────────────────┐                │
│     │ XGBoost prediction: score_1_10       │                │
│     │ Latency: < 1ms per candidate         │                │
│     └────────────────┬─────────────────────┘                │
│                      │                                      │
│  4. Compute overall score                                   │
│     ┌──────────────────────────────────────┐                │
│     │ Final = w₁·memorability              │                │
│     │       + w₂·predicted_CTR             │                │
│     │       + w₃·novelty                   │                │
│     │       - duplicate_penalty            │                │
│     │       - brand_safety_penalty         │                │
│     └────────────────┬─────────────────────┘                │
│                      │                                      │
│  5. Select & serve best candidate                           │
│     ┌──────────────────────────────────────┐                │
│     │ Thompson sampling or UCB policy      │                │
│     └────────────────┬─────────────────────┘                │
│                      │                                      │
│  6. Collect feedback                                        │
│     ┌──────────────────────────────────────┐                │
│     │ Actual CTR, dwell time, conversions  │                │
│     └────────────────┬─────────────────────┘                │
│                      │                                      │
│  7. Update weights (monthly)                                │
│     ┌──────────────────────────────────────┐                │
│     │ Optimize w₁, w₂, w₃ based on        │                 │
│     │ real performance data                │                │
│     └──────────────────────────────────────┘                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Deployment Architecture

Production Inference:
┌──────────────────┐
│  Ad Candidate    │
│  + Article Text  │
└────────┬─────────┘
         │
         ▼
┌─────────────────────────────┐
│  Feature Extraction (cached)│
│  - M features: GPT-5.2 cache│
│  - F features: compute live │
│  Latency: ~50ms if cached   │
└────────┬────────────────────┘
         │
         ▼
┌─────────────────────────────┐
│  XGBoost Model (.pkl)       │
│  - 21 features → score      │
│  Latency: < 1ms             │
│  Cost: $0.0001 per call     │
└────────┬────────────────────┘
         │
         ▼
┌─────────────────────────────┐
│  Score: 1-10                │
│  (feeds into bandit scorer) │
└─────────────────────────────┘

Cost comparison:

  • Before: GPT-5.2 scoring = $0.02/ad + 120s latency → not feasible for real-time
  • After: XGBoost scoring = $0.0001/ad + < 1ms latency → production ready

Key Takeaways for Devs

What Worked Well

  1. Perceptual hashing saved $200+ by deduplicating before API calls
  2. Synthetic article generation let us control the training distribution
  3. Teacher-student paradigm traded upfront cost for fast inference
  4. Batch processing + resume capability made the pipeline robust to failures
  5. XGBoost was the right choice for tabular data with 21 features

What Was Hard

  1. Dataset creation took 90% of time and money

    • 29K API calls to GPT-5.2 Vision ($580+)
    • Incremental saves + resume logic to handle timeouts
    • Memory-efficient batching to avoid OOM
  2. Preventing data leakage

    • Must split by ad_id not row-level (same ad appears in multiple pairs)
    • Careful validation of positive pair alignment
  3. Balancing negative sample types

    • Too many random negatives = model learns trivial patterns
    • Safe contrast negatives = harder to generate but critical for nuance

Future Work

  • Replace teacher scores with real CTR data once we have enough traffic
  • Add brand safety dimension (currently missing from M×F framework)
  • Experiment with learning-to-rank instead of regression
  • Fine-tune embeddings instead of using off-the-shelf CLIP/OpenAI

Dataset Statistics

Metric Value
Unique Ads 7,258
Synthetic Articles 7,258
Total Pairs 29,032
Features 21 (15 M + 6 F)
Teacher Scores 29,032
Dataset Size 8.1 GB
Training Time < 5 minutes
Total Cost ~$600 (mostly GPT-5.2 API)

References

  • Dataset: albertbn/ad-memorability-scorer-v0 (HuggingFace, private)
  • Model: baseline_xgboost_model.pkl (8 MB)
  • Code: /Labs/memorability/ (8 Python modules + utilities)
  • Framework: Based on LoudEcho Creative Quality Framework (6 dimensions)
  • Gist: Context Ad Learning Research

Questions? Ping the team or check the code in /Labs/memorability/


Generated for internal dev team walkthrough • February 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment