Skip to content

Instantly share code, notes, and snippets.

@loganlinn
Last active December 12, 2025 23:26
Show Gist options
  • Select an option

  • Save loganlinn/4ddafc69d513573b9f4377212deff3bd to your computer and use it in GitHub Desktop.

Select an option

Save loganlinn/4ddafc69d513573b9f4377212deff3bd to your computer and use it in GitHub Desktop.
Collision Risk Analysis for Nano ID Custom Alphabet Configuration

Collision Risk Analysis: customAlphabet('0123456789abcdefghijklmnopqrstuvwxyz', 15)

Configuration Summary

  • Alphabet: 0-9a-z (36 characters)
  • Length: 15 characters
  • Bits per character: log₂(36) ≈ 5.17 bits
  • Total entropy: 77.5 bits
  • Total possible IDs: 221,073,919,720,733,357,899,776 (~2.21 × 10²³)

Defensible Risk Assessment Framework

1. Volume-Based Thresholds

The key metric is total IDs that will ever exist in your system's lifetime:

Total IDs (lifetime) Collision Probability Risk Assessment
< 10M < 0.00002% Safe - collision extremely unlikely
10M - 100M 0.00002% - 0.002% Acceptable - 1 in millions chance
100M - 500M 0.002% - 0.06% Moderate - collision possible but rare
500M - 1B 0.06% - 0.23% Elevated - collision likely in 1/500-1/1000
1B - 10B 0.23% - 18% High risk - collisions will occur
> 10B > 18% Unacceptable - frequent collisions

"Moderate-scale apps" = systems generating 10M-100M total IDs over their lifetime.

Examples:

  • SaaS with 100K users × 100 records each = 10M IDs ✅
  • E-commerce with 10M orders/year × 10 years = 100M IDs ⚠️ (borderline)
  • Social media with 100M posts/day = 100M in 1 day ❌

2. Collision Handling Capability

Does your system have collision detection/handling?

With uniqueness constraints (DB unique index, etc.):

  • Risk tolerance: < 1% collision probability is acceptable
  • Your config (77.5 bits): Safe up to ~600M IDs

Without collision handling (blind inserts):

  • Risk tolerance: < 0.01% collision probability required
  • Your config: Safe up to ~50M IDs

Why this matters: With DB constraints, collisions cause retries (performance hit). Without constraints, collisions cause data corruption.

3. Generation Rate & Time Window

Birthday paradox intensifies with concentrated generation:

Scenario IDs/day Days to 100M Risk Assessment
Small app 1,000 274 years Safe ✅
Growing startup 100,000 2.7 years Monitor ⚠️
High-volume API 1,000,000 100 days Risky ❌
Distributed system 10M+ 10 days Dangerous ❌

Defensible criterion: If you'll reach 100M IDs within 5 years, reconsider this config.

4. Distributed Generation Risk

Multiple servers generating IDs simultaneously amplifies collision risk:

Centralized Generation

  • Birthday problem applies normally
  • 77.5 bits safe for ~50M IDs @ <0.01% collision
  • Single point of coordination

Distributed Generation (N independent nodes)

  • Problem: Each node creates independent ID pools that can collide with each other
  • Amplification: Collision risk increases due to cross-node collisions
  • Conservative adjustment: Reduce safe capacity by factor of √N, OR add log₂(N) bits

Defensible criteria by node count:

Nodes Safe ID Count (0.01% risk) Recommended Length Total Bits
1 (centralized) 50M 15 chars 77.5
2-5 nodes 25M 16 chars 82.7
5-10 nodes 15M 16-17 chars 82.7-87.9
10-50 nodes 10M 17-18 chars 87.9-93.1
50+ nodes < 5M 19+ chars 98.2+

Additional distributed concerns:

  • Uneven load distribution (hotspot nodes)
  • Synchronized bursts (batch jobs)
  • Clock skew in time-based generation
  • No coordination between nodes

Formula for distributed systems:

bits_needed = 77.5 + log₂(N)

For 10 nodes: 77.5 + 3.3 ≈ 81 bits → 16 characters
For 100 nodes: 77.5 + 6.6 ≈ 84 bits → 17 characters

5. Consequence Severity

What happens if collision occurs?

Low consequence (collision acceptable):

  • Temporary cart IDs (retry on conflict)
  • Analytics event IDs (duplication tolerable)
  • Cache keys (overwrite acceptable)
  • Risk tolerance: < 1%

Medium consequence (collision causes errors):

  • Order IDs (customer confusion)
  • Invoice numbers (accounting issues)
  • URL slugs (SEO/user experience)
  • Risk tolerance: < 0.01%

High consequence (collision causes corruption):

  • Financial transaction IDs
  • Medical record identifiers
  • Authentication tokens
  • Risk tolerance: < 0.0001% (use 128+ bits)

6. Regulatory/Compliance Requirements

NIST/FIPS standards:

  • Security-sensitive: minimum 128 bits entropy
  • Your 77.5 bits: Not compliant for cryptographic use

PCI-DSS, HIPAA, SOC2:

  • Unpredictable identifiers required
  • Must use secure variant (not non-secure)
  • Minimum 80 bits recommended (you're at 77.5 ⚠️)

7. Mathematical Safety Threshold

Industry standard: < 10⁻⁶ collision probability

For your config (36¹⁵ possibilities):

Probability per ID pair: p = 1 / 36¹⁵ ≈ 4.5 × 10⁻²⁴

Birthday problem formula:
P(collision) ≈ n² / (2 × 36¹⁵)

Safe ID count where P(any collision) < 10⁻⁶:
n ≈ √(2 × 36¹⁵ × 10⁻⁶) ≈ 21 million IDs

Defensible thresholds:

  • < 10⁻⁶ risk: Stay under 20M IDs
  • < 10⁻⁵ risk: Stay under 47M IDs
  • < 10⁻⁴ risk: Stay under 150M IDs

Concrete Decision Matrix

IF centralized AND total_ids < 20M AND has_unique_constraint:
    ✅ SAFE

ELIF centralized AND total_ids < 50M AND has_unique_constraint:
    ⚠️ ACCEPTABLE (monitor collision rate)

ELIF distributed AND nodes < 10 AND total_ids < 15M AND has_unique_constraint:
    ⚠️ ACCEPTABLE (consider adding characters)

ELIF distributed AND nodes >= 10:
    ❌ INCREASE TO 17+ CHARACTERS

ELIF total_ids < 500M AND has_unique_constraint AND low_consequence:
    ⚠️ RISKY (plan migration path)

ELIF is_security_sensitive OR no_unique_constraint:
    ❌ NOT RECOMMENDED

ELIF total_ids > 1B:
    ❌ UNACCEPTABLE

ELSE:
    ⚠️ EVALUATE (consider cost of collision vs. migration)

Specific Examples

✅ Safe Use Cases

Centralized, low-volume:

  • Startup SaaS with 10K users (10M records max over lifetime)
  • Internal tool with 50K entities/year
  • Blog with 1M posts over 10 years
  • E-commerce with < 5M orders lifetime
  • Mobile app with offline-first sync (single user device)

⚠️ Borderline (Monitor Carefully)

Centralized, growing volume:

  • Growing platform: 1M users → 100M records
  • API serving 10K requests/sec (86M/day - collision risk in months)
  • Multi-year project with uncertain growth trajectory

Distributed, low-volume:

  • 3-5 microservices generating < 20M total IDs
  • Small distributed system with < 10M IDs

❌ Not Recommended

High volume:

  • Twitter-scale (500M tweets/day)
  • Distributed logging system (billions of events)
  • Any system expecting > 100M IDs

Distributed systems:

  • 10+ nodes without coordination
  • Cloud auto-scaling (unknown node count)
  • Multi-region deployments

Compliance/security:

  • Payment processor (PCI-DSS)
  • Session tokens (security requirement: 128+ bits)
  • Any HIPAA/PCI regulated identifier
  • API keys or authentication tokens

No collision handling:

  • Blind inserts without unique constraints
  • Append-only systems without validation
  • Legacy systems that can't handle retry logic

Bottom Line Formulas

Centralized Generation

Maximum safe ID count (for <0.01% collision risk):

max_ids ≈ 0.01 × √(alphabet_size ^ length)
       ≈ 0.01 × √(36¹⁵)
       ≈ 47 million IDs

Distributed Generation

Conservative adjustment (for N nodes):

max_ids_distributed ≈ max_ids_centralized / √N

For 10 nodes: 47M / √10 ≈ 15M IDs
For 100 nodes: 47M / √100 ≈ 4.7M IDs

Or increase length to compensate:

chars_needed = 15 + (log₂(N) / 5.17)

For 10 nodes: 15 + (3.3 / 5.17) ≈ 16 characters
For 100 nodes: 15 + (6.6 / 5.17) ≈ 16-17 characters

Safety Checklist

Your config (15 chars, 36-char alphabet) is defensibly safe if ALL of:

  • Centralized generation (single server/process), OR
  • Distributed < 10 nodes AND total IDs < 15M, OR
  • Distributed 10+ nodes AND willing to increase to 17+ chars
  • DB unique constraints exist (retry on collision)
  • Total lifetime IDs < 50M (centralized) OR < 15M (distributed)
  • Not security-sensitive (use secure variant if borderline)
  • Low-medium consequence of collision (not financial/medical)
  • No compliance requirements (PCI/HIPAA/SOC2)

Needs stronger config if ANY of:

  • Growth trajectory unclear or aggressive
  • Will scale to distributed system
  • No unique constraints / can't handle collisions
  • Security-sensitive identifiers
  • Cost of collision > cost of longer IDs
  • Regulatory compliance required

Configuration Alternatives

Option 1: Increase Length (keep alphabet)

For centralized systems:

// 82.7 bits - safe for 100M IDs
customAlphabet('0123456789abcdefghijklmnopqrstuvwxyz', 16)

// 93 bits - safe for 500M IDs
customAlphabet('0123456789abcdefghijklmnopqrstuvwxyz', 18)

// 103 bits - safe for 5B IDs
customAlphabet('0123456789abcdefghijklmnopqrstuvwxyz', 20)

// 129 bits - UUID-level safety
customAlphabet('0123456789abcdefghijklmnopqrstuvwxyz', 25)

For distributed systems (10+ nodes):

// 87.9 bits - safe for 15M IDs across 10 nodes
customAlphabet('0123456789abcdefghijklmnopqrstuvwxyz', 17)

// 93 bits - safe for 50M IDs across 10 nodes
customAlphabet('0123456789abcdefghijklmnopqrstuvwxyz', 18)

// 98.2 bits - safe for 150M IDs across 50 nodes
customAlphabet('0123456789abcdefghijklmnopqrstuvwxyz', 19)

Option 2: Add Uppercase (keep length)

// 89.4 bits - safe for 1B IDs centralized, 100M distributed (10 nodes)
customAlphabet('0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', 15)

Option 3: Default Nanoid (recommended)

// 126 bits - industry standard, handles billions of IDs
import { nanoid } from 'nanoid'
nanoid() // 21 chars, URL-safe alphabet (A-Za-z0-9_-)

Option 4: Custom with Safety Margin

// Add 3 chars for ~15 bits safety margin
customAlphabet('0123456789abcdefghijklmnopqrstuvwxyz', 18) // 93 bits

// Handles growth: 50M → 500M IDs without migration

Monitoring & Mitigation

Detect Collisions

// With DB unique constraint
try {
  await db.insert({ id: nanoid(), ... })
} catch (error) {
  if (error.code === 'UNIQUE_VIOLATION') {
    // Log collision event for monitoring
    logger.warn('ID collision detected', { attempts: 1 })
    // Retry with new ID
    await db.insert({ id: nanoid(), ... })
  }
}

Monitor Collision Rate

// Track collision frequency
const collisionRate = collisions / totalGenerated

// Alert thresholds
if (collisionRate > 0.0001) {
  alert('Collision rate exceeds 0.01% - consider longer IDs')
}

Migration Strategy

If approaching limits:

  1. Add new ID column with longer config
  2. Dual-write to both columns temporarily
  3. Backfill old records asynchronously
  4. Switch reads to new column
  5. Drop old column after validation

References

Summary

Architecture Safe Threshold Recommendation
Centralized, < 50M IDs ✅ Current config safe Monitor growth
Centralized, 50-500M IDs ⚠️ Acceptable with constraints Consider 18+ chars
Distributed < 10 nodes, < 15M IDs ⚠️ Acceptable Consider 16+ chars
Distributed 10+ nodes ❌ Insufficient Use 17-19+ chars
Security-sensitive ❌ Insufficient Use 21+ chars (126+ bits)
Billions of IDs ❌ Insufficient Use default nanoid (21 chars)

Most important factors:

  1. Total lifetime ID count (not just current)
  2. Distributed vs. centralized generation
  3. Unique constraints (collision handling)
  4. Consequence severity (retry cost vs. data corruption)
  5. Growth trajectory (can you migrate later if needed?)

Analysis Date: 2025-12-09
Configuration: customAlphabet('0123456789abcdefghijklmnopqrstuvwxyz', 15)
Verdict: Safe for centralized systems < 50M IDs with DB constraints. Add 1-2 characters per 10x node increase in distributed systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment