Skip to content

Instantly share code, notes, and snippets.

@ChakshuGautam
Created December 28, 2025 13:20
Show Gist options
  • Select an option

  • Save ChakshuGautam/826503992684b53dce8f9f8f25a87ede to your computer and use it in GitHub Desktop.

Select an option

Save ChakshuGautam/826503992684b53dce8f9f8f25a87ede to your computer and use it in GitHub Desktop.
MECE Knowledge Graph Implementation - Pseudocode

MECE Knowledge Graph Implementation - Pseudocode

Overview

An agentic system that builds MECE-compliant knowledge graphs using 4 complementary mechanisms to prevent duplicate entities and ensure atomic, mutually exclusive concepts.


Architecture

┌─────────────────────────────────────────────────────────────┐
│                   Graph Query Tool                           │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Semantic   │  │   Synonym    │  │   Naming     │      │
│  │    Search    │  │    Mapping   │  │  Convention  │      │
│  │ (Embeddings) │  │   (Runtime)  │  │   (Strict)   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│         │                  │                  │              │
│         └──────────────────┴──────────────────┘              │
│                            │                                 │
│                    ┌───────▼────────┐                        │
│                    │  MECE Entity   │                        │
│                    │  Deduplication │                        │
│                    └───────┬────────┘                        │
│                            │                                 │
│                    ┌───────▼────────┐                        │
│                    │ Post-Processing│                        │
│                    │     Merge      │                        │
│                    └────────────────┘                        │
└─────────────────────────────────────────────────────────────┘
                            │
                            │
                    ┌───────▼────────┐
                    │   MCP Server   │
                    │   (13 Tools)   │
                    └───────┬────────┘
                            │
                    ┌───────▼────────┐
                    │  Claude Agent  │
                    │  (Per Scheme)  │
                    └────────────────┘

Data Structures

// Core entities with vector embeddings
interface Entity {
  id: string                    // Format: "type:normalized_name"
  type: string                  // Strict types: scheme, beneficiary_type, etc.
  canonical: string             // Original name
  aliases: string[]             // All known variants
  embedding: number[]           // 768-dim vector from text-embedding-004
  metadata?: any
}

// Relationships between entities
interface Triple {
  subject: string               // Entity ID
  predicate: string             // Relationship (targets, provides, requires_age, etc.)
  object: string                // Entity ID
  confidence?: number           // Optional confidence score
}

// Main graph structure
interface SPOGraph {
  entities: Record<string, Entity>
  triples: Triple[]
  entityIndex: {
    byType: Record<string, string[]>           // Fast type lookup
    byCanonical: Record<string, string>        // Fast canonical lookup
    byAlias: Record<string, string>            // Fast alias lookup
  }
  synonymMap: Record<string, string>           // Dynamic variant → canonical
  processedSchemes: string[]
  lastUpdated: string
}

MECE Mechanism #1: Vector/Semantic Search

Purpose

Prevent semantic duplicates like "widow" vs "widows" vs "vidhwa" (Hindi)

Implementation

function generateEmbedding(text: string) -> vector:
    """
    Generate 768-dimensional embedding using Google Generative AI
    Model: text-embedding-004
    """
    result = await embeddingModel.embedContent(text)
    return result.embedding.values

function cosineSimilarity(vector_a, vector_b) -> float:
    """
    Calculate similarity between two vectors
    Returns: 0.0 (completely different) to 1.0 (identical)
    """
    dot_product = sum(a[i] * b[i] for i in range(len(a)))
    magnitude_a = sqrt(sum(a[i]^2 for i in range(len(a))))
    magnitude_b = sqrt(sum(b[i]^2 for i in range(len(b))))
    return dot_product / (magnitude_a * magnitude_b)

function findSimilarEntities(query_embedding, type?, threshold=0.85) -> entities:
    """
    Find semantically similar entities using cosine similarity

    Args:
        query_embedding: Vector to compare against
        type: Optional entity type filter
        threshold: Similarity threshold (default 85%)

    Returns:
        List of entities above threshold, sorted by similarity
    """
    results = []

    for entity in graph.entities:
        // Filter by type if specified
        if type and entity.type != type:
            continue

        // Skip entities without embeddings
        if not entity.embedding:
            continue

        similarity = cosineSimilarity(query_embedding, entity.embedding)

        if similarity >= threshold:
            results.append({
                id: entity.id,
                canonical: entity.canonical,
                similarity: similarity
            })

    return sort(results, by=similarity, descending=true)

async function searchEntity(type, name) -> result:
    """
    Multi-stage entity search with semantic fallback

    Stage 1: Exact alias match
    Stage 2: Canonical match with type
    Stage 3: Semantic similarity search (if enabled)
    """
    // Stage 1: Exact alias match (fast)
    normalized = normalize(applySynonym(name))
    if normalized in graph.entityIndex.byAlias:
        entity_id = graph.entityIndex.byAlias[normalized]
        entity = graph.entities[entity_id]

        // Check type mismatch
        if type and entity.type != type:
            return {
                found: false,
                suggestion: "Entity exists as {entity.type}, not {type}"
            }

        return {
            found: true,
            entityId: entity_id,
            entity: entity
        }

    // Stage 2: Canonical with type (medium)
    if type:
        key = "{type}:{normalized}"
        if key in graph.entityIndex.byCanonical:
            return {
                found: true,
                entityId: graph.entityIndex.byCanonical[key],
                entity: graph.entities[entity_id]
            }

    // Stage 3: Semantic search (slow but thorough)
    if useSemanticSearch and graph.entities.length > 0:
        query_embedding = await generateEmbedding(name)

        if query_embedding.length > 0:
            similar = await findSimilarEntities(query_embedding, type, 0.85)

            if similar.length > 0:
                best = similar[0]
                return {
                    found: true,
                    entityId: best.id,
                    entity: graph.entities[best.id],
                    suggestion: "Found semantically similar ({best.similarity}% match)",
                    similarEntities: similar
                }

    return { found: false }

Example Flow

User tries to add: "vidhwa" (Hindi for widow)

1. Normalize: "vidhwa" → "vidhwa"
2. Check alias index: Not found
3. Check canonical index: Not found
4. Generate embedding: [0.123, 0.456, ..., 0.789] (768 dims)
5. Compare with existing entities:
   - "widow": similarity = 0.92 (92%)
   - "divorced_woman": similarity = 0.45 (45%)
6. Found match! Return existing "beneficiary_type:widow"
7. Result: NO duplicate created ✓

MECE Mechanism #2: Dynamic Synonym Mapping

Purpose

Map known variants to canonical forms at runtime (learned, not hardcoded)

Implementation

function applySynonym(text: string) -> string:
    """
    Apply synonym mapping before normalization
    """
    normalized = normalize(text)
    return graph.synonymMap[normalized] OR normalized

function addSynonym(variant: string, canonical: string) -> result:
    """
    Runtime synonym mapping

    Example: addSynonym("bpl", "below_poverty_line")
             addSynonym("tn", "tamil_nadu")
             addSynonym("vidhwa", "widow")
    """
    normalized_variant = normalize(variant)
    normalized_canonical = normalize(canonical)

    // Validate canonical entity exists
    canonical_exists = any(
        entity for entity in graph.entities
        if normalize(entity.canonical) == normalized_canonical
    )

    if not canonical_exists:
        return {
            success: false,
            message: "Canonical entity not found. Add it first."
        }

    // Store mapping
    graph.synonymMap[normalized_variant] = normalized_canonical
    saveGraph()

    return {
        success: true,
        message: "{variant} → {canonical}"
    }

Example Usage

// Agent encounters "BPL" in new scheme
search_result = searchEntity("income_criteria", "bpl")

// Not found, but agent or user adds synonym:
addSynonym("bpl", "below_poverty_line")

// Now all future searches for "bpl" automatically resolve to:
// "income_criteria:below_poverty_line"

MECE Mechanism #3: Strict Naming Conventions

Purpose

Enforce consistent entity naming through prompt engineering

Naming Rules

Entity Types (EXACTLY these):
  - scheme: Government schemes (e.g., "ignwps", "dwps")
  - beneficiary_type: Who benefits (e.g., "widow", "senior_citizen")
  - benefit_type: What they receive (e.g., "pension", "financial_assistance")
  - income_criteria: Income requirements (e.g., "below_poverty_line", "apl")
  - age_requirement: Age rules (e.g., "age_min_40", "age_max_60")
  - location: Geography (e.g., "tamil_nadu", "andhra_pradesh", "all_india")
  - area_type: Urban/rural (e.g., "rural", "urban", "all_areas")
  - caste: Caste requirements (e.g., "all_castes", "sc_st", "obc")
  - gender: Gender (e.g., "female", "male", "all_genders")
  - occupation: Jobs (e.g., "farmer", "agricultural_worker")

Naming Patterns (EXACTLY):
  Age:        "age_min_40", "age_min_60", "age_max_65", "no_age_limit"
              NOT: "age_40_minimum", "40_years", "minimum_age_40"

  Income:     "below_poverty_line", "apl", "income_max_100000"
              NOT: "bpl", "poor", "low_income"

  Universal:  "all_areas", "all_castes", "all_genders", "no_age_limit"
              NOT: "both", "all", "any", "no_limit"

  Locations:  "tamil_nadu", "andhra_pradesh", "all_india"
              NOT: "TN", "AP", "India", "Tamil Nadu"

  Multi-word: "senior_citizen", "agricultural_worker"
              Format: snake_case, NOT camelCase or kebab-case

Atomic Entities (ONE concept per node):
  ✓ GOOD: "widow", "below_poverty_line", "pension", "age_min_40"
  ✗ BAD:  "poor_widow", "widow_pension", "bpl_widow", "40_year_widow"

Enforcement via Prompt

CRITICAL INSTRUCTIONS:

STEP 2: Extract ATOMIC concepts following NAMING CONVENTIONS

For EACH concept:
a) Call search_entity({type: "EXACT_TYPE", name: "EXACT_NAME"})

   Examples of CORRECT usage:
   - search_entity({type: "age_requirement", name: "age_min_40"})
   - search_entity({type: "income_criteria", name: "below_poverty_line"})
   - search_entity({type: "area_type", name: "all_areas"})

   Examples of INCORRECT usage (DON'T DO THIS):
   - search_entity({type: "age", name: "age_40_minimum"})  ❌ Wrong type & format
   - search_entity({type: "income", name: "bpl"})          ❌ Wrong type & name
   - search_entity({type: "area", name: "both"})           ❌ Wrong type & name

b) If FOUND → use that entityId (REUSE!)
c) If NOT FOUND → call add_entity({type: "EXACT_TYPE", name: "EXACT_NAME"})

MECE Mechanism #4: Post-Processing Merge

Purpose

Find and merge semantic duplicates that slipped through (safety net)

Implementation

async function mergeDuplicates(entityType?, similarityThreshold=0.9) -> result:
    """
    Post-processing duplicate detection and merge

    Args:
        entityType: Optional filter (e.g., "beneficiary_type")
        similarityThreshold: Similarity cutoff (default 90%)

    Returns:
        {
            success: true,
            mergedCount: 3,
            details: [
                {
                    kept: "income_criteria:below_poverty_line",
                    merged: ["income_criteria:bpl", "income_criteria:poor"],
                    reason: "Semantic similarity ≥ 90%"
                }
            ]
        }
    """
    merged_groups = []
    entities_to_remove = Set()

    // Get entities to check
    entity_ids = entityType
        ? graph.entityIndex.byType[entityType]
        : Object.keys(graph.entities)

    // Compare all pairs
    for i in range(len(entity_ids)):
        entity1 = graph.entities[entity_ids[i]]

        if entity1.id in entities_to_remove:
            continue

        if not entity1.embedding:
            continue

        group = []

        for j in range(i + 1, len(entity_ids)):
            entity2 = graph.entities[entity_ids[j]]

            if entity2.id in entities_to_remove:
                continue

            if not entity2.embedding:
                continue

            similarity = cosineSimilarity(entity1.embedding, entity2.embedding)

            if similarity >= similarityThreshold:
                group.append(entity2.id)
                entities_to_remove.add(entity2.id)

                // Update all triples pointing to entity2
                for triple in graph.triples:
                    if triple.subject == entity2.id:
                        triple.subject = entity1.id
                    if triple.object == entity2.id:
                        triple.object = entity1.id

                // Merge aliases
                entity1.aliases.extend(entity2.aliases)

        if group.length > 0:
            merged_groups.append({
                kept: entity1.id,
                merged: group,
                reason: "Semantic similarity ≥ {similarityThreshold * 100}%"
            })

    // Remove merged entities
    for entity_id in entities_to_remove:
        delete graph.entities[entity_id]

    // Rebuild index
    rebuildIndex()
    saveGraph()

    return {
        success: true,
        mergedCount: entities_to_remove.size,
        details: merged_groups
    }

Example

Before merge:
  - income_criteria:below_poverty_line (embedding: [0.1, 0.2, ...])
  - income_criteria:bpl (embedding: [0.11, 0.19, ...])
  - income_criteria:poor (embedding: [0.09, 0.21, ...])

Similarity scores:
  - below_poverty_line ↔ bpl: 0.95 (95%)
  - below_poverty_line ↔ poor: 0.92 (92%)

Action:
  - Keep: below_poverty_line
  - Merge: bpl, poor → aliases of below_poverty_line
  - Update all triples referencing bpl/poor → below_poverty_line

After merge:
  - income_criteria:below_poverty_line
    canonical: "below_poverty_line"
    aliases: ["below_poverty_line", "bpl", "poor"]

Quality Assurance: Isolated Entity Detection

Purpose

Ensure every entity is connected (no orphaned nodes)

Implementation

function findIsolatedEntities() -> result:
    """
    Find entities with no relationships (orphaned nodes)
    """
    connected_entities = Set()

    // Mark all entities appearing in triples
    for triple in graph.triples:
        connected_entities.add(triple.subject)
        connected_entities.add(triple.object)

    // Find entities NOT in any triple
    isolated = []
    for entity in graph.entities.values():
        if entity.id not in connected_entities:
            isolated.append({
                id: entity.id,
                type: entity.type,
                canonical: entity.canonical
            })

    return {
        count: isolated.length,
        isolated: isolated
    }

function cleanupIsolatedEntities() -> result:
    """
    Remove orphaned entities
    """
    isolated = findIsolatedEntities()
    removed = []

    for entity in isolated.isolated:
        delete graph.entities[entity.id]
        removed.append(entity.id)

    if removed.length > 0:
        rebuildIndex()
        saveGraph()

    return {
        success: true,
        removedCount: removed.length,
        removed: removed
    }

function markSchemeProcessed(schemeSlug, autoCleanup=true) -> result:
    """
    Mark scheme complete with automatic quality check
    """
    if schemeSlug in graph.processedSchemes:
        return { success: true, message: "Already processed" }

    // Check for isolated entities
    isolated = findIsolatedEntities()

    cleaned_up = false
    if autoCleanup and isolated.count > 0:
        cleanupIsolatedEntities()
        cleaned_up = true

    graph.processedSchemes.append(schemeSlug)
    saveGraph()

    if isolated.count > 0:
        return {
            success: true,
            message: "{cleaned_up ? 'Removed' : 'Found'} {isolated.count} isolated entities",
            isolatedEntities: isolated.isolated,
            cleanedUp: cleaned_up
        }

    return {
        success: true,
        message: "Marked as processed. No isolated entities!"
    }

Workflow Integration

STEP 4: Quality check and mark complete

a) Call check_isolated_entities()
   Returns: { count: 2, isolated: ["entity1", "entity2"] }

b) If isolated entities found:
   - Review: Did you forget to add triples for them?
   - If yes: Add the missing triples
   - If no (they shouldn't exist): Call cleanup_isolated_entities()

c) Call mark_scheme_processed({schemeSlug})
   - Automatically checks and cleans isolated entities
   - Reports: "Removed 2 isolated entities: entity1, entity2"

Agentic Workflow

Per-Scheme Agent Process

async function extractScheme(scheme) -> result:
    """
    Each scheme gets its own agent with full tool access
    Budget: 50 turns, ~$0.15-0.20 per scheme
    """

    prompt = """
    Build knowledge graph for {scheme.slug} using ATOMIC MECE principles.

    You have 13 graph tools available:
    - 6 READ tools (search, list, find, stats)
    - 3 WRITE tools (add_entity, add_triple, mark_processed)
    - 2 MECE tools (add_synonym, merge_duplicates)
    - 2 QUALITY tools (check_isolated, cleanup_isolated)

    STEP 1: Add scheme entity
    STEP 2: Extract ATOMIC concepts (search first, create if needed)
    STEP 3: Create relationships
    STEP 4: Quality check and mark complete
    """

    conversation = query({
        prompt: prompt,
        model: "claude-sonnet-4-5",
        maxTurns: 50,
        mcpServers: {
            "knowledge-graph": graphMcpServer
        }
    })

    for message in conversation:
        if message.type == "assistant":
            // Agent is using tools autonomously
            continue

        if message.type == "result":
            if message.subtype == "success":
                return {
                    success: true,
                    turns: message.num_turns,
                    cost: message.total_cost_usd
                }
            else:
                return {
                    success: false,
                    error: message.subtype
                }

    return { success: false, error: "Unknown" }

async function main():
    """
    Process all schemes sequentially
    Each scheme = separate agent invocation
    """
    schemes = await fetchSchemes("widow pension")

    for scheme in schemes:
        console.log("🚀 Starting NEW agent for {scheme.slug}")

        result = await extractScheme(scheme)

        if result.success:
            console.log("✅ Completed in {result.turns} turns, ${result.cost}")
        else:
            console.log("❌ Failed: {result.error}")

        console.log("🔚 Agent invocation closed")

Tool Call Example

Agent's internal reasoning (not visible):
"I need to add the widow beneficiary type..."

Agent's tool calls:
1. search_entity({type: "beneficiary_type", name: "widow"})
   → Returns: { found: true, entityId: "beneficiary_type:widow" }

2. Agent: "Great! Entity exists, I'll reuse it."
   (No duplicate created ✓)

3. add_triple({
     subject: "scheme:ignwps",
     predicate: "targets",
     object: "beneficiary_type:widow"
   })
   → Returns: { success: true, message: "Triple added" }

Performance Metrics

Quality Improvements

Before MECE:
❌ Format chaos: age_40_minimum vs age_40_min vs age_min_21
❌ Type inconsistency: bpl vs income_criteria:bpl
❌ Naming chaos: all_castes vs all vs both
❌ Entity reuse: ~20-30%

After MECE:
✅ Consistent naming: age_min_40, age_min_18, age_max_60
✅ Proper types: All use strict type conventions
✅ Atomic entities: widow, divorced_woman (not combined)
✅ Entity reuse: ~50-60%
✅ No isolated entities: 100% connected graph

Graph Compression

11 schemes processed:
  Without entity reuse: ~110 entities expected (10 per scheme)
  With MECE mechanisms: 51 entities actual
  Compression rate: 53.6%
  Avg per scheme: 4.64 entities

Evidence of clustering:
  - "widow" appears in 9 schemes (reused 9 times)
  - "pension" appears in 11 schemes (reused 11 times)
  - "female" appears in 8 schemes (reused 8 times)

Tools Exposed via MCP

MCP Server: "knowledge-graph"
Tools (13):

READ:
  1. search_entity({type?, name})
      Uses semantic search with 85% threshold

  2. list_entity_types()
      Returns all entity types in graph

  3. list_entities_by_type({type})
      Lists all entities of specific type

  4. list_predicates()
      Shows all relationships with examples

  5. find_triples({subject?, predicate?, object?, limit?})
      Pattern matching for relationships

  6. get_stats()
      Entity count, triple count, types, etc.

WRITE:
  7. add_entity({type, name, metadata?})
      Creates entity with embedding
      Auto-checks for duplicates via search

  8. add_triple({subject, predicate, object, confidence?})
      Creates relationship
      Validates both entities exist

  9. mark_scheme_processed({schemeSlug, autoCleanup?})
      Marks complete
      Auto-checks and removes isolated entities

MECE:
  10. add_synonym({variant, canonical})
       Runtime synonym mapping

  11. merge_duplicates({entityType?, similarityThreshold?})
       Post-processing deduplication

QUALITY:
  12. check_isolated_entities()
       Find orphaned entities

  13. cleanup_isolated_entities()
       Remove orphaned entities

Key Insights

Why 4 Mechanisms?

Single mechanism is insufficient:

1. Semantic search alone:
   ❌ Misses exact matches with different formats
   ❌ Slow (API calls for every search)
   ❌ Can't handle brand new terms

2. Synonym mapping alone:
   ❌ Requires pre-knowledge of all variants
   ❌ Can't discover new synonyms
   ❌ Doesn't handle semantic similarity

3. Naming conventions alone:
   ❌ Relies on agent compliance
   ❌ Agents make mistakes
   ❌ Can't catch semantically similar terms

4. Post-processing merge alone:
   ❌ Duplicates already created (wasteful)
   ❌ Fixing problems after the fact
   ❌ Requires re-running

Combined approach:
✅ Naming conventions: First line of defense
✅ Synonym mapping: Known variant resolution
✅ Semantic search: Unknown variant detection
✅ Post-processing merge: Safety net
✅ Isolated entity cleanup: Quality assurance

MECE Principle

Mutually Exclusive, Collectively Exhaustive

Mutually Exclusive:
  ✓ Each concept = ONE entity
  ✓ No overlap between entities
  ✓ "widow" and "divorced_woman" are separate
  ✗ NOT "poor_widow" (combines two concepts)

Collectively Exhaustive:
  ✓ All entities needed for scheme are present
  ✓ No missing relationships
  ✓ Every entity is connected (no orphans)
  ✓ Graph completely describes the scheme

Example violations:
  ❌ "bpl_widow" - Combines income + beneficiary
  ❌ "tamil_nadu_pension" - Combines location + benefit
  ❌ "age_40_minimum" vs "age_min_40" - Same concept, different format

Correct MECE structure:
  ✅ Entities: widow, below_poverty_line, pension, tamil_nadu, age_min_40
  ✅ Triples:
     - scheme targets widow
     - scheme requires below_poverty_line
     - scheme provides pension
     - scheme available_in tamil_nadu
     - scheme requires_age age_min_40

Deployment Checklist

# Prerequisites
✅ PostgreSQL with scheme data
✅ Google Generative AI API key (for embeddings)
✅ Bun runtime
✅ @anthropic-ai/claude-agent-sdk
✅ @google/generative-ai package

# Files needed
✅ graph-query-tool.ts (Core MECE implementation)
✅ spo-graph-agentic.ts (Agent orchestration)
✅ visualize.html (Real-time graph viewer)

# Environment variables
export GOOGLE_GENERATIVE_AI_API_KEY="your-key-here"

# Run extraction
bun run spo-graph-agentic.ts

# Monitor progress
open http://localhost:3030  # Visualization
tail -f output.log          # Extraction logs

# Validate quality
- Check entity reuse rate (target: >50%)
- Verify no isolated entities (target: 0)
- Confirm naming consistency (manual review)
- Run merge_duplicates() as final cleanup

Future Enhancements

  1. Adaptive Thresholds

    • Learn optimal similarity thresholds per entity type
    • Lower threshold for numeric entities (ages, incomes)
    • Higher threshold for semantic entities (beneficiaries)
  2. Multi-language Support

    • Add embeddings for regional languages
    • Synonym mapping for Hindi/Tamil/etc.
    • Language-aware normalization
  3. Active Learning

    • Prompt user when similarity is borderline (0.80-0.90)
    • Build training data for similarity model
    • Improve threshold tuning
  4. Graph Validation Rules

    • Schemes must have at least 1 "targets" triple
    • Schemes must have exactly 1 "provides" triple
    • Age requirements must be numeric
    • Income criteria must be monetary
  5. Performance Optimization

    • Cache embeddings to avoid re-generation
    • Batch embedding generation
    • Pre-compute similarity matrix
    • Use vector database (Pinecone/Weaviate)

License

MIT

Author

Claude Code with MECE Knowledge Graph Implementation Version: 1.0.0 Date: 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment