An agentic system that builds MECE-compliant knowledge graphs using 4 complementary mechanisms to prevent duplicate entities and ensure atomic, mutually exclusive concepts.
┌─────────────────────────────────────────────────────────────┐
│ Graph Query Tool │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Semantic │ │ Synonym │ │ Naming │ │
│ │ Search │ │ Mapping │ │ Convention │ │
│ │ (Embeddings) │ │ (Runtime) │ │ (Strict) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └──────────────────┴──────────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ MECE Entity │ │
│ │ Deduplication │ │
│ └───────┬────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Post-Processing│ │
│ │ Merge │ │
│ └────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
│
┌───────▼────────┐
│ MCP Server │
│ (13 Tools) │
└───────┬────────┘
│
┌───────▼────────┐
│ Claude Agent │
│ (Per Scheme) │
└────────────────┘
// Core entities with vector embeddings
interface Entity {
id: string // Format: "type:normalized_name"
type: string // Strict types: scheme, beneficiary_type, etc.
canonical: string // Original name
aliases: string[] // All known variants
embedding: number[] // 768-dim vector from text-embedding-004
metadata?: any
}
// Relationships between entities
interface Triple {
subject: string // Entity ID
predicate: string // Relationship (targets, provides, requires_age, etc.)
object: string // Entity ID
confidence?: number // Optional confidence score
}
// Main graph structure
interface SPOGraph {
entities: Record<string, Entity>
triples: Triple[]
entityIndex: {
byType: Record<string, string[]> // Fast type lookup
byCanonical: Record<string, string> // Fast canonical lookup
byAlias: Record<string, string> // Fast alias lookup
}
synonymMap: Record<string, string> // Dynamic variant → canonical
processedSchemes: string[]
lastUpdated: string
}Prevent semantic duplicates like "widow" vs "widows" vs "vidhwa" (Hindi)
function generateEmbedding(text: string) -> vector:
"""
Generate 768-dimensional embedding using Google Generative AI
Model: text-embedding-004
"""
result = await embeddingModel.embedContent(text)
return result.embedding.values
function cosineSimilarity(vector_a, vector_b) -> float:
"""
Calculate similarity between two vectors
Returns: 0.0 (completely different) to 1.0 (identical)
"""
dot_product = sum(a[i] * b[i] for i in range(len(a)))
magnitude_a = sqrt(sum(a[i]^2 for i in range(len(a))))
magnitude_b = sqrt(sum(b[i]^2 for i in range(len(b))))
return dot_product / (magnitude_a * magnitude_b)
function findSimilarEntities(query_embedding, type?, threshold=0.85) -> entities:
"""
Find semantically similar entities using cosine similarity
Args:
query_embedding: Vector to compare against
type: Optional entity type filter
threshold: Similarity threshold (default 85%)
Returns:
List of entities above threshold, sorted by similarity
"""
results = []
for entity in graph.entities:
// Filter by type if specified
if type and entity.type != type:
continue
// Skip entities without embeddings
if not entity.embedding:
continue
similarity = cosineSimilarity(query_embedding, entity.embedding)
if similarity >= threshold:
results.append({
id: entity.id,
canonical: entity.canonical,
similarity: similarity
})
return sort(results, by=similarity, descending=true)
async function searchEntity(type, name) -> result:
"""
Multi-stage entity search with semantic fallback
Stage 1: Exact alias match
Stage 2: Canonical match with type
Stage 3: Semantic similarity search (if enabled)
"""
// Stage 1: Exact alias match (fast)
normalized = normalize(applySynonym(name))
if normalized in graph.entityIndex.byAlias:
entity_id = graph.entityIndex.byAlias[normalized]
entity = graph.entities[entity_id]
// Check type mismatch
if type and entity.type != type:
return {
found: false,
suggestion: "Entity exists as {entity.type}, not {type}"
}
return {
found: true,
entityId: entity_id,
entity: entity
}
// Stage 2: Canonical with type (medium)
if type:
key = "{type}:{normalized}"
if key in graph.entityIndex.byCanonical:
return {
found: true,
entityId: graph.entityIndex.byCanonical[key],
entity: graph.entities[entity_id]
}
// Stage 3: Semantic search (slow but thorough)
if useSemanticSearch and graph.entities.length > 0:
query_embedding = await generateEmbedding(name)
if query_embedding.length > 0:
similar = await findSimilarEntities(query_embedding, type, 0.85)
if similar.length > 0:
best = similar[0]
return {
found: true,
entityId: best.id,
entity: graph.entities[best.id],
suggestion: "Found semantically similar ({best.similarity}% match)",
similarEntities: similar
}
return { found: false }User tries to add: "vidhwa" (Hindi for widow)
1. Normalize: "vidhwa" → "vidhwa"
2. Check alias index: Not found
3. Check canonical index: Not found
4. Generate embedding: [0.123, 0.456, ..., 0.789] (768 dims)
5. Compare with existing entities:
- "widow": similarity = 0.92 (92%)
- "divorced_woman": similarity = 0.45 (45%)
6. Found match! Return existing "beneficiary_type:widow"
7. Result: NO duplicate created ✓
Map known variants to canonical forms at runtime (learned, not hardcoded)
function applySynonym(text: string) -> string:
"""
Apply synonym mapping before normalization
"""
normalized = normalize(text)
return graph.synonymMap[normalized] OR normalized
function addSynonym(variant: string, canonical: string) -> result:
"""
Runtime synonym mapping
Example: addSynonym("bpl", "below_poverty_line")
addSynonym("tn", "tamil_nadu")
addSynonym("vidhwa", "widow")
"""
normalized_variant = normalize(variant)
normalized_canonical = normalize(canonical)
// Validate canonical entity exists
canonical_exists = any(
entity for entity in graph.entities
if normalize(entity.canonical) == normalized_canonical
)
if not canonical_exists:
return {
success: false,
message: "Canonical entity not found. Add it first."
}
// Store mapping
graph.synonymMap[normalized_variant] = normalized_canonical
saveGraph()
return {
success: true,
message: "{variant} → {canonical}"
}// Agent encounters "BPL" in new scheme
search_result = searchEntity("income_criteria", "bpl")
// Not found, but agent or user adds synonym:
addSynonym("bpl", "below_poverty_line")
// Now all future searches for "bpl" automatically resolve to:
// "income_criteria:below_poverty_line"Enforce consistent entity naming through prompt engineering
Entity Types (EXACTLY these):
- scheme: Government schemes (e.g., "ignwps", "dwps")
- beneficiary_type: Who benefits (e.g., "widow", "senior_citizen")
- benefit_type: What they receive (e.g., "pension", "financial_assistance")
- income_criteria: Income requirements (e.g., "below_poverty_line", "apl")
- age_requirement: Age rules (e.g., "age_min_40", "age_max_60")
- location: Geography (e.g., "tamil_nadu", "andhra_pradesh", "all_india")
- area_type: Urban/rural (e.g., "rural", "urban", "all_areas")
- caste: Caste requirements (e.g., "all_castes", "sc_st", "obc")
- gender: Gender (e.g., "female", "male", "all_genders")
- occupation: Jobs (e.g., "farmer", "agricultural_worker")
Naming Patterns (EXACTLY):
Age: "age_min_40", "age_min_60", "age_max_65", "no_age_limit"
NOT: "age_40_minimum", "40_years", "minimum_age_40"
Income: "below_poverty_line", "apl", "income_max_100000"
NOT: "bpl", "poor", "low_income"
Universal: "all_areas", "all_castes", "all_genders", "no_age_limit"
NOT: "both", "all", "any", "no_limit"
Locations: "tamil_nadu", "andhra_pradesh", "all_india"
NOT: "TN", "AP", "India", "Tamil Nadu"
Multi-word: "senior_citizen", "agricultural_worker"
Format: snake_case, NOT camelCase or kebab-case
Atomic Entities (ONE concept per node):
✓ GOOD: "widow", "below_poverty_line", "pension", "age_min_40"
✗ BAD: "poor_widow", "widow_pension", "bpl_widow", "40_year_widow"CRITICAL INSTRUCTIONS:
STEP 2: Extract ATOMIC concepts following NAMING CONVENTIONS
For EACH concept:
a) Call search_entity({type: "EXACT_TYPE", name: "EXACT_NAME"})
Examples of CORRECT usage:
- search_entity({type: "age_requirement", name: "age_min_40"})
- search_entity({type: "income_criteria", name: "below_poverty_line"})
- search_entity({type: "area_type", name: "all_areas"})
Examples of INCORRECT usage (DON'T DO THIS):
- search_entity({type: "age", name: "age_40_minimum"}) ❌ Wrong type & format
- search_entity({type: "income", name: "bpl"}) ❌ Wrong type & name
- search_entity({type: "area", name: "both"}) ❌ Wrong type & name
b) If FOUND → use that entityId (REUSE!)
c) If NOT FOUND → call add_entity({type: "EXACT_TYPE", name: "EXACT_NAME"})Find and merge semantic duplicates that slipped through (safety net)
async function mergeDuplicates(entityType?, similarityThreshold=0.9) -> result:
"""
Post-processing duplicate detection and merge
Args:
entityType: Optional filter (e.g., "beneficiary_type")
similarityThreshold: Similarity cutoff (default 90%)
Returns:
{
success: true,
mergedCount: 3,
details: [
{
kept: "income_criteria:below_poverty_line",
merged: ["income_criteria:bpl", "income_criteria:poor"],
reason: "Semantic similarity ≥ 90%"
}
]
}
"""
merged_groups = []
entities_to_remove = Set()
// Get entities to check
entity_ids = entityType
? graph.entityIndex.byType[entityType]
: Object.keys(graph.entities)
// Compare all pairs
for i in range(len(entity_ids)):
entity1 = graph.entities[entity_ids[i]]
if entity1.id in entities_to_remove:
continue
if not entity1.embedding:
continue
group = []
for j in range(i + 1, len(entity_ids)):
entity2 = graph.entities[entity_ids[j]]
if entity2.id in entities_to_remove:
continue
if not entity2.embedding:
continue
similarity = cosineSimilarity(entity1.embedding, entity2.embedding)
if similarity >= similarityThreshold:
group.append(entity2.id)
entities_to_remove.add(entity2.id)
// Update all triples pointing to entity2
for triple in graph.triples:
if triple.subject == entity2.id:
triple.subject = entity1.id
if triple.object == entity2.id:
triple.object = entity1.id
// Merge aliases
entity1.aliases.extend(entity2.aliases)
if group.length > 0:
merged_groups.append({
kept: entity1.id,
merged: group,
reason: "Semantic similarity ≥ {similarityThreshold * 100}%"
})
// Remove merged entities
for entity_id in entities_to_remove:
delete graph.entities[entity_id]
// Rebuild index
rebuildIndex()
saveGraph()
return {
success: true,
mergedCount: entities_to_remove.size,
details: merged_groups
}Before merge:
- income_criteria:below_poverty_line (embedding: [0.1, 0.2, ...])
- income_criteria:bpl (embedding: [0.11, 0.19, ...])
- income_criteria:poor (embedding: [0.09, 0.21, ...])
Similarity scores:
- below_poverty_line ↔ bpl: 0.95 (95%)
- below_poverty_line ↔ poor: 0.92 (92%)
Action:
- Keep: below_poverty_line
- Merge: bpl, poor → aliases of below_poverty_line
- Update all triples referencing bpl/poor → below_poverty_line
After merge:
- income_criteria:below_poverty_line
canonical: "below_poverty_line"
aliases: ["below_poverty_line", "bpl", "poor"]
Ensure every entity is connected (no orphaned nodes)
function findIsolatedEntities() -> result:
"""
Find entities with no relationships (orphaned nodes)
"""
connected_entities = Set()
// Mark all entities appearing in triples
for triple in graph.triples:
connected_entities.add(triple.subject)
connected_entities.add(triple.object)
// Find entities NOT in any triple
isolated = []
for entity in graph.entities.values():
if entity.id not in connected_entities:
isolated.append({
id: entity.id,
type: entity.type,
canonical: entity.canonical
})
return {
count: isolated.length,
isolated: isolated
}
function cleanupIsolatedEntities() -> result:
"""
Remove orphaned entities
"""
isolated = findIsolatedEntities()
removed = []
for entity in isolated.isolated:
delete graph.entities[entity.id]
removed.append(entity.id)
if removed.length > 0:
rebuildIndex()
saveGraph()
return {
success: true,
removedCount: removed.length,
removed: removed
}
function markSchemeProcessed(schemeSlug, autoCleanup=true) -> result:
"""
Mark scheme complete with automatic quality check
"""
if schemeSlug in graph.processedSchemes:
return { success: true, message: "Already processed" }
// Check for isolated entities
isolated = findIsolatedEntities()
cleaned_up = false
if autoCleanup and isolated.count > 0:
cleanupIsolatedEntities()
cleaned_up = true
graph.processedSchemes.append(schemeSlug)
saveGraph()
if isolated.count > 0:
return {
success: true,
message: "{cleaned_up ? 'Removed' : 'Found'} {isolated.count} isolated entities",
isolatedEntities: isolated.isolated,
cleanedUp: cleaned_up
}
return {
success: true,
message: "Marked as processed. No isolated entities!"
}STEP 4: Quality check and mark complete
a) Call check_isolated_entities()
Returns: { count: 2, isolated: ["entity1", "entity2"] }
b) If isolated entities found:
- Review: Did you forget to add triples for them?
- If yes: Add the missing triples
- If no (they shouldn't exist): Call cleanup_isolated_entities()
c) Call mark_scheme_processed({schemeSlug})
- Automatically checks and cleans isolated entities
- Reports: "Removed 2 isolated entities: entity1, entity2"async function extractScheme(scheme) -> result:
"""
Each scheme gets its own agent with full tool access
Budget: 50 turns, ~$0.15-0.20 per scheme
"""
prompt = """
Build knowledge graph for {scheme.slug} using ATOMIC MECE principles.
You have 13 graph tools available:
- 6 READ tools (search, list, find, stats)
- 3 WRITE tools (add_entity, add_triple, mark_processed)
- 2 MECE tools (add_synonym, merge_duplicates)
- 2 QUALITY tools (check_isolated, cleanup_isolated)
STEP 1: Add scheme entity
STEP 2: Extract ATOMIC concepts (search first, create if needed)
STEP 3: Create relationships
STEP 4: Quality check and mark complete
"""
conversation = query({
prompt: prompt,
model: "claude-sonnet-4-5",
maxTurns: 50,
mcpServers: {
"knowledge-graph": graphMcpServer
}
})
for message in conversation:
if message.type == "assistant":
// Agent is using tools autonomously
continue
if message.type == "result":
if message.subtype == "success":
return {
success: true,
turns: message.num_turns,
cost: message.total_cost_usd
}
else:
return {
success: false,
error: message.subtype
}
return { success: false, error: "Unknown" }
async function main():
"""
Process all schemes sequentially
Each scheme = separate agent invocation
"""
schemes = await fetchSchemes("widow pension")
for scheme in schemes:
console.log("🚀 Starting NEW agent for {scheme.slug}")
result = await extractScheme(scheme)
if result.success:
console.log("✅ Completed in {result.turns} turns, ${result.cost}")
else:
console.log("❌ Failed: {result.error}")
console.log("🔚 Agent invocation closed")Agent's internal reasoning (not visible):
"I need to add the widow beneficiary type..."
Agent's tool calls:
1. search_entity({type: "beneficiary_type", name: "widow"})
→ Returns: { found: true, entityId: "beneficiary_type:widow" }
2. Agent: "Great! Entity exists, I'll reuse it."
(No duplicate created ✓)
3. add_triple({
subject: "scheme:ignwps",
predicate: "targets",
object: "beneficiary_type:widow"
})
→ Returns: { success: true, message: "Triple added" }
Before MECE:
❌ Format chaos: age_40_minimum vs age_40_min vs age_min_21
❌ Type inconsistency: bpl vs income_criteria:bpl
❌ Naming chaos: all_castes vs all vs both
❌ Entity reuse: ~20-30%
After MECE:
✅ Consistent naming: age_min_40, age_min_18, age_max_60
✅ Proper types: All use strict type conventions
✅ Atomic entities: widow, divorced_woman (not combined)
✅ Entity reuse: ~50-60%
✅ No isolated entities: 100% connected graph
11 schemes processed:
Without entity reuse: ~110 entities expected (10 per scheme)
With MECE mechanisms: 51 entities actual
Compression rate: 53.6%
Avg per scheme: 4.64 entities
Evidence of clustering:
- "widow" appears in 9 schemes (reused 9 times)
- "pension" appears in 11 schemes (reused 11 times)
- "female" appears in 8 schemes (reused 8 times)
MCP Server: "knowledge-graph"
Tools (13):
READ:
1. search_entity({type?, name})
→ Uses semantic search with 85% threshold
2. list_entity_types()
→ Returns all entity types in graph
3. list_entities_by_type({type})
→ Lists all entities of specific type
4. list_predicates()
→ Shows all relationships with examples
5. find_triples({subject?, predicate?, object?, limit?})
→ Pattern matching for relationships
6. get_stats()
→ Entity count, triple count, types, etc.
WRITE:
7. add_entity({type, name, metadata?})
→ Creates entity with embedding
→ Auto-checks for duplicates via search
8. add_triple({subject, predicate, object, confidence?})
→ Creates relationship
→ Validates both entities exist
9. mark_scheme_processed({schemeSlug, autoCleanup?})
→ Marks complete
→ Auto-checks and removes isolated entities
MECE:
10. add_synonym({variant, canonical})
→ Runtime synonym mapping
11. merge_duplicates({entityType?, similarityThreshold?})
→ Post-processing deduplication
QUALITY:
12. check_isolated_entities()
→ Find orphaned entities
13. cleanup_isolated_entities()
→ Remove orphaned entitiesSingle mechanism is insufficient:
1. Semantic search alone:
❌ Misses exact matches with different formats
❌ Slow (API calls for every search)
❌ Can't handle brand new terms
2. Synonym mapping alone:
❌ Requires pre-knowledge of all variants
❌ Can't discover new synonyms
❌ Doesn't handle semantic similarity
3. Naming conventions alone:
❌ Relies on agent compliance
❌ Agents make mistakes
❌ Can't catch semantically similar terms
4. Post-processing merge alone:
❌ Duplicates already created (wasteful)
❌ Fixing problems after the fact
❌ Requires re-running
Combined approach:
✅ Naming conventions: First line of defense
✅ Synonym mapping: Known variant resolution
✅ Semantic search: Unknown variant detection
✅ Post-processing merge: Safety net
✅ Isolated entity cleanup: Quality assurance
Mutually Exclusive, Collectively Exhaustive
Mutually Exclusive:
✓ Each concept = ONE entity
✓ No overlap between entities
✓ "widow" and "divorced_woman" are separate
✗ NOT "poor_widow" (combines two concepts)
Collectively Exhaustive:
✓ All entities needed for scheme are present
✓ No missing relationships
✓ Every entity is connected (no orphans)
✓ Graph completely describes the scheme
Example violations:
❌ "bpl_widow" - Combines income + beneficiary
❌ "tamil_nadu_pension" - Combines location + benefit
❌ "age_40_minimum" vs "age_min_40" - Same concept, different format
Correct MECE structure:
✅ Entities: widow, below_poverty_line, pension, tamil_nadu, age_min_40
✅ Triples:
- scheme targets widow
- scheme requires below_poverty_line
- scheme provides pension
- scheme available_in tamil_nadu
- scheme requires_age age_min_40
# Prerequisites
✅ PostgreSQL with scheme data
✅ Google Generative AI API key (for embeddings)
✅ Bun runtime
✅ @anthropic-ai/claude-agent-sdk
✅ @google/generative-ai package
# Files needed
✅ graph-query-tool.ts (Core MECE implementation)
✅ spo-graph-agentic.ts (Agent orchestration)
✅ visualize.html (Real-time graph viewer)
# Environment variables
export GOOGLE_GENERATIVE_AI_API_KEY="your-key-here"
# Run extraction
bun run spo-graph-agentic.ts
# Monitor progress
open http://localhost:3030 # Visualization
tail -f output.log # Extraction logs
# Validate quality
- Check entity reuse rate (target: >50%)
- Verify no isolated entities (target: 0)
- Confirm naming consistency (manual review)
- Run merge_duplicates() as final cleanup-
Adaptive Thresholds
- Learn optimal similarity thresholds per entity type
- Lower threshold for numeric entities (ages, incomes)
- Higher threshold for semantic entities (beneficiaries)
-
Multi-language Support
- Add embeddings for regional languages
- Synonym mapping for Hindi/Tamil/etc.
- Language-aware normalization
-
Active Learning
- Prompt user when similarity is borderline (0.80-0.90)
- Build training data for similarity model
- Improve threshold tuning
-
Graph Validation Rules
- Schemes must have at least 1 "targets" triple
- Schemes must have exactly 1 "provides" triple
- Age requirements must be numeric
- Income criteria must be monetary
-
Performance Optimization
- Cache embeddings to avoid re-generation
- Batch embedding generation
- Pre-compute similarity matrix
- Use vector database (Pinecone/Weaviate)
MIT
Claude Code with MECE Knowledge Graph Implementation Version: 1.0.0 Date: 2025