You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Date: 2026-02-04
Author: Viktor & Claude
Status: Approved
Overview
A Claude Code PreToolUse hook that intercepts WebSearch tool calls and redirects documentation-related queries to local RAG databases via LEANN MCP server. The hook provides an intelligent escape hatch allowing Claude to retry web searches if RAG results are insufficient.
Configuration File (~/.claude/hooks/docsearch-config.json) - Maps keywords to RAG database metadata
State Files (~/.claude/hooks/docsearch-state-{session_id}.json) - Per-session tracking of denied searches
LEANN MCP Server - External component, assumed configured in Claude Code MCP settings
Flow Diagram
WebSearch tool call
↓
PreToolUse hook fires (docsearch.py)
↓
Check state: Is this a retry? (same params as last call)
├─ Yes → Allow through (exit 0)
└─ No → Continue
↓
Parse query for configured keywords (case-insensitive)
├─ No match → Allow through (exit 0)
└─ Match(es) found → Store params, Deny (exit 2) + add context
↓
Claude receives denial + context about RAG database(s)
↓
Claude calls LEANN MCP tool(s) (in parallel if multiple matches)
├─ Success → Done
└─ Fail/Unsatisfied → Claude retries WebSearch
↓
Hook sees same params → Allows through
If exact match (query + domains) → Allow through, clear state, exit 0
If no match → Continue to keyword matching
On keyword match (denying search):
Store current tool_input in session-specific state file
Exit 2 with permissionDecision: deny and additionalContext
State cleanup:
Clear last_denied after successful retry
Clear stale state files on session start
Optional: Add 5-minute timestamp expiry as safety net
Parameter Comparison
Exact string match on query
Arrays compared as sets (order-independent) for allowed_domains and blocked_domains
Hook Implementation Details
Hook Type
Shell-based PreToolUse hook (Python script executed as subprocess)
Hook Location
~/.claude/hooks/PreToolUse/docsearch.py
Hook Responsibilities
Filter for WebSearch only - Exit early (code 0) if tool_name != "WebSearch"
Load and parse config - Read docsearch-config.json, handle missing/invalid gracefully
Check escape hatch - Load session state file, compare parameters, allow if match
Keyword detection - Parse query, match against configured keywords using word boundaries
Multi-keyword handling - Detect ALL matching databases in a single query
Deny + guide - If match(es) found, store state and return denial with structured additionalContext
Output Format
Single Keyword Match
{
"hookSpecificOutput": {
"hookEventName": "PreToolUse",
"permissionDecision": "deny",
"permissionDecisionReason": "Query matches 'gitlab' - using RAG database instead",
"additionalContext": "This query should use the LEANN MCP tool 'mcp__leann__search' to search the GitLab documentation RAG database at /Users/viktor/.leann/databases/gitlab instead of web search."
}
}
Multiple Keyword Matches
{
"hookSpecificOutput": {
"hookEventName": "PreToolUse",
"permissionDecision": "deny",
"permissionDecisionReason": "Query matches 'gitlab' and 'kubernetes' - using RAG databases instead",
"additionalContext": "This query matches multiple documentation databases. Please use these LEANN MCP tools IN PARALLEL:\n1. 'mcp__leann__search' for GitLab documentation at /Users/viktor/.leann/databases/gitlab\n2. 'mcp__leann__search' for Kubernetes official documentation at /Users/viktor/.leann/databases/kubernetes"
}
}
Error Handling
Error Philosophy: Fail open - when in doubt, allow the WebSearch through.
Error Scenario
Behavior
Config file missing/unreadable
Allow search through (exit 0)
Config file invalid JSON
Allow search through (exit 0), log to stderr
State file corrupted
Treat as no previous denial, continue
Invalid hook input JSON
Allow search through (exit 0)
Logging
Critical errors only logged to stderr (e.g., invalid config JSON)
Users debug by checking config file syntax manually
Keep logging minimal to avoid noise
Testing & Edge Cases
Testing Strategy
Unit Tests (tests/test_hook.py):
Mock hook input JSON via stdin
Verify correct exit codes (0 for allow, 2 for deny)
Test keyword matching (single, multiple, partial, case variations)
Verify state file read/write operations with session_id
Test escape hatch logic
Integration Tests:
Configure real LEANN MCP server
Test full flow: WebSearch → Hook → MCP → Retry
Verify parallel MCP calls for multi-keyword queries
Edge Cases
Multiple keywords in same query: "How to use GitLab with Kubernetes?"
Detect ALL matching databases
additionalContext mentions all MCP tools with instruction to call in parallel
Partial word matches: Query "ungitlabbed" contains "gitlab"
Use word boundary regex: \bgitlab\b (case-insensitive)
Should NOT match
Case variations: "GITLAB", "GitLab", "gitlab"
All should match (case-insensitive)
Concurrent hook calls: Multiple Claude sessions running
State file per session: docsearch-state-{session_id}.json
Each session has isolated state
Stale state files: User restarts Claude between denial and retry
Clear state file on session start
Fallback: 5-minute timestamp expiry
Technology Choices
Implementation Language
Python 3.12+
Rationale:
Native JSON handling (stdlib)
Excellent regex support for word boundaries
Easy to test and maintain
Widely available on development systems
No external dependencies required
Dependencies
Python 3.12+ standard library only:
json - Config and state file parsing
re - Keyword matching with word boundaries
sys - stdin/stdout/stderr/exit codes
pathlib - File path handling
User Setup & Usage
Prerequisites
LEANN installed and configured
LEANN MCP server configured in Claude Code's MCP settings
RAG databases built manually using LEANN tools (see future CLI issue)
cp config.example.json ~/.claude/hooks/docsearch-config.json
# Edit with your database paths and keywords
Verify LEANN MCP is configured in Claude Code MCP settings
Test the setup:
Start Claude Code
Ask: "How do I configure GitLab CI runners?"
Verify hook intercepts and Claude uses MCP tool
If MCP fails, verify Claude retries with WebSearch
User Experience Flow
User: "How do I configure GitLab CI runners?"
↓
Hook detects "gitlab" → Denies WebSearch
↓
Claude sees context → Calls mcp__leann__search with GitLab database
↓
If MCP succeeds → User gets RAG-based answer
If MCP fails → Claude retries WebSearch → Hook allows through → User gets web results
Project Structure
docsearch-hook/
├── README.md # Setup instructions, usage guide
├── LICENSE
├── docsearch.py # Main hook script (Python 3.12+)
├── config.example.json # Example configuration
├── tests/
│ ├── test_hook.py # Unit tests for hook logic
│ └── fixtures/ # Test data (mock configs, inputs)
├── docs/
│ └── plans/
│ └── 2026-02-04-docsearch-design.md # This document
└── .github/
└── ISSUE_TEMPLATE/
Future Work (GitHub Issues)
Issue 1: Database Sharing Feature
Title: Enable sharing pre-built RAG databases between users
Description:
Currently users must build their own LEANN databases. Add functionality to:
Export database metadata and files in shareable format
Import shared databases with verification
Community repository of common documentation databases (GitLab, K8s, etc.)
Benefits:
Reduce setup friction for new users
Standardize database quality for popular documentation sources
Community contribution model
Issue 2: CLI Setup Command
Title: Add CLI command for automated database creation
Description:
Provide docsearch-hook setup <keyword> <url> command that:
Crawls documentation website using LEANN
Builds RAG database
Adds entry to config file automatically
Validates MCP server configuration
Benefits:
Eliminates manual LEANN tool usage
Reduces errors in database creation
Streamlines onboarding experience
Success Criteria
Functional:
Hook correctly intercepts WebSearch for configured keywords
Multi-keyword queries trigger parallel MCP calls
Escape hatch allows retry after MCP failure
Per-session state isolation works correctly
Reliability:
Hook never breaks Claude's core functionality (fail open)
.mcp.json server name is leann-docs-search (confirmed - needs rename to leann)
README.md contains only # mcp-docsearch (confirmed - stub)
.leann/indexes/ pre-built with HNSW/contriever backend, 179 passages (confirmed)
All P0 items remain pending
Status Summary
Component
Status
Priority
Notes
LICENSE
✅ Complete
-
MIT License
Design Document
✅ Complete
-
373 lines, comprehensive
.leann/ indexes
✅ Complete
-
Pre-built, HNSW/contriever
.mcp.json
⚠️Needs fix
P0
BLOCKING: Server name leann-docs-search → tool name mcp__leann-docs-search__search (design expects mcp__leann__search). Must resolve before config.example.json
docsearch.py
❌ Not started
P0
Critical path blocker
.gitignore
❌ Not started
P1
Quick win
config.example.json
❌ Not started
P1
Quick win (blocked by .mcp.json decision)
tests/
❌ Not started
P2
Blocked by docsearch.py
README.md
⚠️ Stub only
P2
Currently just # mcp-docsearch
GitHub templates
❌ Not started
P3
Low priority
PROMPT_refinement.md
⚠️ To delete
P3
Development artifact, not part of deliverables
Priority-Ordered Task List
Items sorted by implementation priority. Complete in order.
P0 — Critical Path (Blocks Everything)
Resolve .mcp.json server naming
DECISION REQUIRED: Design doc uses mcp__leann__search, current config produces mcp__leann-docs-search__search
Recommended action: Rename server from leann-docs-search to leann in .mcp.json
This unblocks config.example.json creation (P1)
Must be resolved BEFORE any code references tool names
docsearch.py: Create skeleton with constants
Location: /workspace/repo/docsearch.py
Shebang: #!/usr/bin/env python3
Imports: json, re, sys, pathlib, time (stdlib only)
load_state must validate schema structure (has last_denied with required subfields: query, allowed_domains, blocked_domains, timestamp), not just JSON validity
cleanup_stale_states removes state files older than 5 minutes (approximates session-start cleanup per design doc line 126)
ensure_hooks_directory must be called before any file operations to handle first-run scenario
allowed_domains (array, optional): Only include results from these domains
blocked_domains (array, optional): Exclude results from these domains
Clarifications Resolved
State cleanup: Design doc mentions "clear stale state files on session start" (line 126) but hooks cannot detect session boundaries. Resolution: opportunistic cleanup of all state files >5 minutes old on each hook invocation approximates this behavior. Combined with 5-minute timestamp expiry check in escape hatch logic, this provides robust stale state handling.
MCP parameters: Include path in additionalContext, let Claude determine params
Same tool name: Different databases differentiated by path in additionalContext
Parallel MCP calls: Multi-keyword denials must include explicit "IN PARALLEL" text per design doc line 176 to ensure Claude calls MCP tools concurrently
MCP tool naming: The .mcp.json server name determines tool name prefix. Current leann-docs-search produces mcp__leann-docs-search__search. DECISION: Rename to leann for cleaner mcp__leann__search as per design doc examples. (Elevated to P0)
Directory creation: The hooks directory ~/.claude/hooks/ may not exist on first run. State file operations must create it if missing.
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Build a Claude Code PreToolUse hook that intercepts WebSearch tool calls and redirects documentation-related queries to local RAG databases via LEANN MCP server.
Architecture: A Python 3.12+ script (docsearch.py) acts as a PreToolUse hook. It reads configuration from ~/.claude/hooks/docsearch-config.json, tracks state in session-specific files, and uses exit codes to allow (0) or deny (2) WebSearch calls. When denying, it provides additionalContext guiding Claude to use LEANN MCP tools instead. An escape hatch allows retries if RAG results are insufficient.
Tech Stack: Python 3.12+ standard library only (json, re, sys, pathlib, os, time)
Reference: See docs/plans/2026-02-04-docsearch-hook-design.md for full design specification.
Implementation Status
Task
Status
Description
Priority
Order
Task 2
NOT STARTED
Core Hook Script - Skeleton and Input Parsing
P0 - Foundation
1
Task 3
NOT STARTED
Configuration Loading
P0 - Core
2
Task 4
NOT STARTED
Keyword Matching with Word Boundaries
P0 - Core
3
Task 6a
NOT STARTED
Session ID Sanitization (Security)
P0 - Security
4
Task 6
NOT STARTED
Session State Management for Escape Hatch
P0 - Core
5
Task 12
NOT STARTED
Make Script Executable and Add Shebang
P0 - Core
6
Task 5
NOT STARTED
Multiple Keyword Matching (Verification Tests)
P1 - Feature
7
Task 7
NOT STARTED
State Cleanup (Stale State Expiry)
P1 - Enhancement
8
Task 7a
NOT STARTED
Session Start State Cleanup
P1 - Enhancement
9
Task 3a
NOT STARTED
Configuration Schema Validation
P1 - Quality
10
Task 3b
NOT STARTED
Keywords Element Type Validation
P1 - Quality
11
Task 8
NOT STARTED
Error Logging to stderr
P1 - Enhancement
12
Task 9
NOT STARTED
Complete Test Coverage and Edge Cases
P1 - Quality
13
Task 9a
NOT STARTED
Permission Error Tests
P1 - Quality
14
Task 9b
NOT STARTED
Session Isolation Tests
P1 - Quality
15
Task 1
NOT STARTED
Project Structure and Example Config
P2 - Documentation
16
Task 10
NOT STARTED
README Documentation
P2 - Documentation
17
Task 11
NOT STARTED
Final Integration Testing
P2 - Validation
18
Codebase Analysis (2026-02-05)
Current State: No implementation exists. Repository contains only:
README.md - Empty placeholder ("# mcp-docsearch")
LICENSE - MIT license file
docs/plans/ - Design and implementation plan documents
.leann/ - LEANN index files (not relevant to implementation)
The following gaps were identified when comparing this plan against the design spec:
Previously Identified (Addressed in Plan)
Session Start State Cleanup (Task 7a): Design spec mentions "Clear state file on session start" as a complementary mechanism to timestamp expiry - added as new task.
Config Schema Validation (Task 3a): Design spec marks config fields as "required" but original plan silently defaults missing fields - added as new task.
Tech Stack: Design spec should include os and time modules (used in implementation).
Newly Identified (2026-02-05 Analysis)
Critical Gaps:
4. Type validation for config fields missing: Task 3a only checks field presence, not types. keywords should be validated as array of strings, not just present.
5. Empty keywords array not handled: A database entry with keywords: [] will silently fail to match anything. Should log warning and skip.
Important Gaps:
6. Path format validation missing: Design spec says path should be "Absolute path" but no validation exists. Should warn on relative paths.
7. Task 7a cleanup timing differs from design: Design says "Clear stale state files on session start" but Task 7a cleans during hook execution. Semantically equivalent but worth noting.
8. No permission error tests: Implementation silently handles state/config permission errors but these edge cases aren't tested.
9. No concurrent session isolation test: Tests use different session_ids but don't verify true isolation.
Minor Gaps (Can Address Post-MVP):
10. GitHub issue templates not created: Design mentions .github/ISSUE_TEMPLATE/ but no task creates it.
11. Success criteria not all testable: Design lists "Clean Python code with type hints" as success criteria but not verified.
12. State file naming uses raw session_id: No sanitization of session_id for filesystem safety (special characters).
Security Gap (2026-02-05 Deep Analysis) - MUST FIX
CRITICAL - Session ID Path Traversal Vulnerability:
13. Session ID sanitization required: The get_state_file() function uses raw session_id in the filename without sanitization. This could allow path traversal attacks if a malicious session_id like "../../etc/passwd" or "foo/bar" is provided. Add Task 6a: Session ID Sanitization to address this before Task 6.
TDD Issues Identified (2026-02-05 Deep Analysis)
Task 3 "Expected: FAIL" reason is incorrect: The test would actually PASS because Task 2's implementation returns exit 0 for WebSearch tools. Need to fix the expected failure reason.
Task 5 violates TDD principles: Tests are expected to pass immediately (verification tests, not TDD). Should be relabeled or moved.
Task 7a test doesn't verify cleanup:test_stale_state_file_cleaned_on_unrelated_query should assert that stale files were actually deleted.
Task 5 order test missing null checks: Should verify find() doesn't return -1 before comparing positions.
Keywords element type validation missing: Task 3a validates keywords is a list but not that all elements are strings.
Parallelization Opportunities
Tasks 5, 3a, and 8 can run in parallel after Task 4 completes (no dependencies between them).
Task 1 could be P2 since it's just an example config file, not required for core functionality.
Prioritized Remaining Work (Bullet Points)
Phase 1: Core Functionality (P0) - MUST HAVE
All items below are required for a minimal viable hook:
Task 2: Core Hook Script - Skeleton and Input Parsing
Create tests/test_hook.py with run_hook() helper and input parsing tests
Create docsearch.py with main() entry point, stdin JSON parsing, WebSearch filtering
Verify tests pass, commit
Task 3: Configuration Loading
Add tests for missing/invalid config file handling (fail-open behavior)
Implement get_config_path() and load_config() functions
Support DOCSEARCH_CONFIG_PATH environment variable for testing
Verify tests pass, commit
Task 4: Keyword Matching with Word Boundaries
Create tests/fixtures/valid_config.json test fixture
Add tests for single keyword match, no match, case-insensitive, word boundary
Implement find_matching_databases() with \b regex word boundaries
Implement build_deny_response() for single/multiple database responses
Verify tests pass, commit
Task 6a: Session ID Sanitization (Security)(NEW - MUST BE BEFORE Task 6)
Add tests for path traversal and special character handling in session_id
Implement sanitize_session_id() using regex to allow only alphanumeric, dash, underscore
Create get_state_file() stub that uses sanitized session_id
Verify tests pass, commit
Task 6: Session State Management for Escape Hatch
Add tests for state file creation, escape hatch retry, different query denial
#!/usr/bin/env python3"""DocSearch Hook - PreToolUse hook that redirects documentation queries to RAG databases.This hook intercepts WebSearch tool calls and checks if the query matches configureddocumentation keywords. If matched, it denies the search and guides Claude to useLEANN MCP tools instead. Includes an escape hatch for retrying web search if RAG fails."""importjsonimportsysdefmain() ->int:
"""Main entry point for the hook."""# Read and parse input from stdintry:
stdin_data=sys.stdin.read()
hook_input=json.loads(stdin_data)
exceptjson.JSONDecodeError:
# Invalid JSON - fail openreturn0# Get tool name - if not WebSearch, allow throughtool_name=hook_input.get("tool_name", "")
iftool_name!="WebSearch":
return0# Placeholder for future implementationreturn0if__name__=="__main__":
sys.exit(main())
Step 1: Write the failing test for configuration loading
Add to tests/test_hook.py:
classTestConfigLoading:
"""Tests for configuration file loading."""deftest_missing_config_allows_through(self, tmp_path):
"""Missing config file should fail open (exit 0)."""hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "how to configure gitlab ci"},
}
exit_code, stdout, stderr=run_hook(
hook_input, env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(tmp_path/"nonexistent.json")}
)
assertexit_code==0deftest_invalid_json_config_allows_through(self, tmp_path):
"""Invalid JSON config should fail open (exit 0)."""config_file=tmp_path/"config.json"config_file.write_text("not valid json")
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "how to configure gitlab ci"},
}
exit_code, stdout, stderr=run_hook(
hook_input, env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(config_file)}
)
assertexit_code==0
Step 2: Run test to verify it fails
Run: pytest tests/test_hook.py::TestConfigLoading -v
Expected: FAIL (config loading not implemented)
Step 3: Write minimal implementation
Update docsearch.py:
#!/usr/bin/env python3"""DocSearch Hook - PreToolUse hook that redirects documentation queries to RAG databases.This hook intercepts WebSearch tool calls and checks if the query matches configureddocumentation keywords. If matched, it denies the search and guides Claude to useLEANN MCP tools instead. Includes an escape hatch for retrying web search if RAG fails."""importjsonimportosimportsysfrompathlibimportPathdefget_config_path() ->Path:
"""Get the configuration file path."""ifenv_path:=os.environ.get("DOCSEARCH_CONFIG_PATH"):
returnPath(env_path)
returnPath.home() /".claude"/"hooks"/"docsearch-config.json"defload_config() ->dict|None:
"""Load and parse the configuration file. Returns None on any error."""config_path=get_config_path()
try:
withopen(config_path) asf:
returnjson.load(f)
except (FileNotFoundError, json.JSONDecodeError, OSError):
returnNonedefmain() ->int:
"""Main entry point for the hook."""# Read and parse input from stdintry:
stdin_data=sys.stdin.read()
hook_input=json.loads(stdin_data)
exceptjson.JSONDecodeError:
# Invalid JSON - fail openreturn0# Get tool name - if not WebSearch, allow throughtool_name=hook_input.get("tool_name", "")
iftool_name!="WebSearch":
return0# Load configuration - if missing or invalid, allow throughconfig=load_config()
ifconfigisNone:
return0# Placeholder for keyword matchingreturn0if__name__=="__main__":
sys.exit(main())
Step 2: Write the failing test for keyword matching
Add to tests/test_hook.py:
classTestKeywordMatching:
"""Tests for keyword detection in queries."""deftest_single_keyword_match_denies(self):
"""Query containing configured keyword should be denied (exit 2)."""hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "how to configure gitlab ci runners"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json")},
)
assertexit_code==2# Verify output is valid JSON with correct structureoutput=json.loads(stdout)
assertoutput["hookSpecificOutput"]["permissionDecision"] =="deny"assertoutput["hookSpecificOutput"]["hookEventName"] =="PreToolUse"assert"gitlab"inoutput["hookSpecificOutput"]["permissionDecisionReason"].lower()
deftest_no_keyword_match_allows(self):
"""Query without configured keywords should be allowed (exit 0)."""hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "how to make a sandwich"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json")},
)
assertexit_code==0deftest_case_insensitive_matching(self):
"""Keyword matching should be case-insensitive."""forqueryin ["GITLAB ci", "GitLab CI", "gitlab ci"]:
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": query},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json")},
)
assertexit_code==2, f"Failed for query: {query}"deftest_word_boundary_matching(self):
"""Partial word matches should NOT trigger denial."""# "ungitlabbed" contains "gitlab" but should not match due to word boundarieshook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "ungitlabbed workflow"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json")},
)
assertexit_code==0
Step 3: Run test to verify it fails
Run: pytest tests/test_hook.py::TestKeywordMatching -v
Expected: FAIL (keyword matching not implemented)
Step 4: Write minimal implementation
Update docsearch.py:
#!/usr/bin/env python3"""DocSearch Hook - PreToolUse hook that redirects documentation queries to RAG databases.This hook intercepts WebSearch tool calls and checks if the query matches configureddocumentation keywords. If matched, it denies the search and guides Claude to useLEANN MCP tools instead. Includes an escape hatch for retrying web search if RAG fails."""importjsonimportosimportreimportsysfrompathlibimportPathdefget_config_path() ->Path:
"""Get the configuration file path."""ifenv_path:=os.environ.get("DOCSEARCH_CONFIG_PATH"):
returnPath(env_path)
returnPath.home() /".claude"/"hooks"/"docsearch-config.json"defload_config() ->dict|None:
"""Load and parse the configuration file. Returns None on any error."""config_path=get_config_path()
try:
withopen(config_path) asf:
returnjson.load(f)
except (FileNotFoundError, json.JSONDecodeError, OSError):
returnNonedeffind_matching_databases(query: str, config: dict) ->list[dict]:
"""Find all databases with keywords matching the query. Uses word boundary matching (case-insensitive). Returns list of matching database configs. """matches= []
query_lower=query.lower()
fordbinconfig.get("databases", []):
forkeywordindb.get("keywords", []):
# Word boundary regex for exact word matchpattern=rf"\b{re.escape(keyword.lower())}\b"ifre.search(pattern, query_lower):
matches.append(db)
break# Only add each database oncereturnmatchesdefbuild_deny_response(matches: list[dict]) ->dict:
"""Build the JSON response for denying a WebSearch."""iflen(matches) ==1:
db=matches[0]
matched_keywords=db["keywords"][0] # Use first keyword for messagereason=f"Query matches '{matched_keywords}' - using RAG database instead"context= (
f"This query should use the LEANN MCP tool '{db['mcp_tool_name']}' "f"to search the {db['description']} RAG database at {db['path']} instead of web search."
)
else:
keyword_list=" and ".join(f"'{db['keywords'][0]}'"fordbinmatches)
reason=f"Query matches {keyword_list} - using RAG databases instead"lines= ["This query matches multiple documentation databases. Please use these LEANN MCP tools IN PARALLEL:"]
fori, dbinenumerate(matches, 1):
lines.append(f"{i}. '{db['mcp_tool_name']}' for {db['description']} at {db['path']}")
context="\n".join(lines)
return {
"hookSpecificOutput": {
"hookEventName": "PreToolUse",
"permissionDecision": "deny",
"permissionDecisionReason": reason,
"additionalContext": context,
}
}
defmain() ->int:
"""Main entry point for the hook."""# Read and parse input from stdintry:
stdin_data=sys.stdin.read()
hook_input=json.loads(stdin_data)
exceptjson.JSONDecodeError:
# Invalid JSON - fail openreturn0# Get tool name - if not WebSearch, allow throughtool_name=hook_input.get("tool_name", "")
iftool_name!="WebSearch":
return0# Load configuration - if missing or invalid, allow throughconfig=load_config()
ifconfigisNone:
return0# Get the query from tool inputtool_input=hook_input.get("tool_input", {})
query=tool_input.get("query", "")
ifnotquery:
return0# Find matching databasesmatches=find_matching_databases(query, config)
ifnotmatches:
return0# Deny and provide guidanceresponse=build_deny_response(matches)
print(json.dumps(response))
return2if__name__=="__main__":
sys.exit(main())
Step 5: Run test to verify it passes
Run: pytest tests/test_hook.py -v
Expected: PASS
Step 6: Commit
mkdir -p tests/fixtures
git add docsearch.py tests/test_hook.py tests/fixtures/valid_config.json
git commit -m "feat: add keyword matching with word boundaries and denial responses"
Task 5: Multiple Keyword Matching
Files:
Modify: tests/test_hook.py
Step 1: Write the test for multiple keyword matches
Add to tests/test_hook.py:
classTestMultipleKeywordMatching:
"""Tests for queries matching multiple databases."""deftest_multiple_keywords_match_all_databases(self):
"""Query with multiple keywords should mention all matching databases."""hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "how to deploy gitlab on kubernetes"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json")},
)
assertexit_code==2output=json.loads(stdout)
context=output["hookSpecificOutput"]["additionalContext"]
# Both databases should be mentionedassert"gitlab"incontext.lower()
assert"kubernetes"incontext.lower()
assert"IN PARALLEL"incontextdeftest_k8s_alias_matches_kubernetes(self):
"""Alternative keywords like 'k8s' should match kubernetes database."""hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "k8s pod configuration"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json")},
)
assertexit_code==2output=json.loads(stdout)
assert"kubernetes"inoutput["hookSpecificOutput"]["additionalContext"].lower()
deftest_database_order_preserved_in_output(self):
"""Databases should appear in config file order in output."""hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "gitlab kubernetes deployment"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json")},
)
assertexit_code==2output=json.loads(stdout)
context=output["hookSpecificOutput"]["additionalContext"]
# GitLab appears first in config, so should be listed as item 1gitlab_pos=context.find("GitLab")
kubernetes_pos=context.find("Kubernetes")
assertgitlab_pos<kubernetes_pos, "GitLab should appear before Kubernetes (config order)"
git add tests/test_hook.py
git commit -m "test: add tests for multiple keyword matching and config order preservation"
Task 6: Session State Management for Escape Hatch
Files:
Modify: docsearch.py
Modify: tests/test_hook.py
Step 1: Write the failing test for state file management
Add to tests/test_hook.py:
classTestStateManagement:
"""Tests for session state file management."""deftest_first_search_stores_state_and_denies(self, tmp_path):
"""First matching search should store state and deny."""state_dir=tmp_path/"state"state_dir.mkdir()
hook_input= {
"tool_name": "WebSearch",
"tool_input": {
"query": "how to configure gitlab ci",
"allowed_domains": [],
"blocked_domains": [],
},
"session_id": "test-session-123",
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
assertexit_code==2# State file should be createdstate_file=state_dir/"docsearch-state-test-session-123.json"assertstate_file.exists()
state=json.loads(state_file.read_text())
assertstate["last_denied"]["query"] =="how to configure gitlab ci"deftest_retry_same_params_allows_through(self, tmp_path):
"""Retry with exact same params should allow through (escape hatch)."""state_dir=tmp_path/"state"state_dir.mkdir()
# Pre-create state file simulating a previous denialstate_file=state_dir/"docsearch-state-test-session-456.json"state_file.write_text(json.dumps({
"last_denied": {
"query": "how to configure gitlab ci",
"allowed_domains": [],
"blocked_domains": [],
"timestamp": int(time.time()),
}
}))
hook_input= {
"tool_name": "WebSearch",
"tool_input": {
"query": "how to configure gitlab ci",
"allowed_domains": [],
"blocked_domains": [],
},
"session_id": "test-session-456",
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
assertexit_code==0# State file should be cleared after successful retrystate=json.loads(state_file.read_text())
assertstate.get("last_denied") isNonedeftest_different_query_denies_again(self, tmp_path):
"""Different query should deny even with existing state."""state_dir=tmp_path/"state"state_dir.mkdir()
# Pre-create state file with different querystate_file=state_dir/"docsearch-state-test-session-789.json"state_file.write_text(json.dumps({
"last_denied": {
"query": "gitlab runners setup",
"allowed_domains": [],
"blocked_domains": [],
"timestamp": int(time.time()),
}
}))
hook_input= {
"tool_name": "WebSearch",
"tool_input": {
"query": "how to configure gitlab ci", # Different query"allowed_domains": [],
"blocked_domains": [],
},
"session_id": "test-session-789",
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
assertexit_code==2deftest_corrupted_state_file_fails_open(self, tmp_path):
"""Corrupted state file should be treated as no previous denial."""state_dir=tmp_path/"state"state_dir.mkdir()
# Pre-create corrupted state filestate_file=state_dir/"docsearch-state-test-session-corrupted.json"state_file.write_text("{invalid json content")
hook_input= {
"tool_name": "WebSearch",
"tool_input": {
"query": "how to configure gitlab ci",
"allowed_domains": [],
"blocked_domains": [],
},
"session_id": "test-session-corrupted",
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
# Should deny (no valid state to trigger escape hatch)assertexit_code==2
Step 2: Run test to verify it fails
Run: pytest tests/test_hook.py::TestStateManagement -v
Expected: FAIL (state management not implemented)
Step 3: Write minimal implementation
Update docsearch.py to add state management:
#!/usr/bin/env python3"""DocSearch Hook - PreToolUse hook that redirects documentation queries to RAG databases.This hook intercepts WebSearch tool calls and checks if the query matches configureddocumentation keywords. If matched, it denies the search and guides Claude to useLEANN MCP tools instead. Includes an escape hatch for retrying web search if RAG fails."""importjsonimportosimportreimportsysimporttimefrompathlibimportPathdefget_config_path() ->Path:
"""Get the configuration file path."""ifenv_path:=os.environ.get("DOCSEARCH_CONFIG_PATH"):
returnPath(env_path)
returnPath.home() /".claude"/"hooks"/"docsearch-config.json"defget_state_dir() ->Path:
"""Get the state directory path."""ifenv_path:=os.environ.get("DOCSEARCH_STATE_DIR"):
returnPath(env_path)
returnPath.home() /".claude"/"hooks"defget_state_file(session_id: str) ->Path:
"""Get the state file path for a session."""returnget_state_dir() /f"docsearch-state-{session_id}.json"defload_config() ->dict|None:
"""Load and parse the configuration file. Returns None on any error."""config_path=get_config_path()
try:
withopen(config_path) asf:
returnjson.load(f)
except (FileNotFoundError, json.JSONDecodeError, OSError):
returnNonedefload_state(session_id: str) ->dict:
"""Load session state. Returns empty dict on any error."""state_file=get_state_file(session_id)
try:
withopen(state_file) asf:
returnjson.load(f)
except (FileNotFoundError, json.JSONDecodeError, OSError):
return {}
defsave_state(session_id: str, state: dict) ->None:
"""Save session state."""state_file=get_state_file(session_id)
try:
state_file.parent.mkdir(parents=True, exist_ok=True)
withopen(state_file, "w") asf:
json.dump(state, f)
exceptOSError:
pass# Fail silently - state is optionaldefparams_match(current: dict, previous: dict) ->bool:
"""Check if current tool_input matches previous denied params. Compares query exactly and domains as sets (order-independent). """ifcurrent.get("query") !=previous.get("query"):
returnFalse# Compare domains as sets (order-independent)current_allowed=set(current.get("allowed_domains", []) or [])
previous_allowed=set(previous.get("allowed_domains", []) or [])
ifcurrent_allowed!=previous_allowed:
returnFalsecurrent_blocked=set(current.get("blocked_domains", []) or [])
previous_blocked=set(previous.get("blocked_domains", []) or [])
ifcurrent_blocked!=previous_blocked:
returnFalsereturnTruedeffind_matching_databases(query: str, config: dict) ->list[dict]:
"""Find all databases with keywords matching the query. Uses word boundary matching (case-insensitive). Returns list of matching database configs. """matches= []
query_lower=query.lower()
fordbinconfig.get("databases", []):
forkeywordindb.get("keywords", []):
# Word boundary regex for exact word matchpattern=rf"\b{re.escape(keyword.lower())}\b"ifre.search(pattern, query_lower):
matches.append(db)
break# Only add each database oncereturnmatchesdefbuild_deny_response(matches: list[dict]) ->dict:
"""Build the JSON response for denying a WebSearch."""iflen(matches) ==1:
db=matches[0]
matched_keywords=db["keywords"][0] # Use first keyword for messagereason=f"Query matches '{matched_keywords}' - using RAG database instead"context= (
f"This query should use the LEANN MCP tool '{db['mcp_tool_name']}' "f"to search the {db['description']} RAG database at {db['path']} instead of web search."
)
else:
keyword_list=" and ".join(f"'{db['keywords'][0]}'"fordbinmatches)
reason=f"Query matches {keyword_list} - using RAG databases instead"lines= ["This query matches multiple documentation databases. Please use these LEANN MCP tools IN PARALLEL:"]
fori, dbinenumerate(matches, 1):
lines.append(f"{i}. '{db['mcp_tool_name']}' for {db['description']} at {db['path']}")
context="\n".join(lines)
return {
"hookSpecificOutput": {
"hookEventName": "PreToolUse",
"permissionDecision": "deny",
"permissionDecisionReason": reason,
"additionalContext": context,
}
}
defmain() ->int:
"""Main entry point for the hook."""# Read and parse input from stdintry:
stdin_data=sys.stdin.read()
hook_input=json.loads(stdin_data)
exceptjson.JSONDecodeError:
# Invalid JSON - fail openreturn0# Get tool name - if not WebSearch, allow throughtool_name=hook_input.get("tool_name", "")
iftool_name!="WebSearch":
return0# Load configuration - if missing or invalid, allow throughconfig=load_config()
ifconfigisNone:
return0# Get the query from tool inputtool_input=hook_input.get("tool_input", {})
query=tool_input.get("query", "")
ifnotquery:
return0# Get session ID for state managementsession_id=hook_input.get("session_id", "default")
# Check escape hatch - if this is a retry of the same params, allow throughstate=load_state(session_id)
last_denied=state.get("last_denied")
iflast_deniedandparams_match(tool_input, last_denied):
# Clear state and allow throughsave_state(session_id, {"last_denied": None})
return0# Find matching databasesmatches=find_matching_databases(query, config)
ifnotmatches:
return0# Store current params in state for escape hatchsave_state(session_id, {
"last_denied": {
"query": tool_input.get("query", ""),
"allowed_domains": tool_input.get("allowed_domains", []),
"blocked_domains": tool_input.get("blocked_domains", []),
"timestamp": int(time.time()),
}
})
# Deny and provide guidanceresponse=build_deny_response(matches)
print(json.dumps(response))
return2if__name__=="__main__":
sys.exit(main())
Step 4: Run test to verify it passes
Run: pytest tests/test_hook.py -v
Expected: PASS
Step 5: Commit
git add docsearch.py tests/test_hook.py
git commit -m "feat: add session state management for escape hatch"
Task 7: State Cleanup (Stale State Expiry)
Files:
Modify: docsearch.py
Modify: tests/test_hook.py
Step 1: Write the failing test for stale state cleanup
Add to tests/test_hook.py:
classTestStaleStateCleanup:
"""Tests for stale state file cleanup."""deftest_expired_state_is_ignored(self, tmp_path):
"""State older than 5 minutes should be ignored."""state_dir=tmp_path/"state"state_dir.mkdir()
# Pre-create state file with old timestamp (6 minutes ago)old_timestamp=int(time.time()) -360# 6 minutes agostate_file=state_dir/"docsearch-state-test-session-old.json"state_file.write_text(json.dumps({
"last_denied": {
"query": "how to configure gitlab ci",
"allowed_domains": [],
"blocked_domains": [],
"timestamp": old_timestamp,
}
}))
# Same query should be denied again (state expired)hook_input= {
"tool_name": "WebSearch",
"tool_input": {
"query": "how to configure gitlab ci",
"allowed_domains": [],
"blocked_domains": [],
},
"session_id": "test-session-old",
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
assertexit_code==2# Should deny, not allow throughdeftest_recent_state_is_used(self, tmp_path):
"""State less than 5 minutes old should be used."""state_dir=tmp_path/"state"state_dir.mkdir()
# Pre-create state file with recent timestamp (2 minutes ago)recent_timestamp=int(time.time()) -120# 2 minutes agostate_file=state_dir/"docsearch-state-test-session-recent.json"state_file.write_text(json.dumps({
"last_denied": {
"query": "how to configure gitlab ci",
"allowed_domains": [],
"blocked_domains": [],
"timestamp": recent_timestamp,
}
}))
# Same query should be allowed (escape hatch)hook_input= {
"tool_name": "WebSearch",
"tool_input": {
"query": "how to configure gitlab ci",
"allowed_domains": [],
"blocked_domains": [],
},
"session_id": "test-session-recent",
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
assertexit_code==0# Should allow through
Step 2: Run test to verify it fails
Run: pytest tests/test_hook.py::TestStaleStateCleanup -v
Expected: FAIL (timestamp expiry not implemented)
Step 3: Write minimal implementation
Add near the top of docsearch.py:
# State expiry timeout in seconds (5 minutes)STATE_EXPIRY_SECONDS=300defis_state_expired(last_denied: dict) ->bool:
"""Check if the state entry has expired (older than 5 minutes)."""timestamp=last_denied.get("timestamp", 0)
return (int(time.time()) -timestamp) >STATE_EXPIRY_SECONDS
Update the escape hatch check in main():
# Check escape hatch - if this is a retry of the same params, allow throughstate=load_state(session_id)
last_denied=state.get("last_denied")
iflast_deniedandnotis_state_expired(last_denied) andparams_match(tool_input, last_denied):
# Clear state and allow throughsave_state(session_id, {"last_denied": None})
return0
Step 4: Run test to verify it passes
Run: pytest tests/test_hook.py -v
Expected: PASS
Step 5: Commit
git add docsearch.py tests/test_hook.py
git commit -m "feat: add 5-minute expiry for stale state entries"
Task 8: Error Logging to stderr
Files:
Modify: docsearch.py
Modify: tests/test_hook.py
Step 1: Write the failing test for error logging
Add to tests/test_hook.py:
classTestErrorLogging:
"""Tests for error logging to stderr."""deftest_invalid_config_logs_to_stderr(self, tmp_path):
"""Invalid config JSON should log error to stderr."""config_file=tmp_path/"bad_config.json"config_file.write_text("{invalid json")
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "test query"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(config_file)},
)
assertexit_code==0# Fail openassert"error"instderr.lower() or"json"instderr.lower()
Step 2: Run test to verify it fails
Run: pytest tests/test_hook.py::TestErrorLogging -v
Expected: FAIL (no stderr logging)
Step 3: Write minimal implementation
Update load_config() in docsearch.py:
defload_config() ->dict|None:
"""Load and parse the configuration file. Returns None on any error."""config_path=get_config_path()
try:
withopen(config_path) asf:
returnjson.load(f)
exceptFileNotFoundError:
returnNone# Silent - expected during first-time setupexceptjson.JSONDecodeErrorase:
print(f"Error: Invalid JSON in config file {config_path}: {e}", file=sys.stderr)
returnNoneexceptOSErrorase:
print(f"Error: Could not read config file {config_path}: {e}", file=sys.stderr)
returnNone
Step 4: Run test to verify it passes
Run: pytest tests/test_hook.py -v
Expected: PASS
Step 5: Commit
git add docsearch.py tests/test_hook.py
git commit -m "feat: add error logging to stderr for config issues"
Task 9: Complete Test Coverage and Edge Cases
Files:
Modify: tests/test_hook.py
Step 1: Add comprehensive edge case tests
Add to tests/test_hook.py:
classTestEdgeCases:
"""Tests for edge cases and boundary conditions."""deftest_empty_query_allows_through(self):
"""Empty query should be allowed through."""hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": ""},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json")},
)
assertexit_code==0deftest_missing_query_allows_through(self):
"""Missing query field should be allowed through."""hook_input= {
"tool_name": "WebSearch",
"tool_input": {},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json")},
)
assertexit_code==0deftest_missing_tool_input_allows_through(self):
"""Missing tool_input field should be allowed through."""hook_input= {
"tool_name": "WebSearch",
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json")},
)
assertexit_code==0deftest_missing_session_id_uses_default(self, tmp_path):
"""Missing session_id should use 'default' session."""state_dir=tmp_path/"state"state_dir.mkdir()
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "gitlab setup"},
# Note: no session_id
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
assertexit_code==2# Should use default sessionstate_file=state_dir/"docsearch-state-default.json"assertstate_file.exists()
deftest_empty_databases_config_allows_through(self, tmp_path):
"""Config with empty databases array should allow through."""config_file=tmp_path/"empty_config.json"config_file.write_text('{"databases": []}')
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "gitlab ci configuration"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(config_file)},
)
assertexit_code==0deftest_domains_compared_as_sets(self, tmp_path):
"""Domain arrays should be compared as sets (order-independent)."""state_dir=tmp_path/"state"state_dir.mkdir()
# Pre-create state with domains in one orderstate_file=state_dir/"docsearch-state-test-domains.json"state_file.write_text(json.dumps({
"last_denied": {
"query": "gitlab ci",
"allowed_domains": ["b.com", "a.com"], # Different order"blocked_domains": [],
"timestamp": int(time.time()),
}
}))
# Query with same domains in different orderhook_input= {
"tool_name": "WebSearch",
"tool_input": {
"query": "gitlab ci",
"allowed_domains": ["a.com", "b.com"], # Same domains, different order"blocked_domains": [],
},
"session_id": "test-domains",
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
assertexit_code==0# Should match and allow throughdeftest_special_characters_in_keywords(self, tmp_path):
"""Keywords with regex special characters should match correctly."""config_file=tmp_path/"special_config.json"config_file.write_text(json.dumps({
"databases": [
{
"keywords": ["c++", "c#", ".net"],
"path": "/mock/path/dotnet",
"mcp_tool_name": "mcp__leann__search",
"description": ".NET documentation"
}
]
}))
# Test C++ keywordhook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "c++ templates tutorial"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(config_file)},
)
assertexit_code==2deftest_output_contains_all_required_fields(self):
"""Output JSON should contain all required hookSpecificOutput fields."""hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "gitlab ci"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json")},
)
assertexit_code==2output=json.loads(stdout)
hook_output=output["hookSpecificOutput"]
# Verify all required fields presentassert"hookEventName"inhook_outputassert"permissionDecision"inhook_outputassert"permissionDecisionReason"inhook_outputassert"additionalContext"inhook_output# Verify field valuesasserthook_output["hookEventName"] =="PreToolUse"asserthook_output["permissionDecision"] =="deny"
# DocSearch Hook
A Claude Code PreToolUse hook that intercepts WebSearch tool calls and redirects documentation-related queries to local RAG databases via LEANN MCP server.
## Features-**Keyword-based interception**: Configure keywords that trigger RAG lookups instead of web searches
-**Multiple database support**: Match queries against multiple documentation databases
-**Smart escape hatch**: If RAG results are insufficient, retry the same search to use web
-**Fail-open design**: Any errors gracefully fall back to normal web search
-**Session isolation**: Per-session state prevents cross-session interference
## Prerequisites1. Python 3.12+
2.[LEANN](https://github.com/user/leann) installed and configured
3. LEANN MCP server configured in Claude Code's MCP settings
4. RAG databases built using LEANN tools
## Installation1.**Install the hook script:**```bash
mkdir -p ~/.claude/hooks/PreToolUse
cp docsearch.py ~/.claude/hooks/PreToolUse/docsearch.py
chmod +x ~/.claude/hooks/PreToolUse/docsearch.py
Create configuration file:
cp config.example.json ~/.claude/hooks/docsearch-config.json
# Edit with your database paths and keywords
Configure Claude Code to use the hook by adding to your Claude Code settings:
Array of keywords to match (case-insensitive, word boundaries)
path
Yes
Absolute path to LEANN database directory
mcp_tool_name
Yes
Exact MCP tool name for Claude to use
description
Yes
Human-readable description shown to Claude
How It Works
You ask Claude a question containing a configured keyword (e.g., "How do I configure GitLab CI?")
Claude attempts to use WebSearch
The hook intercepts and denies the search
Claude receives guidance to use the LEANN MCP tool instead
If RAG results are insufficient, Claude can retry the exact same WebSearch
The hook recognizes the retry and allows it through
Escape Hatch
If the RAG database doesn't have what you need, Claude can simply retry the same web search. The hook tracks the last denied search per session and allows identical retries through. State expires after 5 minutes as a safety net.
Testing
pytest tests/test_hook.py -v
Troubleshooting
Hook not intercepting searches
Verify the hook script is executable: chmod +x ~/.claude/hooks/PreToolUse/docsearch.py
"""Integration Testing Guide for DocSearch HookThese tests require a real LEANN MCP server configured.Run these manually to verify end-to-end functionality.Setup:1. Configure LEANN MCP server in Claude Code2. Build a test RAG database with LEANN3. Add the database to docsearch-config.json4. Run Claude Code and test the flowTest Scenarios:1. Basic interception: Ask about configured keyword topic - Verify hook denies WebSearch - Verify Claude uses MCP tool - Verify answer comes from RAG2. Escape hatch: Ask about topic where RAG fails - Verify first search denied - Verify Claude can retry - Verify retry uses web search3. Multiple keywords: Ask about two topics in one query - Verify both databases mentioned - Verify Claude calls MCP tools in parallel4. Non-matching query: Ask about unconfigured topic - Verify hook allows WebSearch through"""
Task 12: Make Script Executable and Final Verification
Files:
Verify: docsearch.py
Step 1: Verify shebang line
The shebang is already present: #!/usr/bin/env python3
Step 2: Run full test suite
Run: pytest tests/test_hook.py -v --tb=short
Expected: All tests PASS
Step 3: Verify script is executable
Run: chmod +x docsearch.py && ./docsearch.py < /dev/null; echo "Exit code: $?"
Expected: Exit code: 0 (fail open on no input)
Step 4: Final commit
git add -A
git commit -m "chore: final cleanup and verification"
Task 3a: Configuration Schema Validation (NEW)
Files:
Modify: docsearch.py
Modify: tests/test_hook.py
Step 1: Write the failing test for config validation
Add to tests/test_hook.py:
classTestConfigValidation:
"""Tests for configuration schema validation."""deftest_missing_keywords_logs_warning(self, tmp_path):
"""Config entry missing 'keywords' should log warning and skip entry."""config_file=tmp_path/"incomplete_config.json"config_file.write_text(json.dumps({
"databases": [
{
"path": "/mock/path/test",
"mcp_tool_name": "mcp__leann__search",
"description": "Test database"# Missing: "keywords"
}
]
}))
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "some query"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(config_file)},
)
# Should allow through (no valid databases)assertexit_code==0# Should log warning about missing fieldassert"keywords"instderr.lower() or"missing"instderr.lower()
deftest_missing_path_logs_warning(self, tmp_path):
"""Config entry missing 'path' should log warning and skip entry."""config_file=tmp_path/"incomplete_config.json"config_file.write_text(json.dumps({
"databases": [
{
"keywords": ["test"],
"mcp_tool_name": "mcp__leann__search",
"description": "Test database"# Missing: "path"
}
]
}))
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "test query"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(config_file)},
)
# Should allow through (no valid databases after validation)assertexit_code==0# Should log warningassert"path"instderr.lower() or"missing"instderr.lower()
deftest_keywords_not_array_logs_warning(self, tmp_path):
"""Config entry with keywords as string (not array) should log warning."""config_file=tmp_path/"bad_type_config.json"config_file.write_text(json.dumps({
"databases": [
{
"keywords": "gitlab", # Should be ["gitlab"]"path": "/mock/path/test",
"mcp_tool_name": "mcp__leann__search",
"description": "Test database"
}
]
}))
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "gitlab ci"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(config_file)},
)
# Should allow through (invalid entry skipped)assertexit_code==0# Should log warning about typeassert"keywords"instderr.lower() or"array"instderr.lower() or"list"instderr.lower()
deftest_empty_keywords_array_logs_warning(self, tmp_path):
"""Config entry with empty keywords array should log warning."""config_file=tmp_path/"empty_keywords_config.json"config_file.write_text(json.dumps({
"databases": [
{
"keywords": [], # Empty array"path": "/mock/path/test",
"mcp_tool_name": "mcp__leann__search",
"description": "Test database"
}
]
}))
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "test query"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(config_file)},
)
# Should allow through (no valid databases)assertexit_code==0# Should log warning about empty keywordsassert"keywords"instderr.lower() or"empty"instderr.lower()
deftest_relative_path_logs_warning(self, tmp_path):
"""Config entry with relative path should log warning but still work."""config_file=tmp_path/"relative_path_config.json"config_file.write_text(json.dumps({
"databases": [
{
"keywords": ["test"],
"path": "relative/path/database", # Should be absolute"mcp_tool_name": "mcp__leann__search",
"description": "Test database"
}
]
}))
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "test query"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(config_file)},
)
# Should still deny (relative path is a warning, not an error)assertexit_code==2# Should log warning about relative pathassert"path"instderr.lower() or"absolute"instderr.lower() or"relative"instderr.lower()
Step 2: Run test to verify it fails
Run: pytest tests/test_hook.py::TestConfigValidation -v
Expected: FAIL (validation not implemented)
Step 3: Write minimal implementation
Add validation function to docsearch.py:
REQUIRED_DATABASE_FIELDS= ["keywords", "path", "mcp_tool_name", "description"]
defvalidate_database_entry(db: dict, index: int) ->bool:
"""Validate a database entry has all required fields and correct types. Returns True if valid, False if invalid (logs warning to stderr). Maintains fail-open behavior - warns but allows through when possible. """# Check required fields are presentmissing= [fforfinREQUIRED_DATABASE_FIELDSiffnotindb]
ifmissing:
print(
f"Warning: Database entry {index} missing required fields: {missing}",
file=sys.stderr
)
returnFalse# Validate keywords is a non-empty listkeywords=db.get("keywords")
ifnotisinstance(keywords, list):
print(
f"Warning: Database entry {index} 'keywords' must be an array, got {type(keywords).__name__}",
file=sys.stderr
)
returnFalseiflen(keywords) ==0:
print(
f"Warning: Database entry {index} 'keywords' array is empty",
file=sys.stderr
)
returnFalse# Warn (but don't fail) for relative pathspath=db.get("path", "")
ifpathandnotpath.startswith("/"):
print(
f"Warning: Database entry {index} 'path' should be absolute, got relative path: {path}",
file=sys.stderr
)
# Continue anyway - relative path might still workreturnTrue
Update find_matching_databases() to skip invalid entries:
deffind_matching_databases(query: str, config: dict) ->list[dict]:
"""Find all databases with keywords matching the query."""matches= []
query_lower=query.lower()
fori, dbinenumerate(config.get("databases", [])):
# Skip invalid database entriesifnotvalidate_database_entry(db, i):
continueforkeywordindb.get("keywords", []):
pattern=rf"\b{re.escape(keyword.lower())}\b"ifre.search(pattern, query_lower):
matches.append(db)
breakreturnmatches
Step 4: Run test to verify it passes
Run: pytest tests/test_hook.py -v
Expected: PASS
Step 5: Commit
git add docsearch.py tests/test_hook.py
git commit -m "feat: add configuration schema validation with type checking"
Task 7a: Session Start State Cleanup (NEW)
Files:
Modify: docsearch.py
Modify: tests/test_hook.py
Step 1: Write the failing test for stale file cleanup
Add to tests/test_hook.py:
classTestSessionStartCleanup:
"""Tests for cleaning stale state files on session start."""deftest_stale_state_file_cleaned_on_unrelated_query(self, tmp_path):
"""Very old state files should be cleaned up when processing new queries."""state_dir=tmp_path/"state"state_dir.mkdir()
# Create multiple stale state files (older than 5 minutes)old_timestamp=int(time.time()) -600# 10 minutes agostale_file1=state_dir/"docsearch-state-old-session-1.json"stale_file1.write_text(json.dumps({
"last_denied": {
"query": "old query 1",
"allowed_domains": [],
"blocked_domains": [],
"timestamp": old_timestamp,
}
}))
stale_file2=state_dir/"docsearch-state-old-session-2.json"stale_file2.write_text(json.dumps({
"last_denied": {
"query": "old query 2",
"allowed_domains": [],
"blocked_domains": [],
"timestamp": old_timestamp,
}
}))
# Run a hook call for a new session (triggers cleanup)hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "unrelated query no keywords"},
"session_id": "new-session",
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
# Query should pass through (no keyword match)assertexit_code==0# Stale files should be cleaned up# Note: This is optional behavior - cleanup runs periodically# Test verifies stale files don't interfere with new sessionsdeftest_recent_state_file_preserved(self, tmp_path):
"""Recent state files should NOT be cleaned up."""state_dir=tmp_path/"state"state_dir.mkdir()
# Create a recent state file (2 minutes ago)recent_timestamp=int(time.time()) -120recent_file=state_dir/"docsearch-state-active-session.json"recent_file.write_text(json.dumps({
"last_denied": {
"query": "gitlab ci",
"allowed_domains": [],
"blocked_domains": [],
"timestamp": recent_timestamp,
}
}))
# Run a hook call for a different sessionhook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "unrelated query"},
"session_id": "other-session",
}
run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
# Recent file should still existassertrecent_file.exists()
Step 2: Run test to verify it fails
Run: pytest tests/test_hook.py::TestSessionStartCleanup -v
Expected: FAIL (cleanup not implemented)
Step 3: Write minimal implementation
Add cleanup function to docsearch.py:
defcleanup_stale_state_files() ->None:
"""Clean up state files older than the expiry threshold. This is a best-effort cleanup that runs periodically to prevent state file accumulation. Errors are silently ignored. """state_dir=get_state_dir()
ifnotstate_dir.exists():
returntry:
forstate_fileinstate_dir.glob("docsearch-state-*.json"):
try:
withopen(state_file) asf:
state=json.load(f)
last_denied=state.get("last_denied")
iflast_deniedandis_state_expired(last_denied):
state_file.unlink()
except (json.JSONDecodeError, OSError, KeyError):
# Corrupted or unreadable - remove ittry:
state_file.unlink()
exceptOSError:
passexceptOSError:
pass# Can't list directory - skip cleanup
Add cleanup call at the start of main() (after config loading):
defmain() ->int:
"""Main entry point for the hook."""# ... existing code ...# Load configuration - if missing or invalid, allow throughconfig=load_config()
ifconfigisNone:
return0# Periodically clean up stale state files# Only run occasionally to avoid performance impactcleanup_stale_state_files()
# ... rest of main() ...
Step 4: Run test to verify it passes
Run: pytest tests/test_hook.py -v
Expected: PASS
Step 5: Commit
git add docsearch.py tests/test_hook.py
git commit -m "feat: add periodic cleanup of stale state files"
Task 6a: Session ID Sanitization (NEW - SECURITY)
Files:
Modify: docsearch.py
Modify: tests/test_hook.py
Step 1: Write the failing test for session ID sanitization
Add to tests/test_hook.py:
classTestSessionIdSanitization:
"""Tests for session ID sanitization to prevent path traversal."""deftest_session_id_with_path_traversal_is_sanitized(self, tmp_path):
"""Session ID with path traversal characters should be sanitized."""state_dir=tmp_path/"state"state_dir.mkdir()
# Attempt path traversal attackhook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "gitlab ci setup"},
"session_id": "../../etc/passwd", # Malicious session_id
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
assertexit_code==2# Should still work (deny)# State file should be created with sanitized name, NOT traverse paths# Should NOT create file at tmp_path/etc/passwdassertnot (tmp_path/"etc").exists()
# Should create file with sanitized session_id (special chars replaced)state_files=list(state_dir.glob("docsearch-state-*.json"))
assertlen(state_files) ==1# Filename should not contain path separatorsassert"/"notinstate_files[0].nameassert".."notinstate_files[0].namedeftest_session_id_with_special_chars_is_sanitized(self, tmp_path):
"""Session ID with special filesystem characters should be sanitized."""state_dir=tmp_path/"state"state_dir.mkdir()
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "gitlab ci setup"},
"session_id": "test<>:\"|?*session", # Invalid filesystem chars
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
assertexit_code==2# State file should be created with sanitized namestate_files=list(state_dir.glob("docsearch-state-*.json"))
assertlen(state_files) ==1
Step 2: Run test to verify it fails
Run: pytest tests/test_hook.py::TestSessionIdSanitization -v
Expected: FAIL (sanitization not implemented)
Step 3: Write minimal implementation
Add sanitization function to docsearch.py:
defsanitize_session_id(session_id: str) ->str:
"""Sanitize session_id to prevent path traversal and invalid filenames. Only allows alphanumeric characters, dashes, and underscores. All other characters are replaced with underscores. """returnre.sub(r'[^a-zA-Z0-9_-]', '_', session_id)
Update get_state_file():
defget_state_file(session_id: str) ->Path:
"""Get the state file path for a session."""safe_id=sanitize_session_id(session_id)
returnget_state_dir() /f"docsearch-state-{safe_id}.json"
Step 4: Run test to verify it passes
Run: pytest tests/test_hook.py -v
Expected: PASS
Step 5: Commit
git add docsearch.py tests/test_hook.py
git commit -m "security: add session ID sanitization to prevent path traversal"
Task 3b: Keywords Element Type Validation (NEW)
Files:
Modify: docsearch.py
Modify: tests/test_hook.py
Step 1: Write the failing test for keyword element type validation
Add to tests/test_hook.py in TestConfigValidation class:
deftest_keywords_with_non_string_elements_logs_warning(self, tmp_path):
"""Config entry with non-string keyword elements should log warning."""config_file=tmp_path/"bad_keywords_config.json"config_file.write_text(json.dumps({
"databases": [
{
"keywords": ["valid", 123, None, {"nested": "dict"}],
"path": "/mock/path/test",
"mcp_tool_name": "mcp__leann__search",
"description": "Test database"
}
]
}))
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "valid query"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(config_file)},
)
# Should allow through (invalid entry skipped)assertexit_code==0# Should log warning about non-string elementsassert"string"instderr.lower() or"keywords"instderr.lower()
Step 2: Run test to verify it fails
Run: pytest tests/test_hook.py::TestConfigValidation::test_keywords_with_non_string_elements_logs_warning -v
Expected: FAIL (type validation not implemented)
Step 3: Write minimal implementation
Update validate_database_entry() in docsearch.py:
defvalidate_database_entry(db: dict, index: int) ->bool:
"""Validate a database entry has all required fields and correct types."""# ... existing checks ...# Validate all keyword elements are stringsifnotall(isinstance(k, str) forkinkeywords):
print(
f"Warning: Database entry {index} 'keywords' contains non-string elements",
file=sys.stderr
)
returnFalse# ... rest of function ...
Step 4: Run test to verify it passes
Run: pytest tests/test_hook.py -v
Expected: PASS
Step 5: Commit
git add docsearch.py tests/test_hook.py
git commit -m "feat: validate all keyword elements are strings"
Task 9a: Permission Error Tests (NEW)
Files:
Modify: tests/test_hook.py
Step 1: Write permission error tests
Add to tests/test_hook.py:
classTestPermissionErrors:
"""Tests for permission error handling (fail-open behavior)."""deftest_unreadable_config_allows_through(self, tmp_path):
"""Unreadable config file should fail open (exit 0)."""config_file=tmp_path/"unreadable_config.json"config_file.write_text('{"databases": [{"keywords": ["test"], "path": "/test", "mcp_tool_name": "test", "description": "test"}]}')
config_file.chmod(0o000) # No permissionstry:
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "test query"},
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={**os.environ, "DOCSEARCH_CONFIG_PATH": str(config_file)},
)
# Should fail openassertexit_code==0finally:
config_file.chmod(0o644) # Restore for cleanupdeftest_unwritable_state_dir_still_denies(self, tmp_path):
"""Unwritable state directory should still deny (state is optional)."""state_dir=tmp_path/"state"state_dir.mkdir()
state_dir.chmod(0o555) # Read-onlytry:
hook_input= {
"tool_name": "WebSearch",
"tool_input": {"query": "gitlab ci setup"},
"session_id": "test-session",
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
# Should still deny (state write failure is silent)assertexit_code==2finally:
state_dir.chmod(0o755) # Restore for cleanup
classTestSessionIsolation:
"""Tests for session state isolation between concurrent sessions."""deftest_different_sessions_have_isolated_state(self, tmp_path):
"""State from session A should not affect session B."""state_dir=tmp_path/"state"state_dir.mkdir()
# Create state for session A (previous denial)state_file_a=state_dir/"docsearch-state-session-A.json"state_file_a.write_text(json.dumps({
"last_denied": {
"query": "gitlab ci setup",
"allowed_domains": [],
"blocked_domains": [],
"timestamp": int(time.time()),
}
}))
# Session B with SAME query should be denied (no escape hatch)hook_input= {
"tool_name": "WebSearch",
"tool_input": {
"query": "gitlab ci setup", # Same query as A's state"allowed_domains": [],
"blocked_domains": [],
},
"session_id": "session-B", # Different session
}
exit_code, stdout, stderr=run_hook(
hook_input,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
# Session B should be denied (its own first request)assertexit_code==2# Session A's state should be unchangedstate_a=json.loads(state_file_a.read_text())
assertstate_a["last_denied"]["query"] =="gitlab ci setup"# Session B should have its own state filestate_file_b=state_dir/"docsearch-state-session-B.json"assertstate_file_b.exists()
deftest_session_escape_hatch_only_affects_own_session(self, tmp_path):
"""Escape hatch retry should only work for the session that was denied."""state_dir=tmp_path/"state"state_dir.mkdir()
# Create state for session Astate_file_a=state_dir/"docsearch-state-session-A.json"state_file_a.write_text(json.dumps({
"last_denied": {
"query": "gitlab ci setup",
"allowed_domains": [],
"blocked_domains": [],
"timestamp": int(time.time()),
}
}))
# Session A retries same query - should be allowed (escape hatch)hook_input_a= {
"tool_name": "WebSearch",
"tool_input": {
"query": "gitlab ci setup",
"allowed_domains": [],
"blocked_domains": [],
},
"session_id": "session-A",
}
exit_code_a, _, _=run_hook(
hook_input_a,
env={
**os.environ,
"DOCSEARCH_CONFIG_PATH": str(FIXTURES_DIR/"valid_config.json"),
"DOCSEARCH_STATE_DIR": str(state_dir),
},
)
assertexit_code_a==0# Escape hatch works# Session A's state should be clearedstate_a=json.loads(state_file_a.read_text())
assertstate_a.get("last_denied") isNone
Fail-open design: All errors result in allowing WebSearch through
Word boundary matching: Uses \b regex to prevent partial matches
Session isolation: State files named with sanitized session_id
5-minute expiry: Stale state entries are ignored
Set comparison: Domain arrays compared order-independently
Config validation: Missing required fields AND type validation logged as warnings (Task 3a)
Validates keywords is a non-empty array
Validates all keyword elements are strings (Task 3b)
Warns on relative paths (but allows)
Skips invalid database entries entirely
Stale file cleanup: Periodic cleanup of expired state files (Task 7a)
Session ID sanitization: Prevents path traversal attacks (Task 6a)
Testing Commands
# Run all tests
pytest tests/test_hook.py -v
# Run specific test class
pytest tests/test_hook.py::TestKeywordMatching -v
# Run with coverage
pytest tests/test_hook.py -v --cov=docsearch
# Run new validation tests
pytest tests/test_hook.py::TestConfigValidation -v
# Run new cleanup tests
pytest tests/test_hook.py::TestSessionStartCleanup -v
Total Tasks: 18
Phase
Tasks (in execution order)
Status
P0 - Core
2, 3, 4, 6a, 6, 12
0/6 complete
P1 - Enhanced
5, 7, 7a, 3a, 3b, 8, 9, 9a, 9b
0/9 complete
P2 - Polish
1, 10, 11
0/3 complete
Total
18 tasks
0/18 complete
IMPORTANT: Task 6a (Session ID Sanitization) MUST be completed before Task 6 (Session State Management) for security reasons.
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Build a Claude Code PreToolUse hook that intercepts WebSearch calls and redirects documentation queries to local RAG databases with intelligent escape hatch for retries.
Architecture: Python 3.12+ script reads hook input from stdin, matches queries against configured keywords, stores per-session state for retry detection, and outputs structured JSON denial with MCP tool guidance.
Tech Stack: Python 3.14 stdlib (json, re, sys, pathlib), pytest for testing
cat > docsearch.py << 'EOF'#!/usr/bin/env python3# ABOUTME: Claude Code PreToolUse hook that redirects WebSearch to local RAG databases# ABOUTME: Intercepts documentation queries and routes them to LEANN MCP serverimport sysif __name__ == "__main__": # Placeholder - will be implemented via TDD sys.exit(0)EOF
chmod +x docsearch.py
Step 5: Create placeholder test file
cat > tests/test_hook.py << 'EOF'# ABOUTME: Unit tests for docsearch PreToolUse hook# ABOUTME: Tests keyword matching, state management, and escape hatch logicimport pytest# Tests will be added incrementally via TDDEOF
Expected: FAIL (exit code 0 but test expects proper filtering)
Step 3: Implement minimal pass-through logic
Replace docsearch.py content:
#!/usr/bin/env python3# ABOUTME: Claude Code PreToolUse hook that redirects WebSearch to local RAG databases# ABOUTME: Intercepts documentation queries and routes them to LEANN MCP serverimportjsonimportsysdefmain():
try:
hook_input=json.loads(sys.stdin.read())
# Pass through non-WebSearch toolsifhook_input.get("tool_name") !="WebSearch":
sys.exit(0)
# Placeholder for WebSearch handlingsys.exit(0)
exceptException:
# Fail open - allow tool through on any errorsys.exit(0)
if__name__=="__main__":
main()
Expected: Tests may pass trivially since current code exits 0 - verify logic is correct
Step 4: Implement config loading with fail-open
Update docsearch.py:
#!/usr/bin/env python3# ABOUTME: Claude Code PreToolUse hook that redirects WebSearch to local RAG databases# ABOUTME: Intercepts documentation queries and routes them to LEANN MCP serverimportjsonimportsysfrompathlibimportPathdefload_config():
"""Load config file from ~/.claude/hooks/docsearch-config.json Returns config dict or None if missing/invalid (fail open) """config_path=Path.home() /".claude"/"hooks"/"docsearch-config.json"ifnotconfig_path.exists():
returnNonetry:
withopen(config_path) asf:
config=json.load(f)
# Validate basic structureifnotisinstance(config.get("databases"), list):
sys.stderr.write("Invalid config: databases must be an array\n")
returnNonereturnconfigexceptjson.JSONDecodeErrorase:
sys.stderr.write(f"Invalid config JSON: {e}\n")
returnNoneexceptExceptionase:
sys.stderr.write(f"Error loading config: {e}\n")
returnNonedefmain():
try:
hook_input=json.loads(sys.stdin.read())
# Pass through non-WebSearch toolsifhook_input.get("tool_name") !="WebSearch":
sys.exit(0)
# Load config - fail open if missing/invalidconfig=load_config()
ifconfigisNone:
sys.exit(0)
# Placeholder for keyword matchingsys.exit(0)
exceptException:
# Fail open - allow tool through on any errorsys.exit(0)
if__name__=="__main__":
main()
deftest_retry_with_different_domains_denies_again(tmp_path, monkeypatch):
"""Retry with different domain filters should deny again"""monkeypatch.setenv("HOME", str(tmp_path))
config_dir=tmp_path/".claude"/"hooks"config_dir.mkdir(parents=True)
(config_dir/"docsearch-config.json").write_text(
Path("tests/fixtures/valid_config.json").read_text()
)
# First call with allowed_domainsresult1=subprocess.run(
["python3", "docsearch.py"],
input=json.dumps({
"hookEventName": "PreToolUse",
"tool_name": "WebSearch",
"tool_input": {
"query": "gitlab ci setup",
"allowed_domains": ["docs.gitlab.com"]
},
"session_id": "domain-test"
}),
capture_output=True,
text=True
)
assertresult1.returncode==2# Second call with different allowed_domainsresult2=subprocess.run(
["python3", "docsearch.py"],
input=json.dumps({
"hookEventName": "PreToolUse",
"tool_name": "WebSearch",
"tool_input": {
"query": "gitlab ci setup",
"allowed_domains": ["stackoverflow.com"]
},
"session_id": "domain-test"
}),
capture_output=True,
text=True
)
assertresult2.returncode==2# Should deny again (different params)deftest_retry_with_same_domains_allows_through(tmp_path, monkeypatch):
"""Retry with same domain filters should allow through"""monkeypatch.setenv("HOME", str(tmp_path))
config_dir=tmp_path/".claude"/"hooks"config_dir.mkdir(parents=True)
(config_dir/"docsearch-config.json").write_text(
Path("tests/fixtures/valid_config.json").read_text()
)
hook_input= {
"hookEventName": "PreToolUse",
"tool_name": "WebSearch",
"tool_input": {
"query": "gitlab ci setup",
"allowed_domains": ["docs.gitlab.com"],
"blocked_domains": ["spam.com"]
},
"session_id": "domain-test-2"
}
# First call should denyresult1=subprocess.run(
["python3", "docsearch.py"],
input=json.dumps(hook_input),
capture_output=True,
text=True
)
assertresult1.returncode==2# Second call with same params should allowresult2=subprocess.run(
["python3", "docsearch.py"],
input=json.dumps(hook_input),
capture_output=True,
text=True
)
assertresult2.returncode==0
Step 2: Run tests to verify they pass
pytest tests/test_hook.py -k "domain" -v
Expected: PASS (already implemented in Task 5)
Step 3: Commit tests
git add tests/test_hook.py
git commit -m "test: add domain filtering test coverage for state comparison"
Task 8: Error Handling Edge Cases
Files:
Modify: tests/test_hook.py
Step 1: Write tests for error scenarios
Add to tests/test_hook.py:
deftest_corrupted_state_file_continues(tmp_path, monkeypatch):
"""Corrupted state file should be treated as no previous denial"""monkeypatch.setenv("HOME", str(tmp_path))
config_dir=tmp_path/".claude"/"hooks"config_dir.mkdir(parents=True)
(config_dir/"docsearch-config.json").write_text(
Path("tests/fixtures/valid_config.json").read_text()
)
# Create corrupted state file
(config_dir/"docsearch-state-corrupt.json").write_text("invalid{json")
hook_input= {
"hookEventName": "PreToolUse",
"tool_name": "WebSearch",
"tool_input": {"query": "gitlab ci"},
"session_id": "corrupt"
}
result=subprocess.run(
["python3", "docsearch.py"],
input=json.dumps(hook_input),
capture_output=True,
text=True
)
# Should deny (not crash)assertresult.returncode==2deftest_invalid_hook_input_fails_open(tmp_path, monkeypatch):
"""Invalid hook input JSON should allow through"""result=subprocess.run(
["python3", "docsearch.py"],
input="invalid json{",
capture_output=True,
text=True
)
assertresult.returncode==0deftest_missing_query_field_fails_open(tmp_path, monkeypatch):
"""Missing query field should allow through"""monkeypatch.setenv("HOME", str(tmp_path))
config_dir=tmp_path/".claude"/"hooks"config_dir.mkdir(parents=True)
(config_dir/"docsearch-config.json").write_text(
Path("tests/fixtures/valid_config.json").read_text()
)
hook_input= {
"hookEventName": "PreToolUse",
"tool_name": "WebSearch",
"tool_input": {}, # No query field"session_id": "no-query"
}
result=subprocess.run(
["python3", "docsearch.py"],
input=json.dumps(hook_input),
capture_output=True,
text=True
)
assertresult.returncode==0
Step 2: Run tests to verify they pass
pytest tests/test_hook.py -k "corrupted or invalid or missing" -v
Expected: PASS (already implemented with fail-open strategy)
Step 3: Commit tests
git add tests/test_hook.py
git commit -m "test: add error handling test coverage for edge cases"
Task 9: Documentation and Installation Instructions
Files:
Modify: README.md
Step 1: Write comprehensive README
Replace README.md:
# DocSearch Hook for Claude Code
A Claude Code PreToolUse hook that intelligently redirects documentation-related WebSearch queries to local RAG databases via the LEANN MCP server.
## Features-**Automatic Search Interception**: Detects documentation queries and redirects to local RAG databases
-**Intelligent Escape Hatch**: Allows Claude to retry web searches if RAG results are insufficient
-**Multi-Database Support**: Query multiple documentation sources in parallel
-**Session Isolation**: Per-session state management prevents cross-session interference
-**Fail-Open Design**: Never breaks Claude's functionality - errors allow searches through
## Prerequisites1.**Python 3.12+** (tested with Python 3.14)
2.**LEANN** installed and configured
3.**LEANN MCP server** configured in Claude Code's MCP settings
4.**RAG databases** built using LEANN tools
## Installation### 1. Install the Hook Script```bash# Clone or download this repository
git clone https://github.com/yourusername/docsearch-hook.git
cd docsearch-hook
# Copy hook to Claude Code hooks directory
mkdir -p ~/.claude/hooks/PreToolUse
cp docsearch.py ~/.claude/hooks/PreToolUse/docsearch.py
chmod +x ~/.claude/hooks/PreToolUse/docsearch.py
2. Create Configuration File
# Copy example config
cp config.example.json ~/.claude/hooks/docsearch-config.json
# Edit with your database paths and keywords# Example config structure:
{
"databases": [
{
"keywords": ["gitlab", "gl", "gitlab-ci"],
"path": "/Users/yourname/.leann/databases/gitlab",
"mcp_tool_name": "mcp__leann__search",
"description": "GitLab documentation from docs.gitlab.com"
}
]
}
3. Verify LEANN MCP Server Configuration
Ensure your ~/.claude/mcp-config.json includes the LEANN server:
# Run tests
pytest tests/
# Start Claude Code and try a query# Example: "How do I configure GitLab CI runners?"# The hook should intercept and suggest using the RAG database
keywords (required): Array of strings to match in queries (case-insensitive, word-boundary matching)
path (required): Absolute path to LEANN database directory
mcp_tool_name (required): Exact MCP tool name (usually mcp__leann__search)
description (required): Description shown to Claude in denial context
How It Works
User asks: "How to configure GitLab CI runners?"
↓
Hook detects "gitlab" keyword → Denies WebSearch
↓
Claude receives denial + context about RAG database
↓
Claude calls mcp__leann__search with GitLab database
↓
If successful → User gets RAG-based answer
If unsuccessful → Claude retries WebSearch → Hook allows through
State Management
Per-session state: Each Claude Code session has isolated state in ~/.claude/hooks/docsearch-state-{session_id}.json
Escape hatch: If Claude retries the exact same search (same query and domain filters), the hook allows it through
Automatic cleanup: State is cleared after successful retry
Testing
# Run all tests
pytest tests/ -v
# Run specific test categories
pytest tests/ -k "keyword" -v # Keyword matching tests
pytest tests/ -k "retry" -v # Escape hatch tests
pytest tests/ -k "session" -v # Session isolation tests# Run with coverage
pytest tests/ --cov=docsearch --cov-report=html
State files location: ~/.claude/hooks/docsearch-state-*.json
Delete stale state files manually if needed
Each session creates its own state file
Development
Running Tests During Development
# Install pytest
pip install pytest
# Run tests with output
pytest tests/ -v -s
# Run specific test
pytest tests/test_hook.py::test_single_keyword_match_denies -v
Adding New Databases
Build LEANN database using LEANN tools
Add entry to ~/.claude/hooks/docsearch-config.json
# Integration Tests
These tests require a real LEANN MCP server configuration and database.
## Setup1. Ensure LEANN is installed
2. Build a test database
3. Configure MCP server in Claude Code
4. Run integration tests manually (not in CI)
## Running```bash# Skip in normal test runs
pytest tests/test_hook.py
# Run integration tests manually
pytest tests/integration/ -v
Note
Integration tests are provided as examples and documentation.
They require manual setup and are not run in automated CI.
**Step 3: Create example integration test**
Create `tests/integration/test_full_flow.py`:
```python
# ABOUTME: Integration tests for full docsearch hook flow with real LEANN MCP server
# ABOUTME: Requires manual setup - not run in automated CI
import json
import subprocess
import pytest
# Mark all tests in this file as integration tests
pytestmark = pytest.mark.integration
@pytest.mark.skip(reason="Requires manual LEANN setup")
def test_full_flow_with_real_mcp():
"""
End-to-end test with real LEANN MCP server
Manual setup required:
1. Build LEANN database for a test documentation site
2. Configure ~/.claude/hooks/docsearch-config.json
3. Ensure LEANN MCP server is running
4. Update this test with your actual config
"""
# This is a template - customize for your setup
hook_input = {
"hookEventName": "PreToolUse",
"tool_name": "WebSearch",
"tool_input": {"query": "your test query here"},
"session_id": "integration-test"
}
result = subprocess.run(
["python3", "docsearch.py"],
input=json.dumps(hook_input),
capture_output=True,
text=True
)
# First call should deny
assert result.returncode == 2
# At this point, you would manually verify:
# 1. Claude Code receives the denial context
# 2. Claude calls the MCP tool
# 3. MCP returns results or fails
# 4. If MCP fails, Claude retries WebSearch
# 5. Hook allows the retry through
# Retry should allow through
result2 = subprocess.run(
["python3", "docsearch.py"],
input=json.dumps(hook_input),
capture_output=True,
text=True
)
assert result2.returncode == 0
Step 4: Update pytest configuration
Create pytest.ini:
[pytest]markers =
integration: marks tests as integration tests (deselect with '-m "not integration"')
# By default, skip integration testsaddopts = -m "not integration"
Step 5: Commit integration test setup
git add tests/integration/ pytest.ini
git commit -m "test: add integration test framework and documentation"