Important
This is experimental. Works on my machine but may need adjustments for your environment. You need a decent GPU with 8GB+ VRAM and ~5GB disk space. Local models are less capable and slower than remote models.
Use Claude Code's tooling with a local model instead of Anthropic's API.
Normal Claude Code:
Claude Code ──────────────────────────────→ Anthropic API (api.anthropic.com)
↓
Claude Model
With local model:
Claude Code → claude-code-router [1] → llama-server [2] → Qwen2.5-Coder [3]
(port 3456) (port 8080) (local GPU)
claude-code-router intercepts requests and translates Anthropic's Messages API to OpenAI-compatible format. llama.cpp runs quantized models with GPU acceleration.
Once set up, everything runs locally (no API calls/internet, for example, you can work on a plane). The tradeoff: local models are much less capable than Claude, resource hungry (need a decent GPU with 8GB+ VRAM), and slower (~50 tokens/sec).
# Add NVIDIA CUDA repo [4]
sudo dnf config-manager --add-repo \
https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
# Install CUDA 12.8 (supports GCC 14)
sudo dnf install -y cuda-nvcc-12-8 cuda-cudart-devel-12-8 libcublas-devel-12-8No CUDA needed. llama.cpp uses Metal automatically.
brew install cmakecd ~/development # or wherever you keep projects
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cppexport PATH=/usr/local/cuda-12.8/bin:$PATH
cmake -B build \
-DGGML_CUDA=ON \
-DCUDAToolkit_ROOT=/usr/local/cuda-12.8 \
-DCMAKE_CUDA_ARCHITECTURES=86 # RTX 30xx series
cmake --build build -j4cmake -B build -DGGML_METAL=ON
cmake --build build -j$(sysctl -n hw.ncpu)Binary location: ~/development/llama.cpp/build/bin/llama-server
mkdir -p ~/models
curl -L -o ~/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
"https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/qwen2.5-coder-7b-instruct-q4_k_m.gguf"~4.5GB download. Q4_K_M quantization fits in 8GB VRAM.
npm install -g @musistudio/claude-code-router --prefix ~/.localCreate ~/.claude-code-router/config.json:
mkdir -p ~/.claude-code-router
cat > ~/.claude-code-router/config.json << 'EOF'
{
"LOG": true,
"API_TIMEOUT_MS": 600000,
"Providers": [
{
"name": "llama",
"api_base_url": "http://127.0.0.1:8080/v1/chat/completions",
"api_key": "not-needed",
"models": ["qwen2.5-coder-7b-instruct"]
}
],
"Router": {
"default": "llama,qwen2.5-coder-7b-instruct",
"background": "llama,qwen2.5-coder-7b-instruct"
}
}
EOFFedora:
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
~/development/llama.cpp/build/bin/llama-server \
--model ~/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
--port 8080 \
--n-gpu-layers 99 \
--ctx-size 32768macOS:
~/development/llama.cpp/build/bin/llama-server \
--model ~/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
--port 8080 \
--n-gpu-layers 99 \
--ctx-size 32768node ~/.local/lib/node_modules/@musistudio/claude-code-router/dist/cli.js startexport ANTHROPIC_AUTH_TOKEN="no-key-needed"
export ANTHROPIC_BASE_URL="http://127.0.0.1:3456"
export API_TIMEOUT_MS="600000"
claude