Running claude code with local LLMs

Important

This is experimental. Works on my machine but may need adjustments for your environment. You need a decent GPU with 8GB+ VRAM and ~5GB disk space. Local models are less capable and slower than remote models.

Use Claude Code's tooling with a local model instead of Anthropic's API.

How it works

Normal Claude Code:

Claude Code ──────────────────────────────→ Anthropic API (api.anthropic.com)
                                                    ↓
                                              Claude Model

With local model:

Claude Code → claude-code-router [1] → llama-server [2] → Qwen2.5-Coder [3]
                  (port 3456)             (port 8080)         (local GPU)

claude-code-router intercepts requests and translates Anthropic's Messages API to OpenAI-compatible format. llama.cpp runs quantized models with GPU acceleration.

Once set up, everything runs locally (no API calls/internet, for example, you can work on a plane). The tradeoff: local models are much less capable than Claude, resource hungry (need a decent GPU with 8GB+ VRAM), and slower (~50 tokens/sec).

Step 1: install CUDA toolkit

Fedora 40+

# Add NVIDIA CUDA repo [4]
sudo dnf config-manager --add-repo \
  https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo

# Install CUDA 12.8 (supports GCC 14)
sudo dnf install -y cuda-nvcc-12-8 cuda-cudart-devel-12-8 libcublas-devel-12-8

macOS (Apple Silicon)

No CUDA needed. llama.cpp uses Metal automatically.

brew install cmake

Step 2: Clone and Build llama.cpp

cd ~/development  # or wherever you keep projects
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Fedora (NVIDIA GPU)

export PATH=/usr/local/cuda-12.8/bin:$PATH
cmake -B build \
  -DGGML_CUDA=ON \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.8 \
  -DCMAKE_CUDA_ARCHITECTURES=86  # RTX 30xx series
cmake --build build -j4

macOS (Metal)

cmake -B build -DGGML_METAL=ON
cmake --build build -j$(sysctl -n hw.ncpu)

Binary location: ~/development/llama.cpp/build/bin/llama-server

Step 3: download model

mkdir -p ~/models
curl -L -o ~/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  "https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/qwen2.5-coder-7b-instruct-q4_k_m.gguf"

~4.5GB download. Q4_K_M quantization fits in 8GB VRAM.

Step 4: install claude-code-router

npm install -g @musistudio/claude-code-router --prefix ~/.local

Step 5: configure router

Create ~/.claude-code-router/config.json:

mkdir -p ~/.claude-code-router
cat > ~/.claude-code-router/config.json << 'EOF'
{
  "LOG": true,
  "API_TIMEOUT_MS": 600000,
  "Providers": [
    {
      "name": "llama",
      "api_base_url": "http://127.0.0.1:8080/v1/chat/completions",
      "api_key": "not-needed",
      "models": ["qwen2.5-coder-7b-instruct"]
    }
  ],
  "Router": {
    "default": "llama,qwen2.5-coder-7b-instruct",
    "background": "llama,qwen2.5-coder-7b-instruct"
  }
}
EOF

Step 6: start services

Terminal 1: llama-server

Fedora:

export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
~/development/llama.cpp/build/bin/llama-server \
  --model ~/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 99 \
  --ctx-size 32768

macOS:

~/development/llama.cpp/build/bin/llama-server \
  --model ~/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 99 \
  --ctx-size 32768

Terminal 2: claude-code-router

node ~/.local/lib/node_modules/@musistudio/claude-code-router/dist/cli.js start

Step 7: run claude code

export ANTHROPIC_AUTH_TOKEN="no-key-needed"
export ANTHROPIC_BASE_URL="http://127.0.0.1:3456"
export API_TIMEOUT_MS="600000"
claude

stefanoamorelli/claude-code-qwen2.5-coder.md

Select an option

No results found