Skip to content

Instantly share code, notes, and snippets.

@stefanoamorelli
Created December 31, 2025 10:39
Show Gist options
  • Select an option

  • Save stefanoamorelli/15638d04f16b4a296ba1f76360854452 to your computer and use it in GitHub Desktop.

Select an option

Save stefanoamorelli/15638d04f16b4a296ba1f76360854452 to your computer and use it in GitHub Desktop.

Running claude code with local LLMs

Important

This is experimental. Works on my machine but may need adjustments for your environment. You need a decent GPU with 8GB+ VRAM and ~5GB disk space. Local models are less capable and slower than remote models.

Use Claude Code's tooling with a local model instead of Anthropic's API.

How it works

Normal Claude Code:

Claude Code ──────────────────────────────→ Anthropic API (api.anthropic.com)
                                                    ↓
                                              Claude Model

With local model:

Claude Code → claude-code-router [1] → llama-server [2] → Qwen2.5-Coder [3]
                  (port 3456)             (port 8080)         (local GPU)

claude-code-router intercepts requests and translates Anthropic's Messages API to OpenAI-compatible format. llama.cpp runs quantized models with GPU acceleration.

Once set up, everything runs locally (no API calls/internet, for example, you can work on a plane). The tradeoff: local models are much less capable than Claude, resource hungry (need a decent GPU with 8GB+ VRAM), and slower (~50 tokens/sec).


Step 1: install CUDA toolkit

Fedora 40+

# Add NVIDIA CUDA repo [4]
sudo dnf config-manager --add-repo \
  https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo

# Install CUDA 12.8 (supports GCC 14)
sudo dnf install -y cuda-nvcc-12-8 cuda-cudart-devel-12-8 libcublas-devel-12-8

macOS (Apple Silicon)

No CUDA needed. llama.cpp uses Metal automatically.

brew install cmake

Step 2: Clone and Build llama.cpp

cd ~/development  # or wherever you keep projects
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Fedora (NVIDIA GPU)

export PATH=/usr/local/cuda-12.8/bin:$PATH
cmake -B build \
  -DGGML_CUDA=ON \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.8 \
  -DCMAKE_CUDA_ARCHITECTURES=86  # RTX 30xx series
cmake --build build -j4

macOS (Metal)

cmake -B build -DGGML_METAL=ON
cmake --build build -j$(sysctl -n hw.ncpu)

Binary location: ~/development/llama.cpp/build/bin/llama-server


Step 3: download model

mkdir -p ~/models
curl -L -o ~/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  "https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/qwen2.5-coder-7b-instruct-q4_k_m.gguf"

~4.5GB download. Q4_K_M quantization fits in 8GB VRAM.


Step 4: install claude-code-router

npm install -g @musistudio/claude-code-router --prefix ~/.local

Step 5: configure router

Create ~/.claude-code-router/config.json:

mkdir -p ~/.claude-code-router
cat > ~/.claude-code-router/config.json << 'EOF'
{
  "LOG": true,
  "API_TIMEOUT_MS": 600000,
  "Providers": [
    {
      "name": "llama",
      "api_base_url": "http://127.0.0.1:8080/v1/chat/completions",
      "api_key": "not-needed",
      "models": ["qwen2.5-coder-7b-instruct"]
    }
  ],
  "Router": {
    "default": "llama,qwen2.5-coder-7b-instruct",
    "background": "llama,qwen2.5-coder-7b-instruct"
  }
}
EOF

Step 6: start services

Terminal 1: llama-server

Fedora:

export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
~/development/llama.cpp/build/bin/llama-server \
  --model ~/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 99 \
  --ctx-size 32768

macOS:

~/development/llama.cpp/build/bin/llama-server \
  --model ~/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  --port 8080 \
  --n-gpu-layers 99 \
  --ctx-size 32768

Terminal 2: claude-code-router

node ~/.local/lib/node_modules/@musistudio/claude-code-router/dist/cli.js start

Step 7: run claude code

export ANTHROPIC_AUTH_TOKEN="no-key-needed"
export ANTHROPIC_BASE_URL="http://127.0.0.1:3456"
export API_TIMEOUT_MS="600000"
claude

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment