Setting Up OpenCode with Ollama (Local-Only)

Full guide for Pop!_OS with 32 GB RAM + Core Ultra 9. Everything runs locally, zero cloud calls.

Strategy: Two Paths for This Hardware

Path 1: CPU-local via Ollama (start here)

Works immediately, great reliability
Ollama as the local model server, OpenCode as the agent on top
Lowest-friction route — 7B-9B models in 4-bit run well with 32 GB RAM
This is your reliable baseline. Get this working first.

Path 2: Arc iGPU via llama.cpp SYCL (optional, experimental)

For Intel GPUs, the most "official" route is llama.cpp's SYCL backend
Supports Arc Series and built-in Intel Arc GPUs
Reality check: performance varies a lot by driver + kernel + model + quant. Some people report it's great, others report it's disappointing depending on stack maturity
Worth trying, but keep CPU as your fallback

Recommended approach: Get Path 1 fully working first. Then experiment with Path 2 if you want faster inference. You can always fall back to CPU.

Path 1: Ollama + OpenCode (CPU)

Step 1: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Verify installation:

ollama --version

Start the Ollama server

ollama serve

This runs in the foreground. Leave this terminal open, or run it in the background:

ollama serve &

To have Ollama start automatically on boot (systemd):

sudo systemctl enable ollama
sudo systemctl start ollama

Check it's running:

curl http://localhost:11434/api/tags

You should get a JSON response (empty models list if fresh install).

Step 2: Pull a Coding Model

ollama pull qwen2.5:7b

This downloads ~4.7 GB. Wait for it to complete.

Alternative models

ollama pull llama3.1:8b       # more general-purpose
ollama pull deepseek-coder:6.7b  # code-focused
ollama pull codellama:7b       # Meta's code model

Verify the model works

ollama run qwen2.5:7b "Write a hello world in Python"

You should see a Python snippet in the output. Press Ctrl+D to exit the interactive session.

List downloaded models

ollama list

Step 3: Install OpenCode

curl -fsSL https://opencode.ai/install | bash

Restart your shell (or source your profile):

exec $SHELL

Verify:

opencode --version

Step 4: Configure OpenCode to Use Ollama

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1.

Option A: Environment variables (quick)

export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
export OPENAI_MODEL=qwen2.5:7b

Option B: Persist in shell profile (recommended)

Add to ~/.bashrc or ~/.zshrc:

# OpenCode + Ollama (local LLM)
export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
export OPENAI_MODEL=qwen2.5:7b

Then reload:

source ~/.bashrc   # or source ~/.zshrc

Option C: OpenCode config file (recommended)

Create ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Ollama (local)",
      "options": {
        "baseURL": "http://localhost:11434/v1"
      },
      "models": {
        "qwen2.5-coder:7b": {
          "name": "Qwen 2.5 Coder 7B"
        }
      }
    }
  },
  "model": "ollama/qwen2.5-coder:7b"
}

Note: provider must be an object (not a string). Model reference uses provider/model format. If tool calls aren't working, try increasing num_ctx in Ollama to 16k-32k.

Step 5: Run OpenCode

cd ~/your-project
opencode

You should see OpenCode start up and connect to your local Ollama instance.

Quick smoke test

Inside OpenCode, try:

> explain this codebase
> list all files
> edit main.py — add a docstring to the main function

If it responds coherently and proposes edits, the setup is working.

Troubleshooting

"Connection refused" errors

# Check Ollama is running
curl http://localhost:11434/api/tags

# If not running, start it
ollama serve

Model not found

# List available models
ollama list

# Make sure OPENAI_MODEL matches exactly
echo $OPENAI_MODEL

Slow responses

7B models on CPU can take 10-30s per response
Check if Ollama is using GPU: ollama ps (shows VRAM usage)
For Intel Arc iGPU, Ollama may need additional setup for GPU offloading

OpenCode not picking up env vars

# Verify they're set
env | grep OPENAI

# Should show:
# OPENAI_BASE_URL=http://localhost:11434/v1
# OPENAI_API_KEY=ollama
# OPENAI_MODEL=qwen2.5:7b

Recommended Models

With 32 GB RAM + Core Ultra 9:

Model	Size	Strength	Notes
qwen2.5:7b	~4.7 GB	Code edits, instruction following	Best default ⭐
llama3.1:8b	~4.7 GB	General purpose	Good all-rounder
deepseek-coder:6.7b	~3.8 GB	Code generation	Code-specialized
codellama:7b	~3.8 GB	Code completion	Good for fill-in tasks

Sizing guidelines

7B models — fast, fits easily in 32 GB RAM
13B models (4-bit) — usable but noticeably slower
< 3B models — avoid for OpenCode, struggle with multi-file edits

Local-Model Caveats

What local models handle well

Single-file edits and explanations
Small refactors (1-3 files)
Code review and suggestions
"Edit these 2 files" style tasks

What local models struggle with

Long multi-step plans across many files
Very large refactors in one shot
Complex reasoning-heavy architecture decisions

Best workflow with local models

Scope tasks narrowly
Ask it to explain before editing
Review diffs before accepting
Break big changes into small steps

Context window

Ollama models typically have 4K-8K context (some support 32K+)
OpenCode chunks intelligently, but large repos need narrower questions
To extend context: ollama run qwen2.5:7b --num-ctx 16384

When This Setup Shines

Private repos (work code, side projects)
Offline or VPN-restricted environments
Fast "edit → test → edit" loops
Predictable behavior and zero token cost

When Cloud Models Still Win

Massive refactors across many subsystems
Extremely long context (RFCs + many files at once)
Complex reasoning-heavy design work

Path 2: Arc iGPU via llama.cpp SYCL (Optional)

This is the experimental path. Only attempt after Path 1 is solid.

Why llama.cpp SYCL?

Intel's official GPU compute path for llama.cpp
Can offload model layers to the Arc iGPU for faster token generation
Potentially 2-5x speedup over CPU-only for supported models

Prerequisites

# Install Intel GPU drivers (if not already present)
sudo apt update
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zero

# Verify GPU is detected
sudo apt install -y clinfo
clinfo | head -20

Install Intel oneAPI Base Toolkit

# Add Intel repo
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
  | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null

echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] \
  https://apt.repos.intel.com/oneapi all main" \
  | sudo tee /etc/apt/sources.list.d/oneAPI.list

sudo apt update
sudo apt install -y intel-oneapi-base-toolkit

Build llama.cpp with SYCL

# Source the oneAPI environment
source /opt/intel/oneapi/setvars.sh

# Clone and build
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build build --config Release -j$(nproc)

Download a GGUF model

# Example: Qwen 2.5 7B in Q4_K_M quantization
# Download from HuggingFace (use huggingface-cli or wget)
mkdir -p ~/models
cd ~/models
# Download the GGUF file for your chosen model

Run with GPU offloading

source /opt/intel/oneapi/setvars.sh

cd ~/llama.cpp
./build/bin/llama-server \
  -m ~/models/qwen2.5-7b-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  -c 8192

-ngl 99 — offload all layers to GPU
-c 8192 — context size
The server exposes an OpenAI-compatible API at http://localhost:8080/v1

Point OpenCode to llama.cpp instead of Ollama

export OPENAI_BASE_URL=http://localhost:8080/v1
export OPENAI_API_KEY=unused
export OPENAI_MODEL=qwen2.5-7b

Reality check: what to expect

Aspect	Status
Driver maturity	Improving but not as stable as NVIDIA CUDA
Performance vs CPU	Often 2-5x faster for token generation
Compatibility	Not all quant formats work equally well
Debugging	Errors can be cryptic, stack traces unhelpful
Community support	Smaller than CUDA ecosystem

If it doesn't work

Check sycl-ls to confirm your GPU is visible
Try fewer layers on GPU: -ngl 20 instead of -ngl 99
Try a different quantization (Q4_0 is most compatible)
Check llama.cpp GitHub issues for Intel Arc-specific threads
Fall back to Ollama on CPU — it's reliable and still fast enough for 7B models

sandeepkv93/opencode-ollama-setup.md