Skip to content

Instantly share code, notes, and snippets.

@sandeepkv93
Created February 8, 2026 22:34
Show Gist options
  • Select an option

  • Save sandeepkv93/676336e40f36bab5338efaa923ec47c6 to your computer and use it in GitHub Desktop.

Select an option

Save sandeepkv93/676336e40f36bab5338efaa923ec47c6 to your computer and use it in GitHub Desktop.
OpenCode Ollama setup

Setting Up OpenCode with Ollama (Local-Only)

Full guide for Pop!_OS with 32 GB RAM + Core Ultra 9. Everything runs locally, zero cloud calls.


Strategy: Two Paths for This Hardware

Path 1: CPU-local via Ollama (start here)

  • Works immediately, great reliability
  • Ollama as the local model server, OpenCode as the agent on top
  • Lowest-friction route — 7B-9B models in 4-bit run well with 32 GB RAM
  • This is your reliable baseline. Get this working first.

Path 2: Arc iGPU via llama.cpp SYCL (optional, experimental)

  • For Intel GPUs, the most "official" route is llama.cpp's SYCL backend
  • Supports Arc Series and built-in Intel Arc GPUs
  • Reality check: performance varies a lot by driver + kernel + model + quant. Some people report it's great, others report it's disappointing depending on stack maturity
  • Worth trying, but keep CPU as your fallback

Recommended approach: Get Path 1 fully working first. Then experiment with Path 2 if you want faster inference. You can always fall back to CPU.


Path 1: Ollama + OpenCode (CPU)

Step 1: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Verify installation:

ollama --version

Start the Ollama server

ollama serve

This runs in the foreground. Leave this terminal open, or run it in the background:

ollama serve &

To have Ollama start automatically on boot (systemd):

sudo systemctl enable ollama
sudo systemctl start ollama

Check it's running:

curl http://localhost:11434/api/tags

You should get a JSON response (empty models list if fresh install).


Step 2: Pull a Coding Model

ollama pull qwen2.5:7b

This downloads ~4.7 GB. Wait for it to complete.

Alternative models

ollama pull llama3.1:8b       # more general-purpose
ollama pull deepseek-coder:6.7b  # code-focused
ollama pull codellama:7b       # Meta's code model

Verify the model works

ollama run qwen2.5:7b "Write a hello world in Python"

You should see a Python snippet in the output. Press Ctrl+D to exit the interactive session.

List downloaded models

ollama list

Step 3: Install OpenCode

curl -fsSL https://opencode.ai/install | bash

Restart your shell (or source your profile):

exec $SHELL

Verify:

opencode --version

Step 4: Configure OpenCode to Use Ollama

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1.

Option A: Environment variables (quick)

export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
export OPENAI_MODEL=qwen2.5:7b

Option B: Persist in shell profile (recommended)

Add to ~/.bashrc or ~/.zshrc:

# OpenCode + Ollama (local LLM)
export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
export OPENAI_MODEL=qwen2.5:7b

Then reload:

source ~/.bashrc   # or source ~/.zshrc

Option C: OpenCode config file (recommended)

Create ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Ollama (local)",
      "options": {
        "baseURL": "http://localhost:11434/v1"
      },
      "models": {
        "qwen2.5-coder:7b": {
          "name": "Qwen 2.5 Coder 7B"
        }
      }
    }
  },
  "model": "ollama/qwen2.5-coder:7b"
}

Note: provider must be an object (not a string). Model reference uses provider/model format. If tool calls aren't working, try increasing num_ctx in Ollama to 16k-32k.


Step 5: Run OpenCode

cd ~/your-project
opencode

You should see OpenCode start up and connect to your local Ollama instance.

Quick smoke test

Inside OpenCode, try:

> explain this codebase
> list all files
> edit main.py — add a docstring to the main function

If it responds coherently and proposes edits, the setup is working.


Troubleshooting

"Connection refused" errors

# Check Ollama is running
curl http://localhost:11434/api/tags

# If not running, start it
ollama serve

Model not found

# List available models
ollama list

# Make sure OPENAI_MODEL matches exactly
echo $OPENAI_MODEL

Slow responses

  • 7B models on CPU can take 10-30s per response
  • Check if Ollama is using GPU: ollama ps (shows VRAM usage)
  • For Intel Arc iGPU, Ollama may need additional setup for GPU offloading

OpenCode not picking up env vars

# Verify they're set
env | grep OPENAI

# Should show:
# OPENAI_BASE_URL=http://localhost:11434/v1
# OPENAI_API_KEY=ollama
# OPENAI_MODEL=qwen2.5:7b

Recommended Models

With 32 GB RAM + Core Ultra 9:

Model Size Strength Notes
qwen2.5:7b ~4.7 GB Code edits, instruction following Best default ⭐
llama3.1:8b ~4.7 GB General purpose Good all-rounder
deepseek-coder:6.7b ~3.8 GB Code generation Code-specialized
codellama:7b ~3.8 GB Code completion Good for fill-in tasks

Sizing guidelines

  • 7B models — fast, fits easily in 32 GB RAM
  • 13B models (4-bit) — usable but noticeably slower
  • < 3B models — avoid for OpenCode, struggle with multi-file edits

Local-Model Caveats

What local models handle well

  • Single-file edits and explanations
  • Small refactors (1-3 files)
  • Code review and suggestions
  • "Edit these 2 files" style tasks

What local models struggle with

  • Long multi-step plans across many files
  • Very large refactors in one shot
  • Complex reasoning-heavy architecture decisions

Best workflow with local models

  1. Scope tasks narrowly
  2. Ask it to explain before editing
  3. Review diffs before accepting
  4. Break big changes into small steps

Context window

  • Ollama models typically have 4K-8K context (some support 32K+)
  • OpenCode chunks intelligently, but large repos need narrower questions
  • To extend context: ollama run qwen2.5:7b --num-ctx 16384

When This Setup Shines

  • Private repos (work code, side projects)
  • Offline or VPN-restricted environments
  • Fast "edit → test → edit" loops
  • Predictable behavior and zero token cost

When Cloud Models Still Win

  • Massive refactors across many subsystems
  • Extremely long context (RFCs + many files at once)
  • Complex reasoning-heavy design work

Path 2: Arc iGPU via llama.cpp SYCL (Optional)

This is the experimental path. Only attempt after Path 1 is solid.

Why llama.cpp SYCL?

  • Intel's official GPU compute path for llama.cpp
  • Can offload model layers to the Arc iGPU for faster token generation
  • Potentially 2-5x speedup over CPU-only for supported models

Prerequisites

# Install Intel GPU drivers (if not already present)
sudo apt update
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zero

# Verify GPU is detected
sudo apt install -y clinfo
clinfo | head -20

Install Intel oneAPI Base Toolkit

# Add Intel repo
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
  | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null

echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] \
  https://apt.repos.intel.com/oneapi all main" \
  | sudo tee /etc/apt/sources.list.d/oneAPI.list

sudo apt update
sudo apt install -y intel-oneapi-base-toolkit

Build llama.cpp with SYCL

# Source the oneAPI environment
source /opt/intel/oneapi/setvars.sh

# Clone and build
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build build --config Release -j$(nproc)

Download a GGUF model

# Example: Qwen 2.5 7B in Q4_K_M quantization
# Download from HuggingFace (use huggingface-cli or wget)
mkdir -p ~/models
cd ~/models
# Download the GGUF file for your chosen model

Run with GPU offloading

source /opt/intel/oneapi/setvars.sh

cd ~/llama.cpp
./build/bin/llama-server \
  -m ~/models/qwen2.5-7b-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  -c 8192
  • -ngl 99 — offload all layers to GPU
  • -c 8192 — context size
  • The server exposes an OpenAI-compatible API at http://localhost:8080/v1

Point OpenCode to llama.cpp instead of Ollama

export OPENAI_BASE_URL=http://localhost:8080/v1
export OPENAI_API_KEY=unused
export OPENAI_MODEL=qwen2.5-7b

Reality check: what to expect

Aspect Status
Driver maturity Improving but not as stable as NVIDIA CUDA
Performance vs CPU Often 2-5x faster for token generation
Compatibility Not all quant formats work equally well
Debugging Errors can be cryptic, stack traces unhelpful
Community support Smaller than CUDA ecosystem

If it doesn't work

  • Check sycl-ls to confirm your GPU is visible
  • Try fewer layers on GPU: -ngl 20 instead of -ngl 99
  • Try a different quantization (Q4_0 is most compatible)
  • Check llama.cpp GitHub issues for Intel Arc-specific threads
  • Fall back to Ollama on CPU — it's reliable and still fast enough for 7B models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment