Full guide for Pop!_OS with 32 GB RAM + Core Ultra 9. Everything runs locally, zero cloud calls.
- Works immediately, great reliability
- Ollama as the local model server, OpenCode as the agent on top
- Lowest-friction route — 7B-9B models in 4-bit run well with 32 GB RAM
- This is your reliable baseline. Get this working first.
- For Intel GPUs, the most "official" route is llama.cpp's SYCL backend
- Supports Arc Series and built-in Intel Arc GPUs
- Reality check: performance varies a lot by driver + kernel + model + quant. Some people report it's great, others report it's disappointing depending on stack maturity
- Worth trying, but keep CPU as your fallback
Recommended approach: Get Path 1 fully working first. Then experiment with Path 2 if you want faster inference. You can always fall back to CPU.
curl -fsSL https://ollama.com/install.sh | shVerify installation:
ollama --versionollama serveThis runs in the foreground. Leave this terminal open, or run it in the background:
ollama serve &To have Ollama start automatically on boot (systemd):
sudo systemctl enable ollama
sudo systemctl start ollamaCheck it's running:
curl http://localhost:11434/api/tagsYou should get a JSON response (empty models list if fresh install).
ollama pull qwen2.5:7bThis downloads ~4.7 GB. Wait for it to complete.
ollama pull llama3.1:8b # more general-purpose
ollama pull deepseek-coder:6.7b # code-focused
ollama pull codellama:7b # Meta's code modelollama run qwen2.5:7b "Write a hello world in Python"You should see a Python snippet in the output. Press Ctrl+D to exit the interactive session.
ollama listcurl -fsSL https://opencode.ai/install | bashRestart your shell (or source your profile):
exec $SHELLVerify:
opencode --versionOllama exposes an OpenAI-compatible API at http://localhost:11434/v1.
export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
export OPENAI_MODEL=qwen2.5:7bAdd to ~/.bashrc or ~/.zshrc:
# OpenCode + Ollama (local LLM)
export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
export OPENAI_MODEL=qwen2.5:7bThen reload:
source ~/.bashrc # or source ~/.zshrcCreate ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "Ollama (local)",
"options": {
"baseURL": "http://localhost:11434/v1"
},
"models": {
"qwen2.5-coder:7b": {
"name": "Qwen 2.5 Coder 7B"
}
}
}
},
"model": "ollama/qwen2.5-coder:7b"
}Note:
providermust be an object (not a string). Model reference usesprovider/modelformat. If tool calls aren't working, try increasingnum_ctxin Ollama to 16k-32k.
cd ~/your-project
opencodeYou should see OpenCode start up and connect to your local Ollama instance.
Inside OpenCode, try:
> explain this codebase
> list all files
> edit main.py — add a docstring to the main function
If it responds coherently and proposes edits, the setup is working.
# Check Ollama is running
curl http://localhost:11434/api/tags
# If not running, start it
ollama serve# List available models
ollama list
# Make sure OPENAI_MODEL matches exactly
echo $OPENAI_MODEL- 7B models on CPU can take 10-30s per response
- Check if Ollama is using GPU:
ollama ps(shows VRAM usage) - For Intel Arc iGPU, Ollama may need additional setup for GPU offloading
# Verify they're set
env | grep OPENAI
# Should show:
# OPENAI_BASE_URL=http://localhost:11434/v1
# OPENAI_API_KEY=ollama
# OPENAI_MODEL=qwen2.5:7bWith 32 GB RAM + Core Ultra 9:
| Model | Size | Strength | Notes |
|---|---|---|---|
| qwen2.5:7b | ~4.7 GB | Code edits, instruction following | Best default ⭐ |
| llama3.1:8b | ~4.7 GB | General purpose | Good all-rounder |
| deepseek-coder:6.7b | ~3.8 GB | Code generation | Code-specialized |
| codellama:7b | ~3.8 GB | Code completion | Good for fill-in tasks |
- 7B models — fast, fits easily in 32 GB RAM
- 13B models (4-bit) — usable but noticeably slower
- < 3B models — avoid for OpenCode, struggle with multi-file edits
- Single-file edits and explanations
- Small refactors (1-3 files)
- Code review and suggestions
- "Edit these 2 files" style tasks
- Long multi-step plans across many files
- Very large refactors in one shot
- Complex reasoning-heavy architecture decisions
- Scope tasks narrowly
- Ask it to explain before editing
- Review diffs before accepting
- Break big changes into small steps
- Ollama models typically have 4K-8K context (some support 32K+)
- OpenCode chunks intelligently, but large repos need narrower questions
- To extend context:
ollama run qwen2.5:7b --num-ctx 16384
- Private repos (work code, side projects)
- Offline or VPN-restricted environments
- Fast "edit → test → edit" loops
- Predictable behavior and zero token cost
- Massive refactors across many subsystems
- Extremely long context (RFCs + many files at once)
- Complex reasoning-heavy design work
This is the experimental path. Only attempt after Path 1 is solid.
- Intel's official GPU compute path for llama.cpp
- Can offload model layers to the Arc iGPU for faster token generation
- Potentially 2-5x speedup over CPU-only for supported models
# Install Intel GPU drivers (if not already present)
sudo apt update
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zero
# Verify GPU is detected
sudo apt install -y clinfo
clinfo | head -20# Add Intel repo
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] \
https://apt.repos.intel.com/oneapi all main" \
| sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install -y intel-oneapi-base-toolkit# Source the oneAPI environment
source /opt/intel/oneapi/setvars.sh
# Clone and build
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build build --config Release -j$(nproc)# Example: Qwen 2.5 7B in Q4_K_M quantization
# Download from HuggingFace (use huggingface-cli or wget)
mkdir -p ~/models
cd ~/models
# Download the GGUF file for your chosen modelsource /opt/intel/oneapi/setvars.sh
cd ~/llama.cpp
./build/bin/llama-server \
-m ~/models/qwen2.5-7b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-c 8192-ngl 99— offload all layers to GPU-c 8192— context size- The server exposes an OpenAI-compatible API at
http://localhost:8080/v1
export OPENAI_BASE_URL=http://localhost:8080/v1
export OPENAI_API_KEY=unused
export OPENAI_MODEL=qwen2.5-7b| Aspect | Status |
|---|---|
| Driver maturity | Improving but not as stable as NVIDIA CUDA |
| Performance vs CPU | Often 2-5x faster for token generation |
| Compatibility | Not all quant formats work equally well |
| Debugging | Errors can be cryptic, stack traces unhelpful |
| Community support | Smaller than CUDA ecosystem |
- Check
sycl-lsto confirm your GPU is visible - Try fewer layers on GPU:
-ngl 20instead of-ngl 99 - Try a different quantization (Q4_0 is most compatible)
- Check llama.cpp GitHub issues for Intel Arc-specific threads
- Fall back to Ollama on CPU — it's reliable and still fast enough for 7B models