For environments that need to be particularly locked down and code/data should not be sent to an external service, a locally served LLM can still be used as a backend to agentic AI coding tools. This gist details steps to use Cline AI coding agent in VSCode using a locally served LLM running in an Ollama Docker image (assuming you use VSCode ± RemoteSSH on the same machine that will serve the model):
- start Ollama docker:
docker run -d --rm --gpus='"device=0"' -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama- requires having docker with nvidia container toolkit installed
- set gpu device index to control specific resource usage on multi-gpu systems
- serve a capable agentic code model inside the container (at the time of writing, Cline suggests Qwen3 Coder 30B at 8-bit quantization):
docker exec ollama ollama run qwen3-coder:30b-a3b-q8_0- 8-bit quantization fits on <80GB of VRAM for 32k token context window (tested with an A100 80GB)
- 4-bit quantization is also available and fits on <40GB of VRAM: ollama qwen3-coder tags
- install Cline using VSCode Extensions
- configure Cline:
- use Ollama as the API provider
- select the served model
- select the checkbox for "Use compact prompt"
- change the auto-approve permissions as desired