Containerize llama.cpp on macOS with GPU support

Note

Using RamaLama might be an easier option. It seems to be a convenience wrapper to do pretty much what I've described here. In fact, I've even used their container image.

Here's a conference talk introducing it: https://www.youtube.com/watch?v=53NZFC-ReWs

Preparation

Install Podman with libkrun backend for GPU acceleration:

brew tap slp/krunkit # https://github.com/slp/homebrew-krunkit
brew install krunkit
brew install podman

Initialize Podman machine with libkrun provider:

export CONTAINERS_MACHINE_PROVIDER="libkrun"
podman machine init --cpus=4 --memory=16384 # adjust cpu and memory settings as needed
podman machine info # verify vmtype is "libkrun"

Start the Podman machine:

podman machine start

Note

You can double check by connecting to the machine via podman machine ssh and running ls -al /dev/dri to see if the GPU devices are available.

Tip

You can stop the machine with podman machine stop when not in use.

Set up a local cache directory so we do not end up pulling models every time:

export LLAMA_CPP_CACHE_DIR="$(pwd)/.llama.cpp/cache"
mkdir -p "$LLAMA_CPP_CACHE_DIR"

Running llama.cpp

Note

We are using a container image patched to support GPU acceleration and with llama.cpp installed: https://quay.io/repository/ramalama/ramalama?tab=tags (source)

Run llama.cpp in a container and check GPU support:

podman run --device /dev/dri --rm quay.io/ramalama/ramalama:latest llama-cli --list-devices

This should list the GPU devices available in the container.

Running models

Run model in a container (interactive):

podman run \
  --device /dev/dri \
  --volume "$LLAMA_CPP_CACHE_DIR":/root/.cache/llama.cpp/ \
  --interactive \
  --tty \
  --rm \
  quay.io/ramalama/ramalama:latest \
  llama-cli \
  -hf ggml-org/gemma-3-1b-it-GGUF

Running llama-server

Run server:

podman run \
  --device /dev/dri \
  --volume "$LLAMA_CPP_CACHE_DIR":/root/.cache/llama.cpp/ \
  --publish 9999:9999 \
  --interactive \
  --tty \
  --rm \
  quay.io/ramalama/ramalama:latest \
  llama-server \
  --host 0.0.0.0 \
  --port 9999 \
  -hf ggml-org/gemma-3-1b-it-GGUF

Open http://localhost:9999 in your browser to access the web interface.

pmeinhardt/readme.md

Select an option

No results found

Select an option

No results found

Containerize llama.cpp on macOS with GPU support

Preparation

Running llama.cpp

Running models

Running llama-server

References

Tools