Skip to content

Instantly share code, notes, and snippets.

@pmeinhardt
Last active December 18, 2025 16:53
Show Gist options
  • Select an option

  • Save pmeinhardt/052e4383f9fcb8fc285cfbc283a0b1a1 to your computer and use it in GitHub Desktop.

Select an option

Save pmeinhardt/052e4383f9fcb8fc285cfbc283a0b1a1 to your computer and use it in GitHub Desktop.
Containerize llama.cpp on macOS with GPU support

Containerize llama.cpp on macOS with GPU support

Note

Using RamaLama might be an easier option. It seems to be a convenience wrapper to do pretty much what I've described here. In fact, I've even used their container image.

Here's a conference talk introducing it: https://www.youtube.com/watch?v=53NZFC-ReWs

Preparation

Install Podman with libkrun backend for GPU acceleration:

brew tap slp/krunkit # https://github.com/slp/homebrew-krunkit
brew install krunkit
brew install podman

Initialize Podman machine with libkrun provider:

export CONTAINERS_MACHINE_PROVIDER="libkrun"
podman machine init --cpus=4 --memory=16384 # adjust cpu and memory settings as needed
podman machine info # verify vmtype is "libkrun"

Start the Podman machine:

podman machine start

Note

You can double check by connecting to the machine via podman machine ssh and running ls -al /dev/dri to see if the GPU devices are available.

Tip

You can stop the machine with podman machine stop when not in use.

Set up a local cache directory so we do not end up pulling models every time:

export LLAMA_CPP_CACHE_DIR="$(pwd)/.llama.cpp/cache"
mkdir -p "$LLAMA_CPP_CACHE_DIR"

Running llama.cpp

Note

We are using a container image patched to support GPU acceleration and with llama.cpp installed: https://quay.io/repository/ramalama/ramalama?tab=tags (source)

Run llama.cpp in a container and check GPU support:

podman run --device /dev/dri --rm quay.io/ramalama/ramalama:latest llama-cli --list-devices

This should list the GPU devices available in the container.

Running models

Run model in a container (interactive):

podman run \
  --device /dev/dri \
  --volume "$LLAMA_CPP_CACHE_DIR":/root/.cache/llama.cpp/ \
  --interactive \
  --tty \
  --rm \
  quay.io/ramalama/ramalama:latest \
  llama-cli \
  -hf ggml-org/gemma-3-1b-it-GGUF

Running llama-server

Run server:

podman run \
  --device /dev/dri \
  --volume "$LLAMA_CPP_CACHE_DIR":/root/.cache/llama.cpp/ \
  --publish 9999:9999 \
  --interactive \
  --tty \
  --rm \
  quay.io/ramalama/ramalama:latest \
  llama-server \
  --host 0.0.0.0 \
  --port 9999 \
  -hf ggml-org/gemma-3-1b-it-GGUF

Open http://localhost:9999 in your browser to access the web interface.

References

Tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment