Note
Using RamaLama might be an easier option. It seems to be a convenience wrapper to do pretty much what I've described here. In fact, I've even used their container image.
Here's a conference talk introducing it: https://www.youtube.com/watch?v=53NZFC-ReWs
Install Podman with libkrun backend for GPU acceleration:
brew tap slp/krunkit # https://github.com/slp/homebrew-krunkit
brew install krunkit
brew install podmanInitialize Podman machine with libkrun provider:
export CONTAINERS_MACHINE_PROVIDER="libkrun"
podman machine init --cpus=4 --memory=16384 # adjust cpu and memory settings as needed
podman machine info # verify vmtype is "libkrun"Start the Podman machine:
podman machine startNote
You can double check by connecting to the machine via podman machine ssh
and running ls -al /dev/dri to see if the GPU devices are available.
Tip
You can stop the machine with podman machine stop when not in use.
Set up a local cache directory so we do not end up pulling models every time:
export LLAMA_CPP_CACHE_DIR="$(pwd)/.llama.cpp/cache"
mkdir -p "$LLAMA_CPP_CACHE_DIR"Note
We are using a container image patched to support GPU acceleration and with llama.cpp installed: https://quay.io/repository/ramalama/ramalama?tab=tags (source)
Run llama.cpp in a container and check GPU support:
podman run --device /dev/dri --rm quay.io/ramalama/ramalama:latest llama-cli --list-devicesThis should list the GPU devices available in the container.
Run model in a container (interactive):
podman run \
--device /dev/dri \
--volume "$LLAMA_CPP_CACHE_DIR":/root/.cache/llama.cpp/ \
--interactive \
--tty \
--rm \
quay.io/ramalama/ramalama:latest \
llama-cli \
-hf ggml-org/gemma-3-1b-it-GGUFRun server:
podman run \
--device /dev/dri \
--volume "$LLAMA_CPP_CACHE_DIR":/root/.cache/llama.cpp/ \
--publish 9999:9999 \
--interactive \
--tty \
--rm \
quay.io/ramalama/ramalama:latest \
llama-server \
--host 0.0.0.0 \
--port 9999 \
-hf ggml-org/gemma-3-1b-it-GGUFOpen http://localhost:9999 in your browser to access the web interface.