Created
February 5, 2026 19:27
-
-
Save audstanley/4f21400ff7fce7abf1f5d5f711a49aa5 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ./bin/llama-server --help | |
| -h, --help, --usage print usage and exit | |
| --version show version and build info | |
| --license show source code license and dependencies | |
| -cl, --cache-list show list of models in cache | |
| --completion-bash print source-able bash completion script for llama.cpp | |
| --verbose-prompt print a verbose prompt before generation (default: false) | |
| -t, --threads N number of CPU threads to use during generation (default: -1) | |
| (env: LLAMA_ARG_THREADS) | |
| -tb, --threads-batch N number of threads to use during batch and prompt processing (default: | |
| same as --threads) | |
| -C, --cpu-mask M CPU affinity mask: arbitrarily long hex. Complements cpu-range | |
| (default: "") | |
| -Cr, --cpu-range lo-hi range of CPUs for affinity. Complements --cpu-mask | |
| --cpu-strict <0|1> use strict CPU placement (default: 0) | |
| --prio N set process/thread priority : low(-1), normal(0), medium(1), high(2), | |
| realtime(3) (default: 0) | |
| --poll <0...100> use polling level to wait for work (0 - no polling, default: 50) | |
| -Cb, --cpu-mask-batch M CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch | |
| (default: same as --cpu-mask) | |
| -Crb, --cpu-range-batch lo-hi ranges of CPUs for affinity. Complements --cpu-mask-batch | |
| --cpu-strict-batch <0|1> use strict CPU placement (default: same as --cpu-strict) | |
| --prio-batch N set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime | |
| (default: 0) | |
| --poll-batch <0|1> use polling to wait for work (default: same as --poll) | |
| -c, --ctx-size N size of the prompt context (default: 0, 0 = loaded from model) | |
| (env: LLAMA_ARG_CTX_SIZE) | |
| -n, --predict, --n-predict N number of tokens to predict (default: -1, -1 = infinity) | |
| (env: LLAMA_ARG_N_PREDICT) | |
| -b, --batch-size N logical maximum batch size (default: 2048) | |
| (env: LLAMA_ARG_BATCH) | |
| -ub, --ubatch-size N physical maximum batch size (default: 512) | |
| (env: LLAMA_ARG_UBATCH) | |
| --keep N number of tokens to keep from the initial prompt (default: 0, -1 = | |
| all) | |
| --swa-full use full-size SWA cache (default: false) | |
| [(more | |
| info)](https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055) | |
| (env: LLAMA_ARG_SWA_FULL) | |
| -fa, --flash-attn [on|off|auto] set Flash Attention use ('on', 'off', or 'auto', default: 'auto') | |
| (env: LLAMA_ARG_FLASH_ATTN) | |
| --perf, --no-perf whether to enable internal libllama performance timings (default: | |
| false) | |
| (env: LLAMA_ARG_PERF) | |
| -e, --escape, --no-escape whether to process escapes sequences (\n, \r, \t, \', \", \\) | |
| (default: true) | |
| --rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by | |
| the model | |
| (env: LLAMA_ARG_ROPE_SCALING_TYPE) | |
| --rope-scale N RoPE context scaling factor, expands context by a factor of N | |
| (env: LLAMA_ARG_ROPE_SCALE) | |
| --rope-freq-base N RoPE base frequency, used by NTK-aware scaling (default: loaded from | |
| model) | |
| (env: LLAMA_ARG_ROPE_FREQ_BASE) | |
| --rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N | |
| (env: LLAMA_ARG_ROPE_FREQ_SCALE) | |
| --yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training | |
| context size) | |
| (env: LLAMA_ARG_YARN_ORIG_CTX) | |
| --yarn-ext-factor N YaRN: extrapolation mix factor (default: -1.00, 0.0 = full | |
| interpolation) | |
| (env: LLAMA_ARG_YARN_EXT_FACTOR) | |
| --yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: -1.00) | |
| (env: LLAMA_ARG_YARN_ATTN_FACTOR) | |
| --yarn-beta-slow N YaRN: high correction dim or alpha (default: -1.00) | |
| (env: LLAMA_ARG_YARN_BETA_SLOW) | |
| --yarn-beta-fast N YaRN: low correction dim or beta (default: -1.00) | |
| (env: LLAMA_ARG_YARN_BETA_FAST) | |
| -kvo, --kv-offload, -nkvo, --no-kv-offload | |
| whether to enable KV cache offloading (default: enabled) | |
| (env: LLAMA_ARG_KV_OFFLOAD) | |
| --repack, -nr, --no-repack whether to enable weight repacking (default: enabled) | |
| (env: LLAMA_ARG_REPACK) | |
| --no-host bypass host buffer allowing extra buffers to be used | |
| (env: LLAMA_ARG_NO_HOST) | |
| -ctk, --cache-type-k TYPE KV cache data type for K | |
| allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1 | |
| (default: f16) | |
| (env: LLAMA_ARG_CACHE_TYPE_K) | |
| -ctv, --cache-type-v TYPE KV cache data type for V | |
| allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1 | |
| (default: f16) | |
| (env: LLAMA_ARG_CACHE_TYPE_V) | |
| -dt, --defrag-thold N KV cache defragmentation threshold (DEPRECATED) | |
| (env: LLAMA_ARG_DEFRAG_THOLD) | |
| --mlock force system to keep model in RAM rather than swapping or compressing | |
| (env: LLAMA_ARG_MLOCK) | |
| --mmap, --no-mmap whether to memory-map model. (if mmap disabled, slower load but may | |
| reduce pageouts if not using mlock) (default: enabled) | |
| (env: LLAMA_ARG_MMAP) | |
| -dio, --direct-io, -ndio, --no-direct-io | |
| use DirectIO if available. (default: disabled) | |
| (env: LLAMA_ARG_DIO) | |
| --numa TYPE attempt optimizations that help on some NUMA systems | |
| - distribute: spread execution evenly over all nodes | |
| - isolate: only spawn threads on CPUs on the node that execution | |
| started on | |
| - numactl: use the CPU map provided by numactl | |
| if run without this previously, it is recommended to drop the system | |
| page cache before using this | |
| see https://github.com/ggml-org/llama.cpp/issues/1437 | |
| (env: LLAMA_ARG_NUMA) | |
| -dev, --device <dev1,dev2,..> comma-separated list of devices to use for offloading (none = don't | |
| offload) | |
| use --list-devices to see a list of available devices | |
| (env: LLAMA_ARG_DEVICE) | |
| --list-devices print list of available devices and exit | |
| -ot, --override-tensor <tensor name pattern>=<buffer type>,... | |
| override tensor buffer type | |
| (env: LLAMA_ARG_OVERRIDE_TENSOR) | |
| -cmoe, --cpu-moe keep all Mixture of Experts (MoE) weights in the CPU | |
| (env: LLAMA_ARG_CPU_MOE) | |
| -ncmoe, --n-cpu-moe N keep the Mixture of Experts (MoE) weights of the first N layers in the | |
| CPU | |
| (env: LLAMA_ARG_N_CPU_MOE) | |
| -ngl, --gpu-layers, --n-gpu-layers N max. number of layers to store in VRAM, either an exact number, | |
| 'auto', or 'all' (default: auto) | |
| (env: LLAMA_ARG_N_GPU_LAYERS) | |
| -sm, --split-mode {none,layer,row} how to split the model across multiple GPUs, one of: | |
| - none: use one GPU only | |
| - layer (default): split layers and KV across GPUs | |
| - row: split rows across GPUs | |
| (env: LLAMA_ARG_SPLIT_MODE) | |
| -ts, --tensor-split N0,N1,N2,... fraction of the model to offload to each GPU, comma-separated list of | |
| proportions, e.g. 3,1 | |
| (env: LLAMA_ARG_TENSOR_SPLIT) | |
| -mg, --main-gpu INDEX the GPU to use for the model (with split-mode = none), or for | |
| intermediate results and KV (with split-mode = row) (default: 0) | |
| (env: LLAMA_ARG_MAIN_GPU) | |
| -fit, --fit [on|off] whether to adjust unset arguments to fit in device memory ('on' or | |
| 'off', default: 'on') | |
| (env: LLAMA_ARG_FIT) | |
| -fitt, --fit-target MiB0,MiB1,MiB2,... | |
| target margin per device for --fit, comma-separated list of values, | |
| single value is broadcast across all devices, default: 1024 | |
| (env: LLAMA_ARG_FIT_TARGET) | |
| -fitc, --fit-ctx N minimum ctx size that can be set by --fit option, default: 4096 | |
| (env: LLAMA_ARG_FIT_CTX) | |
| --check-tensors check model tensor data for invalid values (default: false) | |
| --override-kv KEY=TYPE:VALUE,... advanced option to override model metadata by key. to specify multiple | |
| overrides, either use comma-separated values. | |
| types: int, float, bool, str. example: --override-kv | |
| tokenizer.ggml.add_bos_token=bool:false,tokenizer.ggml.add_eos_token=bool:false | |
| --op-offload, --no-op-offload whether to offload host tensor operations to device (default: true) | |
| --lora FNAME path to LoRA adapter (use comma-separated values to load multiple | |
| adapters) | |
| --lora-scaled FNAME:SCALE,... path to LoRA adapter with user defined scaling (format: | |
| FNAME:SCALE,...) | |
| note: use comma-separated values | |
| --control-vector FNAME add a control vector | |
| note: use comma-separated values to add multiple control vectors | |
| --control-vector-scaled FNAME:SCALE,... | |
| add a control vector with user defined scaling SCALE | |
| note: use comma-separated values (format: FNAME:SCALE,...) | |
| --control-vector-layer-range START END | |
| layer range to apply the control vector(s) to, start and end inclusive | |
| -m, --model FNAME model path to load | |
| (env: LLAMA_ARG_MODEL) | |
| -mu, --model-url MODEL_URL model download url (default: unused) | |
| (env: LLAMA_ARG_MODEL_URL) | |
| -dr, --docker-repo [<repo>/]<model>[:quant] | |
| Docker Hub model repository. repo is optional, default to ai/. quant | |
| is optional, default to :latest. | |
| example: gemma3 | |
| (default: unused) | |
| (env: LLAMA_ARG_DOCKER_REPO) | |
| -hf, -hfr, --hf-repo <user>/<model>[:quant] | |
| Hugging Face model repository; quant is optional, case-insensitive, | |
| default to Q4_K_M, or falls back to the first file in the repo if | |
| Q4_K_M doesn't exist. | |
| mmproj is also downloaded automatically if available. to disable, add | |
| --no-mmproj | |
| example: unsloth/phi-4-GGUF:q4_k_m | |
| (default: unused) | |
| (env: LLAMA_ARG_HF_REPO) | |
| -hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant] | |
| Same as --hf-repo, but for the draft model (default: unused) | |
| (env: LLAMA_ARG_HFD_REPO) | |
| -hff, --hf-file FILE Hugging Face model file. If specified, it will override the quant in | |
| --hf-repo (default: unused) | |
| (env: LLAMA_ARG_HF_FILE) | |
| -hfv, -hfrv, --hf-repo-v <user>/<model>[:quant] | |
| Hugging Face model repository for the vocoder model (default: unused) | |
| (env: LLAMA_ARG_HF_REPO_V) | |
| -hffv, --hf-file-v FILE Hugging Face model file for the vocoder model (default: unused) | |
| (env: LLAMA_ARG_HF_FILE_V) | |
| -hft, --hf-token TOKEN Hugging Face access token (default: value from HF_TOKEN environment | |
| variable) | |
| (env: HF_TOKEN) | |
| --log-disable Log disable | |
| --log-file FNAME Log to file | |
| (env: LLAMA_LOG_FILE) | |
| --log-colors [on|off|auto] Set colored logging ('on', 'off', or 'auto', default: 'auto') | |
| 'auto' enables colors when output is to a terminal | |
| (env: LLAMA_LOG_COLORS) | |
| -v, --verbose, --log-verbose Set verbosity level to infinity (i.e. log all messages, useful for | |
| debugging) | |
| --offline Offline mode: forces use of cache, prevents network access | |
| (env: LLAMA_OFFLINE) | |
| -lv, --verbosity, --log-verbosity N Set the verbosity threshold. Messages with a higher verbosity will be | |
| ignored. Values: | |
| - 0: generic output | |
| - 1: error | |
| - 2: warning | |
| - 3: info | |
| - 4: debug | |
| (default: 3) | |
| (env: LLAMA_LOG_VERBOSITY) | |
| --log-prefix Enable prefix in log messages | |
| (env: LLAMA_LOG_PREFIX) | |
| --log-timestamps Enable timestamps in log messages | |
| (env: LLAMA_LOG_TIMESTAMPS) | |
| -ctkd, --cache-type-k-draft TYPE KV cache data type for K for the draft model | |
| allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1 | |
| (default: f16) | |
| (env: LLAMA_ARG_CACHE_TYPE_K_DRAFT) | |
| -ctvd, --cache-type-v-draft TYPE KV cache data type for V for the draft model | |
| allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1 | |
| (default: f16) | |
| (env: LLAMA_ARG_CACHE_TYPE_V_DRAFT) | |
| ----- sampling params ----- | |
| --samplers SAMPLERS samplers that will be used for generation in the order, separated by | |
| ';' | |
| (default: | |
| penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature) | |
| -s, --seed SEED RNG seed (default: -1, use random seed for -1) | |
| --sampler-seq, --sampling-seq SEQUENCE | |
| simplified sequence for samplers that will be used (default: | |
| edskypmxt) | |
| --ignore-eos ignore end of stream token and continue generating (implies | |
| --logit-bias EOS-inf) | |
| --temp N temperature (default: 0.80) | |
| --top-k N top-k sampling (default: 40, 0 = disabled) | |
| (env: LLAMA_ARG_TOP_K) | |
| --top-p N top-p sampling (default: 0.95, 1.0 = disabled) | |
| --min-p N min-p sampling (default: 0.05, 0.0 = disabled) | |
| --top-nsigma N top-n-sigma sampling (default: -1.00, -1.0 = disabled) | |
| --xtc-probability N xtc probability (default: 0.00, 0.0 = disabled) | |
| --xtc-threshold N xtc threshold (default: 0.10, 1.0 = disabled) | |
| --typical N locally typical sampling, parameter p (default: 1.00, 1.0 = disabled) | |
| --repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1 | |
| = ctx_size) | |
| --repeat-penalty N penalize repeat sequence of tokens (default: 1.00, 1.0 = disabled) | |
| --presence-penalty N repeat alpha presence penalty (default: 0.00, 0.0 = disabled) | |
| --frequency-penalty N repeat alpha frequency penalty (default: 0.00, 0.0 = disabled) | |
| --dry-multiplier N set DRY sampling multiplier (default: 0.00, 0.0 = disabled) | |
| --dry-base N set DRY sampling base value (default: 1.75) | |
| --dry-allowed-length N set allowed length for DRY sampling (default: 2) | |
| --dry-penalty-last-n N set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 = | |
| context size) | |
| --dry-sequence-breaker STRING add sequence breaker for DRY sampling, clearing out default breakers | |
| ('\n', ':', '"', '*') in the process; use "none" to not use any | |
| sequence breakers | |
| --adaptive-target N adaptive-p: select tokens near this probability (valid range 0.0 to | |
| 1.0; negative = disabled) (default: -1.00) | |
| [(more info)](https://github.com/ggml-org/llama.cpp/pull/17927) | |
| --adaptive-decay N adaptive-p: decay rate for target adaptation over time. lower values | |
| are more reactive, higher values are more stable. | |
| (valid range 0.0 to 0.99) (default: 0.90) | |
| --dynatemp-range N dynamic temperature range (default: 0.00, 0.0 = disabled) | |
| --dynatemp-exp N dynamic temperature exponent (default: 1.00) | |
| --mirostat N use Mirostat sampling. | |
| Top K, Nucleus and Locally Typical samplers are ignored if used. | |
| (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) | |
| --mirostat-lr N Mirostat learning rate, parameter eta (default: 0.10) | |
| --mirostat-ent N Mirostat target entropy, parameter tau (default: 5.00) | |
| -l, --logit-bias TOKEN_ID(+/-)BIAS modifies the likelihood of token appearing in the completion, | |
| i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello', | |
| or `--logit-bias 15043-1` to decrease likelihood of token ' Hello' | |
| --grammar GRAMMAR BNF-like grammar to constrain generations (see samples in grammars/ | |
| dir) (default: '') | |
| --grammar-file FNAME file to read grammar from | |
| -j, --json-schema SCHEMA JSON schema to constrain generations (https://json-schema.org/), e.g. | |
| `{}` for any JSON object | |
| For schemas w/ external $refs, use --grammar + | |
| example/json_schema_to_grammar.py instead | |
| -jf, --json-schema-file FILE File containing a JSON schema to constrain generations | |
| (https://json-schema.org/), e.g. `{}` for any JSON object | |
| For schemas w/ external $refs, use --grammar + | |
| example/json_schema_to_grammar.py instead | |
| -bs, --backend-sampling enable backend sampling (experimental) (default: disabled) | |
| (env: LLAMA_ARG_BACKEND_SAMPLING) | |
| ----- example-specific params ----- | |
| -lcs, --lookup-cache-static FNAME path to static lookup cache to use for lookup decoding (not updated by | |
| generation) | |
| -lcd, --lookup-cache-dynamic FNAME path to dynamic lookup cache to use for lookup decoding (updated by | |
| generation) | |
| --ctx-checkpoints, --swa-checkpoints N | |
| max number of context checkpoints to create per slot (default: | |
| 8)[(more info)](https://github.com/ggml-org/llama.cpp/pull/15293) | |
| (env: LLAMA_ARG_CTX_CHECKPOINTS) | |
| -cram, --cache-ram N set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 - | |
| disable)[(more | |
| info)](https://github.com/ggml-org/llama.cpp/pull/16391) | |
| (env: LLAMA_ARG_CACHE_RAM) | |
| -kvu, --kv-unified, -no-kvu, --no-kv-unified | |
| use single unified KV buffer shared across all sequences (default: | |
| enabled if number of slots is auto) | |
| (env: LLAMA_ARG_KV_UNIFIED) | |
| --context-shift, --no-context-shift whether to use context shift on infinite text generation (default: | |
| disabled) | |
| (env: LLAMA_ARG_CONTEXT_SHIFT) | |
| -r, --reverse-prompt PROMPT halt generation at PROMPT, return control in interactive mode | |
| -sp, --special special tokens output enabled (default: false) | |
| --warmup, --no-warmup whether to perform warmup with an empty run (default: enabled) | |
| --spm-infill use Suffix/Prefix/Middle pattern for infill (instead of | |
| Prefix/Suffix/Middle) as some models prefer this. (default: disabled) | |
| --pooling {none,mean,cls,last,rank} pooling type for embeddings, use model default if unspecified | |
| (env: LLAMA_ARG_POOLING) | |
| -np, --parallel N number of server slots (default: -1, -1 = auto) | |
| (env: LLAMA_ARG_N_PARALLEL) | |
| -cb, --cont-batching, -nocb, --no-cont-batching | |
| whether to enable continuous batching (a.k.a dynamic batching) | |
| (default: enabled) | |
| (env: LLAMA_ARG_CONT_BATCHING) | |
| -mm, --mmproj FILE path to a multimodal projector file. see tools/mtmd/README.md | |
| note: if -hf is used, this argument can be omitted | |
| (env: LLAMA_ARG_MMPROJ) | |
| -mmu, --mmproj-url URL URL to a multimodal projector file. see tools/mtmd/README.md | |
| (env: LLAMA_ARG_MMPROJ_URL) | |
| --mmproj-auto, --no-mmproj, --no-mmproj-auto | |
| whether to use multimodal projector file (if available), useful when | |
| using -hf (default: enabled) | |
| (env: LLAMA_ARG_MMPROJ_AUTO) | |
| --mmproj-offload, --no-mmproj-offload whether to enable GPU offloading for multimodal projector (default: | |
| enabled) | |
| (env: LLAMA_ARG_MMPROJ_OFFLOAD) | |
| --image-min-tokens N minimum number of tokens each image can take, only used by vision | |
| models with dynamic resolution (default: read from model) | |
| (env: LLAMA_ARG_IMAGE_MIN_TOKENS) | |
| --image-max-tokens N maximum number of tokens each image can take, only used by vision | |
| models with dynamic resolution (default: read from model) | |
| (env: LLAMA_ARG_IMAGE_MAX_TOKENS) | |
| -otd, --override-tensor-draft <tensor name pattern>=<buffer type>,... | |
| override tensor buffer type for draft model | |
| -cmoed, --cpu-moe-draft keep all Mixture of Experts (MoE) weights in the CPU for the draft | |
| model | |
| (env: LLAMA_ARG_CPU_MOE_DRAFT) | |
| -ncmoed, --n-cpu-moe-draft N keep the Mixture of Experts (MoE) weights of the first N layers in the | |
| CPU for the draft model | |
| (env: LLAMA_ARG_N_CPU_MOE_DRAFT) | |
| -a, --alias STRING set alias for model name (to be used by REST API) | |
| (env: LLAMA_ARG_ALIAS) | |
| --host HOST ip address to listen, or bind to an UNIX socket if the address ends | |
| with .sock (default: 127.0.0.1) | |
| (env: LLAMA_ARG_HOST) | |
| --port PORT port to listen (default: 8080) | |
| (env: LLAMA_ARG_PORT) | |
| --path PATH path to serve static files from (default: ) | |
| (env: LLAMA_ARG_STATIC_PATH) | |
| --api-prefix PREFIX prefix path the server serves from, without the trailing slash | |
| (default: ) | |
| (env: LLAMA_ARG_API_PREFIX) | |
| --webui-config JSON JSON that provides default WebUI settings (overrides WebUI defaults) | |
| (env: LLAMA_ARG_WEBUI_CONFIG) | |
| --webui-config-file PATH JSON file that provides default WebUI settings (overrides WebUI | |
| defaults) | |
| (env: LLAMA_ARG_WEBUI_CONFIG_FILE) | |
| --webui, --no-webui whether to enable the Web UI (default: enabled) | |
| (env: LLAMA_ARG_WEBUI) | |
| --embedding, --embeddings restrict to only support embedding use case; use only with dedicated | |
| embedding models (default: disabled) | |
| (env: LLAMA_ARG_EMBEDDINGS) | |
| --rerank, --reranking enable reranking endpoint on server (default: disabled) | |
| (env: LLAMA_ARG_RERANKING) | |
| --api-key KEY API key to use for authentication, multiple keys can be provided as a | |
| comma-separated list (default: none) | |
| (env: LLAMA_API_KEY) | |
| --api-key-file FNAME path to file containing API keys (default: none) | |
| --ssl-key-file FNAME path to file a PEM-encoded SSL private key | |
| (env: LLAMA_ARG_SSL_KEY_FILE) | |
| --ssl-cert-file FNAME path to file a PEM-encoded SSL certificate | |
| (env: LLAMA_ARG_SSL_CERT_FILE) | |
| --chat-template-kwargs STRING sets additional params for the json template parser, must be a valid | |
| json object string, e.g. '{"key1":"value1","key2":"value2"}' | |
| (env: LLAMA_CHAT_TEMPLATE_KWARGS) | |
| -to, --timeout N server read/write timeout in seconds (default: 600) | |
| (env: LLAMA_ARG_TIMEOUT) | |
| --threads-http N number of threads used to process HTTP requests (default: -1) | |
| (env: LLAMA_ARG_THREADS_HTTP) | |
| --cache-prompt, --no-cache-prompt whether to enable prompt caching (default: enabled) | |
| (env: LLAMA_ARG_CACHE_PROMPT) | |
| --cache-reuse N min chunk size to attempt reusing from the cache via KV shifting, | |
| requires prompt caching to be enabled (default: 0) | |
| [(card)](https://ggml.ai/f0.png) | |
| (env: LLAMA_ARG_CACHE_REUSE) | |
| --metrics enable prometheus compatible metrics endpoint (default: disabled) | |
| (env: LLAMA_ARG_ENDPOINT_METRICS) | |
| --props enable changing global properties via POST /props (default: disabled) | |
| (env: LLAMA_ARG_ENDPOINT_PROPS) | |
| --slots, --no-slots expose slots monitoring endpoint (default: enabled) | |
| (env: LLAMA_ARG_ENDPOINT_SLOTS) | |
| --slot-save-path PATH path to save slot kv cache (default: disabled) | |
| --media-path PATH directory for loading local media files; files can be accessed via | |
| file:// URLs using relative paths (default: disabled) | |
| --models-dir PATH directory containing models for the router server (default: disabled) | |
| (env: LLAMA_ARG_MODELS_DIR) | |
| --models-preset PATH path to INI file containing model presets for the router server | |
| (default: disabled) | |
| (env: LLAMA_ARG_MODELS_PRESET) | |
| --models-max N for router server, maximum number of models to load simultaneously | |
| (default: 4, 0 = unlimited) | |
| (env: LLAMA_ARG_MODELS_MAX) | |
| --models-autoload, --no-models-autoload | |
| for router server, whether to automatically load models (default: | |
| enabled) | |
| (env: LLAMA_ARG_MODELS_AUTOLOAD) | |
| --jinja, --no-jinja whether to use jinja template engine for chat (default: enabled) | |
| (env: LLAMA_ARG_JINJA) | |
| --reasoning-format FORMAT controls whether thought tags are allowed and/or extracted from the | |
| response, and in which format they're returned; one of: | |
| - none: leaves thoughts unparsed in `message.content` | |
| - deepseek: puts thoughts in `message.reasoning_content` | |
| - deepseek-legacy: keeps `<think>` tags in `message.content` while | |
| also populating `message.reasoning_content` | |
| (default: auto) | |
| (env: LLAMA_ARG_THINK) | |
| --reasoning-budget N controls the amount of thinking allowed; currently only one of: -1 for | |
| unrestricted thinking budget, or 0 to disable thinking (default: -1) | |
| (env: LLAMA_ARG_THINK_BUDGET) | |
| --chat-template JINJA_TEMPLATE set custom jinja chat template (default: template taken from model's | |
| metadata) | |
| if suffix/prefix are specified, template will be disabled | |
| only commonly used templates are accepted (unless --jinja is set | |
| before this flag): | |
| list of built-in templates: | |
| bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, | |
| command-r, deepseek, deepseek2, deepseek3, exaone-moe, exaone3, | |
| exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, | |
| hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, | |
| llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, | |
| mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, | |
| mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, | |
| phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, | |
| yandex, zephyr | |
| (env: LLAMA_ARG_CHAT_TEMPLATE) | |
| --chat-template-file JINJA_TEMPLATE_FILE | |
| set custom jinja chat template file (default: template taken from | |
| model's metadata) | |
| if suffix/prefix are specified, template will be disabled | |
| only commonly used templates are accepted (unless --jinja is set | |
| before this flag): | |
| list of built-in templates: | |
| bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, | |
| command-r, deepseek, deepseek2, deepseek3, exaone-moe, exaone3, | |
| exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, | |
| hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, | |
| llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, | |
| mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, | |
| mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, | |
| phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, | |
| yandex, zephyr | |
| (env: LLAMA_ARG_CHAT_TEMPLATE_FILE) | |
| --prefill-assistant, --no-prefill-assistant | |
| whether to prefill the assistant's response if the last message is an | |
| assistant message (default: prefill enabled) | |
| when this flag is set, if the last message is an assistant message | |
| then it will be treated as a full message and not prefilled | |
| (env: LLAMA_ARG_PREFILL_ASSISTANT) | |
| -sps, --slot-prompt-similarity SIMILARITY | |
| how much the prompt of a request must match the prompt of a slot in | |
| order to use that slot (default: 0.10, 0.0 = disabled) | |
| --lora-init-without-apply load LoRA adapters without applying them (apply later via POST | |
| /lora-adapters) (default: disabled) | |
| --sleep-idle-seconds SECONDS number of seconds of idleness after which the server will sleep | |
| (default: -1; -1 = disabled) | |
| -td, --threads-draft N number of threads to use during generation (default: same as | |
| --threads) | |
| -tbd, --threads-batch-draft N number of threads to use during batch and prompt processing (default: | |
| same as --threads-draft) | |
| --draft, --draft-n, --draft-max N number of tokens to draft for speculative decoding (default: 16) | |
| (env: LLAMA_ARG_DRAFT_MAX) | |
| --draft-min, --draft-n-min N minimum number of draft tokens to use for speculative decoding | |
| (default: 0) | |
| (env: LLAMA_ARG_DRAFT_MIN) | |
| --draft-p-min P minimum speculative decoding probability (greedy) (default: 0.75) | |
| (env: LLAMA_ARG_DRAFT_P_MIN) | |
| -cd, --ctx-size-draft N size of the prompt context for the draft model (default: 0, 0 = loaded | |
| from model) | |
| (env: LLAMA_ARG_CTX_SIZE_DRAFT) | |
| -devd, --device-draft <dev1,dev2,..> comma-separated list of devices to use for offloading the draft model | |
| (none = don't offload) | |
| use --list-devices to see a list of available devices | |
| -ngld, --gpu-layers-draft, --n-gpu-layers-draft N | |
| max. number of draft model layers to store in VRAM, either an exact | |
| number, 'auto', or 'all' (default: auto) | |
| (env: LLAMA_ARG_N_GPU_LAYERS_DRAFT) | |
| -md, --model-draft FNAME draft model for speculative decoding (default: unused) | |
| (env: LLAMA_ARG_MODEL_DRAFT) | |
| --spec-replace TARGET DRAFT translate the string in TARGET into DRAFT if the draft model and main | |
| model are not compatible | |
| --spec-type [none|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod] | |
| type of speculative decoding to use when no draft model is provided | |
| (default: none) | |
| --spec-ngram-size-n N ngram size N for ngram-simple/ngram-map speculative decoding, length | |
| of lookup n-gram (default: 12) | |
| --spec-ngram-size-m N ngram size M for ngram-simple/ngram-map speculative decoding, length | |
| of draft m-gram (default: 48) | |
| --spec-ngram-check-rate N ngram check rate for ngram-simple/ngram-map speculative decoding | |
| (default: 1) | |
| --spec-ngram-min-hits N minimum hits for ngram-map speculative decoding (default: 1) | |
| -mv, --model-vocoder FNAME vocoder model for audio generation (default: unused) | |
| --tts-use-guide-tokens Use guide tokens to improve TTS word recall | |
| --embd-gemma-default use default EmbeddingGemma model (note: can download weights from the | |
| internet) | |
| --fim-qwen-1.5b-default use default Qwen 2.5 Coder 1.5B (note: can download weights from the | |
| internet) | |
| --fim-qwen-3b-default use default Qwen 2.5 Coder 3B (note: can download weights from the | |
| internet) | |
| --fim-qwen-7b-default use default Qwen 2.5 Coder 7B (note: can download weights from the | |
| internet) | |
| --fim-qwen-7b-spec use Qwen 2.5 Coder 7B + 0.5B draft for speculative decoding (note: can | |
| download weights from the internet) | |
| --fim-qwen-14b-spec use Qwen 2.5 Coder 14B + 0.5B draft for speculative decoding (note: | |
| can download weights from the internet) | |
| --fim-qwen-30b-default use default Qwen 3 Coder 30B A3B Instruct (note: can download weights | |
| from the internet) | |
| --gpt-oss-20b-default use gpt-oss-20b (note: can download weights from the internet) | |
| --gpt-oss-120b-default use gpt-oss-120b (note: can download weights from the internet) | |
| --vision-gemma-4b-default use Gemma 3 4B QAT (note: can download weights from the internet) | |
| --vision-gemma-12b-default use Gemma 3 12B QAT (note: can download weights from the internet) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment