GLM-4.6 GPQA diamond via chutes.ai

When tested properly, using the same GLM simple evals reference implementation provided by Z.ai, the evaluations resulted in the following scores:

{
  "chars": 970.0044191919192,
  "chars:std": 153.57443776558713,
  "Chemistry": 72.1774193548387,
  "Chemistry:std": 44.81252136132964,
  "score:std": 39.232085839012754,
  "Physics": 92.87790697674419,
  "Physics:std": 25.71935250533481,
  "Biology": 70.39473684210526,
  "Biology:std": 45.65144805086992,
  "score": 80.99747474747475
}

Z.ai reports that GLM-4.6 scores 81% on GPQA diamond

In other words, chutes.ai GLM-4.6 perfectly matches reference scores on GPQA diamond.

GLM-4.6 configuration on chutes

As with all models, the source code and configuration used to serve this model are available for anyone to inspect: https://chutes.ai/app/chute/579ca543-dda4-51d0-83ef-5667d1a5ed5f?tab=source

In this case, it uses a helper function, build_vllm_chute, which you can see here: https://github.com/chutesai/chutes/blob/main/chutes/chute/template/vllm.py

We run this model with vllm, in a TEE environment (meaning the model runs in a secure, encrypted environment where even root users on the host can't snoop), using the following config:

        "--tool-call-parser glm45 "
        "--reasoning-parser glm45 "
        "--enable-auto-tool-choice "
        "--speculative-config.method mtp "
        "--speculative-config.num_speculative_tokens 1 "
        "--max-num-batched-tokens 8192 "
        "--gpu-memory-utilization 0.83 "
        "--max-num-seqs 40 "
        "--max-completion-tokens 8192 "
        "--max-stream-completion-tokens 65536"

And these additional flags added by the build_vllm_chute helper:

--tensor-parallel-size 8
--enable-prompt-tokens-details
--api-key {random uuuid}

Testing environment

glm-simple-evals repo

First, create an environment (note, this requires python3.10 because of a particular dependency which uses inspect in a deprecated way).

git clone https://github.com/zai-org/glm-simple-evals
cd glm-simple-evals
python3.10 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Next, download the dataset:

huggingface-cli download --repo-type=dataset zai-org/glm-simple-evals-dataset --local-dir data

Answer extraction model

To properly reproduce the GPQA diamond scores, or in general use the glm-simple-evals project, you must run an answer extraction model. z.ai used meta-llama/Llama-3.1-70B-Instruct for this purpose.

For our reproduction, we ran this with vllm on an 8x NVIDIA 6000 Ada GPU node, with vllm==0.13.0

vllm serve meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 8 --port 8812 --host 0.0.0.0 --served-model-name Meta-Llama-3.1-70B-Instruct

Eval script

This evaluation took nearly 24 hours (we could have optimized the evaluation framework a bit, but for reproducibility we did not change it).

You must first set two environment variables:

bash
export CHECKER_MODEL_URL=http://{ip address of vllm llama}:{port of vllm llama}/v1
export CHUTES_API_KEY={replace with your chutes api key}

python evaluate.py \
    --model_name "zai-org/GLM-4.6" \
    --backbone "openai" \
    --checker_url $CHECKER_MODEL_URL \
    --openai_base_url https://llm.chutes.ai/v1 \
    --openai_api_key $CHUTES_API_KEY \
    --save_dir "glm_46_gpqa_2025122600" \
    --tasks gpqa \
    --proc_num 32 \
    --auto_extract_answer \
    --max_new_tokens 128000 \
    --temperature 1.0 \
    --top_p 0.95 \
    --stream

See the raw results from our run here: glm_46_gpqa_2025122600.tar.gz

jondurbin/chutes-glm46-gpqa.md

Select an option

No results found