When tested properly, using the same GLM simple evals reference implementation provided by Z.ai, the evaluations resulted in the following scores:
{
"chars": 970.0044191919192,
"chars:std": 153.57443776558713,
"Chemistry": 72.1774193548387,
"Chemistry:std": 44.81252136132964,
"score:std": 39.232085839012754,
"Physics": 92.87790697674419,
"Physics:std": 25.71935250533481,
"Biology": 70.39473684210526,
"Biology:std": 45.65144805086992,
"score": 80.99747474747475
}Z.ai reports that GLM-4.6 scores 81% on GPQA diamond
In other words, chutes.ai GLM-4.6 perfectly matches reference scores on GPQA diamond.
As with all models, the source code and configuration used to serve this model are available for anyone to inspect: https://chutes.ai/app/chute/579ca543-dda4-51d0-83ef-5667d1a5ed5f?tab=source
In this case, it uses a helper function, build_vllm_chute, which you can see here: https://github.com/chutesai/chutes/blob/main/chutes/chute/template/vllm.py
We run this model with vllm, in a TEE environment (meaning the model runs in a secure, encrypted environment where even root users on the host can't snoop), using the following config:
"--tool-call-parser glm45 "
"--reasoning-parser glm45 "
"--enable-auto-tool-choice "
"--speculative-config.method mtp "
"--speculative-config.num_speculative_tokens 1 "
"--max-num-batched-tokens 8192 "
"--gpu-memory-utilization 0.83 "
"--max-num-seqs 40 "
"--max-completion-tokens 8192 "
"--max-stream-completion-tokens 65536"
And these additional flags added by the build_vllm_chute helper:
--tensor-parallel-size 8
--enable-prompt-tokens-details
--api-key {random uuuid}
First, create an environment (note, this requires python3.10 because of a particular dependency which uses inspect in a deprecated way).
git clone https://github.com/zai-org/glm-simple-evals
cd glm-simple-evals
python3.10 -m venv venv
source venv/bin/activate
pip install -r requirements.txtNext, download the dataset:
huggingface-cli download --repo-type=dataset zai-org/glm-simple-evals-dataset --local-dir dataTo properly reproduce the GPQA diamond scores, or in general use the glm-simple-evals project, you must run an answer extraction model. z.ai used meta-llama/Llama-3.1-70B-Instruct for this purpose.
For our reproduction, we ran this with vllm on an 8x NVIDIA 6000 Ada GPU node, with vllm==0.13.0
vllm serve meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 8 --port 8812 --host 0.0.0.0 --served-model-name Meta-Llama-3.1-70B-Instruct
This evaluation took nearly 24 hours (we could have optimized the evaluation framework a bit, but for reproducibility we did not change it).
You must first set two environment variables:
bash
export CHECKER_MODEL_URL=http://{ip address of vllm llama}:{port of vllm llama}/v1
export CHUTES_API_KEY={replace with your chutes api key}
python evaluate.py \
--model_name "zai-org/GLM-4.6" \
--backbone "openai" \
--checker_url $CHECKER_MODEL_URL \
--openai_base_url https://llm.chutes.ai/v1 \
--openai_api_key $CHUTES_API_KEY \
--save_dir "glm_46_gpqa_2025122600" \
--tasks gpqa \
--proc_num 32 \
--auto_extract_answer \
--max_new_tokens 128000 \
--temperature 1.0 \
--top_p 0.95 \
--streamSee the raw results from our run here: glm_46_gpqa_2025122600.tar.gz