This is a comparison between https://github.com/shisa-ai/ja-mt-bench-harness which aims to be faithful to the original JA MT-Bench and the version that is used in Swallow Evalulation Instruct v202510 https://github.com/swallow-llm/swallow-evaluation-instruct/releases/tag/v202510
The two frameworks both use an OpenAI-compatible API, but they run and score JA MT‑Bench in materially different ways. The FastChat-based harness is closer to the original MT‑Bench pipeline (question file layout, judge prompts, and single-sample judging), while Swallow’s lighteval task intentionally modifies the evaluation: Japanese-enforced judge prompts, a Japanese system prompt for model generation, multi-sample averaging (N=5), output truncation by character length, different judge model, and additional metrics. These differences alone can easily move scores by multiple points.
Key takeaways:
- Prompting and judging are different (language constraint