This is a comparison between https://github.com/shisa-ai/ja-mt-bench-harness which aims to be faithful to the original JA MT-Bench and the version that is used in Swallow Evalulation Instruct v202510 https://github.com/swallow-llm/swallow-evaluation-instruct/releases/tag/v202510

JA-MT-Harness vs Swallow-Evaluation (JA MT-Bench)

High-level summary

The two frameworks both use an OpenAI-compatible API, but they run and score JA MT‑Bench in materially different ways. The FastChat-based harness is closer to the original MT‑Bench pipeline (question file layout, judge prompts, and single-sample judging), while Swallow’s lighteval task intentionally modifies the evaluation: Japanese-enforced judge prompts, a Japanese system prompt for model generation, multi-sample averaging (N=5), output truncation by character length, different judge model, and additional metrics. These differences alone can easily move scores by multiple points.

Key takeaways:

Prompting and judging are different (language constraints and system prompts change model behavior and judge expectations).
Scoring logic differs (single-sample vs 5-sample average, and /10 normalization in Swallow).
Data and references differ (FastChat uses per-judge reference files; Swallow uses a single edited reference set from a different provenance).
Output processing differs (Swallow truncates outputs by character length and has special handling for reasoning models).

Major differences (table)

Area	ja-mt-bench-harness (FastChat fork)	swallow-evaluation-instruct-202510 (lighteval)	Likely impact
Runner / flow	Two-step pipeline: generate answers then judge (plus optional analysis) (`fastchat/llm_judge/run.sh`, `fastchat/llm_judge/gen_api_answer.py`, `fastchat/llm_judge/gen_judgment.py`)	Single lighteval run that generates + judges in one pipeline (`run-jamt-local-vllm.sh`, `lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py`)	Different orchestration, fewer moving parts in Swallow, but also different defaults
Dataset source	Local `question.jsonl` and per-judge reference files (`fastchat/llm_judge/data/ja_mt_bench/question.jsonl`, `fastchat/llm_judge/data/ja_mt_bench/reference_answer/*`)	HF dataset `tokyotech-llm/swallow_japanese_mt_bench` with Swallow-edited references; provenance from wandb artifacts (`lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py`, `BENCHMARKS.md`)	Different question/ref sources and edits change scores
System prompt for model answers	FastChat conversation templates (often `"You are a helpful assistant."`, plus a special GPT‑4‑turbo template) (`fastchat/conversation.py`, `fastchat/model/model_adapter.py`)	Explicit Japanese system prompt by default (`"あなたは誠実で優秀な日本人のアシスタントです。"`) (`run-jamt-local-vllm.sh`, `run-jamt-shisa-api.sh`)	Strongly shifts model tone and language output
Judge prompts	English judge instructions; only a generic language mention (`fastchat/llm_judge/data/judge_prompts.jsonl`)	Japanese-enforcing judge prompts + judge explanation must be in Japanese (`lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/judge_prompt_templates.py`)	Judge scoring is stricter on language and style
Judge model	Configurable; run script defaults to `gpt-4.1-2025-04-14` (`fastchat/llm_judge/run.sh`)	Fixed to `gpt-4o-2024-08-06` in task (`lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py`)	Different judge model => non-comparable scoring
Sampling	Default one answer per turn (`--num-choices=1`) (`fastchat/llm_judge/gen_api_answer.py`)	Always N=5 samples and average (`NUM_SAMPLES=5`) (`lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py`, `lighteval/src/lighteval/metrics/metrics_sample.py`)	Averaging changes score distribution and variance
Output length	`max_tokens` (default 8000 tokens) (`fastchat/llm_judge/gen_api_answer.py`)	Character truncation: 8192 or 6144 chars (`max_gen_text_length`) (`lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py`, `lighteval/src/lighteval/models/litellm_model.py`)	Truncation can remove critical content
Reasoning handling	Strips `<think>` / `<reason>` tags before judging (`fastchat/llm_judge/gen_judgment.py`)	Optional vLLM reasoning parser or custom parser; otherwise raw content (`lighteval/src/lighteval/models/vllm/vllm_model.py`, `lighteval/src/lighteval/models/litellm_model.py`)	Different content fed to judge
Scoring / aggregation	Scores 1–10 per turn; no /10 normalization in outputs (`fastchat/llm_judge/show_result.py`)	Averages 5 samples, per‑turn + per‑category + overall; divides by 10 for corpus metrics (`lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py`, `lighteval/src/lighteval/metrics/metrics_sample.py`)	Scores appear scaled differently and are computed differently
Extra metrics	None	Adds Japanese character ratio metric (`lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py`)	Additional reporting (not comparable)

Detailed differences

1) Execution flow and runner

Harness (FastChat fork)

Generates answers via fastchat/llm_judge/gen_api_answer.py and judges separately via fastchat/llm_judge/gen_judgment.py.
fastchat/llm_judge/run.sh is a convenience wrapper that runs answers → judgments → visualization and judge comparisons.
Output files are plain JSONL in data/ja_mt_bench/... (answers and judgments).

Swallow (lighteval)

Uses lighteval endpoint litellm to run the full pipeline in one command, including model generation + judge calls.
Provided scripts run-jamt-local-vllm.sh and run-jamt-shisa-api.sh define defaults and environment requirements.
Outputs go to lighteval/outputs with --save-details to keep prompt and judge outputs.

Why it matters: the harness is architected like original FastChat MT‑Bench: separate answer and judge phases, with explicit per‑judge outputs. Lighteval bundles them, which makes it harder to fully align control flags (sampling, truncation, system prompt) with FastChat defaults.

References: fastchat/llm_judge/run.sh, fastchat/llm_judge/gen_api_answer.py, fastchat/llm_judge/gen_judgment.py, run-jamt-local-vllm.sh.

2) Dataset and reference answers

Harness

Questions are from a local JSONL: fastchat/llm_judge/data/ja_mt_bench/question.jsonl (80 questions, 2 turns each).
Reference answers are per judge model, and the judge name selects the reference file (e.g., gpt-4.1-2025-04-14.jsonl, gpt-4o-2024-08-06.jsonl) (fastchat/llm_judge/data/ja_mt_bench/reference_answer/*).
The judging pipeline requires that the judge model has a matching reference file (enforced in fastchat/llm_judge/common.py).

Swallow

Dataset is tokyotech-llm/swallow_japanese_mt_bench loaded from HF (lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py).
Provenance: questions + prompt + reference answers come from wandb‑japan artifacts, and the reference answers were edited by the Swallow team (BENCHMARKS.md).
References are embedded in each dataset item (not per judge).

Why it matters: The harness uses per-judge references, which can be generated with the same judge model. Swallow uses a single edited reference set regardless of judge. Different references and data edits can shift scores materially.

References: fastchat/llm_judge/data/ja_mt_bench/question.jsonl, fastchat/llm_judge/data/ja_mt_bench/reference_answer, fastchat/llm_judge/common.py, lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py, BENCHMARKS.md.

3) System prompt for model answers (model under test)

Harness

Uses FastChat conversation templates, e.g. the default ChatGPT template has system message "You are a helpful assistant." (fastchat/conversation.py).
For GPT‑4 Turbo specifically, FastChat injects a longer system prompt (with knowledge cutoff and current date) via gpt-4-turbo-2024-04-09 template (fastchat/conversation.py, fastchat/model/model_adapter.py).
For local models, the system prompt depends on the selected FastChat template for that model (adapter‑dependent).

Swallow

The run scripts pass a Japanese system prompt by default:
- "あなたは誠実で優秀な日本人のアシスタントです。" (run-jamt-local-vllm.sh, run-jamt-shisa-api.sh).
Swallow runs with --use-chat-template, so the model’s tokenizer chat template is applied to multi‑turn contexts (lighteval/src/lighteval/tasks/prompt_manager.py).

Why it matters: The system prompt is a strong conditioning signal. Swallow explicitly pushes the model toward Japanese output and a particular tone, while FastChat defaults are English and generic. This alone can shift evaluation.

References: fastchat/conversation.py, fastchat/model/model_adapter.py, run-jamt-local-vllm.sh, lighteval/src/lighteval/tasks/prompt_manager.py.

4) Judge prompt differences (single and multi‑turn)

Harness judge prompts (fastchat/llm_judge/data/judge_prompts.jsonl):

Example (single‑turn):
- “Please act as an impartial judge… consider helpfulness, relevance, accuracy, depth… also consider whether the prompt responded in the correct language and the fluency and naturalness of this response.”
System prompt is generic ("You are a helpful assistant.") for single‑turn.
No explicit instruction that the explanation must be in Japanese.

Swallow judge prompts (lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/judge_prompt_templates.py):

Adds explicit language constraints:
- “The expected language is Japanese… Responses in languages other than Japanese will incur score deductions… explanation of judgement should be in Japanese.”
- Exception: not mandatory when output is only Python scripts or calculation results.
Similar constraints appear in the multi‑turn templates.

Why it matters: Swallow’s judge is explicitly more strict about Japanese output and even the judge’s own explanation language. This is a meaningful modification of the original MT‑Bench judging prompt and will alter scores.

References: fastchat/llm_judge/data/judge_prompts.jsonl, lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/judge_prompt_templates.py.

5) Judge model and call parameters

Harness

fastchat/llm_judge/run.sh defaults to gpt-4.1-2025-04-14 as judge.
Uses OpenAI ChatCompletion with temperature=0 and max_tokens=2048 in the judge call (fastchat/llm_judge/common.py).

Swallow

Hard‑codes judge model to gpt-4o-2024-08-06 in the JA MT‑Bench task (lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py).
OpenAI judge calls use temperature=0 and max_tokens=4096 (lighteval/src/lighteval/metrics/llm_as_judge.py).

Why it matters: A different judge model and higher max tokens will change scoring behavior. Even if prompts were identical, scores will differ.

References: fastchat/llm_judge/run.sh, fastchat/llm_judge/common.py, lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py, lighteval/src/lighteval/metrics/llm_as_judge.py.

6) Sampling and decoding differences

Harness

Default --num-choices=1, i.e., one completion per question and turn (fastchat/llm_judge/gen_api_answer.py).
Uses category‑based temperature mapping (0.7 for writing/roleplay, 0.0 for math/reasoning/coding, 0.1 for stem/humanities) (fastchat/llm_judge/common.py).
max_tokens default 8000 for model generation (fastchat/llm_judge/gen_api_answer.py).

Swallow

Always generates 5 samples (NUM_SAMPLES=5) and averages judge scores (lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py, lighteval/src/lighteval/metrics/metrics_sample.py).
Uses the same category‑based temperature map, but only per‑category (no required_temperature override) (lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py).
If temperature is 0, Swallow calls once and duplicates the output across samples (lighteval/src/lighteval/models/litellm_model.py).
No explicit max_tokens limit in the task; relies on backend defaults and truncation (see below).

Important subtlety: multi‑turn sampling correlation (Swallow)

In Swallow’s litellm multi‑turn generation, all turn‑2 samples are conditioned on the first turn‑1 sample only (lighteval/src/lighteval/models/litellm_model.py).
In FastChat, if num_choices > 1, each sample’s turn‑2 is conditioned on its own turn‑1 (the conversation template carries that state) (fastchat/llm_judge/gen_api_answer.py).

Why it matters: Multi‑sample averaging plus sample correlation differences can substantially affect scoring and variance.

References: fastchat/llm_judge/gen_api_answer.py, fastchat/llm_judge/common.py, lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py, lighteval/src/lighteval/models/litellm_model.py.

7) Output length control and truncation

Harness

Uses token‑based generation limit (max_tokens=8000) but does not truncate output text afterwards (fastchat/llm_judge/gen_api_answer.py).

Swallow

Truncates generated text by character length using max_gen_text_length (8192 chars default, 6144 for a shorter variant) (lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py).
Truncation happens inside both vLLM and litellm backends (lighteval/src/lighteval/models/vllm/vllm_model.py, lighteval/src/lighteval/models/litellm_model.py).

Why it matters: Truncation can remove important content (especially in turn‑2) and systematically lower scores for verbose models.

References: fastchat/llm_judge/gen_api_answer.py, lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py, lighteval/src/lighteval/models/litellm_model.py.

8) Reasoning content handling

Harness

Strips <think>...</think> or <reason>...</reason> tags from answers before judging (fastchat/llm_judge/gen_judgment.py).

Swallow

Supports reasoning parsers in vLLM; if enabled, it extracts reasoning_content and content and uses the final content for evaluation (lighteval/src/lighteval/models/vllm/vllm_model.py, lighteval/src/lighteval/models/vllm/utils.py).
For litellm, custom reasoning parsing can be enabled; otherwise the raw content is used (lighteval/src/lighteval/models/litellm_model.py).
Evaluation policy strongly encourages reasoning extraction for reasoning‑type models (EVALUATION_POLICY.md).

Why it matters: The judged text can differ significantly when reasoning output is present (especially if the model outputs internal chain-of-thought markers).

References: fastchat/llm_judge/gen_judgment.py, lighteval/src/lighteval/models/vllm/vllm_model.py, lighteval/src/lighteval/models/litellm_model.py, EVALUATION_POLICY.md.

9) Scoring and aggregation

Harness

Outputs a single score per question/turn (1–10). show_result.py averages per turn and overall for display (fastchat/llm_judge/show_result.py).
Uses canonicalization to de‑duplicate repeated judgments (fastchat/llm_judge/show_result.py).
Does not divide by 10, so outputs remain in 1–10 scale.

Swallow

Produces scores per sample, per turn; then computes:
- judge_score_*_turn_1_avg and judge_score_*_turn_2_avg and overall averages. (lighteval/src/lighteval/metrics/metrics_sample.py).
Corpus metrics divide by 10 (mt_bench_corpus_level_fn) giving a 0–1 scale (lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py).
Adds a Japanese‑character ratio metric for fluency/language coverage (lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py).

Why it matters: Even if raw scores were identical, Swallow reports averaged and normalized numbers with extra metrics.

References: fastchat/llm_judge/show_result.py, lighteval/src/lighteval/metrics/metrics_sample.py, lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py.

10) Output locations and reporting

Harness

Answers: data/ja_mt_bench/model_answer/<model>.jsonl
Judgments: data/ja_mt_bench/model_judgment/<judge>_single.jsonl
Visualization and comparison scripts operate directly on these JSONL files (fastchat/llm_judge/visualize-results.py, fastchat/llm_judge/compare-judges.py).

Swallow

Lighteval outputs to lighteval/outputs with structured results per run (run-jamt-local-vllm.sh).
Optional aggregation via scripts/aggregate_results.py.

Why it matters: The harness is designed for per‑judge comparisons and judge‑level analysis, while Swallow is designed for centralized multi‑task aggregation.

References: fastchat/llm_judge/gen_api_answer.py, fastchat/llm_judge/gen_judgment.py, run-jamt-local-vllm.sh, scripts/aggregate_results.py.

Which is closer to canonical MT‑Bench?

The FastChat‑based harness is closer to the canonical MT‑Bench flow:

Same script structure as FastChat (answer → judge), same prompt formats, and minimal additional constraints.
Uses the standard FastChat judge prompts (with only generic language guidance).
Single‑sample evaluation by default, which is how MT‑Bench is typically reported.

Swallow intentionally modifies several components:

Japanese‑enforced judge prompts and Japanese system prompt.
Fixed judge model (gpt‑4o) instead of user‑selectable.
Multi‑sample averaging and score normalization.
Output truncation by character count.
Additional metrics (Japanese ratio).

References: fastchat/llm_judge/run.sh, fastchat/llm_judge/data/judge_prompts.jsonl, lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/main.py, run-jamt-local-vllm.sh.

Why scores can differ so much

The following differences are large enough to shift scores significantly:

System prompt mismatch (generic English vs explicit Japanese). This changes model outputs, especially for bilingual models.
Judge prompt language enforcement (Swallow penalizes non‑Japanese and requires Japanese explanations).
Judge model mismatch (gpt‑4.1 vs gpt‑4o) and different max token limits for the judge.
Sampling (single‑sample vs 5‑sample average) and Swallow’s sampling correlation across turns.
Output truncation (Swallow truncates to 8192/6144 characters, not tokens).
Reference answer provenance (per‑judge refs vs single edited ref set).
Reasoning content handling (strip tags vs parser‑based extraction).

Each of these can move the score distribution; together they explain large discrepancies.

Appendix: Prompt differences (summary + excerpts)

Summary highlights:

The harness uses the FastChat canonical prompts; Swallow adds explicit Japanese language requirements and requires the judge explanation to be in Japanese.
The harness uses the default ChatGPT system prompt (model‑template dependent); Swallow injects a Japanese system prompt by default.
Multi‑turn formatting is structurally the same, but Swallow adds language constraints and includes a literal extra quote prefix in its system prompt string.

Italic text below marks deviations from the FastChat canonical prompts (harness baseline).

Summary table (prompts at a glance)

Prompt	JA MT‑Bench Harness (FastChat)	Swallow Evaluation (lighteval)
Model system prompt (generation)	"You are a helpful assistant." (default chatgpt template). For `gpt-4-turbo-2024-04-09`, FastChat uses a longer system message with cutoff/date.	"あなたは誠実で優秀な日本人のアシスタントです。" (explicit Japanese system prompt passed by run scripts).
Judge prompt (single‑turn, general)	English judge instructions; includes a generic “correct language/fluency” consideration.	English judge instructions + explicit Japanese‑only requirement and Japanese judge explanation mandate (with exceptions for code/calculation‑only outputs).
Judge prompt (single‑turn, with reference)	English judge instructions with reference answer; no language enforcement beyond “correct language” consideration.	English judge instructions with reference + explicit Japanese‑only requirement and Japanese judge explanation mandate (same exceptions).
Judge prompt (multi‑turn, general)	English instructions; focuses scoring on turn‑2 answer.	Same structure + explicit Japanese‑only requirement and Japanese judge explanation mandate.

A1) Model system prompt (generation)

Harness (FastChat default chatgpt template) Source: fastchat/conversation.py

You are a helpful assistant.

Harness (GPT‑4‑Turbo template used when model name matches gpt-4-turbo-2024-04-09) Source: fastchat/conversation.py

You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture.
Knowledge cutoff: 2023-11
Current date: {{currentDateTime}}

Image input capabilities: Enabled
Personality: v2

Swallow (default system prompt passed by run scripts) Source: run-jamt-local-vllm.sh, run-jamt-shisa-api.sh "あなたは誠実で優秀な日本人のアシスタントです。"

Notes:

The harness system prompt varies by model adapter/template (fastchat/model/model_adapter.py).
Swallow uses --use-chat-template and injects the system prompt at the start of the multi‑turn context (lighteval/src/lighteval/tasks/prompt_manager.py).

A2) Judge prompt: single‑turn (general)

Harness (single‑v1) Source: fastchat/llm_judge/data/judge_prompts.jsonl

System: You are a helpful assistant.

[Instruction]
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Your evaluation should also consider whether the prompt responded in the correct language and the fluency and naturalness of this response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".

[Question]
{question}

[The Start of Assistant's Answer]
{answer}
[The End of Assistant's Answer]

Swallow (single‑v1, Japanese‑enforced) Source: lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/judge_prompt_templates.py

System: You are a helpful assistant.
[Instruction]
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. The expected language is Japanese. Responses in languages other than Japanese will incur score deductions unless specifically required. Failure to use Japanese at all will result in the lowest evaluation. However, using Japanese is not mandatory when providing only Python scripts or calculation results, where Japanese is not essential. Additionally, your explanation of judgement should be in Japanese. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".
[Question]
{question}
[The Start of Assistant's Answer]
{answer}
[The End of Assistant's Answer]

Key deltas:

Swallow adds explicit Japanese‑language requirements and mandates Japanese judge explanations.
Harness only has a generic “correct language” consideration.

A3) Judge prompt: single‑turn with reference (math/reasoning/coding)

Harness (single‑math‑v1) Source: fastchat/llm_judge/data/judge_prompts.jsonl

System: You are a helpful assistant.

[Instruction]
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider correctness and helpfulness. You will be given a reference answer and the assistant's answer. Begin your evaluation by comparing the assistant's answer with the reference answer. Identify and correct any mistakes. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".

[Question]
{question}

[The Start of Reference Answer]
{ref_answer_1}
[The End of Reference Answer]

[The Start of Assistant's Answer]
{answer}
[The End of Assistant's Answer]

Swallow (single‑v1 with reference, Japanese‑enforced) Source: lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/judge_prompt_templates.py

System: You are a helpful assistant.
[Instruction]
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider correctness and helpfulness. You will be given a reference answer and the assistant's answer. Begin your evaluation by comparing the assistant's answer with the reference answer. Identify and correct any mistakes. Be as objective as possible. The expected language is Japanese. Responses in languages other than Japanese will incur score deductions unless specifically required. Failure to use Japanese at all will result in the lowest evaluation. However, using Japanese is not mandatory when providing only Python scripts or calculation results, where Japanese is not essential. Additionally, your explanation of judgement should be in Japanese. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".
[Question]
{question}
[The Start of Reference Answer]
{gold}
[The End of Reference Answer]
[The Start of Assistant's Answer]
{answer}
[The End of Assistant's Answer]

Key deltas:

Swallow enforces Japanese language and Japanese judge explanations even for ref‑based scoring.
Reference placeholder name differs (ref_answer_1 vs gold) but is functionally equivalent.

A4) Judge prompt: multi‑turn (general)

Harness (single‑v1‑multi‑turn) Source: fastchat/llm_judge/data/judge_prompts.jsonl

System:
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. You evaluation should focus on the assistant's answer to the second user question. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".

User:
<|The Start of Assistant A's Conversation with User|>

### User:
{question_1}

### Assistant A:
{answer_1}

### User:
{question_2}

### Assistant A:
{answer_2}

<|The End of Assistant A's Conversation with User|>

Swallow (single‑v1‑multi‑turn, Japanese‑enforced) Source: lighteval/src/lighteval/tasks/swallow/japanese_mt_bench/judge_prompt_templates.py

System:
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. You evaluation should focus on the assistant's answer to the second user question. Begin your evaluation by providing a short explanation. Be as objective as possible. The expected language is Japanese. Responses in languages other than Japanese will incur score deductions unless specifically required. Failure to use Japanese at all will result in the lowest evaluation. However, using Japanese is not mandatory when providing only Python scripts or calculation results, where Japanese is not essential. Additionally, your explanation of judgement should be in Japanese. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".
User:
<|The Start of Assistant A's Conversation with User|>
...
<|The End of Assistant A's Conversation with User|>

Key deltas:

Same structural format, but Swallow adds Japanese requirements and Japanese explanation mandate.
Swallow’s multi‑turn system prompt string includes an extra leading quote sequence (""""), which is a literal text difference in the system message.

lhl/JA-MT-Harness-vs-Swallow-Evaluation.md

Select an option

No results found

Select an option

No results found

JA-MT-Harness vs Swallow-Evaluation (JA MT-Bench)

High-level summary

Major differences (table)

Detailed differences

1) Execution flow and runner

2) Dataset and reference answers

3) System prompt for model answers (model under test)

4) Judge prompt differences (single and multi‑turn)

5) Judge model and call parameters

6) Sampling and decoding differences

7) Output length control and truncation

8) Reasoning content handling

9) Scoring and aggregation

10) Output locations and reporting

Which is closer to canonical MT‑Bench?

Why scores can differ so much

Appendix: Prompt differences (summary + excerpts)

Summary table (prompts at a glance)

A1) Model system prompt (generation)

A2) Judge prompt: single‑turn (general)

A3) Judge prompt: single‑turn with reference (math/reasoning/coding)

A4) Judge prompt: multi‑turn (general)