EvalArenaHead-to-Head Benchmarking Workbench

Side-by-Side Comparison

LLM-as-a-Judge

Chain-of-Thought Analysis

Beyond "Vibes" Based Eval

Most teams pick LLMs based on generic leaderboards (MMLU, HumanEval). But for specialized domains like Legal Contract Review or Medical Summarization, generic scores are meaningless.

**EvalArena** is a workbench for empirical selection. It allows me to run identical prompts across Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o simultaneously, blind-vote on the outputs, and automatically score them using a rigorous rubric.

System Architecture

EvalArena operates as a multi-model orchestration layer, routing identical prompts to multiple providers and collecting responses in parallel. The judge LLM then scores each response using a rubric-based evaluation.

LLM-as-a-Judge: Rubric-Based Scoring

Instead of relying on subjective human voting alone, EvalArena uses GPT-4 as an automated judge with a structured rubric. This enables reproducible, scalable evaluation across hundreds of test cases.

eval_judge.py

1# Judge LLM Prompt Template with Rubric
2SYSTEM_PROMPT = """
3You are an expert evaluator for AI-generated responses.
4Score the following response on a scale of 1-10 based on these criteria:
5
61. **Accuracy** (0-3 points): Factual correctness, no hallucinations
72. **Relevance** (0-3 points): Directly addresses the user's question
83. **Clarity** (0-2 points): Well-structured, easy to understand
94. **Completeness** (0-2 points): Covers all aspects of the question
10
11Provide:
12- A breakdown score for each criterion
13- A total score (0-10)
14- 2-3 sentence reasoning
15
16Format your response as JSON:
17{
18  "accuracy": 3,
19  "relevance": 2,
20  "clarity": 2,
21  "completeness": 1,
22  "total": 8,
23  "reasoning": "Response is factually correct but lacks depth..."
24}
25"""
26
27# Example evaluation call
28async def judge_response(prompt: str, response: str, rubric: dict) -> EvalResult:
29    judge_prompt = f"""
30    **User Prompt:** {prompt}
31    
32    **Model Response:** {response}
33    
34    **Evaluation Rubric:** {json.dumps(rubric)}
35    
36    Provide your scoring and reasoning.
37    """
38    
39    result = await openai.chat.completions.create(
40        model="gpt-4o",
41        messages=[
42            {"role": "system", "content": SYSTEM_PROMPT},
43            {"role": "user", "content": judge_prompt}
44        ],
45        response_format={"type": "json_object"},
46        temperature=0.1  # Low temp for consistency
47    )
48    
49    return EvalResult.parse_raw(result.choices[0].message.content)

Why GPT-4 as Judge?

GPT-4 shows highest correlation with human preferences in academic studies. We use structured JSON output mode for reliable parsing.

Temperature = 0.1

Low temperature ensures consistent scoring across runs. Critical for reproducible benchmarks.

Custom Rubrics

Rubric weights can be adjusted per domain (e.g., legal docs prioritize accuracy 50%).

Parallel Model Execution

All three models are called simultaneously using async/await patterns. This reduces total wait time from ~15 seconds (sequential) to ~5 seconds (parallel, limited by slowest model).

arena_controller.py

1import asyncio
2from typing import List
3
4async def call_model(model_name: str, prompt: str) -> ModelResponse:
5    """Call a single model and track timing/tokens"""
6    start = time.time()
7    
8    if model_name == "gemini":
9        response = await gemini_client.generate_content(prompt)
10    elif model_name == "claude":
11        response = await anthropic_client.messages.create(
12            model="claude-3-5-sonnet-20241022",
13            messages=[{"role": "user", "content": prompt}]
14        )
15    elif model_name == "gpt4":
16        response = await openai.chat.completions.create(
17            model="gpt-4o",
18            messages=[{"role": "user", "content": prompt}]
19        )
20    
21    latency = time.time() - start
22    
23    return ModelResponse(
24        model=model_name,
25        text=response.text,
26        latency_ms=latency * 1000,
27        tokens=response.usage.total_tokens,
28        cost_usd=calculate_cost(model_name, response.usage)
29    )
30
31async def arena_compare(prompt: str) -> ArenaResult:
32    """Execute all models in parallel"""
33    models = ["gemini", "claude", "gpt4"]
34    
35    # Run all models concurrently
36    responses = await asyncio.gather(*[
37        call_model(model, prompt) for model in models
38    ])
39    
40    # Anonymize for blind voting (shuffle identifiers)
41    shuffled = random.sample(list(enumerate(responses)), len(responses))
42    
43    return ArenaResult(
44        responses={f"Model {chr(65+i)}": resp for i, (_, resp) in enumerate(shuffled)},
45        identity_mapping={f"Model {chr(65+i)}": models[orig_idx] for i, (orig_idx, _) in enumerate(shuffled)}
46    )

Performance Metrics (Typical)

~5.2s

Parallel execution time

$0.03

Total cost per comparison

Speed improvement vs sequential

1. The "Arena" Mode

Inspired by LMSYS. Two models fight anonymously. You vote for "Model A" or "Model B". Only after voting are the identities revealed. This eliminates brand bias.

2. Token Observability

Real-time tracking of Token/Sec and Cost per Query. Often, I find that Phi-3-medium ($0.50/1M tokens) is "good enough" for tasks where I assumed I needed GPT-4 ($5.00/1M).

3. Synthesis Engine

Can't decide? The "Synthesis" button fuses the best parts of all three model outputs into a single, superior answer using a recursive summarization capability.

This tool is my "daily driver" for AI engineering. It's how I choose the right model for every client problem.

Watch Demo