badcase of Top Model

chatgpt-4o-latest
o1-mini
o3-mini
gemini-2.0-pro
gemini-2.0-flash-thinking
Claude-3.5-Sonnet
qwen2.5-max
ERNIE-4.0
ERNIE-4.0-Turbo
xunfei-4.0Ultra
MiniMax-Text-01
Baichuan4
GLM-4-Plus
GLM-Zero-Preview
kimi-latest
Doubao-1.5-pro
deepseek-chat-v3
DeepSeek-R1
Llama-3.3-70B-Instruct
qwen2.5-72b-instruct

badcase of Benchmarks

Reasoning & Math

Language & Instruction Following

Education

Law & Civil Service

Medical

BBH subtasks