AI Reasoning Models Drive Up Benchmarking Costs

Recent analysis reveals that AI reasoning models, such as OpenAI's o1 and Anthropic's Claude 3.7 Sonnet, are significantly more expensive to benchmark than non-reasoning models. OpenAI’s o1 model costs approximately $2,767 for evaluation, while Claude 3.7 Sonnet costs $1,485. The complexity of these models generates more tokens, leading to higher costs for testing. As AI labs develop more reasoning models, benchmarking expenses are projected to rise further. This trend raises concerns about the replicability of AI performance results, as many labs offer subsidized model access for evaluations.