Meta Executive Refutes Claims of Manipulated Llama 4 Benchmark Scores

A Meta executive has denied allegations that the company manipulated benchmark scores for its new AI models, Llama 4 Maverick and Llama 4 Scout. Ahmad Al-Dahle, VP of generative AI at Meta, stated that the claims of training on test sets are false. The rumors gained traction after reports suggested the models performed poorly on certain tasks. Al-Dahle acknowledged that users are experiencing inconsistent quality across cloud providers but assured that improvements are underway. He emphasized that the models were released as soon as they were ready, with ongoing bug fixes planned.