AI Struggles with History Exams: New Study Reveals Limitations

A recent study presented at NeurIPS has revealed that leading large language models (LLMs) like OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini perform poorly on historical questions. Researchers created a benchmark, Hist-LLM, testing these models against the Seshat Global History Databank. GPT-4 Turbo achieved only 46% accuracy, highlighting a significant shortfall in nuanced understanding. The study identified biases in training data, particularly regarding underrepresented regions. While LLMs excel at basic facts, they remain inadequate for advanced historical analysis, but researchers remain optimistic about their future applications in historical research.