PUBLICATIONS

2026

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation.

Mubashara Akhtar,* Anka Reuel,* Prajna Soni, Sanchit Ahuja, Pawan Ammanamanchi, Ruchit Rawal, Vilém Zouhar, et al.

Accepted at ICML 2026.

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations.

Anka Reuel et al.

Accepted at ICML 2026.

Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking.

Mubashara Akhtar, Michael Schlichtkrull, Andreas Vlachos

Transactions of the Association for Computational Linguistics (TACL), 2026. Presented at EACL 2026.

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning.

Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan.

ICLR 2026.

LEXam: Benchmarking Legal Reasoning on 340 Law Exams.

Yu Fan, Jingwei Ni, Jakob Merane, Etienne Salimbeni, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, et al.

ICLR 2026.

Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads.

Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan.

ACL 2026.

2025

TANQ: An open domain dataset of table answered questions.

Mubashara Akhtar,* Chenxi Pang,* Andreea Marzoca, Yasemin Altun, Julian Martin Eisenschlos

Transactions of the Association for Computational Linguistics (TACL), 2025. Presented at ACL 2025.

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons.

Shaona Ghosh et al.

ArXiv, 2025

The 2nd automated verification of textual claims (AVeriTeC) shared task: Open-weights, reproducible and efficient systems.

Mubashara Akhtar, Rami Aly, Yulong Chen, Zhenyun Deng, Michael Schlichtkrull, Chenxi Whitehouse, Andreas Vlachos

Proceedings of the Eighth Fact Extraction and VERification Workshop (FEVER).

LEXam: Benchmarking Legal Reasoning on 340 Law Exams.

Yu Fan, Jingwei Ni, Jakob Merane, Etienne Salimbeni, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, et al.

Arxiv, 2025.

Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads.

Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan.

Arxiv, 2025.

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning.

Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan.

Arxiv, 2025.

Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding.

Ziheng Chi, Yifan Hou, Chenxi Pang, Shaobo Cui, Mubashara Akhtar, Mrinmaya Sachan.

Arxiv, 2025.

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations.

Anka Reuel et al.

Arxiv, 2025.

2024

Croissant: A Metadata Format for ML-Ready Datasets.

Mubashara Akhtar,* Omar Benjelloun,* Costanza Conforti,* et al.

Advances in Neural Information Processing Systems 36 (NeurIPS 2024) as spotlight. (top ~3% of submissions)

Croissant: A Metadata Format for ML-Ready Datasets.

Mubashara Akhtar,* Omar Benjelloun,* Costanza Conforti,* et al.

DEEM workshop @SIGMOD, 2024. (Best Short Paper award 🥇)

A Standardized Machine-readable Dataset Documentation Format for Responsible AI.

Nitisha Jain,* Mubashara Akhtar,* Joan Giner-Miguelez,* Rajat Shinde,* et al.

ArXiv, 2024.

ChartCheck: Explainable Fact-Checking over Real-World Chart Images.

Mubashara Akhtar, Nikesh Subedi, Vivek Gupta, Sahar Tahmasebi, Oana Cocarascu, Elena Simperl

Findings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024.

The Automated Verification of Textual Claims (AVeriTeC) Shared Task.

Michael Schlichtkrull, Yulong Chen, Chenxi Whitehouse, Zhenyun Deng, Mubashara Akhtar, et al.

Proceedings of the Sixth Fact Extraction and VERification Workshop (FEVER), 2024.

Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER).

Michael Schlichtkrull, Yulong Chen, Chenxi Whitehouse, Zhenyun Deng, Mubashara Akhtar, et al.

Proceedings of the Sixth Fact Extraction and VERification Workshop (FEVER), 2024.

2022

PubHealthTab: A public health table-based dataset for evidence-based fact checking.

Mubashara Akhtar, Oana Cocarascu, Elena Simperl

Findings of Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022.

Publications

2026

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation.

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations.

Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking.

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning.

LEXam: Benchmarking Legal Reasoning on 340 Law Exams.

Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads.

2025

TANQ: An open domain dataset of table answered questions.

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons.

The 2nd automated verification of textual claims (AVeriTeC) shared task: Open-weights, reproducible and efficient systems.

LEXam: Benchmarking Legal Reasoning on 340 Law Exams.

Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads.

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning.

Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding.

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations.

2024

Croissant: A Metadata Format for ML-Ready Datasets.

Croissant: A Metadata Format for ML-Ready Datasets.

A Standardized Machine-readable Dataset Documentation Format for Responsible AI.

ChartCheck: Explainable Fact-Checking over Real-World Chart Images.

The Automated Verification of Textual Claims (AVeriTeC) Shared Task.

Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER).

2023

Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data.

Multimodal Automated Fact-Checking: A Survey.

Reading and Reasoning over Chart Images for Evidence-based Automated Fact-Checking.

Proceedings of the Sixth Fact Extraction and VERification Workshop (FEVER).

2022

PubHealthTab: A public health table-based dataset for evidence-based fact checking.