top of page
  • Mubashara Akhtar

NAACL 2021: Summaries of selected papers

Aktualisiert: 20. Mai 2022

Just one day left till the start of this year’s NAACL conference! If you’d like to attend and didn’t register yet – late registration ($50 for students and $175 for all others) remains open throughout the conference:

With 477 papers accepted for the main conference and various sessions taking place at the same time – deciding which paper presentations to attend can be a challenge.

In this blog I briefly present some papers I found interesting.

Here you can find the NAACL proceedings.


This paper discusses an alternative approach to massively huge Language Models (LM) like GPT-3 which perform well on few-shot tasks. Due to the huge amount of data required for training models like GPT-3 and the resulting carbon footprint, the authors investigate whether small models can be used for this task. They propose key factors for solving NLU tasks using small models.

The paper arguments that the priming technique used in GPT-3 to achieve few-shot learning given a few examples contains major drawbacks. As an alternative pattern-exploiting training (PET) is proposed. PET uses gradient based fine-tuning and reformulates tasks as cloze questions.

Both training methods are compared using the SuperGLUE benchmark and a training set of unlabeled claims named FewGLUE.


As the title already indicates, this paper is a survey on numeracy related NLP tasks and methods. The authors discuss why numbers are in integral part of natural language and mention several unexplored questions related to numeracy. Previous NLP research mostly ignored numbers by either filtering them out or treating them like all other words occurring in a text.

Seven numeracy tasks are presented alongside with methods for encoding (numbers => embeddings) and decoding (embeddings => numbers) numbers. The following graphic provides an overview of NLP methods used in previous research for numeracy. The taxonomy and each method are described in more detail in section 3 of the paper.

The paper mentions various open challenges related to numeracy and NLP – they are definitely worth checking out in more detail!


This paper discusses the ethical implications of NLP research and criticized the scant attention this topic has received in past NLP research. The authors analyse papers published at major NLP conference venues (ACL, EMNLP, NAACL) during 2015 – 2020.

They discuss crowdsourcing in NLP and its ethical implications in much detail. While crowdsourcing has been used in approximately 10% of the papers, only 17% of them mention the workers’ payment and 2% state that ethical clearance from an Institutional Review Board has been requested and granted. The paper provides various arguments why crowdworkers should be considered as human subjective in research. Thus, aspects such as their safety, privacy and security need to be considered by NLP researchers.


Dynabench is a platform for human-and-model-in-the-loop dataset creation. Motivated by the fact that many SOTA models claim very quickly “super-human” performance on “challenging” benchmarks but fail solving simple problems, the authors emphasize the need for more robust benchmarks. (I also recommend checking out work by [Bender and Koller, 2020], which is a very interesting paper discussing the super-human performance of NLP models.)

Below an interface for the tool is added. In its initial state the Dynabench platform can be used for the collection of adversarial examples (as it is demonstrated in the graphic) but is not limited to that for future use. For a start four tasks have been selected: NLI, QA, sentiment analysis and hate speech detection.


This paper proposes four criteria for NLU benchmarks. Given limitations such as bias and ambiguousness in widely used benchmarks, the authors argue that adversarial data (recently very popular) is not sufficient to address the problem. Rather certain criteria have to be considered earlier e.g. during benchmark design and creation. These criteria are summarized in the graphic below, which has been extracted from the paper.

I also recommend looking at the following papers, which are related to this topic: “Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI” (CHI 2021 best paper award) and “Small and synthetic benchmarks”.


The authors introduce several tasks and a benchmark dataset based on Wikipedia revisions. The dataset has been created using most popular Wikipedia articles and COVID-19 related articles.

Multiple tasks have been proposed alongside the data:

  1. Factual revision flagging: which of the revisions are related to factual facts

  2. Fact verification

  3. Determining word level relationals: anchor words supporting the claim. Changing these words results in a change of the fact label, from true to false or other way around.

  4. Generating facts based on the Wikipedia revisions

Overall, this paper proposes a challenging fact checking benchmark - for sure worth checking out!

68 Ansichten0 Kommentare

Aktuelle Beiträge

Alle ansehen


bottom of page