Conference Insights: NAACL 2022
Aktualisiert: 28. Juli
The 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) was run as a hybrid conference in Seattle, WA and online.
After ACL, this was my second in-person conference. Similar to ACL 2022, the conference experimented with a new reviewing process based on ACL rolling review (ARR). Overall, 442 papers were accepted for the main proceedings and 209 for publications in the "Findings of ACL: NAACL 2022".
In addition to various oral and poster sessions, NAACL offered very interesting panels, keynotes, tutorials, and workshops. The first conference day kicked off with six tutorials and on the last two days 23 different workshops covered a wide range of topics.
This blog post summarizes NAACL papers and talks that I found particularly interesting.
Panel: The Place of Linguistics and Symbolic Structures (w/ Emily Bender, Dilek Hakkani-Tür, Chitta Baral, and Chris Manning)
A very interesting session at the conference was the panel on “The Place of Linguistics and Symbolic Structures” with Emily Bender, Dilek Hakkani-Tür, Chitta Baral, and Chris Manning. Each panellist gave a short pitch on the topic at the start of the panel. Chitta Baral argued for the importance of symbolic structures if we want to go beyond solving the dataset only and concentrate on the task. Symbolic knowledge and structure can help in creating better datasets for that purpose. While there are quite some open challenges involving symbolic aspects, current research is only concentrating on few of them; for example, focusing on common-sense facts while common-sense reasoning includes much more challenges. Emily Bender called for a partnership between NLP and linguistics to use the deep scholarship available in linguistics research for NLP. For example, prior work in sociolinguistics can be useful for reasoning about the potential harms of today’s language technologies. Dilek Hakkani-Tür emphasized challenges related to dialog systems and how symbolic structure and knowledge grounding can be useful to address them. Finally, Chris Manning highlighted the difference between language as a symbolic structure and the human brain as a processor of these symbols, which is not implemented as a physical symbols system itself. He also mentioned that fundamental concepts of linguistics are becoming more important in deep learning research in general and used to understand human intelligence.
During the session, panellists mentioned grounded learning as well as social learning (motivated by how humans acquire language) multiple times as interesting directions for future research.
Like ACL, multimodality and grounded language learning was quite present and likewise the conference offered a tutorial and workshop on multimodal machine learning.
The tutorial by Louis-Philippe Morency, Paul Pu Liang, and Amir Zadeh started with some discussion on terms and concepts central to multimodal research. Louise-Philippe provided an historical overview of multimodal research and tasks on which the community focused in the past five years.
During the rest of the tutorial, the presenter discussed six core challenges of multimodal ML:
Representation learning: Learning multimodal representations that capture cross-modal interaction between data points of different modalities.
Alignment: Aligning data points such that interconnection and dependencies between different modalities become apparent, e.g. which object relates to which word in image captioning.
Reasoning: Using the multimodal knowledge acquired through representation learning and alignment for reasoning in a multistep fashion.
Generation: Producing raw modality that captures information from other modalities. Exemplary generation tasks are summarization, translation, and creation.
Summarization: Summarizing information content from multiple modalities into a smaller, compressed set of data and modalities.
Translation: Aiming to maintain the information content while translating from one modality to another.
Creation: Expanding the information content, e.g. going from latent representation to image.
Transfer concentrates on transferring knowledge between different modalities to overcome cases of noisy or limited resources by using data from other modalities.
Quantification encapsulates the previous challenges by studying heterogeneity in data, cross-modal interaction and multimodal learning.
Tutorial on Self-supervised representation learning for speech processing machine Learning
Another tutorial concentrated on self-supervised representation learning for speech processing, motivated by the fact that speech representation learning comes with unique challenges, such as:
Models required to deal with orthogonal information for different speech tasks (e.g. extracting content vs. capturing the speaker’s style).
No predefined dictionary of units is available to the model
Missing segment boundaries in the data
To train machine hearing models in describing environmental sounds with text, models require human-annotated datasets with audio sounds and parallel textual descriptions of the audio (e.g. “the sounds of heavy rain”).
However, available datasets for audio-text alignment are limited and small.
To overcome the restriction of parallel audio-text data for model training, this paper proposes using visual data as a pivot to connect audio and text implicitly in the embedding space. This would allow utilising co-occurring image-text and video-audio data available to a great extend on the web.
The proposed approach learns a shared, tri-modal (image, audio, text) representation space without the need for aligned and annotated image-audio-text data.
Modality-specific representation vectors are obtained with separate encoders. The vectors are mapped to a shared space such that they have a higher similarity in the shared space if they are co-occurring data points, i.e. an image and the associated caption text.
The authors implicitly align audio and text data over the visual modality by sharing the one image encoder between the vision-audio and vision-text alignment models.
Multimodal NER (MNER) and multimodal relation extraction (MRE) extend the text-only task variants with the visual modality to enhance text representations with visual clues from images. Especially for multi-sense words, additional visual clues can provide valuable insights.
To overcome current challenges in MNER and MRE, Chen et al. introduce an approach that uses object-level image representations as “visual prefixes” for text embeddings. The visual prefixes are used as additional input at each self-attention BERT layer (see Hierarchical Visual Prefix in image below) for the MNER/MRE tasks.
In detail, their approach (i) extracts image objects, (ii) obtains visual representations at different abstraction levels (low- to high-level features) using ResNet50, (iii) scales the visual representations to a uniform shape, (iv) routes the visual feature vectors to the different Transformer layers using a dynamic gate module, (v) and finally concatenates the visual representations as prefix to the text sequence at each self-attention layer.
They evaluate their approach on three benchmarks and demonstrate state-of-the-art performance compared to different text- and multimodal-baselines. The model outperforms the baselines in the low-resource settings.
The paper proposes an approach that directly uses images’ grid features obtained through CNNs to learn joint multimodal representations without requiring object detectors.
It introduces two object-related pretraining tasks applied in addition to the widely-used image-text matching and masked language modelling objectives.
Using a Transformer model, they co-learn object detection during pretraining. At the inference time, no two-stage architecture with an external object detector is required. The new pretraining objectives are:
Object guided masked vision modelling: The authors sample objects in images and mask out their grid-level features in the image vector.
Phrase-region alignment: This objective aims to align positive text phrase–image regions closer in the embedding space versus negative pairs.
While achieving comparable or better performance than the baseline models, the proposed approach no longer requires object detectors during fine-tuning and testing.
This work proposes an evaluation framework for probing video-text models. The aim is to analyse whether models (which achieve a high accuracy on video retrieval) can correctly differentiate between similar entities and actions.
The authors construct contrast sets using an automated pipeline to probe models on their understanding of entities and actions. They replace verbs and entities in textual descriptions such that a caption’s semantics changes but the new caption is still plausible given the image. The authors also create a small set of hard negatives using crowdsourcing. Models show similar behaviour on the automatic and manually-created contrast sets.
Results indicate that recent CLIP-based methods as well as earlier models struggle with the multiple-choice-style probing task and models particularly struggle with contrast sets where verbs are replaced by their antonyms.
This paper provides an overview of work that uses human-generated explanations as the gold standard for evaluating models’ performance in explanation generation.
It discusses the limitations of human explanations and argues that these must be considered if explanations are used as uniform ground truth labels.
This TACL paper provides an overview of explanation-based human debugging of NLP models, a research area interlinking previous work in explainability, human-in-the-loop learning, and knowledge integration.
The authors review papers that use explanations by humans as feedback for model performance and improvement. The survey organizes SOTA work across the dimensions of (1) explanation generation, (2) human feedback, and (3) model update.
Chen et al. examine the robustness of rational models against adversarial attacks in the form of noisy and adversarial text.
Rationale models are models that first generate rationales (e.g. by selection of input tokens) before making predictions based on the selected rationales only.
While rationale models show improved robustness compared to their non-rational counterparts, the experiments indicate that such models are sensitive to certain attack factors such as the position of the adversarial text and the added words.
This work studies explanation generation in the context of toxic speech using different knowledge sources. The authors integrate expert knowledge (i.e. annotations), as well as implicit (e.g. generative models) and explicit knowledge (e.g. knowledge graphs) for explaining why a text is hateful.
Data collection and benchmarking
Panel on the future of data collection @DADC workshop
I attended the DADC workshop and especially enjoyed the panel on the future of data collection with lists, with Anna Rogers, Jordan Boyd-Graber, Sam Bowman, Sherry Tongshuang Wu, Lora Aroyo, Douwe Kiela, and Swabha Swayamdipta.
Starting with the status quo in data collection and evaluation, Lora emphasized that the goal of data collection should be capturing the natural expressions, perception, and diversity in humans instead of aiming for answers that fit well into models. She sees currently a lack of evaluation approaches ensuring that we reach this goal.
Swapah highlighted that a common pitfall of adversarial data collection is collecting data samples that are no longer meaningful for the task at hand. As adversarial examples can be very diverse, as a community we need to take a step back and consider what actually are adversarial examples. Also, we need to define certain terms that are commonly used in adversarial data collection but only vaguely defined.
Douwe raised the interesting point that in data collection a key point is to understand our data better, mentioning the data cartography work by Swayamdipta et al. (2020). Moreover, from an academic perspective it is useful that we ask the question of what kind of data we want to collect and work on as a community to measure our progress.
Douwe and Sherry also both mentioned interactive data collection with the help of humans and different models in the loop as an effective method that can help to become more model agnostic. Sam criticized one point about adversarial data collection, namely that data becomes biased towards the model used while collecting.
The panel also had an interesting discussion on the role of experts and crowdworkers in data collection, emphasizing the diversity in annotator groups. Concentrating only on expert datasets (with experts from our research community or neighbouring research communities), we might miss things due to the lack of diversity in the annotators for some dimensions/variables. Using experts, crowdworkers, and models jointly in data collection would lead to more diverse data.
While state-of-the-art models perform remarkably well on vision-and-language navigation (VLN) benchmarks, this work is motivated by the fact that it remains mostly unclear how agents perceive their multimodal input. This paper discusses how different agents perceive multimodal input in indoor and outdoor VLN tasks.
This question is studied with the following counterfactual interventions:
Altering instructions by removing or replacing object/direction/numeric tokens in instructions.
Changing the visual environment by masking objects or flipping the image view horizontally.
Moreover, they compare Transformer-based agents to non-Transformer agents on these interventions and investigate how models perform differently in indoor and outdoor navigation settings.
The results show that in indoor settings agents mostly rely on object and direction tokens of instructions, while in outdoor scenarios they mostly concentrate on direction instructions. Indoor agents can align object tokens to their visual environment to a certain extent, whereas outdoor agents poorly understand visual objects. Finally, Transformer-based models outperform non-Transformer models in terms of numerical reasoning and cross-modal understanding.
Kasai et al. introduce a new evaluation approach for image-captioning models considering (i) the trade-off between precision and recall (salient information), and (ii) the text quality of the generated caption.
The authors evaluate models on the MSCOCO dataset and measure text quality with the help of human evaluators analysing the text for fluency, conciseness, and inclusive language.
While human performance is 250th on the MSCOCO leaderboard, their evaluations propose that automatically-generated captions lack quality compared to human-written ones on different levels.
Tutorial on Human-Centered Evaluation of Explanations
While there has been a growing interest in the NLP community in methods for explanation generation, there is no consensus on how to evaluate explanations.
With the aim of explaining model predictions in such a way that humans can easily understand, the tutorial on human-centered evaluation of explanations focused on human-centered explanation approach.
Bias & Fairness
This paper surveys metrices for measuring bias and fairness in pretrained language models and evaluates their compatibility. The authors find measures proposed in literature difficult to compare as they strongly depend on design choices such as probing templates, target seeds, and embeddings that are probed.
Lalor et al. evaluate intersectional bias in pretrained models across multiple demographic factors, i.e. gender, race, age, education, and income.
The paper also discusses the trade-off between fairness and predictive performance of models after applying debiasing strategies. While models’ predictive power is well preserved, intersectional biases are not truly mitigated, although debiased models are relatively “fair” in terms of single single-dimensional biases.
This paper studies how annotators’ identities and beliefs bias annotations for toxic language detection. The authors conduct two annotation experiments with more than 600 annotators whose attitudes are measured across seven attitude types: valuing the freedom of offensive speech, perceiving the harm of hate speech, endorsement of racist beliefs, traditionalism, language purism (lingpurism) (i.e. there is a “correct” way of using English), empathy, and altruism.
Crowdworkers annotate posts with three characteristics: anti-Black language, African American English (AAE) dialect, and vulgarity. The results indicate strong associations between annotator attributes and the toxicity ratings of the posts.
Some further papers I enjoyed reading!
While every few days new pretrained models are released that are trained with more data, the carbon footprint of these models has been a subject of discussion in recent times.
This paper evaluates techniques for reducing energy consumption while maintaining model performance and computation time. For example, one of the proposed methods is power capping, which limits the maximum power consumption of GPUs, resulting in a 15% reduction in energy consumption.
"Prompts allow faster learning in the same way that humans faster understand tasks given instructions." This paper evaluates this argument – commonly used as a motivation for prompt-based learning – by experimenting with over 30 manually-written prompt templates for NLI.
Each prompt belongs to one of the following five categories:
Instructive templates: Positive examples describing the NLI task similarly to human instructions for NLI.
Misleading-moderate: Prompts intended for other related/tangential tasks to NLI. If models understand instructions, they will perform poorly on the NLI task given these prompts.
Misleading-extreme: Prompts that instruct the models to perform a task unrelated to NLI.
Irrelevant: Simple concatenation of the premise, an unrelated sentence, and the hypothesis.
Null: The premise and hypothesis concatenated without any additional text.
The results indicate that the model shows similar results for irrelevant/misleading prompts and meaningful ones. While the paper acknowledges the massive performance of prompt-based models, their results question how much this can be contributed to an understanding of instructions the way humans do.
This work investigates how multilingual pre-trained models learn representations across languages in the absence of explicit signal, e.g. parallel texts.
The authors test their assumption that models learn language-universal abstractions about grammar by probing two multilingual pre-trained models, m-BERT and XLMR across 43 languages and fourteen morphosyntactic categories. Using neuron-level probes, they evaluate if models encode morphosyntactic information in the same subset of neurons across different languages.
The results indicate that a cross-lingual overlap of neurons exists. The extent might vary across categories and depends on language proximity (e.g. languages with similar typological features) and the size of pre-training corpora.
This paper received the “Best efficient NLP paper” award at NAACL.
It demonstrates that transformer training can be sped up by a large margin by replacing its self-attention sublayers by unparameterized Fourier transform sublayers. While still reaching 92-97% of the accuracy of BERT-base on the GLUE benchmark, the Fourier transform counterpart trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths.
I hope you enjoyed this blog post!