- Mubashara Akhtar
Conference Insights: ACL 2022
Aktualisiert: 17. Sept. 2022
This post offers an overview of multimodal research at ACL 2022 and summarizes some conference highlights.
Following two online versions (2020 and 2021), this year's ACL was a hybrid conference taking place in Dublin and online. Approximately 1,500 in-person attendees were registered.
The first day kicked off with tutorials - including one on multimodal training: "Vision-Language Pretraining: Current Trends and the Future". For the main conference two new formats of invited talks were introduced: the Spotlight Talks by Young Rising Stars (STIRS) and the Next Big Ideas talks. I briefly summarize some of the talks below. The two final (workshop) days offered a wide range of topics covered across 28 different workshops.
At this year's ACL, we find papers that use various combinations of modalities for different tasks, probing, model training, dataset creation, etc. The post summarizes a few papers that I found particularly interesting while scrolling through the ACL proceedings.
This blog post is the first in a series that will offer an overview of papers related to multimodality and NLP. It is inspired by Michael Galkin's blog on knowledge graphs, which I strongly recommend if you are interesting in knowledge graphs!
I'd love to hear your thoughts and feedback on this post, so please feel free to reach out!
Content
Also special thanks go to Sebastian Ruder for providing very useful comments on this blog post!
Next Big Ideas Talks
The Next Big Ideas Talks was a plenary session moderated by Iryna Gurevych and featuring talks by Marco Baroni, Eduard Hovy, Heng Ji, Mirella Lapata, Hang Li, Dan Roth, and Thamar Solorio. The talks were very inspiring and thought provoking. Here, I will offer more details on two of them but recommend checking out the others on Underline.
Mirella Lapata
Mirella's talk was on story understanding, inspired by the fact that we find stories everywhere in our daily lives, e.g. in books, movies, and podcasts. She highlighted that the NLP research community should care more about stories if we aim to master NLU.
She also mentioned concrete challenges related to story understanding, such as the availability of annotated data or reasoning over long text spans. Moreover, story understanding includes multiple subtasks such as understanding characters, the context/setting of an event, commonsense knowledge about the world, causality, temporal structure, etc.
Eduard Hovy
I also found the talk by Eduard Hovy particularly interesting. Perhaps you have seen the one or other Twitter posts mentioning statements from his talk, e.g. “Stop being lazy.” or “Let’s start thinking again!”.
Knowledge gaps in large pretrained language models were at the centrer of his talk. Even after training LMs on a a large share of the web, certain knowledge types are not captured. This knowledge is either not written explicitly, rare in our training data, or hidden somewhere in the data.
To overcome these gaps, we need to:
Find out and specify what type of knowledge models fail to capture.
Develop appropriate approaches to fill the gaps.
He mentioned some exemplary knowledge gaps linked to our commonsense knowledge:
Knowledge about goals/incentives of human beings, such as the fact that humans might have conflicting goals.
The structure and schema of events. As an example he mentioned the (for humans) obvious structure of sub-events related to ordering food at a restaurant.
People’s roles in groups, i.e. the notion that humans take up different social roles within a group.
The talk concluded with a call to action for NLP researchers: to stop being lazy and start thinking again, e.g. by looking at knowledge structures and applying our human knowledge to build typology, a set of features, etc.
Spotlight Talks by Young Rising Stars (STIRS)
Another new initiative was the Spotlight Talks by Young Rising Stars plenary session. It featured ten-minutes talks by Eunsol Choi, Ryan Cotterell, Sebastian Ruder, Swabha Swayamdipta, and Diyi Yang. Here, I briefly summarize the talk by Diyi Yang and Swabha Swayamdipta, but definitely recommend checking out the other talks as well!
Diyi Yang
Diyi Yang gave a talk on socially aware NLP. While we have seen super-human performance of large models on different talks, we see issues related to social norms, culture, religions, underrepresented groups, etc. when probing and evaluating them.
These issues can pose a challenge in adapting models in the real world. Diyi called for incorporating social awareness into NLP progress. She proposed extending the common NLP pipeline by components inspired by social science research (highlighted in red the figure below).

She introduced seven social factors that NLP systems need to be aware of to overcome current limitations (see her and Dirk Hovy's paper "The Importance of Modeling Social Factors of Language: Theory and Practice" for more details):

She concluded the talk by highlighting the impact of socially aware NLP on current challenges; for example, in bridging the divide between SOTA tasks used for benchmarking and human-centered tasks, or between knowledge available to the models today and social knowledge required for the success of NLP technology.
Swabha Swayamdipta
Swabha gave a talk on “Mapping and Generating Datasets for Robust Generalization,” placing the question of whether data scale is really the key to NL generalization at the center of her talk.
While large-scale datasets like StanfordNLI and MultiNLI have triggered lots of research, they contain biases and artifacts, such as premises with the token “cat” being correlated with the label “contradiction”.
Swabha highlighted that we need better tools to analyse model-data relationships and go beyond accuracy. She discussed data maps as a tool to discover ambiguous data samples, i.e. samples with high variability (= standard deviation of the true class probability). She also showed that models trained on samples that are “ambiguous” to the model performed much better on out-of-distribution datasets.
Below you can see an examplary Data Map for the SNLI dataset:

Special Theme on Language Diversity
The special theme of the conference was “Language Diversity: from Low-Resource to Endangered Languages”, aligned with the 60th anniversary of ACL.
In addition to the best theme paper and a paper by Leong et al. (2022) discussed in the Data scarcity section, I came across two very interesting theme papers during the conference.
Below you can find recommendations from both papers on how to conduct research on endangered or low-resource languages.
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia (Aji et al.)
Indonesia is the fourth most populous country in the world with more than 700 languages, of which 440 are listed as endangered and twelve as extinct. Aji et al. (2022) offer an overview of the current state of NLP for Indonesian languages and discuss how we should develop NLP technology for underrepresented languages in general.
Dialect metadata. Inspired by regional and dialect variations of Indonesian languages, the authors suggest adding dialect metadata to datasets and models. This can help in clearly communicating systems’ capabilities to users and stakeholders.
Data efficiency. Data collection is a well-known and commonly discussed challenge related to low-resource languages. They suggest a research focus on data-efficient approaches, few-shot learning, and learning from related languages with data availability.
Compute efficiency. While large models may work well on particular benchmarks, the authors suggest developing more lightweight and faster neural networks that are adoptable by locals.
NLP beyond text. Given that limited written text is available for many languages, they suggest exploring research directions which are less text-focused, e.g. spoken language understanding.
How can NLP Help Revitalize Endangered Languages? A Case Study and Roadmap for the Cherokee Language (Zhang et al.)
This position paper discusses how NLP practitioners can (i) understand and collaborate with local language communities and (ii) assist in language education. The authors also discuss their case study of Cherokee, an endangered Native American language.
They suggest following three principles while conducting research on endangered indigenous languages:
Understanding basic needs and building respect between the researchers and language communities.
Placing the voice of the community and their needs at the center of the research process.
Building a community with speakers of the language by finding common interests and setting common goals.
Tutorial on Vision-Language Pretraining: Current Trends and the Future
The first conference day kicked off with tutorials, including one organized by Aishwarya Agrawal, Damien Teney, and Aida Nematzadeh on VL pretraining (slides available).
Aishwarya offered a historical overview of the VL landscape before the pretraining era. She discussed common tasks that the community has been working on since the early years (i.e. image retrieval, grounding expressions in images, VQA, etc.), the basic skeleton of VL models, and benchmarking for VL tasks.
Aishwarya highlighted several challenges in training for VL tasks and discussed the following future research directions:
Biases and artifacts present in many datasets
Distribution shifts in datasets
Poor evaluation metrics for image captioning
Counting visual entities in images
Reasoning over text in images
Compositional reasoning
Commonsense and knowledge based reasoning
In her part of the talk, Aida discussed the models and pretraining approaches in more recent times. She highlighted that most early transformer-based VL models (i.e. VilBERT, LXMERT, etc.) shared a similar architecture and pretraining objectives.
Moreover, she discussed in more detail how modeling decisions and data selection influence model performance; for example, models using attention between modalities outperform those where there is no cross-attention between language and image.
She also highlighted previous work discussing the redundancy of image loss for VL pretraining (see Frank et al., 2021), and criteria for “good” VL pretraining datasets (see Hendricks et al., 2021).
Multimodal papers @ACL
Multimodal representations
Several ACL papers focus on multimodal representation learning.
Learning multimodal representations requires encoding heterogeneous data from different modalities such that they complement each other on the one side, but avoid redundancy on the other side.
Liu et al. propose an approach to learn representations that are modality-independent and capture information on a finer granularity level. Wang et al. introduce a notion to measure whether multimodal models treat different languages equally. Milewski et al. probe multimodal BERTs for structural knowledge in text and images.
Cross-Modal Discrete Representation Learning (Liu et. al)
This paper presents a new approach for generating cross-modal representations. The approach allows…
...capturing semantic concepts independent of the input data modalities and...
...encodes information at different levels of granularity. Unlike previous approaches, which mostly concentrate on generating high-level cross-modal representation.
The figure below offers an overview of the method, which comprises:
Encoding data into latent features using unimodal encoders (encoder fine) and generating high-level feature vectors for each modality (encoder high).
Adopting an objective function (L_mms) to maximise the similarity between the high-level representations of aligned pairs vs. reducing similarity between “negative” pairs.
Obtaining more fine-grained representations by projecting features from 1) into a “shared discrete embedding space” (green).
The approach uses a cross-modal code matching objective (in addition to the similarity loss L_mms) to learn a modality-independent space; for example, using audio and image data. A codebook is introduced to capture cross-modal correspondence between the unimodal features.

They demonstrate the approach’s applicability to different modality pairs on video-text, video-audio, and image-audio retrieval tasks (see table below).

Through ablation studies and comparing to baselines without the second cross-modal objective, the paper shows that fine-grained cross-modal features complement high-level ones.
I think that in future we might see increasingly more representation approaches that are applicable to a wider range of modalities.
Assessing Multilingual Fairness in Pre-trained Multimodal Representations (Wang et. al)
Let us stay with multimodal representations but focus on another aspect: multilingual fairness in a multimodal context.
This paper discusses the following questions at the intersection of multilinguality and multimodality: (i) Do multilingual multimodal models treat different languages equally? (ii) Is the performance of multimodal models biased towards particular languages?
To evaluate multilingual fairness in multimodal representations, the authors introduce two notions:
a) Multilingual individual fairness. The intention is that semantically-similar translations should be equally similar/dissimilar to the grounding image of the source text. For example, regarding the sentence “this is a cat” and its German translation “das ist eine Katze”, a fair multilingual VL model should perceive both sentences equally similar to the image of a cat.
b) Multilingual group fairness. Multilingual group fairness is given if multimodal models have a similar predictive performance across languages for VL tasks.
The paper evaluates multilingual CLIP using these fairness metrices and discusses accuracy disparities across languages.
Finding Structural Knowledge in Multimodal-BERT (Milewski et al.)
This paper probes multimodal BERTs to analyse whether models can store structure in their learned representations. The authors refer to structure as (i) the grammatical structure found in language; (ii) the structural dependency of objects in images (they introduce the term scene tree).
Given images' textual descriptions, the linguistic structure is extracted with dependency parsing. Dependencies between objects in the image are identified by mapping tokens of the textual dependency tree to object regions in the image (see figure below).

The paper offers interesting insights on structural knowledge in multimodal BERTs.
First, although the probed models were initialised with BERT weights, additional training on multimodal data hinders preserving grammatical structure. Second, providing additional visual information, allows the multimodal BERTs to achieve similar results and helps recovering text structure. However, experiments indicate that visual representations don’t encode tree depths of the proposed scene trees.
Data scarcity
Let us continue with another topic that is addressed in various papers at ACL: data scarcity and how to overcome it for multimodal tasks.
Fang et al. use phrase-level, region-based image retrieval to overcome data limitations for multimodal machine translation (MMT). Inspired by ε-bounded image perturbation (commonly used in CV for data augmentation), Gokhale et al. augment video-/image-inference datasets with linguistically transformed sentences. Leong et al. jointly use audio and text data to train NER models for low-resource languages. Pine et al. also address challenges related to data availability for low-resource languages, but concentrate on models for text-to-speech synthesis.
Neural Machine Translation with Phrase-Level Universal Visual Representations (Fang et al.)
While we have seen considerable improvement in multimodal machine translation (MMT), training MMT models is often bound to the availability of paired source/translated sentences and the grounding image.
To overcome this limitation and have a higher variety of grounding images available, this paper introduces an approach to learn grounded representations on a phrase instead of sentence level, which is achieved by matching phrases to regions of images.

The authors start with building a phrase-level image set based on Multi30K: a dataset containing 29k bilingual sentence-image pairs. Using source sentences and their grounding images, they extract <noun phrase, image region> pairs for all noun phrases.
During MMT, this phrase-level image set is used to extract matching top image regions for the source sentences (to be translated). Regions are matched given the semantic similarity between the phrase and regions available in the image set.
While their results do not demonstrate high performance gains compared to image-level baselines, I think that the phrase-level retrieval approach might be interesting for other task settings, e.g. low-resource domains/languages/etc.
Semantically Distributed Robust Optimization for Vision-and-Language Inference (Gokhale et al.)
This paper transfers adversarial training to the context of VL tasks.
The authors address VL models’ sensibility to linguistic phenomena like paraphrasing, negation, textual entailment, and word substitutions by including linguistically augmented text in the training process.

They use a pre-defined set of linguistic transformations (see table below) as their perturbation set and generate (1) positive, semantics-preserving (SP) and (2) negative, semantics-inverting (SI) transformations.

Experiments on inference benchmarks with images (NLVR2) and video (VIOLIN) demonstrate performance and robustness improvements. The table below shows how models perform on the initial task in different training settings:
Using the task’s training data.
Using SI and/or SP samples in addition.
I would also like to highlight the two different approaches that they apply for selecting adversarial samples during training: sampling at a (linguistic) group level vs. sample level (see Section 2.3 in the paper for further details).
Phone-ing it in: Towards Flexible, Multi-Modal Language Model Training using Phonetic Representations of Data (Leong et al.)
Tackling the challenge of data limitations as well, this paper jointly uses audio and text for modality training.
Given that only a fraction of today’s spoken languages have sufficient text data available for training purposes, the authors develop a multimodal pre-training approach. They use whatever speech and text data is available to pre-train models for Swahili NER.
They evaluate their approach using audio/text data from “low-resource” languages, namely Swahili and Kinyarwanda. The approach first translates audio and text data into IPA phonetic symbols, before training a character-based model in the phonetic space.
They evaluate their models on three Swahili NER tasks (NER 1 – 3) with different complexity levels (see screenshot below for details on the tasks):

Their experiments show a limited viability of models trained on phonetic (audio/text) data for the simpler NER tasks (i.e. NER 1 & 2). However, performance gains increase (compared to the fine-tuning only model) for the more complex NER3 task.
Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization (Pine et al.)
This paper also addresses data scarcity for low-resource languages and received the Best Theme Paper award at ACL!
Most indigenous languages spoken in Canada have fewer than 500 fluent speakers today. Thus, collecting data to train text-to-speech synthesis (TTS) models is a challenge. The authors evaluate the common conception that tens of hours of audio and transcribed text is necessary to train TTS models that can produce speech with a sufficient level of naturalness.
The authors conduct two different types of experiments:
First, by training TTS models for three indigenous languages with training data ranging from 25 minutes to 3.5 hours.
Second, to analyse the performance of state-of-the-art TTS models in different training setting, they experiment with English models, trained on dataset sizes ranging from 1 to 3, 5, 10, and 24 (full corpus) hours of speech.
Experiments show that a FastSpeech2 voice model trained on 3 hours of English data can achieve subjective naturalness that is not significantly different from a Tacotron2 voice trained on 24 hours.
Evaluating the models for indigenous languages with human evaluation, they conclude that models produce acceptable speech given less data.
As an interesting side note, they mention limitations related to evaluating TTS models for languages with limited native speakers, which possibly reflects an interesting direction for future work.
Few-shot and prompt-based learning for multimodal models
Papers in this section also study multimodal models in low-resource settings. Song et al. evaluate if CLIP's strong zero-shot performance on vision tasks is transferable to vision-language (VL) tasks as well. Whereas, Jin et al. propose a method for prompt-based learning of VL tasks.
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment (Song et al.)
This paper evaluates whether the strong zero-shot ability of CLIP can be transferred to vision-language understanding tasks. Inspired by CLIP’s great zero-shot performance on vision tasks, the authors probe its few-shot capabilities on vision-language tasks.
Two VL understanding tasks are used for experimenting: VQA and visual entailment. Moreover, to transform CLIP into a few-shot learner for VQA, the authors use prompt-based learning and reduce the differences between CLIP’s pre-training tasks and VQA’s task form.
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models (Jin et al.)
This paper evaluates the usefulness of prompt-based learning for VL tasks. More concretely, the authors discuss at the following questions:
Q1) How does prompt design affect zero-/few-shot learning on new VL tasks?
Therefore, the authors test several ad-hoc prompts on vision-language tasks (which tasks?) and analyse how strongly zero- and few-shot performance is affected by different prompt designs, namely hand-crafted and noisy prompts.
Q2) Does prompt design still matter given larger training (data)?
The authors train models with different sizes of training data and prompts, and compare their performance.
Q3) How do different pre-training objectives affect zero-/few-shot learning?
To study the impact of pre-training objectives on few-shot VL tasks, the paper compares prefix language modelling (PrefixLM) with masked language modelling (MaskedLM) - see figure below.

They experiment with VQA, image captioning, and the mini-ImageNet (Vinyals et al., 2016) task and report:
Q1) While zero-shot performance is significantly affected by prompt design, the affect is smaller for few-shot performance.
Q2) Their results imply that models with noisy prompts require more training data to learn as quickly as hand-crafted prompts.
Q3) There does not exist one pre-training objective that works best for all tasks, but rather the objective varies depending on the downstream tasks used. The results indicate that MaskedLM helps VQA tasks, while PrefixLM boosts captioning performance.
New multimodal datasets
At ACL, we also find a variety of new and exciting multimodal tasks, whereby in this section I discuss a few of them.
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena (Parcalabescu et al.)
This paper studies models’ performance in capturing fine-grained information in image to deal with the following linguistic phenomena: existence, plurality, counting, relation, actions, coreference.
Therefore the authors propose VALSE (Vision And Language Structured Evaluation).
The benchmark comprises six sub-tasks, each of which is constructed using the same structure, whereby given a visual input, the model is asked to distinguish between real captions from foils ones.
The foil captions are constructed by altering phrases in the original caption such that a specific linguistic phenomenon is addressed, e.g. semantic number of nouns, verb argument structure, or coreference. For each example, models must capture the linguistic phenomenon to distinguish the original and alternated caption from each other.
The figure below provides an overview of the different tasks and examples for each phenomena they include:

WIKIDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types (Wang et al.)
Multimodal entity linking (MEL) links entities mentioned in texts with their multimodal contexts found in knowledge bases (e.g. Wikipedia).
The papers mentions limitations in existing MEL benchmarks, such as a small coverage of entity types and strong concentration on social media posts and movie reviews.
The WikiDiverse dataset has 8k image-caption pairs from WikiNews. The task is to link entity mentions from WikiNews articles to the corresponding Wikipedia entity based on the visual context available in the article’s image and Wikipedia page (see example below).

FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework (Castro et al.)
This paper introduces a new 28k video-caption dataset named FIBER for its video fill-in-the-blank task.
The authors address two limitations of previous video-language understanding tasks:
1) Existing multiple-choice video QA tasks are prone to language biases, allowing selecting the correct answer option without video.
2) Video captioning tasks use open-ended evaluation and might perceive correct answers as incorrect if they differ from the ground truth.
The FIBER benchmark tests models’ understanding of videos with a fill-in-the-blanks task. Models must fill blanked noun phrases in video captions using its video and surrounding caption text.

xGQA: Cross-Lingual Visual Question Answering (Pfeiffer et al.)
Pfeiffer et al. introduce xGQA to study challenges related to cross-lingual VQA. Different to most previous VQA benchmarks, which mainly focus on English.
The dataset extend the English GQA dataset to seven new languages from different language families:

The presented baselines extend VL models pretrained on English text by proposing an adapter-based approach. While the baseline outperforms the multilingual multimodal M3P model in a zero-shot setting, there is considerable room for improvement.
Compared to the English VQA task, accuracy declines by 38 points on average for their baselines.
Visual grounding
Concluding this post with a brief section on visual grounding. Jin et al. study the impact of integrating visual knowledge in LMs on language-only tasks. Li et al. evaluate different vision encoders for multimodal machine translation.
Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer (Jin et al.)
This paper studies the impact of visual knowledge integration in LMs and whether this contributes to better performance on language-only tasks.
The authors explore two types of knowledge transfer: (i) text knowledge transfer using the image captions; and (ii) cross-modal knowledge transfer that uses both the image and its caption for vision-language training objectives.
For text knowledge transfer, they apply the following objectives on image captions:
a) Masked language modelling (on visual clues)
b) Text contrastive learning
For cross-modal transfer:
c) Voken classification matches tokens to related images ("vokens" = visual token).
d) Cross-modal contrastive learning (CMCL) maximises the agreement between correct image-caption pairs versus random pairs.
e) Cross-modal knowledge distillation distils knowledge from a teacher model that is trained using CMCL to a student LM.
The figure below provides an overview of all objectives:

Their results show that:
Text knowledge transfer (MLM, TCL) improves LM performance in (i) low-resource as well as (ii) fully supervised settings.
Cross-model knowledge transfer objectives (c – e) are also useful in both low-resource and fully supervised training.
Particularly CMCL improves performance on downstream tasks. The authors also observe that CMCL training improves with an adversarial negative sampling strategy and augmenting data with positive samples.
This is another interesting paper on MMT. The authors probe VL models and analyse the impact of different vision encoders on MMT.
As an interesting approach, they do not simply plug in different encoders and assess performance using BLEU but rather use three different probing tasks. They assess models’ ability to infer correct translations for masked tokens by extracting information from the grounding image.
The paper uses the following masked language modelling tasks for probing:
Colour-based probing masks all words referring to colours of image objects by a special token, i.e. [Mask_C].
Character-based probing masks the following tokens referring to characters in the image: “man”, “woman”, “people”, “men”, “girl” and “boy” by a [MASK_P].
Noun-based probing masks different numbers of noun phrases in the source sentence with a [MASK_N].
The table below provides examples for each probe:

The authors experiment with English-German and English-French benchmarks, concluding that the selection of vision models has an impact on models’ performances.
It is particularly interesting that the impact is not measurable looking simply at the BLEU score, which remains approximately the same. However, models’ perform differently on probing tasks when different vision models are used for image encoding. For example, the transformer-based ViT encoder beats ResNet50 (still commonly used for MMT) on all probing tasks.
I hope you enjoyed this blog post!
Feel free to contact me with feedback, suggestions for future editions, or your thoughts on the discussed papers: mubashara.akhtar@kcl.ac.uk.