This post offers an overview of multimodal research at ACL 2022 and summarizes some conference highlights.
Following two online versions (2020 and 2021), this year's ACL was a hybrid conference taking place in Dublin and online. Approximately 1,500 in-person attendees were registered.
The first day kicked off with tutorials - including one on multimodal training: "Vision-Language Pretraining: Current Trends and the Future". For the main conference two new formats of invited talks were introduced: the Spotlight Talks by Young Rising Stars (STIRS) and the Next Big Ideas talks. I briefly summarize some of the talks below. The two final (workshop) days offered a wide range of topics covered across 28 different workshops.
At this year's ACL, we find papers that use various combinations of modalities for different tasks, probing, model training, dataset creation, etc. The post summarizes a few papers that I found particularly interesting while scrolling through the ACL proceedings.
This blog post is the first in a series that will offer an overview of papers related to multimodality and NLP. It is inspired by Michael Galkin's blog on knowledge graphs, which I strongly recommend if you are interesting in knowledge graphs!
I'd love to hear your thoughts and feedback on this post, so please feel free to reach out!
Content
Also special thanks go to Sebastian Ruder for providing very useful comments on this blog post!
Next Big Ideas Talks
The Next Big Ideas Talks was a plenary session moderated by Iryna Gurevych and featuring talks by Marco Baroni, Eduard Hovy, Heng Ji, Mirella Lapata, Hang Li, Dan Roth, and Thamar Solorio. The talks were very inspiring and thought provoking. Here, I will offer more details on two of them but recommend checking out the others on Underline.
Mirella Lapata
Mirella's talk was on story understanding, inspired by the fact that we find stories everywhere in our daily lives, e.g. in books, movies, and podcasts. She highlighted that the NLP research community should care more about stories if we aim to master NLU.
She also mentioned concrete challenges related to story understanding, such as the availability of annotated data or reasoning over long text spans. Moreover, story understanding includes multiple subtasks such as understanding characters, the context/setting of an event, commonsense knowledge about the world, causality, temporal structure, etc.
Eduard Hovy
I also found the talk by Eduard Hovy particularly interesting. Perhaps you have seen the one or other Twitter posts mentioning statements from his talk, e.g. “Stop being lazy.” or “Let’s start thinking again!”.
Knowledge gaps in large pretrained language models were at the centrer of his talk. Even after training LMs on a a large share of the web, certain knowledge types are not captured. This knowledge is either not written explicitly, rare in our training data, or hidden somewhere in the data.
To overcome these gaps, we need to:
Find out and specify what type of knowledge models fail to capture.
Develop appropriate approaches to fill the gaps.
He mentioned some exemplary knowledge gaps linked to our commonsense knowledge:
Knowledge about goals/incentives of human beings, such as the fact that humans might have conflicting goals.
The structure and schema of events. As an example he mentioned the (for humans) obvious structure of sub-events related to ordering food at a restaurant.
People’s roles in groups, i.e. the notion that humans take up different social roles within a group.
The talk concluded with a call to action for NLP researchers: to stop being lazy and start thinking again, e.g. by looking at knowledge structures and applying our human knowledge to build typology, a set of features, etc.
Spotlight Talks by Young Rising Stars (STIRS)
Another new initiative was the Spotlight Talks by Young Rising Stars plenary session. It featured ten-minutes talks by Eunsol Choi, Ryan Cotterell, Sebastian Ruder, Swabha Swayamdipta, and Diyi Yang. Here, I briefly summarize the talk by Diyi Yang and Swabha Swayamdipta, but definitely recommend checking out the other talks as well!
Diyi Yang
Diyi Yang gave a talk on socially aware NLP. While we have seen super-human performance of large models on different talks, we see issues related to social norms, culture, religions, underrepresented groups, etc. when probing and evaluating them.
These issues can pose a challenge in adapting models in the real world. Diyi called for incorporating social awareness into NLP progress. She proposed extending the common NLP pipeline by components inspired by social science research (highlighted in red the figure below).
She introduced seven social factors that NLP systems need to be aware of to overcome current limitations (see her and Dirk Hovy's paper "The Importance of Modeling Social Factors of Language: Theory and Practice" for more details):
She concluded the talk by highlighting the impact of socially aware NLP on current challenges; for example, in bridging the divide between SOTA tasks used for benchmarking and human-centered tasks, or between knowledge available to the models today and social knowledge required for the success of NLP technology.
Swabha Swayamdipta
Swabha gave a talk on “Mapping and Generating Datasets for Robust Generalization,” placing the question of whether data scale is really the key to NL generalization at the center of her talk.
While large-scale datasets like StanfordNLI and MultiNLI have triggered lots of research, they contain biases and artifacts, such as premises with the token “cat” being correlated with the label “contradiction”.
Swabha highlighted that we need better tools to analyse model-data relationships and go beyond accuracy. She discussed data maps as a tool to discover ambiguous data samples, i.e. samples with high variability (= standard deviation of the true class probability). She also showed that models trained on samples that are “ambiguous” to the model performed much better on out-of-distribution datasets.
Below you can see an examplary Data Map for the SNLI dataset:
Special Theme on Language Diversity
The special theme of the conference was “Language Diversity: from Low-Resource to Endangered Languages”, aligned with the 60th anniversary of ACL.
In addition to the best theme paper and a paper by Leong et al. (2022) discussed in the Data scarcity section, I came across two very interesting theme papers during the conference.
Below you can find recommendations from both papers on how to conduct research on endangered or low-resource languages.
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia (Aji et al.)
Indonesia is the fourth most populous country in the world with more than 700 languages, of which 440 are listed as endangered and twelve as extinct. Aji et al. (2022) offer an overview of the current state of NLP for Indonesian languages and discuss how we should develop NLP technology for underrepresented languages in general.
Dialect metadata. Inspired by regional and dialect variations of Indonesian languages, the authors suggest adding dialect metadata to datasets and models. This can help in clearly communicating systems’ capabilities to users and stakeholders.
Data efficiency. Data collection is a well-known and commonly discussed challenge related to low-resource languages. They suggest a research focus on data-efficient approaches, few-shot learning, and learning from related languages with data availability.
Compute efficiency. While large models may work well on particular benchmarks, the authors suggest developing more lightweight and faster neural networks that are adoptable by locals.
NLP beyond text. Given that limited written text is available for many languages, they suggest exploring research directions which are less text-focused, e.g. spoken language understanding.
How can NLP Help Revitalize Endangered Languages? A Case Study and Roadmap for the Cherokee Language (Zhang et al.)
Tutorial on Vision-Language Pretraining: Current Trends and the Future
The first conference day kicked off with tutorials, including one organized by Aishwarya Agrawal, Damien Teney, and Aida Nematzadeh on VL pretraining (slides available).
Aishwarya offered a historical overview of the VL landscape before the pretraining era. She discussed common tasks that the community has been working on since the early years (i.e. image retrieval, grounding expressions in images, VQA, etc.), the basic skeleton of VL models, and benchmarking for VL tasks.
Aishwarya highlighted several challenges in training for VL tasks and discussed the following future research directions:
Biases and artifacts present in many datasets
Distribution shifts in datasets
Poor evaluation metrics for image captioning
Counting visual entities in images
Reasoning over text in images
Compositional reasoning
Commonsense and knowledge based reasoning
In her part of the talk, Aida discussed the models and pretraining approaches in more recent times. She highlighted that most early transformer-based VL models (i.e. VilBERT, LXMERT, etc.) shared a similar architecture and pretraining objectives.
Moreover, she discussed in more detail how modeling decisions and data selection influence model performance; for example, models using attention between modalities outperform those where there is no cross-attention between language and image.
She also highlighted previous work discussing the redundancy of image loss for VL pretraining (see Frank et al., 2021), and criteria for “good” VL pretraining datasets (see Hendricks et al., 2021).
Multimodal papers @ACL
Multimodal representations
Several ACL papers focus on multimodal representation learning.
Learning multimodal representations requires encoding heterogeneous data from different modalities such that they complement each other on the one side, but avoid redundancy on the other side.
Liu et al. propose an approach to learn representations that are modality-independent and capture information on a finer granularity level. Wang et al. introduce a notion to measure whether multimodal models treat different languages equally. Milewski et al. probe multimodal BERTs for structural knowledge in text and images.
Cross-Modal Discrete Representation Learning (Liu et. al)
This paper presents a new approach for generating cross-modal representations. The approach allows…
...capturing semantic concepts independent of the input data modalities and...
...encodes information at different levels of granularity. Unlike previous approaches, which mostly concentrate on generating high-level cross-modal representation.
The figure below offers an overview of the method, which comprises:
Encoding data into latent features using unimodal encoders (encoder fine) and generating high-level feature vectors for each modality (encoder high).
Adopting an objective function (L_mms) to maximise the similarity between the high-level representations of aligned pairs vs. reducing similarity between “negative” pairs.
Obtaining more fine-grained representations by projecting features from 1) into a “shared discrete embedding space” (green).
The approach uses a cross-modal code matching objective (in addition to the similarity loss L_mms) to learn a modality-independent space; for example, using audio and image data. A codebook is introduced to capture cross-modal correspondence between the unimodal features.
They demonstrate the approach’s applicability to different modality pairs on video-text, video-audio, and image-audio retrieval tasks (see table below).
Through ablation studies and comparing to baselines without the second cross-modal objective, the paper shows that fine-grained cross-modal features complement high-level ones.
I think that in future we might see increasingly more representation approaches that are applicable to a wider range of modalities.
Assessing Multilingual Fairness in Pre-trained Multimodal Representations (Wang et. al)
Finding Structural Knowledge in Multimodal-BERT (Milewski et al.)
Data scarcity
Let us continue with another topic that is addressed in various papers at ACL: data scarcity and how to overcome it for multimodal tasks.
Fang et al. use phrase-level, region-based image retrieval to overcome data limitations for multimodal machine translation (MMT). Inspired by ε-bounded image perturbation (commonly used in CV for data augmentation), Gokhale et al. augment video-/image-inference datasets with linguistically transformed sentences. Leong et al. jointly use audio and text data to train NER models for low-resource languages. Pine et al. also address challenges related to data availability for low-resource languages, but concentrate on models for text-to-speech synthesis.
Neural Machine Translation with Phrase-Level Universal Visual Representations (Fang et al.)
While we have seen considerable improvement in multimodal machine translation (MMT), training MMT models is often bound to the availability of paired source/translated sentences and the grounding image.
To overcome this limitation and have a higher variety of grounding images available, this paper introduces an approach to learn grounded representations on a phrase instead of sentence level, which is achieved by matching phrases to regions of images.
The authors start with building a phrase-level image set based on Multi30K: a dataset containing 29k bilingual sentence-image pairs. Using source sentences and their grounding images, they extract <noun phrase, image region> pairs for all noun phrases.
During MMT, this phrase-level image set is used to extract matching top image regions for the source sentences (to be translated). Regions are matched given the semantic similarity between the phrase and regions available in the image set.
While their results do not demonstrate high performance gains compared to image-level baselines, I think that the phrase-level retrieval approach might be interesting for other task settings, e.g. low-resource domains/languages/etc.
Semantically Distributed Robust Optimization for Vision-and-Language Inference (Gokhale et al.)
Phone-ing it in: Towards Flexible, Multi-Modal Language Model Training using Phonetic Representations of Data (Leong et al.)
Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization (Pine et al.)
Few-shot and prompt-based learning for multimodal models
Papers in this section also study multimodal models in low-resource settings. Song et al. evaluate if CLIP's strong zero-shot performance on vision tasks is transferable to vision-language (VL) tasks as well. Whereas, Jin et al. propose a method for prompt-based learning of VL tasks.
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment (Song et al.)
This paper evaluates whether the strong zero-shot ability of CLIP can be transferred to vision-language understanding tasks. Inspired by CLIP’s great zero-shot performance on vision tasks, the authors probe its few-shot capabilities on vision-language tasks.
Two VL understanding tasks are used for experimenting: VQA and visual entailment. Moreover, to transform CLIP into a few-shot learner for VQA, the authors use prompt-based learning and reduce the differences between CLIP’s pre-training tasks and VQA’s task form.
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models (Jin et al.)
New multimodal datasets
At ACL, we also find a variety of new and exciting multimodal tasks, whereby in this section I discuss a few of them.
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena (Parcalabescu et al.)
This paper studies models’ performance in capturing fine-grained information in image to deal with the following linguistic phenomena: existence, plurality, counting, relation, actions, coreference.
Therefore the authors propose VALSE (Vision And Language Structured Evaluation).
The benchmark comprises six sub-tasks, each of which is constructed using the same structure, whereby given a visual input, the model is asked to distinguish between real captions from foils ones.
The foil captions are constructed by altering phrases in the original caption such that a specific linguistic phenomenon is addressed, e.g. semantic number of nouns, verb argument structure, or coreference. For each example, models must capture the linguistic phenomenon to distinguish the original and alternated caption from each other.
The figure below provides an overview of the different tasks and examples for each phenomena they include:
WIKIDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types (Wang et al.)
FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework (Castro et al.)
xGQA: Cross-Lingual Visual Question Answering (Pfeiffer et al.)
Visual grounding
Concluding this post with a brief section on visual grounding. Jin et al. study the impact of integrating visual knowledge in LMs on language-only tasks. Li et al. evaluate different vision encoders for multimodal machine translation.
Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer (Jin et al.)
This paper studies the impact of visual knowledge integration in LMs and whether this contributes to better performance on language-only tasks.
The authors explore two types of knowledge transfer: (i) text knowledge transfer using the image captions; and (ii) cross-modal knowledge transfer that uses both the image and its caption for vision-language training objectives.
For text knowledge transfer, they apply the following objectives on image captions:
a) Masked language modelling (on visual clues)
b) Text contrastive learning
For cross-modal transfer:
c) Voken classification matches tokens to related images ("vokens" = visual token).
d) Cross-modal contrastive learning (CMCL) maximises the agreement between correct image-caption pairs versus random pairs.
e) Cross-modal knowledge distillation distils knowledge from a teacher model that is trained using CMCL to a student LM.
The figure below provides an overview of all objectives:
Their results show that:
Text knowledge transfer (MLM, TCL) improves LM performance in (i) low-resource as well as (ii) fully supervised settings.
Cross-model knowledge transfer objectives (c – e) are also useful in both low-resource and fully supervised training.
Particularly CMCL improves performance on downstream tasks. The authors also observe that CMCL training improves with an adversarial negative sampling strategy and augmenting data with positive samples.
I hope you enjoyed this blog post!
Feel free to contact me with feedback, suggestions for future editions, or your thoughts on the discussed papers: mubashara.akhtar@kcl.ac.uk.
Comments