Welcome to the website of EXIST 2024, the fourth edition of the sEXism Identification in Social neTworks task at CLEF 2024.
EXIST is a series of scientific events and shared tasks on sexism identification in social networks. EXIST aims to capture sexism in a broad sense, from explicit misogyny to other subtle expressions that involve implicit sexist behaviours (EXIST 2021, EXIST 2022, EXIST 2023). The fourth edition of the EXIST shared task will be held as a Lab in CLEF 2024, on September 9-12, 2024, in the University of Grenoble Alpes, Grenoble, France.
Social Networks are the main platforms for social complaint, activism, etc. Movements like #MeTwoo, #8M or #Time’sUp have spread rapidly. Under the umbrella of social networks, many women all around the world have reported abuses, discriminations and other sexist experiences suffered in real life. Social networks are also contributing to the transmission of sexism and other disrespectful and hateful behaviours. In this context, automatic tools not only may help to detect and alert against sexism behaviours and discourses, but also to estimate how often sexist and abusive situations are found in social media platforms, what forms of sexism are more frequent and how sexism is expressed in these media. This Lab will contribute to developing applications to detect sexism.
While the three previous editions focused solely on detecting and classifying sexist textual messages, this new edition incorporates new tasks that center around images, particularly memes. Memes are images, typically humorous in nature, that are spread rapidly by social networks and Internet users. With this addition, we aim to encompass a broader spectrum of sexist manifestations in social networks, especially those disguised as humor. Consequently, it becomes imperative to develop automated multimodal tools capable of detecting sexism in both text and memes.
Similar to the approach in the 2023 edition, this edition will also embrace the Learning With Disagreement (LeWiDi) paradigm for both the development of the dataset and the evaluation of the systems. The LeWiDi paradigm doesn’t rely on a single “correct” label for each example. Instead, the model is trained to handle and learn from conflicting or diverse annotations. This enables the system to consider various annotators’ perspectives, biases, or interpretations, resulting in a fairer learning process.
In previous editions, 75 teams from more than 20 countries submitted their results achieving impressive results, especially in the sexism detection task. However, there is still room for improvement, especially in when the problem is addressed under the LeWeDi paradigm.
Participants will be asked to classify “tweets” or “memes” (in English and Spanish) according to the following five tasks:
The first task is a binary classification. The systems have to decide whether or not a given tweet contains sexist expressions or behaviours (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behaviour). The following tweets show examples of sexist and not sexist messages.
Once a message has been classified as sexist, the second task aims to categorize the message according to the intention of the author, which provides insights in the role played by social networks on the emission and dissemination of sexist messages. In this task, we propose a ternary classification task:
DIRECT: the intention was to write a message that is sexist by itself or incites to be sexist, as in:
REPORTED: the intention is to report and share a sexist situation suffered by a woman or women in first or third person, as in:
JUDGEMENTAL: the intention was to judge, since the tweet describes sexist situations or behaviours with the aim of condemning them.
Many facets of a woman’s life may be the focus of sexist attitudes including domestic and parenting roles, career opportunities, sexual image, and life expectations, to name a few. Automatically detecting which of these facets of women are being more frequently attacked in social networks will facilitate the development of policies to fight against sexism. According to this, each sexist tweet must be categorized in one or more of the following categories
IDEOLOGICAL AND INEQUALITY: The text discredits the feminist movement, rejects inequality between men and women, or presents men as victims of gender-based oppression.
STEREOTYPING AND DOMINANCE: The text expresses false ideas about women that suggest they are more suitable to fulfill certain roles (mother, wife, family caregiver, faithful, tender, loving, submissive, etc.), or inappropriate for certain tasks (driving, hardwork, etc), or claims that men are somehow superior to women.
OBJECTIFICATION: The text presents women as objects apart from their dignity and personal aspects, or assumes or describes certain physical qualities that women must have in order to fulfill traditional gender roles (compliance with beauty standards, hypersexualization of female attributes, women’s bodies at the disposal of men, etc.).
SEXUAL VIOLENCE: Sexual suggestions, requests for sexual favors or harassment of a sexual nature (rape or sexual assault) are made.
MISOGYNY AND NON-SEXUAL VIOLENCE: The text expressses hatred and violence towards women.
This is a binary classification task consisting on deciding whether or not a given meme is sexist. The following figures are some examples of sexist and not sexist memes, respectively.
As in task 2, this task aims to categorize the meme according to the intention of the author, which provides insights in the role played by social networks on the emission and dissemination of sexist messages. Due to the characteristics of the memes, the REPORTED label is virtually null, so in this task systems should only classify memes with DIRECT or JUDGEMENTAL labels. The following figures are some examples of them, respectively.
This task aims to classify sexist memes according to the categorization provided for Task 3: (i) IDEOLOGICAL AND INEQUALITY, (ii) STEREOTYPING AND DOMINANCE, (iii) OBJECTIFICATION, (iv) SEXUAL VIOLENCE and (v) MISOGYNY AND NON-SEXUAL VIOLENCE. The following figures are some examples of categorized memes.
(a) Stereotyping
(e) Ideological
(c) Objectification
(d) Misogyny
(b) Sexual violence
If you want to participate in the EXIST 2024 shared task at CLEF 2024, please proceed to register for the lab at CLEF 2024 Labs Registration site. You will receive information about how to join our Google Group, where EXIST-Datasets, EXIST-Communications, EXIST-Questions/Answers, and EXIST-Guidelines will be provided to the participants.
Participants will be required to submit their runs and will have the possibility to provide a technical report that should include a brief description of their approach, focusing on the adopted algorithms, models and resources, a summary of their experiments, and an analysis of the obtained results. Although we recommend to participate in all tasks, participants are allowed to participate just in one of them (e.g. Task 1).
Technical reports will be published in CLEF 2024 Proceedings at CEUR-WS.org.
Note: All deadlines are 11:59PM UTC-12:00 (“anywhere on Earth”).
Since 2021, the primary objective of EXIST campaigns has been the identification of sexism in tweets. Three corpora of annotated tweets have been collected for different EXIST tasks.
Likewise, the focus of EXIST 2024 is to detect sexism in text, using the EXIST 2023 dataset, but we also extent the focus to memes. Memes are images, usually with text captions, that typically carry humor and spread through social media, forums, or other digital platforms. They can be used to spread false information, perpetuate stereotypes or humiliate people.
In EXIST 2024, we have curated a lexicon of terms and expressions leading to sexist memes, derived from expressions proven representative in identifying sexism in previous EXIST editions. The set of seeds encompasses diverse topics, incorporating terms with varying degrees of use in both sexist and non-sexist contexts, all centered around women. The final set contains 250 terms, with 112 in English and 138 in Spanish.
The terms were used as search queries on Google Images to obtain the top 100 images. Rigorous manual cleaning procedures were applied, defining memes, and ensuring removal of noise like textless images, text-only images, ads, and duplicates from the dataset. The final set of memes consists of more than 3,000 memes per language.
Since the proportion of memes per term is very heterogeneous, we have discarded the most unbalanced seeds and made sure that all seeds have at least five memes. Furthermore, the final data set has been the result of obtaining the most equitable distribution of memes per seed. To avoid introducing selection bias, we randomly selected memes, adhering to the appropriate distribution per seed. As a result, we have more than 2,000 memes per language for the training set and more than 500 memes per language for the test set.
As in the previous edition, we have considered some sources of “label bias”. Label bias may be introduced by the socio-demographic differences of the persons that participate in the annotation process, but also when more than one possible correct label exists or when the decision on the label is highly subjective. In order to mitigate label bias, we consider two different social and demographic parameters: gender (MALE/FEMALE) and age (18-22 y.o./23-45 y.o./+46 y.o). Each meme was annotated by 6 crowdsourcing annotators selected through the the Prolific app, following the guidelines developed by two experts in gender issues.
As new feature in the datasets, both 2023 and 2024, we will include three additional demographic characteristic of each anotator: level of education, ethnicity and country of residence.
The assumption that natural language expressions have a single and clearly identifiable interpretation in a given context is a convenient idealization, but far from reality, especially in highly subjective task as sexism identification. The learning with disagreements paradigm aims to deal with this by letting systems learn from datasets where no gold annotations are provided but information about the annotations from all annotators, in an attempt to gather the diversity of views. Following methods proposed for training directly from the data with disagreements, instead of using an aggregated label, we will provide all annotations per instance for the 6 different strata of annotators.
More details about the dataset are provided in the task overview (bias consideration, annotation process, quality experiments, inter-annotator agreement, etc.).
If you want to access EXIST Datasets for research purpose, please fill this form.
From the point of view of evaluation metrics, our six tasks can be described as:
The learning with disagreements paradigm can be considered in both sides of the evaluation process:
For each of the tasks, two types of evaluation will be reported:
For all tasks and all types of evaluation (hard-hard and soft-soft) we will use the same official metric: ICM (Information Contrast Measure) (Amigó and Delgado, 2022). ICM is a similarity function that generalizes Pointwise Mutual Information (PMI), and can be used to evaluate system outputs in classification problems by computing their similarity to the ground truth categories. As there is not, to the best of our knowledge, any current metric that fits hierarchical multi-label classification problems in a learning with disagreement scenario, we have defined an extension of ICM (ICM-soft) that accepts both soft system outputs and soft ground truth assignments.
For each of the tasks, the evaluation will be performed in the two modes described above, as follows:
Enrique Amigó and Agustín Delgado. 2022. Evaluating Extreme Hierarchical Multi-label Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5809–5819, Dublin, Ireland. Association for Computational Linguistics.
Below are the official leaderboards for all participants and tasks in all evaluations contexts:
Enrique Amigó and Agustín Delgado. 2022. Evaluating Extreme Hierarchical Multi-label Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5809–5819, Dublin, Ireland. Association for Computational Linguistics.
EXIST 2024 is co-located with the CLEF Conference and will be held face-to-face on Wednesday September 11th 2024 and Thursday September 12th 2024.
11:15 – 12:45 Overview of EXIST 2024 – Learning with Disagreement for Sexism Identification and Characterization in Social Networks and Memes. Laura Plaza, Jorge Carrillo-de-Albornoz, Víctor Ruiz, Alba Maeso, Berta Chulvi, Paolo Rosso, Enrique Amigó, Julio Gonzalo, Roser Morante, Damiano Spina
15:30 – 16:30 Poster Session
11:15 - 12:50 EXIST 2024 Parallel Session: Sexism Detection and Categorization in Memes
12:50 - 14:00 LUNCH
14:00 – 15:30 EXIST 2024 Parallel Session: Sexism Detection and Categorization in Tweets
Overview Paper:
Extended Overview Paper:
Working Notes:
If you have any specific question about the EXIST 2024, we may ask you to let us know through the Google Group EXIST 2024 at CLEF 2024.
For any other question that does not directly concern the shared task, please write to Jorge Carrillo-de-Albornoz.