Why EXIST?

Welcome to the website of EXIST 2026, the sixth edition of the sEXism Identification in Social neTworks task at CLEF 2026.

EXIST is a series of scientific events and shared tasks on sexism identification in social networks. EXIST aims to foster the automatic detection of sexism in a broad sense, from explicit misogyny to other subtle expressions that involve implicit sexist behaviours (EXIST 2021, EXIST 2022, EXIST 2023, EXIST 2024, EXIST 2025). The sixth edition of the EXIST shared task will be held as a Lab in CLEF 2026, on September 21-24, 2026, at Friedrich-Schiller-Universität Jena, Germany .

Sexism remains a pervasive form of social discrimination, reflected across multiple dimensions such as sexual violence, economic inequality, and online harassment. Recent data show that women represent around 85%-90% of sexual violence victims in the USA, Europe, Spain, and Australia. The gender pay gap continues to disadvantage women, who earn on average between 8.7% and 21.8% less than men across these same regions. In the digital sphere, women also experience disproportionate levels of harassment and discrimination, with reported rates ranging from 16% in the USA to 41% in Australia, compared to 5-26% for men. In this context, the development of AI systems capable of detecting sexism on social media presents a particularly relevant challenge. The perception of what constitutes sexist behavior or expression involves a certain degree of subjectivity, as it may be influenced by cultural norms, personal experiences, and emotional reactions that cannot be fully captured through linguistic data alone. Despite significant advances in computational modeling, the mechanisms underlying human decision-making remain only partially understood. Empirical evidence suggests that human judgments are shaped not only by conscious factors —such as socio-demographic background, prior experiences, and explicit beliefs—but also by unconscious cues, including emotions, physiological states, and sensory responses that subtly guide perception and evaluation. Current AI models, largely trained on textual or visual data, lack access to these deeper layers of cognitive and affective information, limiting their ability to replicate or interpret complex social phenomena. To bridge this gap, it becomes essential to explore new training paradigms that integrate human-centered and sensor-based data to provide richer insights into how individuals consciously and unconsciously perceive sexist content.

In EXIST 2026, we take a significant step forward by integrating the principles of Human-Centered AI (HCAI) into the development of automatic tools for detecting sexism online. Recognizing that no single interpretation can fully capture the diversity of human perception, we go beyond traditional annotation paradigms by combining Learning With Disagreement (LeWiDi) with sensor-based data (EEG, heart rate, and eye-tracking signals) collected from subjects exposed to potentially sexist content, with the aim of capturing unconscious responses to sexism. This dual approach represents a breakthrough in dataset creation for sensitive and value-laden tasks: for the first time, datasets will include not only divergent judgments from annotators, but also the embodied traces of how these content affect. This richer, multidimensional annotation process will enable the development of more inclusive, equitable, and socially aware AI systems for detecting sexism in complex multimedia formats like memes and short videos, where ambiguity and affect play a critical role.

In past editions, teams from over 50 countries submitted more than 1,700 runs, achieving remarkable outcomes, especially in the sexism detection task. However, there is still room for improvement, especially in when the problem is addressed under the LeWeDi paradigm in a multimedia context.

                               

Tasks

Building upon the EXIST 2025 dataset, this edition focuses exclusively on multimedia formats, comprising six experimental subtasks applied to images (memes) and videos (TikToks). Participants are challenged to address three main objectives: sexism identification (x.1), source intention detection (x.2), and sexism categorization (x.3).

A groundbreaking feature of this lab is the integration of Human-Centered AI principles. In the new experimental framework introduced in EXIST 2026, selected subjects were exposed to the multimedia content while their physiological and behavioral responses were continuously recorded. These multimodal signals (including eye tracking, heart rate, and EEG) enrich the traditional annotation labels, providing a deeper window into how users unconsciously process and react to sexist content in English and Spanish.

See the next sections for details and examples on each subtask (numbering is consistent with EXIST 2025).

Subtask 2.1: Sexism Identification in Memes

This is a binary classification subtask consisting on determining wheter a meme describes a sexist situation or criticizes a sexist behaviour), and classifying it into two categories: YES and NO. The following figures are some examples of both types of memes, respectively.

Sexist
(a) YES
Not sexist
(b) NO

Subtask 2.2: Source Intention in Memes

Once a message has been classified as sexist, the second subtask aims to categorize the meme according to the intention of the author, which provides insights in the role played by social networks on the emission and dissemination of sexist messages. Due to the characteristics of the memes, systems should only classify memes with DIRECT or JUDGEMENTAL labels.

  • DIRECT: the intention was to write a message that is sexist by itself or incites to be sexist.
  • JUDGEMENTAL: the intention was to judge, since the tweet describes sexist situations or behaviours with the aim of condemning them.

The following figures are some examples of them, respectively.

Direct
(a) Direct
Judgemental
(b) Judgemental

Subtask 2.3: Sexism Categorization in Memes

Many facets of a woman’s life may be the focus of sexist attitudes including domestic and parenting roles, career opportunities, sexual image, and life expectations, to name a few. Automatically detecting which of these facets of women are being more frequently attacked in social networks will facilitate the development of policies to fight against sexism. According to this, each sexist meme must be categorized in one or more of the following categories

  • IDEOLOGICAL AND INEQUALITY: The text discredits the feminist movement, rejects inequality between men and women, or presents men as victims of gender-based oppression.
  • STEREOTYPING AND DOMINANCE: The text expresses false ideas about women that suggest they are more suitable to fulfill certain roles (mother, wife, family caregiver, faithful, tender, loving, submissive, etc.), or inappropriate for certain tasks (driving, hardwork, etc), or claims that men are somehow superior to women.
  • OBJECTIFICATION: The text presents women as objects apart from their dignity and personal aspects, or assumes or describes certain physical qualities that women must have in order to fulfill traditional gender roles (compliance with beauty standards, hypersexualization of female attributes, women’s bodies at the disposal of men, etc.).
  • SEXUAL VIOLENCE: Sexual suggestions, requests for sexual favors or harassment of a sexual nature (rape or sexual assault) are made.
  • MISOGYNY AND NON-SEXUAL VIOLENCE: The text expressses hatred and violence towards women.

The following figures are some examples of categorized memes.

(a) Stereotyping

(e) Ideological

(c) Objectification

(d) Misogyny

(b) Sexual violence

Subtask 3.1: Sexism Identification in Videos

This subtask is the same subtask 2.1. The following figures are some examples of videos classified as YES or NO.

@cayleecresta #stitch with @goodbrobadbro easy should never be the word used to describe womanhood #fyp #foryou #foryoupage #womenempowerment #women #feminism ♬ original sound - Caylee Cresta
(a) YES
@dailyhealth2 #haha #kidnapped #bigredswifesarmy #oregon #victimcard #victimblaming #bodyguard #loved #smile #lagrandeoregon ♬ original sound - รⒶ︎я︎Ⓐ︎𝔥 ģⒶ︎เ︎ᒪ︎🫦
(b) NO

Subtask 3.2: Source Intention in Videos

This subtask replicates subtask 2.2 for memes, but it takes as source videos. The following examples are some videos representing each category.

@yourgirlhaylie #duet with @michaelkoz #sexist #foryou #FitCheck #throwhimaway ♬ original sound - Mike Koz
(a) Direct
@grandtheftangel remember it clearly #malegaze #feminism #objectification #womenempowerment #relatable ♬ original sound - 🖍
(b) Judgemental

Subtask 3.3: Sexism Categorization in Videos

This subtask aims to classify sexist videos according to the categorization provided for subtask 2.3: (i) IDEOLOGICAL AND INEQUALITY, (ii) STEREOTYPING AND DOMINANCE, (iii) OBJECTIFICATION, (iv) SEXUAL VIOLENCE and (v) MISOGYNY AND NON-SEXUAL VIOLENCE. The following figures are some examples of categorized videos.

@streaminfreedom I’m an idiot! @streaminfreedom #truestory #menvswomen #relationshipcomedy ♬ original sound - leanne_lou
(a) Stereotyping
@laanaintw #ViolenciaMachista #misoginia #patriarcado #91ColoursPullandBear #parati #hazmeviral ♬ sonido original - LaAnain Tw
(b) Ideological and Inequality
@zo3tv #duet with @lenatheplug #noJumper #dunked #in #theRight #goal #she #is #beautiful & #babygirl #isTo #swimsuit #never #gotTight #bodySnatched #congrats ♬ Aesthetic Girl - Yusei
(c) Objectification
@alt_acc393 IT'S A JOKEEEEE. #fyp #foryoupage #foryou ♬ original sound - alt acc
(d) Misogyny
@janetmild #niunamenos #noeslaropa #ylaculpanoeramia #violenciadegenero #violenciamachista ♬ sonido original - Yami Safdie
(e) Sexual violence

How to participate

If you want to participate in the EXIST 2026 shared task at CLEF 2026, please proceed to register for the lab at CLEF 2026 Labs Registration site. Once you have filled out the form, you will receive an email with information on how to join the EXIST 2026 Discord Forum, where EXIST-Datasets, EXIST-Communications, EXIST-Questions/Answers, and EXIST-Guidelines will be made available to participants. This is a manual process, so it might take some time. Please don’t worry, :-).

Participants will be required to submit their runs and will have the possibility to provide a technical report that should include a brief description of their approach, focusing on the adopted algorithms, models and resources, a summary of their experiments, and an analysis of the obtained results. Although we recommend to participate in all subtasks and in both languages, participants are allowed to participate just in one of them (e.g. subtask 2.1) and in one language (e.g. English).

Publications

Technical reports will be published in CLEF 2026 Proceedings at CEUR-WS.org.

Important dates

  • 17 November 2025: Registration opens.
  • 26 February 2026: Training set available.
  • 9 April 2026: Test set available.
  • 23 April 2026: Registration closes.
  • 7 May 2026: Runs submission due to organizers. Extended Deadline 14 May 2026: Runs submission due to organizers.
  • 28 May 2026: Results notification to participants.
  • 4 June 2026: Submission of Working Notes by participants.
  • 30 June 2026: Notification of acceptance (peer reviews).
  • 6 July 2026: Camera-ready participant papers due to organizers.
  • 21-24 September 2026: EXIST 2026 at CLEF Conference.

Note: All deadlines are 11:59PM UTC-12:00 (“anywhere on Earth”).

Dataset

The EXIST 2026 Memes Dataset will be used in Task 2.X (subtasks 1–3), while the EXIST 2026 Videos Dataset will be used in Task 3.X (subtasks 1–3).

This edition builds upon the EXIST 2025 dataset and extends it with additional sensor-based information collected under a Human-Centered AI framework, enriching the original annotations with complementary physiological data.

This collection integrates and extends the datasets developed for EXIST 2024 (memes) and EXIST 2025 (TikTok videos), which have been reused within a novel experimental setting designed to explore how individuals perceive and interpret sexism in online media. In this new dataset, subjects’ conscious judgments are reflected in the labels they assigned to each instance, while unconscious or implicit reactions are captured through sensor data such as eye tracking, heart rate, and EEG activity. The sensor data captured during the annotation process will be provided to the participants to use (if desired) in the training process. Detailed descriptions of the annotation methodologies for the EXIST 2024 and 2025 datasets are available in Overview of EXIST 2025 and Overview of EXIST 2024.

In the new novel experimental setting, each session followed a structured protocol comprising several stages. First, subjects were fitted with physiological sensors for electrocardiography (ECG), eye tracking, and electroencephalography (EEG), specifically using Pupil Labs Neon glasses (binocular gaze recording at 200 Hz), a Garmin Venu 3 smartwatch (continuous heart-rate and inter-beat interval monitoring), and an OpenBCI Cyton 16-channel EEG headset (10–20 system, 250 Hz). A two-minute resting-state baseline was recorded before exposure to the stimuli. Second, subjects completed a demographic questionnaire, administered only once, that collected information such as age, gender, education level, country of residence, and occupation. Third, they reported their average daily social media usage (in hours) and the percentage of time spent on platforms such as TikTok, Instagram, X/Twitter, and Facebook.

The next stage involved completing a cognitive thinking style questionnaire composed of 24 items rated on a six-point Likert scale (from “totally disagree” to “totally agree”). The items were grouped into four cognitive dimensions: open thought, closed thought, intuitive thought, and effortful thought, capturing different aspects of individual reasoning and information processing tendencies. Finally, before each experimental session, subjects completed a visual analogue scale (0–100) to self-assess their current emotional state, indicating the degree to which they felt happy, sad, calm, tense, energetic, or sleepy.

During the experiment, subjects were seated comfortably while stimuli were displayed on a screen until they provided a response. A 3 second pause followed each stimulus to minimize carry-over effects between consecutive items. Each session lasted approximately 45 minutes, during which subjects viewed 100–170 stimuli. After each stimulus, they answered brief control questions to ensure engagement and comprehension. All participants provided consent for the anonymous use of their data for research purposes.

The EXIST 2026 dataset comprises 8,294 multimedia instances, including memes and short videos, in both English and Spanish. The EXIST 2026 Memes Dataset contains more than 5,000 labeled memes, both in English and Spanish. In particular, the training set contains 3,984 memes and the test set contains 1,053 memes. Distribution between both languages has been balanced. Additionally, a small number of memes from the 2025 edition have been removed after detecting duplicated instances that were not identified during the initial dataset creation process.

Evaluation

From the point of view of evaluation metrics, our six subtasks can be described as:

  • Subtasks 2.1 and 3.1 (sexism identification): binary classification, mono label.
  • Subtasks 2.2 and 3.2 (source intention): multiclass hierarchical classification, mono label. The hierarchy of classes has a first level with YES/NO, and a second level for the sexist category with two mutually exclusive subcategories: direct and judgemental. A suitable evaluation metric must reflect the fact that a confusion between not sexist and a sexist category is more severe than a confusion between two sexist subcategories.
  • Subtasks 2.3 and 3.3 (sexism categorization): multiclass hierarchical classification, multi-label. Again, the first level is a binary distinction between YES/NO, and there is a second level for the sexist category that includes “ideological and inequality”, “stereotyping and dominance”, “objectification”, “sexual violence” and “misogyny and non-sexual violence”. These classes are not mutually exclusive: a meme/video may belong to several subcategories at the same time.

The learning with disagreements paradigm can be considered in both sides of the evaluation process:

  • The ground truth. In a “hard” setting, variability in the human annotations is reduced to a gold standard set of categories, hard labels, that are assigned to each item (e.g., using majority vote). In a “soft” setting, the gold standard is the full set of human annotations with their variability. Therefore, the evaluation metric incorporates the proportion of human annotators that have selected each category, soft labels. Note that in subtasks 2.1, 3.1, 2.2 and 3.2, which are mono label problems, the sum of the probabilities of each class must be one. But in subtasks 2.3 and 3.3, which are multi-label, each annotator may select more than one category for a single item. Therefore, the sum of the probabilities of each class may be larger than one.
  • The system output. In a “hard”, traditional setting, the system predicts one or more categories for each item. In a “soft” setting, the system predicts a probability for each category, for each item. The evaluation score is maximized when the probabilities predicted match the actual probabilities in a soft ground truth. Again, note that in subtasks 2.3 and 3.3, which is a multi-label problem, the probabilities predicted by the system for each of the categories do not necessarily add up to one.

For each of the tasks, two types of evaluation will be reported:

  • Hard-hard: hard system output and hard ground truth.
  • Soft-soft: soft system output and soft ground truth.

For all tasks and all types of evaluation (hard-hard and soft-soft) we will use the same official metric: ICM (Information Contrast Measure) (Amigó and Delgado, 2022). ICM is a similarity function that generalizes Pointwise Mutual Information (PMI), and can be used to evaluate system outputs in classification problems by computing their similarity to the ground truth categories. As there is not, to the best of our knowledge, any current metric that fits hierarchical multi-label classification problems in a learning with disagreement scenario, we have defined an extension of ICM (ICM-soft) that accepts both soft system outputs and soft ground truth assignments. The evaluation framework was implemented and executed using PyEvALL, an open-source toolkit for evaluation.

For each of the tasks, the evaluation will be performed in the two modes described above, as follows:

  • Hard-hard evaluation. For systems that provide a hard, conventional output, we will provide a hard-hard evaluation. To derive the hard labels in the ground truth from the different annotators’ labels, we use a probabilistic threshold computed for each task. As a result, for subtask 2.1, the class annotated by more than 3 annotators is selected; for subtask 2.2, the class annotated by more than 2 annotators is selected; and for subtask 2.3 (multi-label), the class annotated by more than 1 annotator are selected. Due to the nature of subtasks 3.1, 3.2 and 3.3 and the complexity of video labeling, the labelling methodology challenged for this subtasks so hard labels included are those annotated by more than 1 annotator. Items for which there is no majority class (i.e., no class receives more probability than the threshold) will be removed from this evaluation scheme. The official metric will be the original ICM (as defined in (Amigó and Delgado, 2022)). We will also report and compare systems with F1 (the harmonic average of precision and recall). In subtasks 2.1 and 3.1, we will use F1 for the positive class. In the remaining subtasks, we will use the average of F1 for all classes. Note, however, that F1 is not ideal in our experimental setting: although it can handle multi-label situations, it does not consider the relationships between classes: a mistake between not sexist and any of the sexist subclasses, and a mistake between two of the positive subclasses, are penalized equally, although the former is a more severe error.
  • Soft-soft evaluation. For systems that provide probabilities for each category, we will provide a soft-soft evaluation that compares the probabilities assigned by the system with the probabilities assigned by the set of human annotators. As in the previous case, we will use ICM-soft as the official evaluation metric in this variant. We may also report additional metrics in the final report.

Enrique Amigó and Agustín Delgado. 2022. Evaluating Extreme Hierarchical Multi-label Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5809–5819, Dublin, Ireland. Association for Computational Linguistics.

Results

A total of 247 registrations from 28 countries were received for EXIST 2026. Of these, 122 teams from 16 countries submitted at least one run, resulting in 1306 submitted runs overall: 783 hard-label and 523 soft-label submissions.

Participation was especially strong in the meme-based tasks, which received 939 runs in total. In particular, Task 2.1 attracted the highest number of submissions, with 351 runs, followed by Task 2.3 with 295 runs and Task 2.2 with 293 runs. The video-based tasks also received substantial participation, with 367 runs across Task 3, reflecting the growing interest in multimodal approaches for sexism detection beyond text.

Overall, EXIST 2026 shows a strong engagement with multimodal sexism analysis. The high number of submissions for both hard-label and soft-label evaluation highlights the relevance of studying not only final categorical decisions, but also systems’ ability to model uncertainty in subjective and socially sensitive content.

Below are the official leaderboards for all participants and tasks in all evaluations contexts:

Link to Subtask 2.1 Leaderboard

Link to Subtask 2.2 Leaderboard

Link to Subtask 2.3 Leaderboard

Link to Subtask 3.1 Leaderboard

Link to Subtask 3.2 Leaderboard

Link to Subtask 3.3 Leaderboard

Details:

  • Hard-hard: hard system output and hard ground truth.
    • Metrics:
      • ICM-Hard: ICM is the official metric for the ranking (as defined in Amigó and Delgado, 2022).
      • ICM-Hard Norm: ICM hard normalized.
      • F1: in Subtask 1, we provide results for F1 for the positive class, “YES”. In Subtask 2 and 3, we provide results for the average of F1 for all classes.
    • Baselines:
      • Majority class: non-informative baseline that classifies all instances as the majority class.
      • Minority class: non-informative baseline that classifies all instances as the minority class.
  • Soft-soft: soft system output and soft ground truth.
    • Metrics:
      • ICM-Soft: ICM soft is the official metric for the ranking (as adapted from Amigó and Delgado, 2022).
      • ICM-Soft Norm: ICM soft normalized.
      • Cross Entropy: in Subtask 1 and Subtask 2 we provide results for cross entropy measure.
    • Baselines:
      • Majority class: non-informative baseline that classifies all instances as the majority class. Note that the probability of the class has been set to 1.
      • Minority class: non-informative baseline that classifies all instances as the minority class. Note that the probability of the class has been set to 1.

Enrique Amigó and Agustín Delgado. 2022. Evaluating Extreme Hierarchical Multi-label Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5809–5819, Dublin, Ireland. Association for Computational Linguistics.

Organizers

Avatar

Damiano Spina

RMIT University

Senior Lecturer

Avatar

Iván Arcos

Universitat Politècnica de València

Researcher in Computational Linguistic

Avatar

Jorge Carrillo-de-Albornoz

UNED

RMIT University

Associate Professor

Avatar

Laura Plaza

UNED

RMIT University

Associate Professor

Avatar

Maria Aloy Mayo

UPV

Researcher in Computational Linguistic

Avatar

Paolo Rosso

Universitat Politècnica de València

Full Professor

Sponsors

Avatar

ANNOTATE Project

(PID2024-156022OB-C31, PID2024-156022OB-C32)

Spanish Ministry of Science, Innovation and Universities funded by MICIU/AEI/10.13039/501100011033 and the European Social Fund Plus (ESF+)

Avatar

ARC Centre of Excellence for Automated Decision-Making and Society (ADM+S) (CE200100005)

RMIT University

Avatar

Pattern Recognition and Human Language Technologies (PRHLT) Research Center

Universitat Politècnica de València

Contact

For any question that concern the shared task, please write to Jorge Carrillo-de-Albornoz.

Related Work

Overviews previous LeWiDi EXIST editions:

Extended Overviews previous LeWiDi EXIST editions:

Working Notes previous LeWiDi EXIST editions:

Video and Meme related work

Sensor Data and NLP related work