Why EXIST?

Welcome to the website of EXIST 2023, the third edition of the sEXism Identification in Social neTworks task at CLEF 2023.

EXIST is a series of scientific events and shared tasks on sexism identification in social networks. EXIST aims to capture sexism in a broad sense, from explicit misogyny to other subtle expressions that involve implicit sexist behaviours (EXIST 2021, EXIST 2022). The third edition of the EXIST shared task will be held as a Lab in CLEF 2023, on September 18-21, 2023, in the Centre for Research & Technology Hellas (CERTH), Thessaloniki, Greece.

Social Networks are the main platforms for social complaint, activism, etc. Movements like #MeTwoo, #8M or #Time’sUp have spread rapidly. Under the umbrella of social networks, many women all around the world have reported abuses, discriminations and other sexist experiences suffered in real life. Social networks are also contributing to the transmission of sexism and other disrespectful and hateful behaviours. Even though social platforms such as Twitter are continuously creating new ways to identify and eradicate hateful content, they are facing many difficulties when dealing with the huge amount of data generated by users. In this context, automatic tools not only may help to detect and alert against sexism behaviours and discourses, but also to estimate how often sexist and abusive situations are found in social media platforms, what forms of sexism are more frequent and how sexism is expressed in these media. This Lab will contribute to developing applications to detect sexism.

In previous editions, 50 teams from more than 15 countries submitted their results achieving impressive results, especially in the sexism detection task. However, there is still room for improvement, especially in the task of categorizing sexism according to the facet of the women that is undermined.

In this new edition, we will also address a new task and face the sexism identification from the perspective of the learning with disagreements paradigm.

                               

Tasks

Participants will be asked to classify “tweets” (in English and Spanish) according to the following three tasks:

TASK 1: Sexism Identification

The first task is a binary classification. The systems have to decide whether or not a given tweet contains sexist expressions or behaviours (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behaviour). The following tweets show examples of sexist and not sexist messages.

  • SEXIST:
    • “Mujer al volante, tenga cuidado!"
    • “People really try to convince women with little to no ass that they should go out and buy a body. Like bih, I don’t need a fat ass to get a man. Never have."
  • NOT SEXIST:
    • “Alguien me explica que zorra hace la gente en el cajero que se demora tanto."
    • "@messyworldorder it’s honestly so embarrassing to watch and they’ll be like “not all white women are like that”"

TASK 2: Source Intention

Once a message has been classified as sexist, the second task aims to categorize the message according to the intention of the author, which provides insights in the role played by social networks on the emission and dissemination of sexist messages. In this task, we propose a ternary classification task:

  • DIRECT: the intention was to write a message that is sexist by itself or incites to be sexist, as in:

    • A woman needs love, to fill the fridge, if a man can give this to her in return for her services (housework, cooking, etc), I don’t see what else she needs.
    • Women shouldn’t code…perhaps be influencer/creator instead…it’s their natural strength.
  • REPORTED: the intention is to report and share a sexist situation suffered by a woman or women in first or third person, as in:

    • I doze in the subway, I open my eyes feeling something weird: the hand of the man sat next to me on my leg #SquealOnYourPig.
    • Today, one of my year 1 class pupils could not believe he’d lost a race against a girl.
  • JUDGEMENTAL: the intention was to judge, since the tweet describes sexist situations or behaviours with the aim of condemning them.

    • As usual, the woman was the one quitting her job for the family’s welfare…
    • 21st century and we are still earning 25% less than men #Idonotrenounce.

TASK 3: Sexism Categorization

Many facets of a woman’s life may be the focus of sexist attitudes including domestic and parenting roles, career opportunities, sexual image, and life expectations, to name a few. Automatically detecting which of these facets of women are being more frequently attacked in social networks will facilitate the development of policies to fight against sexism. According to this, each sexist tweet must be categorized in one or more of the following categories

  • IDEOLOGICAL AND INEQUALITY: The text discredits the feminist movement, rejects inequality between men and women, or presents men as victims of gender-based oppression.

    • “Mi hermana y mi madre se burlan de mí por defender todo el tiempo los derechos de todos y me acaban de decir feminazi, la completaron”.
    • “I think the whole equality thing is getting out of hand. We are different, thats how were made!”.
  • STEREOTYPING AND DOMINANCE: The text expresses false ideas about women that suggest they are more suitable to fulfill certain roles (mother, wife, family caregiver, faithful, tender, loving, submissive, etc.), or inappropriate for certain tasks (driving, hardwork, etc), or claims that men are somehow superior to women.

    • “@Paula2R @faber_acuria A las mujeres hay que amarlas…solo eso… Nunca las entenderás.”.
    • “Most women no longer have the desire or the knowledge to develop a high quality character, even if they wanted to.”.
  • OBJECTIFICATION: The text presents women as objects apart from their dignity and personal aspects, or assumes or describes certain physical qualities that women must have in order to fulfill traditional gender roles (compliance with beauty standards, hypersexualization of female attributes, women’s bodies at the disposal of men, etc.).

    • ““Pareces una puta con ese pantalón” - Mi hermano de 13 cuando me vio con un pantalón de cuero”.
    • “Don’t get married than blame all woman for your poor investment. You should of got a hooker but instead you choose to go get a wedding ring.”.
  • SEXUAL VIOLENCE: Sexual suggestions, requests for sexual favors or harassment of a sexual nature (rape or sexual assault) are made.

    • “#MeToo Estas 4 no han conseguido su objetivo.El juez estima que se abrieron de patas https://t.co/GSHiiwqY6Aánta lagartona hay en este \metoo"!👇🏻👇🏻🤔🤔🤔 https://t.co/8t5VmFIUFn””
    • “fuck that cunt, I would with my fist”.
  • MISOGYNY AND NON-SEXUAL VIOLENCE: The text expressses hatred and violence towards women.

    • “Las mujeres de hoy en dia te enseñar a querer… estar soltero”
    • “Some woman are so toxic they don’t even know they are draining everyone around them in poison. If you lack self awareness you won’t even notice how toxic you really are”.

How to participate

If you want to participate in the EXIST 2023 shared task at CLEF 2023, please proceed to register for the lab at CLEF 2023 Labs Registration site. You will receive information about how to join our Google Group, where EXIST-Datasets, EXIST-Communications, EXIST-Questions/Answers, and EXIST-Guidelines will be provided to the participants.

Participants will be required to submit their runs and will have the possibility to provide a technical report that should include a brief description of their approach, focusing on the adopted algorithms, models and resources, a summary of their experiments, and an analysis of the obtained results. Although we recommend to participate in all tasks, participants are allowed to participate just in one of them (e.g. Task 1).

Publications

Technical reports will be published in CLEF 2023 Proceedings at CEUR-WS.org.

Important dates

  • 14 November 2022 Registration open.
  • 13 February 2023 Training set available.
  • 27 March 2023 Development set available.
  • 10 April 2023 Test set available.
  • 28 April 2023 Registration closes.
  • 10 May 2023 Runs submission due to organizers. Extended Deadline: 15 May 2023 Runs submission due to organizers.
  • 26 May 2023 Results notification to participants.
  • 5 June 2023 Submission of Working Notes by participants.
  • 23 June 2023 Notification of acceptance (peer-reviews).
  • 7 July 2023 Camera-ready participant papers due to organizers.
  • 18-21 September 2023 EXIST 2023 at CLEF Conference.

Note: All deadlines are 11:59PM UTC-12:00 (“anywhere on Earth”).

Dataset

Sexism comprises any form of oppression or prejudice against women because of their sex. As stated in (Rodríguez-Sánchez et al. 2020), sexism is frequently found in many forms in social networks, includes a wide range of behaviours (such as stereotyping, ideological issues, sexual violence, etc.) (Donoso-Vázquez and Rebollo-Catalán, 2018; Manne, 2018), and may be expressed in different forms: direct, indirect, descriptive or reported (Miller, 2009; Chiril et al. 2020). While previous studies have focused on identifying explicit hatred or violence towards women (Zeerak and Dirk, 2016; Anzovino et al., 2018; Frenda et al., 2019), the aim of the EXIST campaigns is to cover sexism in a broad sense, from explicit misogyny to other subtle expressions that involve implicit sexist behaviours. The EXIST dataset incorporates any type of sexist expressions or related phenomena, including descriptive or reported assertions where the sexist message is a report or a description of a sexist behaviour.

To this aim, and following the methodology used in previous EXIST editions (Rodríguez-Sánchez et al. 2021), we collected different popular expressions and terms, both in English and Spanish, commonly used to underestimate the role of women in our society. To mitigate the seed bias, we included seeds that are commonly employed in both sexist and non-sexist contexts. The final set contains more than 400 expressions.

Crawling

The final set of seeds was used to extract tweets both in English and Spanish (more than 8,000,000 tweets were downloaded). Crawling was performed during the period from the 1st September 2021 till the 30th September 2022. To ensure an appropriate balance between seeds, we have removed those with less than 60 tweets. The final set of seeds contains 183 seeds for Spanish and 163 seeds for English.

To mitigate terminology and temporal bias, the final sets were selected as follows: for each seed, approximately 20 tweets were randomly selected within the period from 1st September 2021 to 28th February of 2022 for the training set, taking into account a representative temporal distribution between tweets of the same seed. Similarly, 3 tweets per seed within the period from 1st to 31st May of 2022 were selected for the development set, and 6 tweets per seed within the period from 1st August 2022 to 30th September of 2022 were selected for the test set. Only one tweet per author was included in the final selection to avoid author bias. Finally, tweets containing less than 5 words were removed. As a result, we have more than 3.200 tweets per language for the training set, around 500 per language for the development set, and nearly 1.000 tweets per language for the test set.

Labeling process

During the annotation process we have also considered some sources of “label bias”. Label bias may be introduced by the socio-demographic differences of the persons that participate in the annotation process, but also when more than one possible correct label exists or when the decision on the label is highly subjective. In order to mitigate label bias, we consider two different social and demographic parameters: gender (MALE/FEMALE) and age (18-22 y.o./23-45 y.o./+46 y.o). Each tweet was annotated by 6 crowdsourcing annotators selected through the the Prolific app, following the guidelines developed by two experts in gender issues.

Learning with disagreements

The assumption that natural language expressions have a single and clearly identifiable interpretation in a given context is a convenient idealization, but far from reality, especially in highly subjective task as sexism identification. The learning with disagreements paradigm aims to deal with this by letting systems learn from datasets where no gold annotations are provided but information about the annotations from all annotators, in an attempt to gather the diversity of views. Following methods proposed for training directly from the data with disagreements, instead of using an aggregated label, we will provide all annotations per instance for the 6 different strata of annotators.

More details about the dataset will be provided in the task overview (bias consideration, annotation process, quality experiments, inter-annotator agreement, etc.).

References

Rodríguez-Sánchez, F., Carrillo-de-Albornoz, J., Plaza, L., Automatic Classification of Sexism in Social Networks: An Empirical Study on Twitter Data. IEEE Access (2020).

Donoso-Vázquez, Trinidad; Rebollo-Catalán, Ángeles. (coordinadoras) (2018). Violencias de género en entornos virtuales. Ediciones Octaedro, S.L.

Manne, K., DOWN GIRL: The logic of misogyny. Oxford University Press (2018)

Miller, S., Language and Sexism. Cambridge University Press (2009)

Chiril, P., Moriceau, V., Benamara, F., He said “who’s gonna take care of your children when you are at ACL?”: Reported Sexist Acts are Not Sexist. In proceedings of the ACL (2020)

Zeerak, W., Dirk, H., Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. In proceedings of the ACL (2016)

Anzovino, M., Fersini, E., Rosso, P., Automatic Identification and Classification of Misogynistic Language on Twitter, Springer. In Proceedings of NLDB (2018)

Frenda S., Ghanem B., Montes-y-Gómez M., Rosso P., Online Hate Speech against Women: Automatic Identification of Misogyny and Sexism on Twitter. In: Journal of Intelligent & Fuzzy Systems, vol. 36, num. 5, pp. 4743–4752 (2019)

Rodríguez-Sánchez, F., Carrillo-de-Albornoz, J., Plaza, L., Gonzalo, J., Rosso, P., Comet, M., Donoso, T., Overview of EXIST 2021: sEXism Identification in Social neTworks. Procesamiento del Lenguaje Natural, Vol 67, (2021)

Evaluation

From the point of view of evaluation metrics, our three tasks can be described as:

  • Task 1 (sexism identification): binary classification, mono label.
  • Task 2 (source intention): multiclass hierarchical classification, mono label. The hierarchy of classes has a first level with sexist/not sexist, and a second level for the sexist category with three mutually-exclusive subcategories: direct/reported/judgemental. A suitable evaluation metric must reflect the fact that a confusion between not sexist and a sexist category is more severe than a confusion between two sexist subcategories.
  • Task 3 (sexism categorization): multiclass hierarchical classification, multi label. Again the first level is a binary distinction between sexist/not sexist, and there is a second level for the sexist category that includes ideological & inequality, stereotyping and dominance, objectification, sexual violence, misogyny and non-sexual violence. These classes are not mutually exclusive: a tweet may belong to several subcategories at the same time.

The learning with disagreements paradigm can be considered in both sides of the evaluation process:

  • (i) The ground truth. In a “hard” setting, variability in the human annotations is reduced to a gold standard set of categories, hard labels, that are assigned to each item (e.g. using majority vote). In a “soft” setting, the gold standard is the full set of human annotations with their variability. Therefore, the evaluation metric incorporates the proportion of human annotators that have selected each category, soft labels. Note that in tasks 1 and 2, which are mono label problems, the sum of probabilities of each class must be one. But in task 3, which is multi label, each annotator may select more than one category for a single item. Therefore, the sum of probabilities of each class may be larger than one.
  • (ii) The system output. In a “hard”, traditional setting, the system predicts one or more categories for each item. In a “soft” setting, the system predicts a probability for each category, for each item. The evaluation score is maximized when the probabilities predicted match the actual probabilities in a soft ground truth. Again, note that in task 3, which is a multi label problem, the probabilities predicted by the system for each of the categories do not necessarily add up to one.

For each of the tasks, three types of evaluation will be reported:

  1. Hard-hard: hard system output and hard ground truth.
  2. Hard-soft: hard system output and soft ground truth.
  3. Soft-soft: soft system output and soft ground truth.

For all tasks and all types of evaluation (hard-hard, hard-soft and soft-soft) we will use the same official metric: ICM (Information Contrast Measure) (Amigó and Delgado, 2022). ICM is a similarity function that generalizes Pointwise Mutual Information (PMI), and can be used to evaluate system outputs in classification problems by computing their similarity to the ground truth categories. As there is not, to the best of our knowledge, any current metric that fits hierarchical multi-label classification problems in a learning with disagreement scenario, we have defined an extension of ICM (ICM-soft) that accepts both soft system outputs and soft ground truth assignments.

For each of the tasks, evaluation will be performed in the three modes described above, as follows:

  • Hard-hard evaluation. For systems that provide a hard, conventional output, we will provide a hard-hard evaluation. To derive the hard labels in the ground truth from the different annotators’ labels, we use a probabilistic threshold computed for each task. As a result, for task 1, the class annotated by more than 3 annotators is selected; for task 2, the class annotated by more than 2 annotators is selected; and for task 3 (multi-label), the annotated by more than 1 annotator are selected. Items for which there is no majority class (i.e. no class receives more probability than the threshold) will be removed from this evaluation scheme. The official metric will be the original ICM (as defined in (Amigó and Delgado, 2022)). We will also report and compare systems with F1 (the harmonic average of precision and recall). In task 1, we will use F1 for the positive class. In tasks 2 and 3, we will use the average of F1 for all classes. Note, however, that F1 is not ideal in our experimental setting: although it can handle multi-label situations, it does not take into account the relationships between classes. In particular, a mistake between not sexist and any of the sexist subclasses, and a mistake between two of the positive subclasses, are penalized equally, although the former is a more severe error.
  • Hard-soft evaluation. For systems that provide a hard output we will also provide a hard-soft evaluation, comparing the categories assigned by the system with the probabilities assigned to each category in the ground truth. We will use ICM-soft as the official evaluation metric in this variant. The probabilities of the classes for each instance are calculated according to the distribution of labels and the number of annotators for that instance. It is important to notice that some instances are labeled as “UNKNOWN”. In those cases, the number of annotators is decreased according to the number of “UNKNOW” found for that instance. As the soft evaluation context is less restrictive all instances of the set are included. At this point only ICM-soft will be included in the evaluation script, although we may report additional metrics in the final report.
  • Soft-soft evaluation. For systems that provide probabilities for each category, we will provide a soft-soft evaluation that compares the probabilities assigned by the system with the probabilities assigned by the set of human annotators. As in the previous case, we will use ICM-soft as the official evaluation metric in this variant. We may also report additional metrics in the final report.

Enrique Amigó and Agustín Delgado. 2022. Evaluating Extreme Hierarchical Multi-label Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5809–5819, Dublin, Ireland. Association for Computational Linguistics.

Results

Below are the official leaderboards for all participants and tasks in all evaluations contexts:

Link to Task 1 Leaderboard

Link to Task 2 Leaderboard

Link to Task 3 Leaderboard

Details:

  • Hard-hard: hard system output and hard ground truth.
    • Metrics:
      • ICM-Hard: ICM is the official metric for the ranking (as defined in Amigó and Delgado, 2022).
      • ICM-Hard Norm: ICM hard normalized.
      • F1: in Task 1, we provide results for F1 for the positive class, “YES”. In Tasks 2 and 3, we provide results for the average of F1 for all classes.
    • Baselines:
      • Majority class: non-informative baseline that classifies all instances as the majority class.
      • Minority class: non-informative baseline that classifies all instances as the minority class.
  • Soft-soft: soft system output and soft ground truth.
    • Metrics:
      • ICM-Soft: ICM soft is the official metric for the ranking (as adapted from Amigó and Delgado, 2022).
      • ICM-Soft Norm: ICM soft normalized.
      • Cross Entropy: in task 1 and task 2 we provide results for cross entropy measure.
    • Baselines:
      • Majority class: non-informative baseline that classifies all instances as the majority class. Note that the probability of the class has been set to 1.
      • Minority class: non-informative baseline that classifies all instances as the minority class. Note that the probability of the class has been set to 1.
  • Hard-soft: hard system output and soft ground truth.
    • Metrics:
      • ICM-Soft: ICM soft is the official metric for the ranking (as adapted from Amigó and Delgado, 2022).
      • ICM-Soft Norm: ICM soft normalized.
    • Baselines:
      • Majority class: non-informative baseline that classifies all instances as the majority class.
      • Minority class: non-informative baseline that classifies all instances as the minority class.
      • Oracle most voted: hard approach that selects the most voted label following the same procedure as the one used to generate the gold hard.

Enrique Amigó and Agustín Delgado. 2022. Evaluating Extreme Hierarchical Multi-label Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5809–5819, Dublin, Ireland. Association for Computational Linguistics.

EXIST 2023 Lab Program

EXIST 2023 is co-located with the CLEF Conference, and will be held face-to-face on Wednesday, September 20th 2023, and Thursday, September 21st 2023.

Wednesday September 20th:

  • 11:20 - 13:00: Overview of EXIST 2023 – Learning with Disagreement for Sexism Identification and Characterization. Laura Plaza, Jorge Carrillo-de-Albornoz, Roser Morante, Enrique Amigó, Julio Gonzalo, Damiano Spina, Paolo Rosso.
  • 16:00 - 17:40: EXIST 2023 Parallel Session:
    • 16:00 - 16:20: Welcome and Opening Remarks.
    • 16:20 - 16:40: Efficient Multilingual Sexism Detection via Large Language Model Cascades. Lin Tian, Nannan Huang, Xiuzhen Zhang.
    • 16:40 - 17:00: ROH_NEIL@EXIST2023: Detecting Sexism in Tweets using Multilingual Language Models. Rohit Koonireddy, Niloofar Adel.
    • 17:00 – 17:20: AI-UPV at EXIST 2023 – Sexism Characterization Using Large Language Models Under The Learning with Disagreement Regime. Angel Felipe Magnossão de Paula, Giulia Rizzi, Elisabetta Fersini, Damiano Spina.
    • 17:20 – 17:40: When Multiple Perspectives and an Optimization Process Lead to Better Performance, an Automatic Sexism Identification on Social Media With Pretrained Transformers in a Soft Label Context. Johan Erbani, Elöd Egyed-Zsigmond, Diana Nurbakova, Pierre-Edouard Portier.

Thursday September 21st:

  • 9:30 - 11:00: EXIST 2023 Parallel Session:
    • 9:30 - 10:10: Keynote Speaker: Evaluation in Learning with Disagreement. Enrique Amigó.
    • 10:10 - 10:30: AIT_FHSTP at EXIST 2023 Benchmark: Sexism Detection by Transfer Learning, Sentiment and Toxicity Embeddings and Hand-Crafted Features. Jaqueline Böck, Mina Schütz, Daria Liakhovets, Nathanya Queby Satriani, Andreas Babic, Djordje Slijepčević, Matthias Zeppelzauer, Alexander Schindler.
    • 10:30 - 10:50: IimasGIL_NLP@EXIST2023: Unveiling Sexism on Twitter with Fine-tuned Transformers. Andrea Sanchez-Urbina, Helena Gómez-Adorno, Gemma Bel-Enguix, Vianey Rodríguez-Figueroa, Angela Monge-Barrera
    • 10:50 - 11:00: Final discussion and suggestions.

EXIST 2023 Proceedings

Overview Paper:

Extended Overview Paper:

Working Notes:

Organizers

Avatar

Damiano Spina

RMIT University

Senior Lecturer

Avatar

Enrique Amigó

UNED

Associate Professor

Avatar

Jorge Carrillo-de-Albornoz

UNED

RMIT University

Associate Professor

Avatar

Julio Gonzalo

UNED

Full Professor

Avatar

Laura Plaza

UNED

RMIT University

Associate Professor

Avatar

Paolo Rosso

Universitat Politècnica de València

Full Professor

Avatar

Roser Morante

UNED

Researcher in Computational Linguistic

Sponsors

Avatar

FairTransNLP Project

(PID2021-124361OB-C32)

Spanish Ministry of Science and Innovation

Avatar

Space for Observation of AI in Spanish

UNED and RED.ES, M.P., ref. C039/21- OT

Spanish Ministry of Economy and Competitiveness

Contact

If you have any specific question about the EXIST 2023, we may ask you to let us know through the Google Group EXIST 2023 at CLEF 2023.

For any other question that does not directly concern the shared task, please write to Jorge Carrillo-de-Albornoz.