First Shared Task at IberLEF 2021
Welcome to the website of EXIST, the first shared task on sEXism Identification in Social neTworks at IberLEF 2021.
The Oxford English Dictionary defines sexism as “prejudice, stereotyping or discrimination, typically against women, on the basis of sex”. Inequality and discrimination against women that remain embedded in society is increasingly being replicated online.
Detecting online sexism may be difficult, as it may be expressed in very different forms. Sexism may sound “friendly”: the statement “Women must be loved and respected, always treat them like a fragile glass” may seem positive, but is actually considering that women are weaker than men. Sexism may sound “funny”, as it is the case of sexist jokes or humour (“You have to love women… just that… You will never understand them.”). Sexism may sound “offensive” and “hateful”, as in “Humiliate, expose and degrade yourself as the fucking bitch you are if you want a real man to give you attention”. Our aim is the detection of sexism in a broad sense, from explicit misogyny to other subtle expressions that involve implicit sexist behaviours.
However, even the most subtle forms of sexism can be as pernicious as the most violent ones and affect women in many facets of their lives, including domestic and parenting roles, career opportunities, sexual image and life expectations, to name a few. The automatic identification of sexisms in a broad sense may help to create, design and determine the evolution of new equality policies, as well as encourage better behaviors in society.
Participants will be asked to classify “tweets” and “gab post” (in English and Spanish) according to the following two tasks:
The first subtask is a binary classification. The systems have to decide whether or not a given text (tweet or gab) is sexist (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behaviour). The following tweets show examples of sexist and not sexist messages.
Once a message has been classified as sexist, the second task aims to categorize the message according to the type of sexism (according to the categorization proposed by experts and that takes into account the different facets of women that are undermined). In particular, we propose a five-classification task:
IDEOLOGICAL AND INEQUALITY: The text discredits the feminist movement, rejects inequality between men and women, or presents men as victims of gender-based oppression.
STEREOTYPING AND DOMINANCE: The text expresses false ideas about women that suggest they are more suitable to fulfill certain roles (mother, wife, family caregiver, faithful, tender, loving, submissive, etc.), or inappropriate for certain tasks (driving, hardwork, etc), or claims that men are somehow superior to women.
OBJECTIFICATION: The text presents women as objects apart from their dignity and personal aspects, or assumes or describes certain physical qualities that women must have in order to fulfill traditional gender roles (compliance with beauty standards, hypersexualization of female attributes, women’s bodies at the disposal of men, etc.).
SEXUAL VIOLENCE: Sexual suggestions, requests for sexual favors or harassment of a sexual nature (rape or sexual assault) are made.
MISOGYNY AND NON-SEXUAL VIOLENCE: The text expressses hatred and violence towards women.
If you want to participate in the EXIST@IberLEF-2021 shared task, please fill this form. You will receive information about how to join our Google Group, where EXIST-Datasets, EXIST-Communications, EXIST-Questions/Answers, and EXIST-Guidelines will be provided to the participants.
Participants will be required to submit their runs and will have the possibility to provide a technical report that should include a brief description of their approach, focusing on the adopted algorithms, models and resources, a summary of their experiments, and an analysis of the obtained results or their publication in the Proceedings. Although we recommend to participate in both tasks, participants are allowed to participate just in one of them (e.g. Task 1).
Technical reports will be published in IberLEF 2021 Proceedings at CEUR-WS.org.
Note: All deadlines are 11:59PM UTC-12:00 (“anywhere on Earth”).
Sexism comprises any form of oppression or prejudice against women because of their sex. As stated in (Rodríguez-Sánchez et al. 2020), sexism is frequently found in many forms in social networks, includes a wide range of behaviours (such as stereotyping, ideological issues, sexual violence, etc. (Donoso-Vázquez and Rebollo-Catalán, 2018; Manne, 2018)), and may be expressed in different forms: direct, indirect, descriptive or reported (Miller, 2009; Chiril et al. 2020). While previous studies have focused on identifying explicit hatred or violence towards women (Zeerak and Dirk, 2016; Zeerak, 2016; Anzovino et al., 2018; Frenda et al., 2019), the aim of the EXIST dataset is to cover sexism in a broad sense, from explicit misogyny to other subtle expressions that involve implicit sexist behaviours. The EXIST dataset incorporates any type of sexist expression or related phenomena, including descriptive or reported assertions where the sexist message is a report or a description of a sexist behaviour.
To this aim, we have collected a number of popular expressions and terms, both in English and Spanish, commonly used to underestimate the role of women in our society and extracted from several Twitter accounts which collects phrases and expressions that women (Twitter users) have received on a day-to-day basis, as well as terms used in previous state of the art approaches. These terms were analyzed and filtered by two experts in gender issues, Trinidad Donoso and Miriam Comet, which examined examples of tweets extracted using these terms as seeds. The final set contains more than 200 expressions that can be used in sexist contexts.
The final set of sexism terms was used to extract tweets both in English and Spanish (more than 800.000 tweets were downloaded). Crawling was performed during the period from the 1st December 2020 till the 28st February 2021. To ensure an appropriate balance between seeds, we have removed those with less than 60 tweets. The final set of seeds used contains 94 seeds for Spanish and 91 seeds for English.
For each seed, approximately 50 tweets were randomly selected within the period from 1st to 31st of December 2020 for the training set, and 22 tweets per seed within the period from 1st to 28th February of 2021 for the test set. This distribution was set to allow a temporal separation between the training and test data. As a result, we have 4.500 tweets per language for the training set and 2.000 tweets per language for the test set.
Each tweet was annotated by 5 crowdsourcing annotators, following the guidelines developed by Trinidad and Miriam (different experiments were done to ensure quality), and an inter-annotator agreement test was carried out. Final labels were selected according to the majority vote between crowdsourcing annotators, but tweets with 3 to 2 were manually reviewed by two persons (man and woman) with more than two years of experience analyzing sexist content in social networks. Final EXIST dataset consists of 6977 tweets for training and 3.386 tweets for testing, where both sets are randomly selected from the 9.000 and 4.000 labeled sets, training and test respectively, to ensure class balancing according to Task 1.
In addition, we have collected 492 “gabs” in English and 490 in Spanish from the uncensored social network Gab.com following a similar procedure as described before. This set will be included in the EXIST test set in order to measure the difference between social networks with and without “content control”, Twitter and Gab.com respectively.
More details about the dataset will be provided in the task overview (bias consideration, annotation process, quality experiments, inter-annotator agreement, etc).
Rodríguez-Sánchez, F., Carrillo-de-Albornoz, J., Plaza, L., Automatic Classification of Sexism in Social Networks: An Empirical Study on Twitter Data. IEEE Access (2020).
Donoso-Vázquez, Trinidad; Rebollo-Catalán, Ángeles. (coordinadoras) (2018). Violencias de género en entornos virtuales. Ediciones Octaedro, S.L.
Manne, K., DOWN GIRL: The logic of misogyny. Oxford University Press (2018)
Miller, S., Language and Sexism. Cambridge University Press (2009)
Chiril, P., Moriceau, V., Benamara, F., He said “who’s gonna take care of your children when you are at ACL?”: Reported Sexist Acts are Not Sexist. In proceedings of the ACL (2020)
Zeerak, W., Dirk, H., Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. In proceedings of the ACL (2016)
Zeerak, W., Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter. In proceedings of the ACL (2016)
Anzovino, M., Fersini, E., Rosso, P., Automatic Identification and Classification of Misogynistic Language on Twitter, Springer (2018)
Frenda S., Ghanem B., Montes-y-Gómez M., Rosso P., Online Hate Speech against Women: Automatic Identification of Misogyny and Sexism on Twitter. In: Journal of Intelligent & Fuzzy Systems, vol. 36, num. 5, pp. 4743–4752 (2019)
In order to evaluate the performance of the different approaches proposed by the participants we will use the Evaluation Framework EvALL, www.evall.uned.es (Amigo et al., 2017; Amigo et al., 2018, Amigo et al., 2020). Within this framework, we will evaluate the system outputs as classification tasks (binary and multiclass respectively) with the following measures: Accuracy, Precision, Recall and F-measure (using macro average with all classes for the three last).
In the first task, Sexism Identification, results of participants will be ranked using Accuracy, as distribution between sexist and non-sexist categories is balanced. Besides, other measures will be computed, such Precision, Recall and F1, as well as other analysis based on the two different social networks will be performed.
For the second task, Sexism Categorization, we will use macro-average F-measure to rank the system outputs, analyzing the results according to the different categories and distributions. Similarly, we will compute other measures such as Precision and Recall.
More details about the evaluation and additional experiments will be provided in the task overview.
Amigó, E., Carrillo-de-Albornoz, J., Almagro-Cádiz, M., Gonzalo, J., Rodríguez-Vidal, J., and Verdejo, F. (2017). EvALL: Open Access Evaluation for Information Access Systems. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017.
Amigó, E., Spina, D., and Carrillo-de-Albornoz, J.. An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ‘18). ACM, New York, NY, USA, 625-634.
Amigo, E., Gonzalo, J., Mizzaro, S., and Carrillo-de-Albornoz, J.. An Effectiveness Metric for Ordinal Classification: Formal Properties and Experimental Results. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020).