Creating a New Dataset for Hate Speech Research

Beginning in 2018 with an initial preprint and resolving in early 2022 with the publication of our work in Language Resources and Evaluation (LRE), I led the development of a new operationalization of hate speech, entitled “hate-based rhetoric,” the subsequent development of a focused coding guide for training annotators, and the supervision of the annotation process of more than 27,000 posts from the Gab social network.

In addition to the creation of this valuable corpus, which has been cited in computational research more than 60 times, including in our own work (e.g., our paper “Contextualizing Hate Speech Classifiers with Post-hoc Explanation”), I also conducted:

An analysis of the inter-annotator agreement on all labels in the corpus, a valuable componenet of quality assurance during data collection
A classification analysis of each of the major labels in the corpus using several NLP methods, establishing prediction baselines for future work
An error analysis of common mistakes made by the best-performing classifiers