Transfer Learning for Hate Speech Detection in Social Media

Marian-Andrei Rizoiu, Tianyu Wang, Gabriela Ferraro, Hanna Suominen

In today's society more and more people are connected to the Internet, and its information and communication technologies have become an essential part of our everyday life. Unfortunately, the flip side of this increased connectivity to social media and other online contents is cyber-bullying and -hatred, among other harmful and anti-social behaviors. Models based on machine learning and natural language processing provide a way to detect this hate speech in web text in order to make discussion forums and other media and platforms safer. The main difficulty, however, is annotating a sufficiently large number of examples to train these models. In this paper, we report on developing automated text analytics methods, capable of jointly learning a single representation of hate from several smaller, unrelated data sets. We train and test our methods on the total of $37,520$ English tweets that have been annotated for differentiating harmless messages from racist or sexists contexts in the first detection task, and hateful or offensive contents in the second detection task. Our most sophisticated method combines a deep neural network architecture with transfer learning. It is capable of creating word and sentence embeddings that are specific to these tasks while also embedding the meaning of generic hate speech. Its prediction correctness is the macro-averaged F1 of $78\%$ and $72\%$ in the first and second task, respectively. This method enables generating an interpretable two-dimensional text visualization --- called the Map of Hate --- that is capable of separating different types of hate speech and explaining what makes text harmful. These methods and insights hold a potential for not only safer social media, but also reduced need to expose human moderators and annotators to distressing online~messaging.

arrow_drop_up