PROJECT DESCRIPTION

The collaboration between the University of Passau and the University of Belgrade would be realized through the work on improving the lexical resources and tools that will aid hate-speech detection in both German and Serbian (and closely related languages), argumentation mining efforts – to detect and analyze the lines of argumentation that are characteristic of hate-speech in order to better prevent and “deflate” heated on-line discussions on hot topics (Mitrović et al., 2019a; Mitrović et al., 2017). 

Discriminatory messages and exhortation to violence are related to hate speech, which has been gaining more attention due to the extensive use of online media and the Internet in general.

The research analyses the concept of hate speech, its content and forms of expression, trying to define its vocabulary, collocations, colloquial expressions, and context. The international documents related to hate speech have been signed by both Serbia and Germany, so there are needs and tendencies towards recognizing, monitoring, and sanctioning hate speech in both countries.

Building a conceptual framework of hate speech, the research will focus on different media: Twitter, blogs, Facebook, news, tabloids …The regulation of hate speech legislation in Serbia will be studied and considered for framework implementation.  

The need for cooperation became apparent in the light of the work on the Human+ project which aims at analyzing migration movements based on the data collected during the 2015 Migration crisis (Urchs et al., 2019). As many tweets in the Human+ project dataset  were written in Serbian, a collaboration between the University of Passau and the University of Belgrade and its Natural Language Processing experts became apparent. 

As Anti-Migrants Hate-Speech is omnipresent in all social media, the need for analyzing this kind of speech in more areas through which the migrants pass on their way to Germany (such as Serbia) became an obvious next step regarding this research.

OBJECTIVES

The aim of the proposed collaboration with the University of Belgrade is to extend the knowledge gained through building the systems Uni Passau has and to try to achieve comparable results in hate-speech and offensive speech detection in the Serbian language.

This work aims at containing and preventing the alarming diffusion of massive online hate campaigns on social networks (SNSs) and it focuses on Serbian social media texts, while also taking into account the texts written in BCSM  (or in one of the BCSM languages, Bosnian-Croatian-Serbian-Montenegrin, which are historically, structurally and semantically similar to Serbian) languages, if necessary.

The objective of the Hate-speech detection (HaSDet) project is precisely to deepen the investigations on the methods and tools through a comparison of experiences that concern  the increasing amount of online hate speech, requiring very much appropriate methods.  

Given that human moderators cannot monitor the huge user generated texts on social networks, we believe this work represents the basis to track divergent states of Serbian and possibly also related  Serbian texts in online conversations.

There is a great adequacy between both research profiles and research objects of the partners, considering their respective specialties:

Serbian expertise on NLP tools and resources development for various text analysis, annotation and classification. A number of resources and tools have been developed for the processing of texts written in Serbian. Previous efforts of the Serbian team enabled the development of a complex system for sentiment analysis, named entity recognition and tagging, …. Serbian system for text analysis represents the most complete approach that is currently available for the Serbian language. We have produced the first small version of a lexical resource, consisting of words that could be used as a trigger for recognition of hate-speech.  Future improvements to the Serbian system for recognition and normalization of hate-speech expressions will also take into consideration phrases, figurative speech and sarcasm, as well as some other rhetorical figures, such as Litotes) as an indicator. We will look at both explicit and implicit hate-speech, as some hateful and offensive messages are not always apparent at first glance.

German  expertise on Deep Learning Natural Language Processing systems in general, and Transfer learning tasks will allow for integrating with a plethora of both rule-based and statistical systems that have been built at the University of Belgrade for processing Serbian language texts. In particular, the Neural Network Language Models Uni Passau has built for German will be adjusted to the specificities of a morphologically rich language such is Serbian, with the aid of specialized resources and tools that are available from the Serbian team.

Collaborating with a foreign partner who will bring its knowledge on a language belonging to a different language family is of utmost importance regarding the aim of international standardization of the project. In our opinion, such a normalisation effort must be grounded on an effective practical experience in terms of annotation. This is why a secondary objective of the project will be to provide the scientific community with useful Serbian and German resources that will be interoperable.

METHODOLOGY

Starting from the definition of hate speech as ’any communication that disparages a person or a group on the basis of some characteristic such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other characteristic’ it is clear that hate-speech is a complex social and linguistic phenomenon. Computational processing of such language requires usage of finely-tuned, task specific language tools and resources, especially for morphologically rich and low-resource languages such is Serbian. In this proposed, joint work, we aim at improving the resources and systems for hate-speech detection for the German language and building new ones, based on the already existing NLP resources in Serbian, specifically built for the Serbian language. This process will unfold in the following fashion:

1. Stage 1 has to be realized by the means of a visit of each partner to the other:

First, the evaluation of the resources, procedures and tools have to be initiated with the presentation (by two emissaries) of the work of his team while visiting the other team, in presence of the whole hosting team. Both visits should follow the same agenda:

-presentation of the resources (German, Serbian lexicons and corpus)

-presentation of the tools (German, Serbian NLP tools)

– first steps toward establishing the annotation guidelines

– practical annotation workshop

– an in depth exploration of the State of the Art 

=> Stage 1 : 1 visit (2 researchers ger -> team srb) & 1 visit (2 researchers srb-> team ger)

2. Stage 2 will entail constructing the framework for gathering of textual resources related to hate speech in the Serbian language from social media. The work on a common procedure can follow the first stage, since the emissaries can bring their own initial  results and share them with the other team, allowing for harmonization of resources and tools from both sides. 

=> Stage 2 : 1 visit (2 researchers ger -> team srb) & 1 visit (2 researchers srb-> team ger)

3. Stage 3  procedure update can be lead apart for each team according to the agreement found during the gathering of the teams . The annotators will assign one class to each post, where classes span over different levels of hate,  all of which will be agreed upon in the 1. and 2. stage. Annotation of explicit hate-speech and Implicit hate-speech will also be made. 

1 day Workshop at the University of Belgrade, with attendees from academia and language institutes

 => Stage 3 :  2 visits of 2 researchers ger or srb) -> team srb or ger 

Stage 4 may be realized during one visit of a partner to the other: Training

 => Stage 4 :  1 visit  of 2 researchers ger or srb) -> team srb or ger 

Final NN model

Stage 4 may be finalized  by the gathering of the teams while ending the detailed statement of the covered phenomenon, and starting to measure the resources reliability

=> finalization of the first part and start of the second part of stage 4: gathering of the two teams

Travel Plans

1 travel per year Dr. Jelena Mitrović and Prof. Dr. Cvetana Krstev,  including a workshop collaboration and invited talk at the locally organized conference

2 travels per year Prof. Dr. Ranka Stanković, including a workshop collaboration with the University of Passau and invited talk at the locally organized conference

RESOURCES

Five researchers from the Data Science chair of The University of Passau will be involved in the project, namely Dr.  Michael Granitzer, as the Head of the chair, PhD students Lorenz Wendlinger, Stefanie Urchs, Torben Schnuchel and Christofer Fellicious, as well as a postdoctoral researcher already involved in the production of the corpora and tools for the Serbian language and the German language, Dr. Jelena Mitrović.

Serbian scientific instrumentation available for project implementation consists of the Corpus of contemporary Serbian (SrpKor), various domain specific corpora, as well as parallel German-Serbian corpus of literary text. Small parts of those lexical resources are annotated, and the research within the proposed project could be used for the completion of annotations. Within the project “Serbian language and its resources”, financed by the Serbian Ministry of Education, Science and Technological Development, Serbian team aim to further develop language resources and tools, specifically semantic lexicon and tools for hate-speech detection and evaluation of annotations. One of our goals is to compile and use corpora of text derived from various sources:  blogs, social networks (Twitter, Facebook), daily papers, etc. in order to investigate hate-speech specific features and triggers.

Serbian team consists of three researchers with PhD, namely Ranka Stanković, Cvetana Krstev and Jelena Jaćimović (gathering from  three faculties of the University of Belgrade), with a wide experience in lexical resources and tools development, corpus preparation, processing and automatic annotation, as well as PhD students Branislava Šandrih, Biljana Lazić and Mihailo Škorić who are getting acquainted with these topics.

The expected improvements of tools and resources don’t justify any specific instrumentation, apart from the expert analysis and their usual equipment including server storage space provided by the universities on both sides.

EXPECTED RESULTS

This collaboration will lead to three kinds of results concerning respectively: the development of lexical resources for hate speech detection, annotated corpus and a contribution in terms of finding the international concept of hate speech, its content and forms of expression using machine learning tools.

1. Language resources development  

  • compiled corpora from various sources (Crawled comments)
  • guidelines for annotation with the proposal of hate categories taxonomy to distinguish the kind of hate.
  • annotated corpora (by up to five distinct human annotators, according to the defined taxonomy. 

– adaptation, complementation and publication of the German corpus.  Its adaptation to a German-Serbian common scheme of annotation will extend the impact of this resource. The hateful utterances will further be annotated for the type of hate-speech they represent – insult, direct personal thereat, death threat, abusive speech, etc. (Mitrović et al., 2019).

– adaptation, complementation and publication of the Serbian corpus, which will be the first Serbian corpus annotated with different levels and different categories  of hateful language.

These resources will be published under free licenses (CC-BY-SA). This kind of resources are useful for language industry and any information technology involving information extraction and hate-speech recognition.

2. Lexical resources development

– specific vocabulary, collocations and colloquial expressions resources

– sentiment polarity positive, negative and neutral sentiment

– word similarity lexicons

– word embedding lexicons

3.  Training a model for hate-speech detection

As neural network approaches outperform existing methods for text classification problems, a deep learning model will be introduced. This classifier assigns each tweet (or post) to one of the categories of a Twitter dataset in Serbian, based on the already developed systems for the German language. We will investigate classification into three categories: hate, offensive language, and neither, and we will aim at further classification into hate-speech language that is an  insult, direct personal threat, death threats, abusive speech, etc. or even further, hate speech directed at migrants, women, members of national minorities etc.

Leveraging morpho-syntactic features, sentiment polarity and word embedding lexicons,

we will design and implement two classifiers for the Serbian language, based on state of the art machine (deep) learning algorithms. The features will be organised into three main categories: raw and lexical text features, morpho-syntactic and syntactic features, and lexicon features.

We will  test these two learning algorithms in order to verify their classification performances on the task of hate speech recognition. The performance of this model will be tested using the accuracy, as well as looking at the precision, recall and F1-score.     

This research will be promoted in different ways :

– open publication of common annotation guidelines as a technical report of the project

– presentation of results in 2 seminars and 1 conference

– publication of a research paper that will present extensively results

COLLABORATION

Jelena Mitrović has finished her PhD studies at the University of Belgrade, Serbia, where initial collaboration with the Serbian team has begun. After continuing her academic career at the University of Passau, Germany, Dr. Mitrović has strived to continue this collaboration, thus aiding academic transfer of competences between the two countries. This proposal is a step towards strengthening that collaboration in a more formal way.

  1. Jelena Mitrović, Electronic Lexical Resources and Tools for Natural Language Processing of Serbian and their Enrichment via Crowdsourcing, PhD thesis, University of Belgrade, defended on May 18th, 2018. (Professors Cvetana Krstev and Ranka Stanković served as advisors for this PhD work).
  2. Miljana Mladenović, Cvetana Krstev, Jelena Mitrović, Ranka Stanković, “Using Lexical Resources for Irony and Sarcasm Classification”. In Proceedings of the 8th Balkan Conference in Informatics (BCI ’17). ACM, New York, NY, USA, Article 13, 8 pages. 2017.
  3. Jelena Mitrović, Miljana Mladenović and Cvetana Krstev, “A Language-independent Model for Adding New Semantic Relations to a WordNet”, Global WordNet Conference, Bucharest, Romania, January, 2016.
  4. Miljana Mladenović, Jelena Mitrović, Cvetana Krstev, Dusko Vitas, “Hybrid Sentiment Analysis Framework for a Morphologically Rich Language”, Journal of Intelligent Information Systems (JIIS) 08/2015.
  5. Jelena Mitrović, Miljana Mladenović, Cvetana  Krstev, “Adding MWEs to Serbian Lexical Resources Using Crowdsourcing”, poster presentation at the 5th General PARSEME meeting, September 23rd, 2015.
  6. Jelena Mitrović, “Electronic Tools and Resources for Multi-Word Unit Detection and Research in Serbian” poster presentation, 2nd General PARSEME Meeting, Athens, March 10th, 2014.
  7. Miljana Mladenović, Jelena Mitrović, Cvetana Krstev, “Developing and Maintaining a WordNet: Procedures and Tools”, In Proceedings of the Global WordNet Conference, Tartu, Estonia, January 2014, pp. 55-62.

BIBLIOGRAPHY

  1. Jelena Mitrović, Cliff O’Reilly, Miljana Mladenović, Siegfried Handschuh, Ontological representations of Rhetorical Figures for Argument Mining, Argument and Computation 8 (3), 2017. Pp. 267–287.
  2.  Stefanie Urchs, Lorenz Wendlinger, Jelena Mitrović, Michael Granitzer, MoveT15: A Twitter Dataset for Extracting and Analysing Migration-Movement Data of the European Migration Crisis 2015, 28th IEEE International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE-2019)
  3. Jelena Mitrović, Bastian Birkeneder, Michael Granitzer, nlpUP at SemEval-2019 Task 6: A Deep Neural Language Model for Offensive Language Detection, Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019)
  4. Miljana Mladenović, Cvetana Krstev, Jelena Mitrović, Ranka Stanković, “Using Lexical Resources for Irony and Sarcasm Classification”. In Proceedings of the 8th Balkan Conference in Informatics (BCI ’17). ACM, New York, NY, USA, Article 13, 8 pages. 2017.
  5. Cvetana Krstev, Branislava Sandrih, Ranka Stanković, “Using English Baits to Catch Serbian Multi-Word Terminology”, Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, 2018, eds. Nicoletta Calzolari et al., ISBN 979-10-95546-00-9.