Michela Camelettia, Silvia Fabrisa; Stephan Schlosserb; Daniele Toninellia*
a. University of Bergamo; via dei Caniana, 2, 24127, Bergamo, Italy.
b. University of Göttingen, Goßlerstraße 19, 37073 Göttingen, Germany.
Abstract: In the era of social media, the huge availability of big data such as digital data (e.g. posts sent through social networks or unstructured data scraped from websites) allows to develop new types of research in a wide range of fields. These types of big data are available for low costs and in almost real-time. Nevertheless, their collection and analysis are challenging. This paper proposes an unsupervised dictionary-based method to filter tweets related to a specific topic, i.e. environment. We start from the tweets sent by a selection of Official Social Accounts clearly linked with the subject of interest. Then, we identify a list of expressions (bigrams, trigrams and hashtags) used to set the topic-oriented dictionary. Our approach has some relevant advantages: it attempts to reduce as much as possible the interventions and decisions of the researcher as well as the processing time; it is based mostly on combination of words (instead of single words) in order to ease the identification of tweets concerning the topic of interest; it is not based on a pre-defined dictionary, but it can rather be personalized and generalized to other topics. We test the performance of our method by applying the built dictionary to a sample of more than 3.5 million geolocated tweets posted in Great Britain between January and May 2019. All the criteria used to evaluate the performance highlighted very good performances. In particular, the level of accuracy, of sensitivity and of the F1 score were equal or higher than 98.4%; moreover, also for specificity and precision we obtain excellent levels of performance (around 97,5%), higher than the currently most common methods of selection.
Keywords: tweet filtering, big data analysis, dictionary-based selection, dictionary-based search, unsupervised algorithm, text analysis.
Pages: 13 – 32 | Full PDF Paper