• Non ci sono risultati.

TextWiller: Collection of functions for text mining, specially devoted to the Italian language

N/A
N/A
Protected

Academic year: 2021

Condividi "TextWiller: Collection of functions for text mining, specially devoted to the Italian language"

Copied!
2
0
0

Testo completo

(1)

TextWiller: Collection of functions for text mining,

specially devoted to the Italian language

Dario Solari

1

, Andrea Sciandra

2

, and Livio Finos

3

1 Bee Viva srl 2 Department of Communication and Economics, University of Modena and Reggio Emilia 3 Department of Developmental Psychology and Socialisation, University of Padova DOI:10.21105/joss.01256 Software • Review • Repository • Archive Submitted: 24 October 2018 Published: 08 September 2019 License

Authors of papers retain copyright and release the work under a Creative Commons Attribution 4.0 International License (CC-BY).

Summary

TextWiller is the development version of a R package that collects some text mining utilities intended for the Italian language. It’s available at https://github.com/livioivil/TextWiller. The aim of TextWiller is to help to deal with the pre-processing of a corpus and it also provides some functions about word classification and polarity.

This software is one of the few text mining R packages in the Italian language. The main differences compared to other popular packages like tm are that TextWiller allows sentiment analysis; it includes classification tools for Italian cities and names; and it can help social media researchers with some specific functions for the data extracted from a social networking site via APIs. In particular, TextWiller allows to normalize (Miner et al. (2012), Bolasco & De Mauro (2013)) Italian text, i.e. transforming a corpus in a canonical form, useful for text mining. Normalization includes several functions for removing punctuation, stopwords and for managing:

• upper-lower cases; • plurals;

• URLs and emoticons recognition;

• some slang expressions (which are brought back to the correct form in Italian). Specifically, other relevant TextWiller functions allow to:

• get the sentiment (Wilson, Wiebe, & Hoffmann (2005), Ceron, Curini, & Iacus (2014)) of each document in a corpus, based on an internal lexicon or a custom one;

• classify users’ gender by (Italian) names;

• classify Italian cities into 5 macro-areas (North East, North West, Centre, South, Is-lands);

• find re-tweets (RTHound function; Ferraccioli (2014)) by evaluation of texts similar-ity (and replace texts so that they become equals) through hierarchical clustering on Levenshtein distance (dissimilarity) matrix;

• extract short URLs and get the long ones;

• extract users’ communication pattern, with the patternExtract function, based on the ‘@’ sign followed by a username, as in several social media platforms and forums it’s used to denote a reply.

TextWiller is designed to be used by researchers (mainly statisticians and social scientists) and by students in courses on text mining (it has already been used in several Bachelor and Master’s degree theses).

Solari et al., (2019). TextWiller: Collection of functions for text mining, specially devoted to the Italian language. Journal of Open Source

Software, 4(41), 1256. https://doi.org/10.21105/joss.01256

(2)

References

Bolasco, S., & De Mauro, T. (2013). L’analisi automatica dei testi: Fare ricerca con il text

mining. Carocci Editore.

Ceron, A., Curini, L., & Iacus, S. M. (2014). Social media e sentiment analysis. Springer Milan. doi:10.1007/978-88-470-5532-2

Ferraccioli, F. (2014). Topic model workout: Un approccio per l’analisi di microblogging mass media e dintorni - m. Sc. Thesis.

Miner, G., Elder IV, J., Fast, A., Hill, T., Nisbet, R., & Delen, D. (2012). Practical text

mining and statistical analysis for non-structured text data applications. Elsevier. doi:10. 1016/c2010-0-66188-8

Wilson, T., Wiebe, J., & Hoffmann, P. (2005). Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the conference on human language technology and

empirical methods in natural language processing - HLT 05. Association for Computational

Linguistics. doi:10.3115/1220575.1220619

Solari et al., (2019). TextWiller: Collection of functions for text mining, specially devoted to the Italian language. Journal of Open Source

Software, 4(41), 1256. https://doi.org/10.21105/joss.01256

Riferimenti

Documenti correlati

It contains internai lexical procedures to build up new words of the verb. adjective and noun category. together with inf1ections for verb analysis. are organized in a morpho-

Work on mice null for IL-18 or its receptor subunit alpha is helping to decipher the action of this cytokine in the brain.. Finally, the recent discovery of novel IL-18

In this setting a typically tachycardia-dependent remodeling occurs: in a study of 101 patients 11 we identified loss of atrial refractoriness adaptation to heart rate, short

In conclusion, we have shown that nitrogen fertilization and post-anthesis temperature affected growth, accumulation and partitioning of dry matter and N in durum wheat which, in

The presence of gold on the surface is clearly observable along with cracks which allow the view of both a greyish layer underneath the gold (the silver layer) as well as a red

In the reading here provided, which complements the Arendtian matrix of Biesta with an argument inspired by Patočka, this understanding is rooted in the very beginning of West-

L’impianto complessivo che qui presentiamo, così come propone un Piano rego- latore sostanzialmente basato sulla fattibilità delle trasformazioni territoriali e sul confinamento

The Severe Aplastic Anaemia Working Party of the European Society of Blood and Bone Marrow Transplanta- tion (SAAWP-EBMT) aimed to evaluate the diseases charac-