Algorithms for Knowledge and Information Extraction in Text with Wikipedia

(1)

Report on the PhD Activities

Marco Ponza

February 19, 2019

Research Activities

Marco Ponza’s PhD thesis focuses on the design of algorithms for the extraction of knowledge (in terms of entities belonging to a knowledge graph) and information (in terms of open facts) from text through the use of Wikipedia as main repository of world knowledge.

The first part of the dissertation focuses on research problems that specifically lie in the domain of knowledge and information extraction. In this context, Ponza contributes to the scientific literature with the following three achievements: first, he studies the problem of computing the relatedness between Wikipedia entities, through the introduction of a new dataset of human judgements complemented by a study of all entity relatedness measures proposed in recent literature as well as with the proposal of a new computationally lightweight two-stage framework for relatedness computation; second, he studies the problem of entity salience through the design and implementation of a new system that aims at identifying the salient Wikipedia entities occurring in an input text and that improves the state-of-the-art over diﬀerent datasets; third, he introduces a new research problem called fact salience, which addresses the task of detecting salient open facts extracted from an input text, and he proposes, design and implement the first system that eﬃcaciously solves it.

In the second part of the dissertation Ponza studies an application of knowledge extraction tools in the domain of expert finding. He proposes a new system which hinges upon a novel profiling technique that models people (i.e., experts) through a small and labeled graph drawn from Wikipedia. This new profiling technique is then used for designing a novel suite of ranking algorithms for matching the user query and whose eﬀectiveness is shown by improving state-of-the-art solutions.

Training Activities

Schools:

Bertinoro International Spring School 2016 (BISS 2016) The school was held in the University

Residential Center di Bertinoro (FC) (6 -11 March). The candidate attended the following 3 courses (and he passed the related exam):

1. Algorithmic methods for mining large graphs

Lecturer: Prof. Aristides Gionis (Aalto University, Finland) 2. Advanced Topics in Programming Languages

Lecturer: Prof. Giuseppe Castagna (Universit Paris Diderot - Paris 7, France) 3. Models and Languages for Service-Oriented and Cloud Computing

Lecturer: Prof. Gianluigi Zavattaro (University of Bologna, Italy) 1

(2)

Courses:

• Course “Machine Learning Techniques and Selected Applications for Big Data”

Lecturer: Prof. Stan Matwin (Dalhousie University, Canada)

• Course “Searching by Similarity on a Very Large Scale”

Lecturer: Prof. Giuseppe Amato (CNR Pisa, Italy)

Seminars Cycles:

• Seminar at GATE Summer School (2016) • PhD+ 2016

• Research, Innovation and Future of ICT (2018)

Period Abroad

• Max Planck Institute for Informatics, Saarbrcken (Germany)

From August 2017 to October 2017 and from November 2017 to February 2018

Publications

M. Ponza, F. Piccinno and P. Ferragina. Document Aboutness via Sophisticated Syntactic and Semantic Features. In Proceedings of the 2017 International Conference on Natural Language and Information

Systems. NLDB 2017, pages 441–453, Lecture Notes in Computer Science, Springer.

M. Ponza, P. Ferragina and S. Chakrabarti. A Two-Stage Framework for Computing Entity Relatedenss in Wikipedia. In Proceedings of the 2017 International Conference on Conference on Information

and Knowledge Management, CIKM 2017, pages 1867–1876, ACM.

M. Ponza, L. Del Corro and G. Weikum. Facts That Matter. In Proceedings of the 2018 Conference on

Empirical Methods in Natural Language Processing, EMNLP 2018, pages 1043–1048, ACL.

P. Cifariello, P. Ferragina and M. Ponza. WISER: A Semantic Approach for Expert Finding in Academia based on Entity Linking. Information Systems 2019, pages 1–16, Elsevier.

M. Ponza, F. Piccinno and P. Ferragina. SWAT: A System for Detecting Salient Wikipedia Entities in Texts. Under review at Computational Intelligence, Wiley.