A Tool for Researchers: Querying Big Scholarly Data through Graph Databases

(1)

A Tool for Researchers: Querying Big Scholarly

Data through Graph Databases

Fabio Mercorio1₍ _{), Mario Mezzanzanica}1_{, Vincenzo Moscato}2_{, Antonio}

Picariello2_{, and Giancarlo Sperlì}2

1

Department of Statistics and Quantitative Methods, Milan, Italy University of Milano-Bicocca

2 _{Department of Electrical Engineering and Information Technology}

University of Naples Federico II Naples, Italy

Abstract. We demonstrate GraphDBLP, a tool to allow researchers for querying the DBLP bibliography as a graph. The DBLP source data were enriched with semantic similarity relationships computed using word-embeddings. A user can interact with the system either via a Web-based GUI or using a shell-interface, both provided with three parametric and pre-defined queries. GraphDBLP would represent a first graph-database instance of the computer scientist network, that can be improved through new relationships and properties on nodes at any time, and this is the main purpose of the tool, that is freely available on Github. To date, GraphDBLP contains 5+ million nodes and 24+ million relationships.

Keywords: Graph Databases · Big Scholarly Data · Word Embeddings.

1 Introduction and Motivation

Nowadays, the number of scientific publications is increasing apace, making the network of collaborations, topics, papers, and venues more complex than ever. Not surprisingly, the term Big Scholarly Data has been recently coined to refer to the rapidly growing of scholarly source of information (e,g., large collections of scholarly data with million authors, papers, citations, figures, tables, as well as massive scale related data such as scholarly networks) [8]. The analysis of such data and network is useful for researchers to identify colleagues working on similar topics, to make a profile of a researcher for understanding its research interests on the basis of its academic records and scores, as well as to identify ex-perts on a specific research area. The idea behind GraphDBLP - firstly presented in [4] - is to build-up a model of DBLP as a graph, exploiting word-embeddings to discover similarities between researchers that, in turn, can be included as relationships within the graph (see Fig. 1a).

(2)

2 Mercorio et al.

exploiting vector-space models to derive semantic similarities between publica-tions. GraphDBLP has been deployed on top of the Neo4j graph-database, acting as a tool that anyone can query and improve over time.

(a)

(b)

Fig. 2: (a) The DBLP Graph Model. Solid lines are relationships extracted from the DBLP XML file. Dotted lines are derived through Cypher queries on the graph. (b) Snapshot of the GraphDBLP Web-app.

2 Approach

GraphDBLP is modelled as a multigraph to allow for multiple edges to exist between two nodes. It contains 5+ million nodes and 24.7+ million relation-ships, enabling users to browse 3.3+ million publications authored by 1.7 million researchers on more than 5 thousand publication venues. Thanks to the use of word-embeddings, more than 7.5 thousand keywords and related similarity values were collected. The GraphDBLP data model is shown in Fig. 1a. The similar-ity relation estimates the similarsimilar-ity between two venues based on the network of authors that publish on specific venues (i.e., the contributed_to relation of Fig.1a). Then, the Jaccard index is used to compute similarities between two venues v1 and v2 on the basis of the authors that they have in common. Since

DBLP does not explicitly provide neither topics nor abstracts of the stored pub-lications, we decided to exploit publication titles to extract research keywords. Keywords from the Faceted-DBLP project were used as ground-truth, as it uses GrowBag graphs for identifying computer-science specific keywords [2]. Then, for each title, we computed a list of top-k most similar keywords through word2vec, using [5] to learn a sequence of 4-grams to identify Similar_To relations.

(3)

A Tool for Researchers: GraphDBLP 3

Table 1: Output of query Q1 - Knowledge Discovery for the keyword knowledge_management (top-5 items returned). Values are shown in %

Author Relevance Score Venues (DBLP id)

Murray E. Jennex 0.63 46.15 [ijkm, isf, ijiscram, joeuc, amcis, hicss] Stefan Smolnik 0.42 15.84 [hicss, wm, icis]

David T. Croasdell 0.39 41.67 [hicss]

Petter Gottschalk 0.39 17.86 [hicss, jilt, kbs, es, ijitm, eswa, eg, isf, irmj, jkm, informingscij] Henry Linger 0.34 38.24 [ijkm, sjis, ajis, itp, jds, isdevel, pacis, amcis, icis, ecis]

3 What you can do with GraphDBLP?

GraphDBLP is provided with four pre-defined queries, accessible either through a Python shell interface or a Web GUI. Clearly, any graph-based queries can be performed using the Cypher query language. Here we show Q1 and Q2.

Q1: Keyword Discovery takes as input a keyword (i.e., research topic) and returns a list of authors working on that topic, along with venues where they have published their research (see Tab. 1). The relevance estimates the prolificacy of the author within the whole DBLP community that has been working on that topic, while the score estimates the weight of that keyword among all the author’s publication records. This query is useful to perform expert finding on a given research topic and similar research fields.

Q2: Researcher Profiling takes as input the name of a researcher for ex-tracting all the topics on which she/he has been working along her/his career (Tab. 2). This query is useful to profile researchers, and to discover other re-searchers working on similar or related topics. To this end, a list of keyword similarities is returned for each topic with the similarity value.

Table 2: Output of Q2 - Researcher Profiling for Fabio Mercorio (top-3 keywords and researchers). Values in %

Suggested Author Fabio Mercorio Keyword Keyword Similarities

Name Rel Score Rel Score (value) ≥ 0.6

Subbarao Kambhampati

0.64 26.85

0.05 18.52 planning

motion planning (0.6), optimal plan-ning (0.6), planner (0.65), planplan-ning control (0.63), path planning (0.63) Eva Onaindia 0.41 52.11 Dana S. Nau 0.37 27.42 Ismael Ca-ballero 3.41 40.51 0.42 14.81 data quality

software quality (0.74), information quality (0.84), service quality (0.73), public health (0.72), data privacy (0.71), business intelligence (0.71) Mario Piattini 2.98 4.28

Angelica Caro 2.45 69.7 Edmund Clarke 1.33 17.11

0.11 11.1 model checking

safety properties (0.77), state ma-chines (0.8), abstraction refinement (0.83), reachability analysis (0.79), runtime verification (0.79), abstract interpretation (0.78), timed au-tomata (0.78), formal verification (0.78), model checker (0.85) Moshe Vardi 1.16 10.99

E. Allen Emer-son

0.95 29.63

Scalability. A performance test was executed on a GraphDBLP instance3

mea-suring the running time for Q1 and Q2. Our tests selected a random set of

3

(4)

4 Mercorio et al.

k ∈ [10, 100, 1000, 5000] keywords for Q1, and authors for Q2. Results showed that the running time is acceptable even in worst cases, as GraphDBLP aver-agely requires 0.01 (7.9) seconds for executing Q1 (Q2) while it never needed more than 0.33 (32.1) seconds for completing Q1 (Q2).

4 Limitation and Future Work

Though we are continuously working to upgrade GraphDBLP, it already includes a range of features that motivate us to share our work with the community. With this work, we seek to encourage other researchers to use our tool, with the aim to build a shared and freely accessible network of computer scientists as a graph. To date, GraphDBLP does not take into account citations and research ab-stracts. We are currently working to improve the similarity relationships by us-ing citations and texts from the AMiner project. We are also workus-ing to provide GraphDBLP as a service through REST-APIs.

DEMO. The Demo video is accessible at https://youtu.be/eoDX-782Z8M while the source code is on Github.4

References

1. Chikhaoui, B., Chiazzaro, M., Wang, S.: A new granger causal model for influence evolution in dynamic social networks: The case of DBLP. In: AAAI (2015) 2. Diederich, J., Balke, W.T., Thaden, U.: Demonstrating the semantic growbag:

auto-matically creating topic facets for faceteddblp. In: ACM/IEEE-CS joint conference on Digital libraries. pp. 505–505. ACM (2007)

3. Durand, G.C., Janardhana, A., Pinnecke, M., Shakeel, Y., Krüger, J., Leich, T., Saake, G.: Exploring large scholarly networks with hermes. In: EDBT (2018) 4. Mezzanzanica, M., Mercorio, F., Cesarini, M., Moscato, V., Picariello, A.:

GraphD-BLP: a system for analysing networks of computer scientists through graph databases. Multimedia Tools and Applications 77(14), 18657–18688 (Jul 2018) 5. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed

repre-sentations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013)

6. Moreira, C., Calado, P., Martins, B.: Learning to rank academic experts in the DBLP dataset. Expert Systems 32(4), 477–493 (2015)

7. Wu, P., Pan, L.: Mining application-aware community organization with expanded feature subspaces from concerned attributes in social networks. Knowledge-Based Systems 139, 1–12 (2018)

8. Xia, F., Wang, W., Bekele, T.M., Liu, H.: Big scholarly data: A survey. IEEE Trans-actions on Big Data 3(1), 18–35 (2017)

4