On a Java based implementation of ontology evolution processes based on Natural Language Processing

(1)

ITALIAN NATIONAL RESEARCH COUNCIL

“NELLO CARRARA” INSTITUTE FOR APPLIED PHYSICS

CNR

F

LORENCE

R

ESEARCH

A

REA

Italy

__________________________

TECHNICAL, SCIENTIFIC AND RESEARCH REPORTS

__________________________

Vol. 2 - n. 65-8 (2010)

ISSN 2035-5831

Francesco Gabbanini

On a Java based implementation

of ontology evolution processes

based on Natural Language Processing

(2)

On

pro

Com e-In

C

n a Java

ocesses

mmessa ICT clusion Lab

Consigli

Istituto VIA MADO

a based

based o

T-P10-007 b

io Nazi

o di Fisica

ONNA DEL PIANO

CNR Fran

implem

on Natu

ionale

a Applicata O 10 - 50019 SES R-IFAC-TR cesco Gabb

mentati

ural Lan

delle R

a “Nello C STO FIORENTINO R-09/010 banini

on of on

nguage

Ricerch

Carrara” O – ITALIA

ntology

Proces

he

y evolut

ssing

tion

(3)

On a Java based implementation of ontology evolution processes based

on Natural Language Processing

F. Gabbanini

Institute for Applied Physics, Italian National Research Council Via Madonna del Piano 10, 50019, Sesto Fiorentino, Firenze, Italy

[email protected]

1 Introduction

An architecture was described Burzagli et al. (2010) that can serve as a basis for the design of a Collective Knowledge Management System. The system can be used to exploit the strengths of collective intelligence and merge the gap that exists among two expressions of web intelligence, i.e., the Semantic Web and Web 2.0. In the architecture, a key component is represented by the Ontology Evolution Manager, made up with an Annotation Engine and a

Feed Adapter, which is able to interpret textual contributions that represent human intelligence

(such as posts on social networking tools), using automatic learning techniques, and to insert knowledge contained therein in a structure described by an ontology.

This opens up interesting scenarios for the collective knowledge management system, which could be used to provide up to date information that describes a given domain of interest, to automatically augment it, thus coping with information evolution and to make information available for browsing and searching by an ontology driven engine.

This report describes a Java based implementation of the Ontology Evolution Manager within the above outlined architecture.

2 Specification of building blocks

The Ontology Evolution Manager (sketched in Figure 1) is designed to take corpora of textual documents as input, produce a series of RDF statements and use them to enrich an ontology. In order to achieve these aims, main issues that have to be faced for its implementation consist in:

1. extracting machine readable knowledge from text;

2. transform extracted information in a form that is suitable for insertion into an ontology.

(4)

It indiv “hote probl “onto evolu In softw popu grow 3 In or to be Ge text w      is to be note viduals (as in els” entity) a lem under s ology learni ution”. the followin ware compon ulation” and wing the know

Annotatio technique rder to extrac e employed (s enerally, thes with a variety splitting se splitting te part-of-spe up the wor definition, sentence, o term index user-define that opera Figure 1. ed that knowl n “hotel X ha and assertions study is nam ing”, being ng sections a nents used fo “ontology l wledge base on Engine: es ct knowledge see Buitelaar se consist in y of informa entences; ext into word

eech tagging rds in a text as well as or paragraph xing using ga ed transducti ate over an . Ontology E ledge to be e as a good cui s about entiti med “ontolo the two t a description or information earning”, se (see the Feed

implemen e from texts, r and Cimian a series of s ation. These s ds through to (POS), whic as correspon s its relation ; azetteers (par ion processe nnotations ba Evolution Ma extracted may isine”, where ies (as in “ho ogy populat erms synthe n will be giv n extraction ee the Annot d Adapter bl ntation of N , Natural Lan no, 2008). steps in whic steps are typi

kenization; ch consists in nding to a p nship with a rticular sort o es to further a ased on reg anager gener y in principl e “hotel X” r otels have be tion”, while esized by t ven about a J from texts (b tation Engin ock in Figur Natural Lan nguage Proc ch text is ana ically repres n a form of g articular par adjacent and of dictionarie analyse texts gular expres ral scheme le regard both represents an edrooms”). In in the sec the more g Java based i both in the s ne block in F re 1). nguage Pro essing (NLP alysed, and in ented by: grammatical rt of speech, d related wo es); s using finite ssions, for h assertions n individual o n the first ca cond it is n general “ont mplementati ense of “ont Figure 1) an ocessing P) techniques nclude annot l tagging, ma based on bo ords in a ph e state transd pattern-matc about of the se the named ology ion of ology nd for s have tating arking oth its hrase, ducers ching,

(5)

Th anno block 3.1 Al descr Text2 Engin GA langu differ lingu semantic e steps. he Annotatio tate textual c k, which in tu Fi A GATE b lthough seve ribed in the 2Onto (2010 neering (GA ATE provid uage process rent tasks by uistic data an extraction, an on Engine bl contents, in o urn uses them

gure 2. UML

based imple eral algorithm e previous s 0)), the Anno ATE, see Cun des a modul sing function y loading plu nd to proces nd other ope lock is mean order to prov m to enrich a L class diagr ementation o m implement section (see otation Engin nningham et lar object-or nality in diver ugins, which ss data. GAT erations over nt to implem vide coherent an ontology w ram of the On of NLP tations exist LExO (201 ne block is b al. (2002), M riented fram rse applicatio can in turn c TE is distrib r syntactic tr ment NLP tec t and structur with new con

ntology Evo that perform 10), Ontoma based on the Maynard et al mework imp ons. It can be contain a num uted with an rees produce chniques so red inputs to ncepts and as lution Mana m some of the at (2010), O General Arc l. (2008)). plemented in e extended a mber of reso n Informatio d by the pre as to proces the Feed Ad ssertions. ger e processing OpenNLP (2 chitecture for n Java to e nd customise urces able to on Extraction evious ss and dapter steps 2010), r Text embed ed for o hold n (IE)

(6)

syste the J such detec JA expre opera opera they Fu and Mana Fo 2, wa API, text c Th on a regis may by m As then resou whic recog proce JAPE em called “A Java Annotat as sentence ction, and pro APE is a fin essions. Thu ations over ations are de are loaded in unctionalities easily usab agement Sys or this purpo as designed t such as initi corpora, pars he GateMan pipeline app tered plug-in contain a se means of resou s the text do sentences (th urce) are iden h further pr gnize geogr esses are per E grammars

A Nearly-New tion Patterns e detection, onominal co nite state tran

us it is use syntactic tre escribed by g

nto the GAT s offered by t ble within stem.

se, the Gate to act as a faç ializing the G sing text corp

Figu nager class proach. A tex n resources, et of text ann urces that pr ocument ente hrough a Sen ntified; then rocessing ste raphic entiti rformed at t that allow w IE System” s Engine (JA tokenization -reference. nsducer syst eful for patt ees such as grammars wh E framework the GATE A the more eManager çade to acce GATE system pora. ure 3. An exa is responsib xt document which, in tu notation reso recede it in th ers the pipeli

ntence Splitt POS tagging eps can be b es, proper the end of th identifying ” (ANNIE), APE) to proc n, POS-taggi

tem that ope tern-matchin

those produ hich get conv

k.

APIs were reo general fram Java class, a ss various fu m, registerin ample text pr ble of manag t enters the p urn, enrich i ources which he pipeline. A ine it first ge er resource) g is perform based and w nouns or d he pipeline, patterns tha which relies cess text corp

ing, chunkin erates over ng, semantic uced by nat verted into fi organized in mework of as sketched in unctionalities ng GATE plu rocessing pip

ing the anno pipeline and it with a set h can take ad An example ets cleaned u and words ( med. These co which may in dates. Finall by means of at are releva s on finite sta pora and per ng and parsi

annotations extraction tural languag inite state ma

order for the the Collec

n the UML d s made availa ugins and res

peline otation proce gets process of annotatio dvantage of pipeline is g up from prev through a Se onstitute fund nclude apply ly, user def f transducers ant for a cer

ate algorithm rforms opera ing, named-based on re and many ge parsers. achines as so em to be ava ctive Know diagram of F able by the G sources, man ss, which is sed by the va ons. Each pl annotations given in Figu vious annota entence Toke damental ste ying gazettee fined elabor s that use cu rtain domain ms and ations entity egular other JAPE oon as ailable wledge Figure GATE naging based arious lug-in taken re 3. ations; enizer eps on ers to ration ustom n and

(7)

annotating them. At the end of the text processing pipeline an annotated text document is obtained and the GATE API includes Java classes that allow going through the annotations.

With reference to the UML diagram in Figure 2, showing the underlying architecture of the natural language processing infrastructure, the GateManager class is responsible of managing the annotation engine: for this purpose it needs plug-ins and resources to be registered for pipeline annotation, each resource implementing an annotation step (see blocks within the text processing pipeline in Figure 3). This process is implemented using a visitor pattern (see Gamma et al., (1995)).

The GateManager is first made aware of which plug-ins to use and of which resources (taken from previously set plug-ins) to use for the set-up of the pipeline. Each resource is modelled by a register class which implements a RegisterVisitor interface and is capable of performing self-initialization steps. Register classes may require or not property maps for initialization purposes: in the latter case they consist in a generalization of the SimpleResourceRegister class. More complex annotation processes require custom register classes to be written. As an example, annotations based on custom JAPE grammars are performed using objects of class OWLIMTransducerRegister. The class can be initialized by specifying custom JAPE grammars to be used to identify patterns that are relevant for a certain domain and to annotate texts based on the occurrence of these patterns. Obviously, multiple instances of OWLIMTransducerRegister may be inserted into the pipeline.

Through the visit method of its interface, each register class is added to a SerialAnalyserController object, which is defined by a Java class in the GATE API and is used to manage the text processing pipeline (see code excerpt in Table 1).

Table 1. Code excerpt showing how to implement a text processing pipeline

public class GateManagerTest {

protected GateManager gateManager; ...

public void initializationTest() {

gateManager = GateManager.getInstance(); ... gateManager.registerPlugin("ANNIE"); gateManager.registerPlugin("Tagger_OpenCalais"); ... gateManager.registerResource(new AnnotationDeleteRegister()); gateManager.registerResource(new SentenceSplitterRegister()); gateManager.registerResource(new DefaultTokenizerRegister()); gateManager.registerResource(new DefaultGazetteerRegister()); gateManager.registerResource(new OpenCalaisRegister("...")); ...

Corpus elaboratedCorpus =gateManager.elaborateCorpus(); DefaultGazetteerParser gazetteerParser = new

DefaultGazetteerParser();

gateManager.parseAnnotation(gazetteerParser); gazetteerParser.getAnnotatedResources(); ...

(8)

} }

public class GateManager {

private static GateManager instance;

private SerialAnalyserController serialController; public void registerResource(RegisterVisitor register) throws GateException {

register.visit(this); }

...

public void parseAnnotation(AnnotationParserVisitor parser) throws GateException {

parser.visit(this); }

}

public class DefaultGazetteerRegister implements RegisterVisitor {

private String resource = "...";

public void visit(GateManager manager) throws GateException {

new SimpleResourceRegister(resource).visit(manager); }

... }

public class DefaultGazetteerParser implements AnnotationParserVisitor {

public void visit(GateManager manager) throws GateException {

//annotation parsing code ...

}

public List<AnnotatedResource> getAnnotatedResources() { return resources;

} }

After initialization, the GateManager is ready to perform text processing by running all the registered resources in cascade. Once text processing is made, the system ends up with a corpus of annotated documents: these can be parsed using an effective class infrastructure which was setup, again, using the visitor pattern. A parser interface was created, named AnnotationParserVisitor, to be implemented by annotation parser classes that have to be set up for each type of annotation. Annotations are retrieved from the documents by issuing a call to the parseAnnotation method of the GateManager, which takes a parser object as input.

This class structure efficiently encapsulates various GATE functionalities and allows to conveniently separate plugin and resources initialization from their usage in the pipeline, and to conveniently retrieve annotations from the corpus as object of class AnnotatedResource.

Finally, it is to be noted that the whole annotation process can be also managed from a single entry point, i.e., the AnnotationEngine class, which can be configured using an AnnotationEngineConfig object and holds a static reference to the GateManager.

(9)

Annotations made available from the different resources constitute the basis on which ontology evolution is performed by the Feed Adapter, as they in principle contain new concepts, relations or individuals relevant to the domain under study.

4 Feed Adapter: the ontology evolution block

A number of Java based frameworks exist to create, alter and persist ontologies. As each one has different characteristics, they suit best for different application scenarios.

Before designing the Feed Adapter block, the most popular semantic web frameworks were examined. Characteristics of interest that were considered are reported in Table 2.

Table 2. Characteristics of semantic web frameworks

Name SPARQL support

OWL 2.0 support

Reasoning features Persistence

Jena 2.6.2 Yes No Unable to reason on data type restrictions (the API is not compatible with a version of Pellet that is capable of reasoning on data type restrictions)

file, database

Protegé OWL API

No No Unable to reason on data type restrictions (the API is not compatible with a version of Pellet that is capable of reasoning on data type restrictions)

file

OWL API 3.0.0 No Yes Able to reason on data type restriction, if the HermiT reasoned is used, as Pellet is still not compatible with OWL API 3

file

AllegroGraph 3.3

Yes ? Able to reason on data type restriction. Uses a proprietary reasoner (RDFS++), which is a RDF reasoner and not an OWL reasoner

database

Sesame 2.3.1 Yes Yes Unable to reason on data type restrictions. OWLIM is compatible with Sesame, but it is only an OWL Lite reasoned

file, database (MySql, Postgres), binary files

Desirable features for the Feed Adapter implementations are represented by:

 Support for OWL 2, which represents the most recent recommendation (dated 27th October 2009) of W3C that refines and extends OWL, the Ontology Web Language, see OWL Working Group at W3C (2009);

 ability to reason and make inference over data type restrictions;  support for a variety of persistence methods;

 support for SPARQL Protocol and RDF Query Language (SPARQL, see SPARQL Working Group at W3C (2008)) queries.

Unfortunately, as the table shows, among the most popular products in this sector, no “full featured” framework is available.

However, the most “promising” frameworks to be adopted within the Collective Knowledge Management System were identified to be OWL API 3 and Sesame 2.3.1 1 (see OWL API (2010) and Sesame (2010), respectively, both of them Open Source), also considering the fact that they are supported by an active community of developers.

In order not to be tied to a particular implementation and to a precise framework, the Feed Adapter was designed as a middleware block acting as an adapter between annotations, coming from parsers described in section 3, and an ontology. For the moment only the adapter for the

(10)

Sesame 2.3.1 framework was implemented, with the OWL API 3 implementation being in progress.

4.1 Sesame Adapter implementation details

The Sesame Adapter is designed around the SesameModelHandler and includes the SesameGazetteerParser and SesameOpenCalaisParser classes.

The SesameModelHandler maintains a reference to a Repository interface, which is part of the Sesame API and can be used to access various Repository implementations, such as the SailRepository (also part of the Sesame API), which defines a Sesame repository that contains RDF data that can be queried and updated and operates on a stack of Sail objects. Sail objects can store RDF statements and evaluate queries over them.

Through the SesameModelHandler it is therefore possible to get access to statements that are present in the repository and to modify the repository itself.

As for the SesameGazetteerParser and SesameOpenCalaisParser, these are extensions, respectively, of the DefaultGazetteerParser and OpenCalaisParser, of which they override the visit method: this allows mapping annotations coming from the NLP process to assertions in the ontology. Although implemented only for the previously mentioned parsers, this construct may be generalised to any kind of parser.

A usage sample of the adapter is given in Table 3, which illustrates an excerpt from a JUnit test case which also represents an example of how to use the framework for an ontology evolution process. It is to be noted that the natural language processing step is centrally managed through the AnnotationEngine class, whereas in Table 1 it was handled through the GateManager.

Table 3. Code excerpt showing how to implement the ontology evolution process

public class OntoEvolutionTest {

private AnnotationEngine annotationEngine;

private final String BASE_URI = "http://www.ifac.cnr.it/test#";

private final String REPOSITORY_PATH = "nativeStore/owlim"; @Before

public void setUp() throws Exception {

GateManager.getInstance();

SesameModelHandler.getInstance().createOWLIMRepository(REPOS ITORY_PATH);

AnnotationEngineConfig config = new AnnotationEngineConfig.Builder() .pluginName("ANNIE") .pluginName("Tagger_OpenCalais") .resourceRegister(new AnnotationDeleteRegister()) .resourceRegister(new SentenceSplitterRegister()) .resourceRegister(new DefaultTokenizerRegister()) .resourceRegister(new POSTaggerRegister()) .resourceRegister(new DefaultGazetteerRegister())

(11)

.resourceRegister(new OpenCalaisRegister(...)) .resourceRegister(new OWLIMTransducerRegister()) .annotationParser(new SesameGazetteerParser(BASE_URI)) .annotationParser(new SesameOpenCalaisParser(BASE_URI)) .textToAnnotate(text) .build();

this.annotationEngine = new AnnotationEngine(config); }

@Test

public void testDoAnnotation() throws Exception {

this.annotationEngine.doAnnotation(); RepositoryConnection connection = SesameModelHandler.getInstance().getRepository().getConnection() ; ValueFactory valueFactory = SesameModelHandler.getInstance().getRepository().getValueFactory (); try {

URI aURI = valueFactory.createURI(BASE_URI, "..."); RepositoryResult<Statement> statements =

connection.getStatements(aURI, OWL.INDIVIDUAL, null, true);

... } finally { connection.close(); } } }

5 Conclusions and future developments

The report illustrates details regarding the Java implementation of an Ontology Evolution Manager, which is a software that extracts structured information from natural language and uses it for “growing” ontologies.

As such, it aims at exploiting synergies between Web 2.0 and the Semantic Web, potentially acting as a bridge from user contributed (unstructured) text to information organized in ontologies.

Future work implementation work related to the processing logic layer of the knowledge management system will regard enriching the framework with support for relations discovery using WordNet (see WordNet (2010)) and the Scarlet (see Scarlet (2010)) framework. As for ontology management, it will be interesting to evaluate Empire (see Empire (2010)), which is an implementation of the Java Persistence API (JPA) for RDF and the Semantic Web. Adoption of JPA for persistence would represent a step ahead towards integration of the collective knowledge management framework within Java Enterprise Edition applications.

(12)

6 References

Berners-Lee, T., Hendler, J., Lassila, O., 2001. The Semantic Web. Scientific American 284 (5), 34-43.

Buitelaar, P., Cimiano, P. (Eds.), 2008. Ontology Learning and Population: Bridging the Gap between Text and Knowledge. Vol. 167 of Frontiers in Artificial Intelligence and Applications. IOS Press, Amsterdam.

Burzagli, L., Como, A., Gabbanini, F., 2010. Towards the convergence of Web 2.0 and Semantic Web for e-Inclusion. In: Miesenberger, K., Klaus, J., Zagler, W., Karshmer, A. (Eds.), Computers Helping People with Special Needs. Vol. 6180 of Lecture Notes in Computer Science. Springer, pp. 343-350.

Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., 2002. GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics.

Empire, 2010. Available at http://github.com/clarkparsia/Empire, last visited on 17/09/2010

Gamma, E., Helm, R., Johnson, R., Vlissides, J., 1995. Design patterns: elements of reusable object-oriented software. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.

LExO, 2010. Available at http://code.google.com/p/lexo/, last visited on 20/09/2010

Maynard, D., Li, Y., Peters, W., 2008. NLP techniques for term extraction and ontology population. In: Proceedings of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge. IOS Press, Amsterdam, The Netherlands, pp. 107-127.

Ontomat, 2010. Available at http://annotation.semanticweb.org/ontomat/index.html, last visited on 20/09/2010 OpenNLP, 2010. Available at http://opennlp.sourceforge.net/, last visited on 20/09/2010

OWL API, 2010. Available at http://owlapi.sourceforge.net/, last visited on 17/09/2010

OWL Working Group at W3C, 2009. http://www.w3.org/2007/OWL/wiki/OWL_Working_Group, last visited on 17/09/2010.

Scarlet, 2010. Available at http://scarlet.open.ac.uk/, last visited on 17/09/2010 Sesame, 2010. Available at http://www.openrdf.org/, last visited on 17/09/2010

SPARQL Working Group at W3C, 2008. SPARQL Query Language for RDF Recommendation. Available at

http://www.w3.org/TR/rdf-sparql-query/, last visited on 17/09/2010

Text2Onto, 2010. Available at http://sourceforge.net/projects/texttoonto/, last visited on 20/09/2010 WordNet, 2010. Available at http://wordnet.princeton.edu/, last visited on 17/09/2010