• Non ci sono risultati.

NLP based Information Extraction methods for Patent Analysis

N/A
N/A
Protected

Academic year: 2021

Condividi "NLP based Information Extraction methods for Patent Analysis"

Copied!
105
0
0

Testo completo

(1)

UNIVERSITÁ DIPISA

DOTTORATO DI RICERCA ININGEGNERIA DELL’INFORMAZIONE

I

NFORMATION

E

XTRACTION

S

YSTEMS

F

OR

P

ATENT

A

NALYSIS

DOCTORALTHESIS

Author

Andrea Cimino

Tutor (s)

Prof. Francesco Marcelloni, PhD. Felice Dell’Orletta

Reviewer (s)

Prof. Fabio Tamburini, Prof. Alessandro Mazzei

The Coordinator of the PhD Program

Prof. Marco Luise

Pisa, 01 2017 XXIX Cycle

(2)
(3)

Acknowledgments

This thesis would not have been possible without the support of many people.

I specially want to express my gratitude to my supervisor Felice Dell’Orletta for his support and patience during these years and for being my first real contact with computational linguistics and research in general.

Many thanks also to Prof. Gualtiero Fantoni and Filippo Chiarello of the Department of Civil and Industrial Engineering of Pisa for providing support in what goes behind the scenes of the patent world. Without their support, this thesis would not be possible. Thanks to Errequadro s.r.l.,which provided the patent sets analyzed in the case studies discussed in this thesis.

Thanks to everybody who reviewed this thesis, in particular Prof. Mazzei who did a great job for improving the quality of this thesis.

I also express my gratitude to the Institute for Computational Linguistics (ILC-CNR) and to all the members of the ItaliaNLP Laboratory. I am really grateful Giulia Ven-turi and Dominique Brunato for constant motivation, inspiration and support during all these years.

Finally, many thanks to my family for their support, education and encouragement for this PhD journey.

(4)
(5)

Summary

The focus of this thesis is the analysis of patents through NLP–based extraction sys-tems. State-of-the-art systems for automatic patent analysis are designed for engi-neers and attorneys and they usually do not take into account that there is a variety of patent readers which are becoming more and more interested in this topic, such as marketers and designers. This new audience is interested in automatic patent analysis since patents contain relevant information that anticipates the availability of products on the market. Managing such information can help them to identify new market trends and define successful strategies.

The main novelty of this work is that the entire information extraction pipeline has been designed to extract relevant information for this new audience. This work focuses on the extraction of users that will possibly benefit from an invention, advantages that an invention brings or drawbacks that an innovation solves.

The extraction problem is addressed by adapting existing tools originally designed to extract information from general–purpose texts.

The adaptation process introduces important novelties. First, it is illustrated a semi-automatic method for the development of a domain specific training set to extract the relevant entities allowing to minimize the human annotation effort.

Secondly, several learning algorithms and feature configurations were tested to im-prove the overall accuracy of the information extraction process.

Finally, it has been tested a method that combines the information extracted from patents and the analysis of social media text specifically conceived to extract advan-tages and drawbacks. This method relies on sentiment analysis of text extracted of social media under the assumption that terms indicating advantages should be gener-ally positively perceived by people, the contrary for drawbacks.

(6)
(7)

List of publications

International Journals

1. Cimino A., Chiarello F., Dell’Orletta F., Fantoni G. (2016) “Automatic Advan-tages and Drawbacks Extraction From Patents”. Scientometrics (Under review), Springer.

2. Cimino A., Chiarello F., Dell’Orletta F., Fantoni G. (2016) “Automatic Users Ex-traction From Patents”. World Patent Information (Under review), Elsevier.

International Conferences/Workshops with Peer Review

1. Cimino A., Cresci S., Dell’Orletta F., Tesconi M. (2014) “Linguistically-motivated and Lexicon Features for Sentiment Analysis of Italian Tweets”. In Proceedings of 4th Evaluation of NLP and Speech Tools for Italian (EVALITA 2014), 11 De-cember, Pisa, Italy.

2. Dell’Orletta F., Venturi G., Cimino A., Montemagni S. (2014) “T2K2: a System for Automatically Extracting and Organizing Knowledge from Texts”. In Pro-ceedings of 9th Edition of International Conference on Language Resources and Evaluation (LREC 2014), 26-31 May, Reykjavik, Iceland.

(8)

Contents

1 Introduction 1

1.1 Patent information extraction for marketers and designers . . . 2

1.2 Navigating this Dissertation . . . 8

2 State of the art in patent analysis 10 2.1 Approach based on patent metadata . . . 11

2.1.1 Patent indicators for the technology life cycle development . . . 11

2.1.2 Stochastic technology life cycle analysis using multiple patent in-dicators . . . 12

2.2 Approach based on keywords . . . 13

2.2.1 Measures for textual patent similarities: a guided way to select appropriate approaches . . . 14

2.2.2 An approach to discovering new technology opportunities: keyword-based patent map approach . . . 14

2.2.3 Novelty-focused patent mapping for technology opportunity anal-ysis . . . 15

2.3 Approach based on Natural Language Processing . . . 16

2.3.1 Searching in Cooperative Patent Classification: comparison be-tween keyword and concept-based search . . . 17

2.3.2 A new instrument for technology monitoring: novelty in patents measured by semantic patent analysis . . . 18

2.3.3 Identifying technological competition trends for R&D planning using dynamic patent maps: SAO-based content analysis . . . . 19

3 Linguistic annotation modules and Information Extraction tools for automatic patent text analysis 20 3.1 Linguistic annotation tools . . . 21

3.1.1 Sentence splitter . . . 21

3.1.2 Tokenizer . . . 21

3.1.3 Part of speech tagger . . . 22

(9)

Contents

3.2 The T2K2information extraction system . . . . 23

3.2.1 Linguistic Pre–processing of Corpora . . . 25

3.2.2 Extraction of Domain–Specific Knowledge . . . 26

3.2.3 Knowledge Organization and Knowledge Graph Construction . . 28

3.2.4 Output Examples . . . 29

3.2.5 Ongoing Applications . . . 37

3.3 ItaliaNLP Sentiment Classifier . . . 38

3.3.1 Lexicons . . . 39

3.3.2 Features . . . 41

3.3.3 Feature Selection Process . . . 43

3.3.4 Results and Discussion . . . 44

4 Automatic Users Extraction From Patents 47 4.1 Users: A key information hidden in patents . . . 47

4.1.1 Users from different perspectives . . . 47

4.1.2 Computer aided systems for textual patent analysis . . . 49

4.2 Approach for automatic users extraction from patents . . . 52

4.2.1 List of users generation . . . 52

4.2.2 Patent set selection . . . 55

4.2.3 Patent text analysis . . . 55

4.2.4 Manual review of the new list of users . . . 58

4.3 Results: case study . . . 58

4.3.1 Patent set . . . 58

4.3.2 Experimental Setup . . . 59

4.3.3 Output of the experiments and measurements . . . 60

4.4 Conclusions . . . 63

5 Automatic Advantages and Drawbacks Extraction From Patents 66 5.1 Advantages and Drawbacks of technologies: A key information hidden in patents . . . 66

5.1.1 Advantages and drawbacks from and engineering point of view . 66 5.1.2 The reasons why advantages and drawbacks are contained in patents 67 5.2 Approach for automatic advantages and drawbacks extraction from patents 68 5.2.1 Advantages and Drawbacks Clue Collection . . . 69

5.2.2 Domain-specific Clue Collection . . . 71

5.2.3 Clue validation . . . 73

5.2.4 Advantages and Drawbacks sentences extraction . . . 79

5.3 Results: case study . . . 81

5.3.1 Patent set . . . 81

5.3.2 Experimental Setup . . . 81

5.3.3 Results: the extracted clues . . . 82

5.3.4 Results: the extracted advantages and drawbacks . . . 84

6 Conclusions 87

(10)

CHAPTER

1

Introduction

Today more than ever, innovation is a critical aspect both for small and big companies. High tech companies in particular must seriously take into account existing technolo-gies and related market trends in order to survive. Patents represent the legal right to exclude competitors from using an innovation preventing very similar developments. Patent infringements in fact can lead to very expensive legal actions. A remarkable case of patent dispute is the one between Samsung and Apple where they have been battling since 2010, but in August 2012 Apple won $1 billion in damages.

However, patents can be a powerful source of information to be used not only to prevent infringements but also to predict innovation trends and find interesting hotspots in the state of the art.

Extracting and mining relevant information from big amounts of textual data, such as patents, is an hard challenge but allows companies to elaborate better strategies ob-taining competitive advantages. Manually performing these tasks is very expensive both in terms of time and human effort. Just to give an example, the average length of a document in the medical device field is 15 pages. Considering a sample of 3,000 patents1, more than 45,000 pages needs to be manually analyzed. Moreover, such

ac-tivities require specific expertise in the field of investigation, since every field has its own terminology. In addition, it is very difficult for humans to detect hidden relations among large number of documents and have macroscopic view of the field.

Advances in hardware and software technologies opened up new opportunities to solve this problem. Distributed computing and low GPU prices made possible and cheap to analyze big data in a reasonable time. On the other hand, exploiting this hardware potential existing data mining techniques have been improved and new ones have been developed. In many cases artificial intelligence systems can perform better

(11)

Chapter 1. Introduction

then humans.

This is why a great interest by engineering research has been shown towards the development of automatic patent analysis systems to assist legal and R&D departments. Such systems require information sources related to the analyzed data in input. It is the case of metadata such as title, creator, assignee and recommended citations, which are mandatory elements in patents. Although patent metadata are relevant, they do not capture the most valuable information that is contained in the text. The automatic acquisition and organization of concepts contained in unstructured texts is nowadays possible thanks to advances in Natural Language Processing (NLP) and Information Extraction (IE) [1]. These advances have opened up the possibility of applying NLP based methods also to the domain of technical documents for Information Extraction. Information Extraction aims at locating a predefined set of entities in a specific do-main which is determined by the corpus of texts to be analyzed. A classic Information Extraction task for the patent domain is detecting chemical and biological entities, dis-eases, and trademarks [2]. Another important related task is the Relation Extraction which aims at detecting and classifying relationships between entities identified in text, such as “X uses Y”, “X owns Y”.

These techniques were successfully applied to state-of-the-art tools for automatic patent analysis [3] [4] [5]. However, these systems analyze patents only from the perspective of extracting information useful for assisting R&D and legal departments: patent attorneys and intellectual property managers are interested in reading patents for legal and intellectual property reasons. Clearly, the focus of these patent readers is not primarily on new product development area, which is instead the main interest of marketersand designers. The main role of marketers and designers is to make available products on the market in order to satisfy customers’ needs bringing them advantages by using the proposed product. In order to achieve this goal they need to study cus-tomers’ behavioral changes and needs and market trends.

The focus of this dissertation is to propose a novel methodology based on NLP tools which can support marketers and designers by allowing them to thoroughly ex-plore technology fields of interest from their point of view. This methodology can be exploited by high level systems such as tailored search engines or patent mappers for many applications such as identification of hotspots or transfer of marketing ideas between technological sectors.

1.1

Patent information extraction for marketers and designers

Nowadays patents are seen mostly as only technology oriented documents. In [6] the authors consider these documents useful to collect information about technology and products, in contrast with manual handbook and market reports containing also market information. On the other side, in [7] the authors affirm that exists an increasing vari-ety of readers, such as marketers and designers, which begin to be interested in patent analysis. These works give the evidence that, even if it is clear that the information con-tained in patents is not only technical and legal, state-of-the-art approaches for patent analysis put their focus on this information.

This situation is due to two main reasons. First, because patents are produced to disclose and protect an invention, their content is mainly technical and legal. Second,

(12)

1.1. Patent information extraction for marketers and designers

the 80% of technical information is not available elsewhere [8] [9], so patents are the most comprehensive resource for technical analysis.

It is a fair assumption that also a fraction of all the other kinds of information is not contained elsewhere, or will appear in future public documents. This is another reason why marketers and designers are more interested in patent analysis.

This interest is growing also because the information that today is contained in patents in the future will be contained in other documents like manual handbooks and market reports, to which marketers and designers are more accustomed. The great ad-vantage of patents, is that the contained information anticipates availability of products on the market by a factor varying from 6 to 18 months [10].

Unfortunately four aspects reduce the readers’ ability to analyze patents efficiently. First of all, due to the increase of patent publications, there exists a massive information overflow [11]; secondly, analyzing patents requires skilled personnel and long time [12]; the quality of patent assessment process is decreasing [13] [14] because of the reduced assessment time available for patent examiners; finally, activities like patent hiding, proliferation and bombing, contribute to the generation of confusion and to the loss of time in research and analysis phases [15]. The new patent readers are afflicted by these problems as the typical ones, and it makes sense to think that the impact on them is even stronger.

The main difference between the typical and the new patent readers is the informa-tion they focus on. Patent attorneys and Intellectual Property managers are interested in reading patents for legal reasons and to orient the Intellectual Property (IP) direc-tion. It is important to underline that analyzing patents is the core of their work, so they are experts in finding the information they need. Furthermore, they can spend a high fraction of their work-time in this activity. On the other side, marketers and designers search for outlier information in patents such as users’ behavioral changes and needs, market trends, designers’ vision, R&D trends and competitors’ strategies. In addition, they rarely work with patents, so they don’t clearly know what to search and how to search it. Moreover, they have short time to spend on information search, and they waste an high fraction of this time in understanding the legal jargon used in patents.

Since marketers and designers focus their work on users’ behavioral changes and on the advantages that inventions can bring, a specific solution for patent information extraction is required to fully satisfy their specific needs. The presented work proposes a methodology for marketers and designers to explore particular fields of interests.

By limiting manual human effort for marketers and designers, several are the po-tential benefits for companies: (a) significant economic savings since the knowledge extraction processes is automatically performed just by analyzing text. The only ef-fort required by marketers and designers is the creative process; (b) the amount of knowledge acquired by automatic extraction systems is higher than the one that can be acquired by humans, due to the size of corpora that automatic system can analyze; (c) by speeding up the creativity process companies can increase competitiveness over other companies belonging to similar technology fields.

(13)

Chapter 1. Introduction 1 2 3 4 5 6 Relevant aspect 1 Rele v ant aspect 2

Figure 1.1: Example of patent mapping in 2-dimensional space. Each patent is represented by a dot labeled with the corresponding identifier.

In figure 1.1 is reported an example of patent map where each patent is projected into a 2-dimensional space according to some relevant aspects characterizing the ana-lyzed patents. In the example it is clear that patents 1, 2, 3, 4 and 5 constitute a cluster, indicating a strong correlation between these patents and consequently a low degree of freedom to operate in the considered technological field. On the other side patent 6 forms its own cluster due to its high grade of novelty with respect to the patents be-longing to the other cluster. The relevant aspect in a patent map depends on the input data of the patent mapping system. In order to adapt patent mapping systems for a specific audience, such systems have to be fed with relevant data which is interesting for the target audience. Patent mappings are constructed by analyzing a vector repre-sentation of patents: each vector encodes patent specific information. Patent mapping systems that can be found in the literature are fed typically with two types of infor-mation: (a) information obtained by extracting the most relevant keywords in patents. Each patent is represented by a vector where each component corresponds to some numerical value related to a keyword (e.g.: frequency, TF-IDF). This approach does not require the use of NLP tools but still requires that the user of the patent mapping system has deep knowledge of the patent domain concepts, making this approach not particularly suitable for marketers and designers; (b) information which is obtained by extracting relations between entities in the text. This approach requires the application of a battery of NLP tools specifically designed for the extraction task. A common use of NLP tools for extracting relevant information is to identify subject-action-object re-lations (SAO), which are structures that encode the subject (S), the action (A) and the object (O) in patent sentences extracted using syntactic parsers. SAO structures of a patent encode the key findings of the invention and expertise of its inventors [3]. Even though SAO structures allow to identify relations between key findings in patents, they are still not very useful for the marketers and designers audience since a deep

(14)

knowl-1.1. Patent information extraction for marketers and designers

edge of analyzed patent domain is still required, since the key findings are represented by domain specific terms.

In order to allow the access to these visualization tools for marketers and designers, a specific layer of knowledge is required to be acquired from patents, tailored for their specific needs. In this dissertation the problem of identifying relevant information is ad-dressed by taking into account that patents are documents that, among their aims, have the one of providing a detailed public disclosure of an invention [16]. An invention is a new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. When the author of a patent describes the invention, sometimes also specifies which are the users, entities on which the invention has or will have a positive or negative effect. In addition the author of a patent sometimes reports which are the advantages that the invention brings or the issues (drawbacks) that the innovation solves. As an example let’s consider the following sentence2:

The shield serves to protect patient from unwanted contact with portions of the ultrasound horn not being utilized for therapeutic effect.

From the marketer’s view, nor shield neither ultrasound are interesting entities. In addition, as said previously, the understanding of these concepts would require a deep knowledge of the technical domain language which often marketers or designers lack. More useful and informative for marketers and designers is the patient entity that, in this particular case, is the beneficiary of the invention described in the granted patent.

Another important aspect to be taken into account is that during the development of a new product or during the redesign of an old one, engineers and product design-ers look for advantages and drawbacks of the technologies they are exploiting in their products. Advantages of a technology help the designers to position the technology within the product; conversely drawbacks are fundamental warnings for the design and engineering phases.

Such information is quite hard to be obtained and consolidated by designers and engineers. However, a strong evidence that such information is valuable for com-panies is the success of the tools developed to manage it [17] [18]. Comcom-panies fre-quently make use of Quality Functional Deployment and requisites lists, users’ needs, users’ requirements for managing advantages [19]. On the other hand companies use FMEA/FMECA design tools to gather and study drawbacks, failure modes and their effects and causes [20]. However, such kind of data is not publicly disclosed since they constitute part of the company know-how.

Furthermore, we can consider advantages and drawbacks of technologies a partic-ularly useful information to describe an invention; it is clear that having a description of an invention in terms of its advantages and drawbacks strongly improves the design process. This assumption is evidenced by the fact that also in patents it is frequent to find the descriptions of the advantages and the drawbacks of the invention that the ap-plicant3 wants to patent. This is due directly to the politics and the guidelines given by WIPO (World Intellectual Property Organization) on writing patents [21].

As stated by WIPO , an invention is a solution to a specific technological problem

2Sentence extracted from patent US8376970_B2 belonging to the A61H patent class. 3The person or company that applies for the patent.

(15)

Chapter 1. Introduction

[21]. The problem that an invention solves in a technological field is something negative that the state-of-the-art technologies can not overcome; on the other side a solution is the way to solve this problem. A solution can lead to some advantages with respect to the known art. Thus, starting from the definition of invention, it is clear how it can be characterized by its advantages and the problems it solves. It is also reasonable to assume that having a clear picture of both the positive side (advantages) and the negative side (drawbacks) of a technology is important for an effective design process. These considerations lead to choose patents as a textual database from which to extract information about advantages and drawback. Consequently, it is worth revising how such concepts are used in various definitions, in particular concerning different patent laws.

Considered the importance of the described entities and that there are no studies on the development of automatic systems for the extraction of this useful information, this work proposes to fill this gap. This work defines the relevant entities which are helpful to the development of new products. These are categorized in: (a) users of an invention; (b) advantages of contributed by an invention; (c) drawbacks solved by an invention. Automatically detecting this information and the corresponding relations in patents makes the process of analyzing product and market trends for marketers and designers easier. In order to extract this useful information in this work is proposed o a novel NLP based methodology for the automatic extraction of users, advantages and drawbacksand the corresponding relations. The proposed methodology addresses the following needs and issues:

1. Minimization of the human effort required to extract knowledge from patents. There are several methods and algorithms to deal with the entity extraction task, but the most effective are the ones based on supervised methods. Supervised meth-ods tackle this task by extracting relevant statistics from an annotated training corpus and using a classification machine learning algorithm to learn a statistical model. Once the training phase is completed, the trained model can be employed to extract information from unseen documents. Since patents are very hetero-geneous documents due to the fact that each patent focuses on a specific field of invention, building manually a training corpus is not particularly effective: a learned statistical model could not be able to identify entities on documents which are lexically distant from the training documents. In addition, building a man-ually annotated corpus from a big collection of documents is very expensive in terms of human effort. For all these reasons an automatically based approach for annotation is proposed.

2. Maximization of the automatically extracted knowledge. The employment of supervised methods for extracting knowledge form unseen documents does not necessarily guarantee the acquisition of new knowledge. This phenomenon is commonly referred to overfitting: this phenomenon happens when a statis-tical model “memorized” the training data rather than “learning” to generalize. This problem is even more significant when a learning corpus is generated au-tomatically, since the employed machine learning algorithm can easily learn the automatic annotation method rather than the entities behaviors in the text. By employing delexicalized models, the proposed methodology allows identifying

(16)

1.1. Patent information extraction for marketers and designers

previously unknown entities to be added in the knowledge base, overcoming the overfitting problem.

3. Independence from the analyzed collection of texts. Recent NLP tools and techniques have been proved to be adequate for the extraction of specific infor-mation from texts [22]. Unfortunately NLP tools suffer from a dramatic drop of accuracy when tested on domain specific texts [23] such as patents. The proposed approach tackles this problem by resorting to recent advances in distributional semantics research area, which studies methods for categorizing semantic similar-ities between words based on their distributional properties in large corpora. By exploiting semantic similarities between words, machine learning algorithms can be better adapted to linguistic domains which are very different from the training ones.

In order to solve the discussed issues, several layers of linguistic information are needed. To capture this information, in this thesis state-of-the-art automatic linguistic annotation tools are used. These tools are composed by systems able to structure and enrich raw text with linguistic information. Even though these tools perform well on analyzing general purpose text, they suffer of a drop of accuracy when tested on tech-nical documents. This is due to the fact that these tools were trained on journalistic texts, which have different characteristics with respect to technical domain documents. To improve the performances of these tools, specific manual training data provided by domain experts were added to the original training sets of the sentence splitter module and the part of speech tagger module. In addition, several tokenization rules were added (e.g. units of measurement, chemical formulas) to improve the tokenization process.

The information extracted by the linguistic annotation tools is used by information extraction tools. To tackle the problem of extracting useful information for marketers and designers, in this thesis will be employed an information extraction system (Text-To-Knowledge, T2K2) that for this work was adapted for the extraction of relevant entities for marketers and designers.

T2K2 is a suite of tools for automatically extracting domain–specific knowledge

from collections of Italian and English texts and relies on statistical text analysis and machine learning algorithms which are dynamically integrated to provide an accurate and incremental representation of the content of vast repositories of unstructured doc-uments. T2K2 includes a multi–lingual information extraction module that extracts Named Entities. This module is a statistical classifier that assigns a named entity tag to a token or a sequence of tokens including four types of named entities: Persons, Loca-tions, Organizations and Miscellaneous entities, i.e. that do not belong to the previous three groups. T2K2 includes also a relation extraction module which allows to detect relations between entities or domain specific terms in order to build a knowledge graph that can be visualized and mined. Currently, in T2K2 two different types of relations can be extracted: co–occurrence and similarity relations.

Even though the T2K2 system provides advanced information and relation extrac-tion capabilities, it is designed to be a general purpose informaextrac-tion extracextrac-tion tool. In particular its Named Entity Tagger module is trained on classes of entities and on texts typically found in newspapers and thus out of scope for users, advantages and drawbacks extraction. For this reason specialized users, advantages and drawbacks extraction processes were designed and integrated into the T2K2 system.

(17)

Chapter 1. Introduction

These designed processes bring three important novelties for what concerns infor-mation extraction in the patents’ domain:

1. A semi-automatic method for the development of a training corpus used to collect statistics in the classifier’s training phase. The proposed method allows to mini-mize the human annotation effort and, at the same time, have good performances for what concerns the automatic tagging of unseen patent sentences.

2. The employment of different learning algorithms and different configurations of the user classifier. By merging results obtained using different models and learn-ing algorithms, it is shown that the amount of new knowledge extracted overcomes the ones obtained using single models.

3. An automatic validation method of the extracted advantages and drawbacks. So-cial media platforms provide powerful venues for consumers to interact not only with brands but also with other consumers as they engage in the processes of cu-ration, creation, and collaboration [24]. Such virtual platforms are places where users discuss about products, about their fantastic features but also about problems and failures they experienced with products. The way they discuss or describe products or services is often unambiguous and highly polarized. By employing a sentiment polarity classifier on social media texts regarding the extracted enti-ties, an average polarity score is calculated for each entity. This score is used to validate the correctness of the extracted entities.

Finally, by employing the T2K2relation extraction module on the extracted entities, the

navigation on the graph built by T2K2 allows to easily mine the knowledge acquired

by analyzing texts. Many are the benefits for marketers and designers by mining this graph. For example, missing links between entities in a graph may indicate a relation (e.g. between a user and an advantage) that was not properly taken into account. Patents that refer to these relations can be later easily retrieved by using the T2K2 indexing module, making T2K2 an all-in-one mining solution for marketers and designers.

1.2

Navigating this Dissertation

The following list summarizes the chapters in this dissertation to help guide the reader through this document:

• Chapter 2 describes existing works in patent information extraction and on vi-sualization tools using different approaches for extracting data from patents. In addition, this section presents a discussion of how the research presented in this dissertation relates to previous works.

• Chapter 3 presents a detailed description of the tools used and adapted for this work. In a first stage, the linguistic annotation pipeline is presented. Subsequently a detailed description of the T2K2 information extraction system and of its

func-tionalities is reported. Finally, a detailed description of a tweet sentiment analyzer used for the advantages and drawbacks validation process is presented. A part of the results in this chapter were published in: Dell’Orletta F., Venturi G., Cimino

(18)

1.2. Navigating this Dissertation

A., Montemagni S. (2014) “T2K2: a System for Automatically Extracting and Or-ganizing Knowledge from Texts”. In Proceedings of 9th Edition of International Conference on Language Resources and Evaluation (LREC 2014), 26-31 May, Reykjavik, Iceland. and Cimino A., Cresci S., Dell’Orletta F., Tesconi M. (2014) “Linguistically-motivated and Lexicon Features for Sentiment Analysis of Italian Tweets”. In Proceedings of 4th Evaluation of NLP and Speech Tools for Italian (EVALITA 2014), 11 December, Pisa, Italy.

• Chapter 4 presents a detailed description of the user extraction process. First the user entity introduced by explaining its importance in the marketing field. Then the user extraction process is described, focusing on the used machine learning techniques and on the automatic annotation procedure of the analyzed patent sets. Finally, a case study of the process for automatic user extraction from patents is conducted on two real patent sets, showing the suitability of the proposed ap-proach. A part of the results presented in this chapter was submitted in: Cimino A., Chiarello F., Dell’Orletta F., Fantoni G. (2016) “Automatic Users Extraction From Patents”. World Patent Information , Elsevier.

• Chapter 5 presents the advantages and drawbacks extraction process. First a def-inition and a proper explanation of the importance of these entities is provided. Then a machine learning based extraction process and an automatic validation process based on the sentiment analysis of social media texts is described. More-over, a method to build a taxonomy from the extracted entities and graphically representing them is reported. Finally, a case study of the overall advantages and drawbacks extraction process is conducted on some patent sets.

A part of the results presented in this chapter was submitted in: Cimino A., Chiarello F., Dell’Orletta F., Fantoni G. (2016) “Automatic Advantages and Draw-backs Extraction From Patents”. Scientometrics, Springer.

• Chapter 6 presents a summary of conclusions from this dissertation research, re-counts the list of scientific contributions of the research, and finally discusses potential directions for future research.

(19)

CHAPTER

2

State of the art in patent analysis

In the present chapter we will introduce state of the art approaches for automatic patent analysis. The main approaches in literature can be classified in:

• Approach based on patent metadata, in which sources of information embedded in patents, such as claims structure or bibliographic information, are considered. This approach is described in section 2.1.

• Approach based on keywords, in which vectors based on keywords found in patents are calculated. Each vector characterizes a patent of the analyzed patent-set col-lection. This approach is described in section 2.2.

• Approach based on Natural Language Processing, which exploits structured data computed by NLP tools, such as part of speech tags and syntactic structures of phrases. This approach is described in section 2.3.

Each approach allows to capture different types of information from patents and build a knowledge base which can be exploited by patent analysis tools. For this reason the right approach to be chosen to develop a patent analysis system depends on the task to be solved, on the information to be analyzed and on the computational resources involved to solve the task. Choosing a good trade-off between these factors is a strict requirement in particular when analyzing big patent sets.

All the patent analysis tools based on these approaches share a common workflow (Fig. 2.1) that includes: (a) patent retrieval from a database; (b) patent transformation from unstructured data to structured data; (c) information extraction from the structured data; (d) data mining on the extracted information for the patent tool purpose. The (b) and (c) steps are approach dependent: the data analyzed in the further steps of the workflow depend on the approach adopted to transform the patent unstructured data to structured data.

(20)

2.1. Approach based on patent metadata

Particular focus of the thesis is to improve approach dependent steps. The state of the art of this specific part of the workflow will be then analyzed.

Patent database

Patent retrieval

Patent structured data extraction

Information extraction from structured data

Analysis of the extracted data

Trend analysis

Novelty detection Patent mapping Approach dependent steps

Patent analysis tools

Figure 2.1: Common workflow used by patent analysis tools

2.1

Approach based on patent metadata

The approaches not relying on patent text analysis exploit mainly three types of infor-mation: (a) bibliometric information; (b) patent structure information; (c) patent review process information.

Statistical analysis and citation analysis allow to perform statistical calculations on the chosen patent set and to analyze the development and distribution of patented tech-nology. In this approach both patent and not related patent literature are considered: patents with a high number of citations usually indicate a strong correlation with the foundation of a technology. Patent structure (e.g. number of claims, number of de-pendent claims) and examination periods are also taken into account by patent analysis literature [25], since they indicate important information about a technology such cur-rent maturity level and grow/maturity trends across diffecur-rent time frames.

2.1.1 Patent indicators for the technology life cycle development

The technology life cycle stage is an important aspect taken into account by who de-cides to invest. Since the life cycle of a product is clearly exposed by patent grants

(21)

Chapter 2. State of the art in patent analysis

evolution and their correlations [26], this lead research to investigate on patent indices that can be considered as appropriate life cycle stage indicators. In this work [25] Haupt et al. conducted an investigation in this direction with the main aim of detecting three different technology life cycle stages: introduction, growth and maturity.

The author took into account that several studies have shown that a S-shape evo-lution of the number of patent applications or even a double-S-shape is typical [27]. Consequently, the author defined the concept of patent activity index as an appropriate life cycle indicator only if its mean value differs significantly between the life cycle stages.

In this work, by considering just bibliometric information, the following hypotheses that can be bring back to patent indices were tested:

1. Backward literature citations increase significantly only at the transition from in-troduction to growth;

2. Backward patent citations increase significantly at both stage transitions;

3. The number of forward citations decreases significantly at the transition from in-troduction to growth;

4. The number of dependent claims is significantly higher at later technology life cycle stages than in earlier ones;

5. The number of priorities referred to in a patent application is significantly higher at later technology life cycle stages than in earlier ones;

6. Examination processes take longer in the phases of introduction and maturity than at the growth stage.

The test approach was conducted by examining patent applications in the pacemaker technology. The authors identified 387 granted patents (during introduction phase: 48, during growth: 282, during maturity: 57), and conducted analyses of variance including Scheffé tests over the calculated values of the defined indicators on these patents. The conducted analyses shown that almost every index is able to identify stage transitions in the life cycle of the pacemaker technology. Only forward citations were not identified as an appropriated indicator for the recent evolution of the technology.

2.1.2 Stochastic technology life cycle analysis using multiple patent indicators In this work [28] Lee at al. introduced a novel approach for automatic technology life cycle analysis. Previous methods [25] relies on curve fitting techniques to observe technological performance changes over time with predefined indicators. The main limit of these methods is the need of assumptions for what concerns the shapes of the stages curves. For this reason the authors introduced an unsupervised method able to automatically detect the number of life cycle stages and the transition times of the technology of interest.

Seven time series patent indicator were taken into account by the authors for their research: (1) patent activity which allows to model the evolution of a pattern. In par-ticular increasing and decreasing patterns are considered a change for what concerns the research and development activity; (2) the number of technology developers in the

(22)

2.2. Approach based on keywords

analyzed temporal series. It has been shown that a great number of competitor enters in the initial stages of a technology’s life cycle, but this number lowers in the matu-rity stage [29]; (3) the number of different patent application areas in the considered temporal series. This is an important indicator since it has been shown that the number of technology application areas are small in the first stages of their life cycles and in-creases in the later life cycle stages [30]; (4) the number of backward citations. It has been shown that patents with an high number of backward citations have less relevance with respect to the patents with a lower number of citations [31]; (5) the number of forward citations which expresses the technological value of a patent in the analyzed temporal period [32]; (6) the duration of examination processes as the average time between the filing and granting dates. Studies [25] have shown that the duration of the examination processes is related to the novelty of the patented technology; (7) the number of claims belonging to the patent. The more the number of claims reported by the patents, the higher the correlation with novelty and the financial value is [33] .

Authors employed the Hidden Markov Model (HMM) algorithm and the Bayesian Information Criterion (BIC) on the patent indicators extracted on a specific patent set to detect the best number of technology life-cycles with respect to the considered data. In addition, the procedure allows to obtain a state transition probability matrix, an ob-servation probability matrix and a initial state probability vector.

The authors analyzed patents belonging to 19 technologies concerning molecular amplification diagnosis. Applying the proposed method, the system was able to detect that 4 life-cycle stages (using the BIC algorithm) modelling the progress of this tech-nology. In addition, by applying a hierarchical clustering algorithm over the extracted time series, technologies with similar life cycles were reported.

2.2

Approach based on keywords

In the keyword based approach each patent is expressed in terms of a fixed length vector where each component represents a numerical value expressing the importance of a specific keyword (e.g. frequency in the document, tf-idf ). The keywords to be taken into account depend on the purpose of the research and on the patent set analyzed. In this approach keywords can be selected with three different criteria: (a) automatically extracted by a text mining module; (b) manually selected by domain experts; (c) the combination of the first two criteria, where domain experts judge the relevance and the quality of the extracted keywords in order to limit the results to the most important keywords. Once keyword vectors are obtained, patent similarity can be easily computed by using standard distance measures like cosine similarity. In addition, the keyword extraction allows to define more complex patent similarities measures [34] that can be exploited for the development of patent analysis tools [35], [36] such as mappers or patent search engines. In the following subsections will be reported some state of the art works concerning similarity measures between patents that can be defined by using keywords. Moreover, two patent mapping system that make use of the keyword based approach will be described.

(23)

Chapter 2. State of the art in patent analysis

2.2.1 Measures for textual patent similarities: a guided way to select appropriate approaches

In this work [34] Moehrle conducted a study related to textual patent similarities mea-sures. Since each patent related application (i.e. patent mapping, patent search) has its own requirements related to the concept of patent similarity, a definition of similarity for each application was proposed.

The similarity definitions are based on patent “concepts”, which are the textual ele-ments found in patents. Concepts are classified as solitary or combined. Solitary con-cepts are singular terms which sometimes are added with important attributes, while combined concepts can be multi-word or phrases.

The proposed model is based on set theory where the set elements are the concepts extracted from the analyzed patents. Given two patents i and j, the following subsets are considered: (a) concepts that can be extracted from patent i, but not from patent j; (b) concepts that can be extracted from patent j, but not from patent i; (c) concepts that can be extracted from both patents i and j.

Furthermore, counting variables definitions are introduced. Variables are the nu-meric values extracted from the previously defined subsets on which the similarity of patents is calculated: (a) ciis the number of concepts that can be extracted from patent

i; (b) cj is the number of concepts that can be extracted from patent j; (c) ci(j) is the

number of concepts in i found also in j; (d) cj(i) is the number of concepts in j found

also in i. Since ci(j)and cj(i)can be different depending on the how they are calculated,

(e) the variables cij and cjiare introduced.

In addition the measurements definitions of variables are provided by the author. Some examples are: (a) complete linkage, concepts are used without modifications and a link between identical concepts of patents i and j is established. (b) reduced linkage, similar to (a), with the exception that the same concept found multiple times in single patent is counted just once; (c) wedding linkage, here concepts are used without modifications and a just mutually exclusive link between identical concepts found in patents i and j is established.

By adapting standard set similarity coefficients found in literature (i.e. Jaccard, Sorensen, cosine), similarity of patents can be computed exploiting the previously de-fined measurements definitions.

2.2.2 An approach to discovering new technology opportunities: keyword-based patent map approach

In this work [35] the Lee at al. developed a system for building keyword-based patent maps to be used for new technology creation activities. The system is composed of a text mining module, a patent mapping module and a patent vacancies identification module. Once a specific technology field is taken into account for analysis and a related patent set is extracted, the modules of the system are sequentially executed.

The text mining module automatically identifies relevant keywords in each patent of the considered patent set. Once all the keywords are extracted, only the ones with the highest relevance are selected for a further screening by domain experts. The final set of keywords resulting from the screening process is then considered for building the patent keyword vectors on the considered patent set. Specifically each component

(24)

2.2. Approach based on keywords

of the patent vector holds the frequency the corresponding keyword in the considered patent.

Once all the keyword vectors are computed, the patent mapping module is executed to generate the patent map. The mapping is calculated by executing the Principal Com-ponent Analysis (PCA) [37] algorithm on all the vectors. The PCA method allows to map n-dimensional vectors on a rectangular planar surface in order to generate the patent map. Intuitively this method allows to find the most meaningful 2 dimensional projection that filters out the correlated components of a n-dimensional vectors. The result of applying this method over the patent keyword vectors is a meaningful patent mapping, in which each patent is mapped over a 2-dimensional surface.

Once the patent map is computed, a vacancy detection module is executed on the patent map. The vacancy detection module identifies sparse areas which can be con-sidered good candidates for a research investigation. For each interesting vacancy, a list of related patents is obtained by selecting the ones which are located on the region boundaries. On the calculated list, a set of information for each patent is computed. This information is used to capture the importance of a patent in this patent list. Fea-tures considered strong indicators of the relevance of each patent are the number of citations [38] and the number of average citations by patents in the patent list. Finally, emerging and declining keywords are computed by taking into account the time series analysis of the considered keywords in the patent list. This allows to identify promising technology trends that can considered for further investigation.

2.2.3 Novelty-focused patent mapping for technology opportunity analysis In this work [36] Lee et al. proposed a method for discovering potential technology op-portunities by detecting patent vacancies and analyzing the behavior of novel patents in the generated patent mapping. More precisely for each novel patent some quantitative measures are computed to estimate the patent influence for what concerns the novelty impact.

As in [35], the approach is composed by several dependent steps. In the first step the patent set is collected according to the research purpose. Structured (e.g. citations) and non structured data (e.g. the text) are collected from analyzed patents.

In the second step, exploiting the data collected in previous step, the morphologi-cal patent contexts [39], [40] are built. Morphologimorphologi-cal patent contexts are defined to exhibit the properties of a technology or a system. Generally, the morphology matrix is constructed by breaking down a system into several dimensions and shapes which are mutually exclusive. Dimensions and related shapes are chosen with respect to the extracted keywords obtained by applying text mining techniques and considering ex-perts judgments. Each morphological patent context consists of three parts: the issued date, the patent number and the keyword vector, referred to the dimensions and shapes previously defined.

In the third step the local outlier factor (LOF) [41] for each patent set is computed by exploiting the morphological contexts collected in the second step. The LOF expresses the property of being an outlier (the degree of novelty in this particular case) in a n-dimensional space. LOF values are computed for each patent year by year to measure the novelty degree at a specific time. In order to standardize LOF values over the years the Kernel Density Estimation function is applied on the dataset.

(25)

Chapter 2. State of the art in patent analysis

In the last step the novelty focused patent identification map is built by projecting the selected novel patents on a 2 dimensional space according to the number of citations (dimension 1) and the number of claims (dimension 2). It has been shown that more frequently cited patents have higher technological and economic impacts [42]. The number of claims has also been found to affect the profitability and value since this lowers the probability that others may imitate the patent.

2.3

Approach based on Natural Language Processing

In the Natural Language Processing approach, state of the art tools developed by the NLP community are used to extract semantic information that simple keywords can-not identify. The NLP approach usually involves the execution of a software pipeline composed of dependent steps used to extract layers of linguistic information suited for tackling a specific information extraction task. A standard NLP tool suited for infor-mation tasks is reported in figure 2.2 and includes the following steps:

• Sentence Splitting and Tokenization: these steps split the raw text into sentences and then segment each sentence in orthographic units called tokens;

• Part Of Speech Tagging: is the step in which unambiguous grammatical categories are assigned to tokens;

• Syntactic Parsing: is the step which computes the parse tree of sentences and the syntactic relations between tokens in a sentence.

Raw text

Sentence Splitting

Tokenization

Part of Speech Tagging

Syntactic Parsing

Relation Extraction

Term Extraction Entity Extraction

(26)

2.3. Approach based on Natural Language Processing

By exploiting the information obtained by these steps, several information extraction tasks can be solved by other NLP tools such as:

• Term extraction: the task of automatically extract relevant terms from a given corpus. Part of Speech tags are typically used by term extractors to narrow the terms search to a predefined term structure;

• Named entity recognition: the task of automatically identify and classify named entities in text such as persons, organizations and locations. Named entity recog-nizers usually use Part of Speech tags in order to disambiguate the morphosyntac-tic role of tokens in a phrase, improving the performance of the extraction; • Relation extraction: the task of automatically build relations among entities in the

analyzed text. In this context entities can be named entities or extracted terms. In addition, the syntactic role of the entities can be exploited to better categorize the relation type (e.g.: subject, object).

Technical domain language, as other linguistic domains, suffers from linguistic am-biguities. For instance the word “support” can have two totally different meanings when used as a noun or as a verb. By using part of speech taggers which are able to disambiguate the morphological role of each word in a sentence, more precise infor-mation extractions are possible and can be used in several applications (e.g. patent search engines). In addition part of speech taggers allow to perform textual lemmati-zation, which can further improve the performances of automatic patent analysis tools. Another key NLP tool used by several automatic patent analysis systems are syntactic parsers: by identifying the syntactic role of each word in document sentences, sev-eral patent analysis applications are possible. A common use of syntactic parsers in automatic patent analysis tools is the extraction of the Subject-Action-Object (SAO) structures. Each SAO structure represents the subject (S), the action (A) and the ob-ject (O) in a patent sentence. By automatically extracting SAO structures from patents, relationships between key technological components can be easily represented. In the following subsections a brief survey on state of the art systems using this approach will be presented.

2.3.1 Searching in Cooperative Patent Classification: comparison between key-word and concept-based search

In [5] Montecchi et al. tackled the problem of identifying the most relevant Patent Classes (PC) considering a specific query. Most of patents search strategies are based on PCs as filters for query results, therefore the selection of relevant PCs is often a primary and crucial activity. This task is considered particularly challenging and only few tools have been specifically developed for this purpose. The most efficient tools are provided by the EPO (European Patent Office) and WIPO (World Intellectual Property Organization) patent offices. The paper analyzes their PCs search strategy (mainly based on keyword-based engines) in order to identify the main limitations in terms of missing relevant PCs (recall) and non-relevant results (precision).

The system developed by the authors used KOM (Knowledge Organizing Module), a concept-based patent search engine that can extract any patent classification contained

(27)

Chapter 2. State of the art in patent analysis

in patent document such as IPC (International Patent Classification), ECLA (European Classification System), FI (File Index), USPC (United States Patent Classification). The module is composed of four steps: (a) query semantic expansion, (b) query execu-tion, (c) tagging, (d) parsing.

The semantic expansion submodule expands the initial user query with terms that express the same concepts of the input query. The aim of this step is to increase the search recall, sacrificing the precision which will be improved by the techniques applied in the following steps. The queries are expanded exploiting list of synonyms, correlated terms, morphological variants and syntactic variants.

The tagger submodule processes the list of patents obtained in the previous steps in order to detect the different grammatical categories of words. Patents containing keywords which do not belong to predefined grammatical categories are discarded, improving the overall precision of the final result.

Finally the parser module processes the previously filtered patents to obtain the syn-tactic analysis of the text. By recognizing the role of each word and the corresponding relationships, patents that without any relationship between the query keywords are dis-carded. Relationships that can be considered strong indicators for relevant patents with respect to the input query are subject-verb, subject or subject-verb-object.

The authors conducted a case study showing that their system performed better with respect to the ones used at WIPO and EPO. In particular they demonstrated shown that irrelevant classes detected by the WIPO and EPO systems were not extracted by their system.

2.3.2 A new instrument for technology monitoring: novelty in patents measured by semantic patent analysis

In this work [4] Gerken et al. focused on measuring the novelty of patents to assess technologies. The authors took into account existing works that focus on the same task, but exploiting just patent citations and patent structure [43], [44]. The method proposed mixes the previous approaches based on bibliographic citations and the information obtained by analyzing the patent text with NLP tools.

The method consists of four successive steps: (a) semantic structures identification; (b) linguistic analysis for structures refinement; (c) patent similarity measurement; (d) calculation of patent novelties.

In the first step, once the research patent set is collected, the semantic structures are extracted by the KnowledgistTMcommercial tool developed by Invention Machine1.

In the second step, a linguistic analysis over the patent set is performed. This anal-ysis is performed to employ filtering modules in order to mitigate the synonymy and stopwords problems. The synonymy problem arises from the fact that different words may represent the same concept in a specific domain. From the other side the stopwords problem is that extremely common words badly contribute to evaluate the similarity of the analyzed patents [45]. For this reason both generic and domain specific stopwords are filtered out.

In the third step the similarity measurement are carried out. The similarity measure-ment is performed to highlight the aspects of the patent’s invention that were already

(28)

2.3. Approach based on Natural Language Processing

described in an older patent. This measurement is calculated by considering the ratio of semantic structures of the newer patent which are found also in the other patent and the whole number of the semantic structures of the new patent.

Finally in the fourth step the novelty calculation impact is calculated. This is ex-pressed in the paper as:

Ni = 1 − max(si(n)) ∀n < i

where Ni is the novelty of patent i and si(n) is the similarity of patent i to each patent

n filled before i.

The approach was tested on two different patent sets concerning the mechanical field and the results provided by the authors have shown that the novelty calculated by means of semantic patent analysis had a strong correlation with respect to the experts’ choices.

2.3.3 Identifying technological competition trends for R&D planning using dy-namic patent maps: SAO-based content analysis

In this work [3] Yoon et al. developed a system for dynamic patent mapping based on subject-action-object structures. The subject-action-object structures are extracted from syntactically ordered sentences exploiting syntactic parsers. In addition the system combines patent bibliographic information and clustering techniques to generate the resulting patent maps. The main goal of this work was to provide to R&D valuable input for the development strategies of a company.

Identifying “patent vacuums” or “technological hot spots” is a great source of infor-mation to check for areas in which patents have not been granted and for overlaps with other technological competitors. This paper differentiates from other works [46], [47], [48] since the developed system constructs SAO-based patent maps from a “dynamic” perspective and presents practical analyses of the dynamic maps to address several re-search and development aspects.

The system is composed of 4 steps: (a) patent data collection; (b) SAO structures extraction; (c) dissimilarity matrix generation; (d) dynamic patent maps generation. After a specific patent set is collected depending on the conducted research (first step), in the second step SAO structures are collected from patents by extracting the syntactic structures obtained by the output of the Stanford [49] and the MiniPar [50] parsers. In a first stage SAO structures are filtered out by using a set of stopwords and subsequently by human experts. In the third step, the dissimilarity matrix between patents is built. To build the dissimilarity matrix, the extracted SAO structures from patents are com-pared for each patent pair. The dissimilarity between patents is computed exploiting the WordNet [51] concept categories which include words and synonyms. Once the dissimilarity matrix is built, dynamic patent maps are visualized considering the corre-sponding values of the matrix. The projection of the n dimensional matrix into a two dimensional space is achieved by using Multi Dimensional Scaling algorithms such as REFSCAL, PROXCAL and ALSCAL. In addition, patent maps are dynamic since the patent maps are combined with additional information such as application dates, appli-cants and patent clusters. This allows to the navigation of the generated patent map, for example, in the temporal dimension to check how a specific research topic evolves over time.

(29)

CHAPTER

3

Linguistic annotation modules and Information

Extraction tools for automatic patent text analysis

As knowledge in technical documents is mostly conveyed through text, content ac-cess requires a detailed understanding of the linguistic structures representing content in text. Thanks to the recent advances in Natural Language Processing and Artifi-cial Intelligence, this requirement is satisfied by linguistic tools which offer the start-ing point for an incremental process of annotation–acquisition–annotation, whereby domain–specific knowledge is acquired from linguistically–annotated texts and then projected back onto texts for extra linguistic information to be annotated and further knowledge layers to be extracted.

In this chapter we will describe the linguistic tools used for the automatic text anal-ysis of patent documents. The chapter is divided in the following sections:

• In section 3.1 a description of the linguistic annotation tools used for this work will be provided. These tools annotate the patent text with linguistic information which is exploited by the information extraction tools for patent documents; • In section 3.2 the information extraction systems used for this thesis and adapted

for the specific task of patent information extraction will be presented. This sys-tem will be adapted and employed to extract Users in chapter 4 and Advantages and Drawbacks in chapter 5.

• In section 3.3 the sentiment classifier employed to analyze text extracted from social media will be presented. In chapter 5 this system will be adptaded to val-idate the Advantages and Drawbacks from patents extracted by the information extraction system.

(30)

3.1. Linguistic annotation tools

Raw text

Sentence Splitter

Tokenizer

Part of Speech Tagger

Syntactic Parser

Relation Extractor

Sentiment Analyzer Term Extractor Entity Extractor

Figure 3.1: An overview of the linguistic pipeline employed for the automatic patent text analysis tools.

3.1

Linguistic annotation tools

3.1.1 Sentence splitter

Natural language processing tools require that the input text is properly segmented in sentences. Since documents do not encode this information in a non ambiguous manner due to common abbreviations (e.g.: “Mr., Dr.”), a sentence splitting process is required. This issue in the technical documents domain is even more problematic due to the pres-ence of chemical entity names and bibliographic referpres-ences. Sentpres-ence splitting errors in early stages of the linguistic analysis pipeline are propagated in the following steps causing a strong decrease for what concerns their accuracy. For this reason the sentence splitter developed by the ItalianNLP lab was employed in the presented pipeline. This sentence splitter is based on machine learning techniques: given a training corpus of properly segmented sentences and a learning algorithm, a statistical model is built. By reusing the statistical model, the sentence splitter is able to split sentences on texts not used in the training phase.

3.1.2 Tokenizer

Before text processing can be performed, the analyzed text needs to be divided into linguistic units such as words, punctuation and numbers. This process is defined as tokenization. In this patent linguistic annotation pipeline the tokenizer developed by

(31)

Chapter 3. Linguistic annotation modules and Information Extraction tools for automatic patent text analysis

the ItalianNLP lab was integrated. This tokenizer is regular expression based: each token must match one of the regular expression defined in a configuration file. For example, the regular expression:

http:\/\/(\w+[\.\/]?)+

is used to tokenize URLs. A rule for URLs must be defined since the ":" in URLs has a different meaning with respect to “explaination/list following”. In this sense single token must be linguistically significant [52]. Among the others, rules are defined to tokenize words, acronyms, numbers, dates and equations.

3.1.3 Part of speech tagger

The part of speech plays an important role in this linguistic annotation pipeline since it provides very useful information concerning the morphological role of a word and its morphosyntactic context: if a token is a determiner, the next token should be noun or an adjective with very high confidence. Part of speech tags are used for many information extraction tools such as named entity taggers in order to identify named entities like people and locations since tokens representing named entities follow common mor-phological patterns. In addition part of speech tags can be used to mitigate problems related to polysemy since words often have different meaning with respect to their part of speech (e.g. “track”, “guide”). This information is extensively used in the proposed methodologies for information extraction from patents as will be shown in chapters 4 and 5.

In this thesis the ILC postagger [53] was employed for the development of the patent annotation pipeline. This postagger uses a supervised training algorithm: given a set of features and a training corpus, the classifier creates a statistical model using the feature statistics extracted from the training corpus. The postagger was trained on the Wall Street Journal section of Penn Treebank corpus [54] using the SVM learning algorithm and was configured to use the set of features reported in table 3.1. The postagger reported an accuracy of 97.19%, in line with state of the art postaggers, on a 10-cross fold validation performed on the training data.

FORM W−2W−1W0W1

FORM LENGTH W0

FORM FORMAT W0

FORM PREFIXES (up to 5) W0

FORM SUFFIXES (up to 5) W0

FORM SHAPE W0

POS W−1

BIGRAMS (P1W0)(W1W0)(W0W1)(W1W2)

TRIGRAMS (P2P1W0)(W1W0W1)(W2W1W0)(W0W1W2)

Table 3.1: Features used by the postagger to create the statistical model based on the training data. Wi

(32)

3.2. The T2K2information extraction system

3.1.4 Syntactic Parser

The aim of syntactic parsers is analysing a sentence in order to determine the grammat-ical structure by determining the text’s hierarchy in input and transform it into a shape, usually a parse tree, manageable in later processes. As shown in chapter 2, many state-of-the-art systems for automatic patent analysis [3], [5] rely on syntactic information in order to extract Subject-Action-Object relations which are used, for instance, to com-pute patent similarity.

For this thesis the DeSR syntactic parser [55] was employed to extract syntactic information to be used by the sentiment classifier in the advantage and drawback vali-dation process shown in chapter 5. DeSR is a state-of-the-art parser which implements an incremental deterministic shift/reduce parsing algorithm, using specific rules to han-dle non-projective dependencies.

3.2

The T2K

2

information extraction system

Understanding and synthesizing on a daily basis texts belonging to large corpora, such as patent sets, is a task that is significantly costly for humans both in terms of time and effort. In addition, manually extracting relevant information from texts requires the knowledge and the expertise of domain experts such as legals or engineers for the patents domain. What is a “relevant information” depends on the needs of the user that has to acquire new knowledge by analyzing texts belonging to a specific domain.

Marketers and designers are interested to identify potential users for new products to be launched on the market in order to satisfy customers’ needs bringing them advan-tages by using the proposed product. Finding hotspots in the market is not easy task without having an organized knowledge of the market trends.

All these needs lead research to find solutions to automatically perform the extrac-tion and the organizaextrac-tion of domain specific texts. Such soluextrac-tions are provided by Information Extraction (IE) systems aimed at extracting domain–specific information from texts, and organizing extracted knowledge in order to be easily accessed. In or-der to support the needs of marketeers and designers this thesis introduces T2K2, an information extraction system which allows to automatically extract domain–specific knowledge from collections of Italian and English texts. While in chapter 4 and 5 T2K2 will be adapted for the specific task of extracting relevant entities for marketers

and designers, here T2K2is presented in its orginal form.

T2K2is a pipeline of Natural Language Processing (NLP) tools, statistical text

anal-ysis and machine learning algorithms which are dynamically integrated to provide an accurate representation of the domain–specific content of text corpora in different do-mains. Extracted knowledge ranges from domain–specific entities and named entities to the relations connecting them and can be used for indexing document collections with respect to different information types. T2K2 also includes “linguistic profiling” functionalities aimed at supporting the user in constructing the acquisition corpus, e.g. in selecting texts belonging to the same genre or characterized by the same degree of specialization or in monitoring the “added value” of newly inserted documents.

T2K2originates from the ontology learning system named T2K (Text–to–Knowledge,

Riferimenti

Documenti correlati

Non cresce solo il diabete tipo 2 ma cresce anche il diabete tipo 1, seppure meno in termini assoluti, causato da una aggressione autoimmune: il proprio

Purpose: Laboratory characterization of thin strip silicon sensors with internal gain (Ultra Fast Silicon Detectors), developed for beam monitoring applications in

 Nei paesi dove i programmi di screening con Pap test sono attivi ed efficaci il vaccino può avere un impatto debole sull’incidenza del cancro alla cervice uterina,

- CHANGE IN THE WORLD: to establish whether a clause describes a change or not, and which type of change it describes, the algorithm takes into account four

FGMS was placed in a clipped and clean area on the dorsal part of the neck of dogs with DKA within 14 hours from the presentation. The interstitial

 “A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

Conventional chondrosarcomas in- clude primary central, primary periosteal (juxtacorti- cal, a rare variant of conventional chondrosarcoma arising on the surface of metaphysis of

In una regione litoranea la popolazione che vive delle risorse del mare si stabilisce per lo più vicino alla costa: ciò spiega perché nella fascia fino a 100m di altitudine