University of Pisa Department of Energy, Systems, Territory and Construction Engineering
Ph.D. Dissertation
Mining Technical Knowledge
Natural Language Processing Techniques and Engineering Management Methods
Filippo Chiarello
January 2019
Supervisors: Prof. Andrea Bonaccorsi Prof. Gualtiero Fantoni
ii
Mining Technical Knowledge
Filippo Chiarello, 2019.
Supervisors:
Prof. Andrea Bonaccorsi University of Pisa, Dept. of Energy, Systems, Territory, and Construction Engineering
Prof. Gualtiero Fantoni University of Pisa, Dept. of Industrial Engineering
Revised by:
Dr. Alessio Ferrari National Council of Research, Institute of Information
Science and Technology,
Prof. Riccardo Fini University of Bologna, Dept. of Management
Department of Energy, Systems, Territory, and Construction Engineering University of Pisa
Largo Lucio Lazzarino, 1 IT-56122 Pisa, Italy
Typeset in Markdown Pisa, Italy 2019
iii
“The larger grows the area of knowledge, so too grows our perimeter of ignorance. It maybe that for all we know we could be stepped in the center of infinite ignorance. Which then provides job security for ever for scien-tists.”
v
UNIVERSITY OF PISA
Abstract
Faculty of Engineering Management School of Engineering
Doctor of Philosophy
Mining Technical Knowledge
Natural Language Processing Techniques and Engineering Management Methods
by Filippo CHIARELLO
The information field companies are living in has changed dra-matically over the last years bringing a new challenge for manage-ment engineers. This discipline comes with engineering method-ologies applied to inherent systems but, nowadays, activities with greater added value for companies are hardly standardized and non-repetitive. The enormous amount of information, which is chang-ing the environment of companies, has a determinant impact on Research and Development, Design, Marketing and Human Re-sources Management: all functions with high strategic content, and so knowledge. Since documents written in natural language con-tains knowledge by design, management engineers has nowadays the great opportunity to exploit the technical knowledge hidden in this unstructured sources to generate value.
The aim of this thesis is to design methods and processes for the analysis of technical documents in order to extract valuable knowl-edge for companies. The methods are ensembles of Natural Lan-guage Processing and Managements Engineering techniques. The methods has the goal of providing correct knowledge exchange be-tween humans and machines, leading to incorporate knowledge of the experts inside machine-learning systems and experts’ abil-ity to use in their process of decision making inductively generated knowledge of machines.
vii
Acknowledgements
I want to thank you to my parents for being my greatest teachers in these years of study. I hope to give you back a at least fraction of the things you given me.
I want to thank you to my supervisors, prof. Gualtiero Fantoni and prof. Andrea Bonaccorsi for believing in me before i did and for giving me the opportunity to start this career that i love more everyday. The best has yet to come.
I want to thank you to my best friends and band, Alessandro, Ranieri and Paolo for remembering me that my work is not the only thing i love to do in my life.
I want to thank you to my colleagues Leonello, Simona, Silvia and Elena for showing me new ways in which things can be done.
I want to thank you prof. Antonella Martini for the patience and passion in showing me how academia actually works.
Finally, i want to thank you Arianna for giving me the inspira-tion and the energy that i needed to conclude my PhD how i wanted to.
0 10 20 30 40 2005 2010 2015 2020 Year ‘Inf or mation [ZB]‘
Automatic patent set annotation 1 Automatically Annotated Patent Set 1 List of new Users Automatically Annotated Patent Set 2 Relevant sentences extraction Sentences Training Set Automatic patent set annotation 2 Difference computation Manual review List of Users List of user generation Patent set Patent text preprocessing Morfosyntactically Analyzed Patent set Patent set Selection 1 3 4 2 5 6 7 8 Document Phase Legend
Advantages & Drawbacks Clues Collection 1 Domain Clues Extraction 2 Domain Clues Validation 3 Advantages & Drawbacks Extraction 4
Automatically Annotated Patent Set 1 List of Advantages & Drawbacks Clues Patent set Patent text preprocessing Morphosyntactically Analyzed Patent set Patent set Selection
List of Advantages & Drawbacks clues
generation
Document Phase Legend Advantages & Drawbacks Pointer Collection
Automatic patent set annotation 1
Automatically Annotated Patent Set 1 Automatically Annotated Patent Set 2 Sentences Training Set Automatic patent set annotation 2 Difference computation List of domain Clues Document Phase Legend Domain Clues Extraction Phase
Relevant sentences extraction
0 0 −2, −1, 0, 1, 2
List of domain Clues Tweets Collection Tweets Corpus Clues Sentiment Polarization List Cleaning Validated Domain Clues Sentiment Analysis W2V Features Clues Validation Phase
Document Phase Legend
Non-domain Clues Validated Domain Clues Merge List of Clues Morphosyntactically Analyzed Patent set Regular Expression Rules Advantages & Drawbacks Sentences Extraction Advantages & Drawbacks Document Phase Legend Advantages & Drawbacks Extraction
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● minimiz e maximiz e 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 number of topics metrics: ●Griffiths2004 CaoJuan2009 Arun2010
cloud manufacturing computer aided manufacturing information management internet of things cutting fluids machining surface roughness turning green manufacturing resource efficiency computer simulation finite element method machine tools production engineering cutting tools electric welding energy efficient environmental management gas metal arc welding optimization welding welds 3d printing additive manufacturing product development product life cycle management remanufacturing industrial applications global warming greenhouse gases recycling waste management aluminum titanium titanium alloys wear of materials manufacturing industries integer programming production control scheduling renewable energies renewable energy renewable energy resources wind power eco innovation innovation lean manufacturing product service systems chains competition maintenance supply chain management
1 2 3 4 5 6 7 8 9 10 11 12
factor(topic)
reorder(ter
0 1000 2000 0 50 100 150 200 250 Number of words Number of sentences
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.8 −0.6 −0.4 −0.2 0.0 2013 2014 2015 2016 2017 Year P olar ity V alue
● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ●● ●● ●● ●● ● ●● minimiz e maximiz e 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 number of topics metrics: ●Griffiths2004 CaoJuan2009 Arun2010 Deveaud2014