1
Natural Language Processing / 1 Natural Language Processing / 1
Representation – Pre-processing – Tasks – Applications Representation – Pre-processing – Tasks – Applications
Laurea in Computer Science Laurea in Computer Science
MAGISTRALE MAGISTRALE
Corso di
ARTIFICIAL INTELLIGENCE Stefano Ferilli
Questi lucidi sono stati preparati per uso didattico. Essi contengono materiale originale di proprietà dell'Università degli Studi di Bari e/o figure di proprietà di altri autori, società e organizzazioni di cui e' riportato il riferimento. Tutto o parte del materiale può essere fotocopiato per uso personale o didattico ma non può essere distribuito per uso commerciale. Qualunque altro uso richiede una specifica autorizzazione da parte dell'Università degli Studi di Bari e degli altri autori coinvolti.
“They are flying planes”
Unknown Author
Motivations
●
Language is central in human life and development
●
Purposely developed by humans for communications
– Fundamental for evolution and society
●
Language is pervasive
●
Vast majority of information conveyed by natural language
– Written
– Spoken
●
Language is an expression of intelligence
●
Conceptual level
Introduction
●
Natural Language Processing (NLP)
●
One of the initial objectives of AI
●
Aims:
– Understanding natural language texts
– Generating natural language sentences
●
Need for automatic processing of natural language
Problems
●
Huge amount of data to be handled
●
Manual processing infeasible
●
Semantics
●
Computers mainly concerned with syntax
●
Ambiguity
●
Large number of different languages
●
Need for specific research on each
●
Languages variable and dynamic
Representation
●
(Natural) Language(s) expressed as text
●
Sequence of characters
– (printable) Graphical symbols (for humans)
– (numeric) Codes (for computers)
●
Need for an agreement about code-symbol mapping/encoding
●
Many characters used for different languages/purposes
– Continuously growing set
●
E.g., emoticons
Representation
●
Standard text formats (.txt)
●
(EBCDIC)
●
ASCII
– American Standard Code for Information Interchange
●
7 bit = 128 characters
●
1 byte = 256 characters
●
IANA: Internet Assigned Numbers Authority
– ISO Latin
– ISO-8859-1
●
UNICODE
Representation
●
UNICODE 3.0
●
4 bytes
– Currently 57700 code points
●
Segment-based handling
– More common subsets
– Less waiste of storage
●
A normal text editor sees more
insignificant characters
●
UTF-8
●
8-bit segments
●
Equivalent toASCII
– Most significant bit in the first byte at 1 indicates extension of the segment
●
UTF-16
– 16-bit segments
Linguistic Levels
●
Morphological
●
Characters
●
Lexical
●
Words
●
Syntactic
●
Sentences
●
Semantic
●
Concepts
●
Pragmatic
●
Application
Processing Phases
●
[Tokenization]
●
Language Identification
●
Stopword Removal
●
Normalization
●
Stemming
●
Lemmatization
●
Part-of-Speech Tagging
●
Parsing
●
Understanding
Processing Phases
Language Identification
Stopword Removal PoS-Tagging
Normalization
Parsing
Text Categorization
Text Understanding Indexing & Retrieval
Pre-processing
Processing Phases
Language Identification
Stopword Removal PoS-Tagging
Normalization
Parsing
Text Categorization
Text Understanding
Indexing & Retrieval
Morphology
Processing Phases
Language Identification
Stopword Removal PoS-Tagging
Normalization
Parsing
Text Categorization
Text Understanding Indexing & Retrieval
Lexicon
Processing Phases
Language Identification
Stopword Removal PoS-Tagging
Normalization
Parsing
Text Categorization
Text Understanding Indexing & Retrieval
Grammar
Processing Phases
Language Identification
Stopword Removal PoS-Tagging
Normalization
Parsing
Text Categorization
Text Understanding Indexing & Retrieval
Semantics
Processing Phases
Language Identification
Stopword Removal PoS-Tagging
Normalization
Parsing
Text Categorization
Text Understanding Indexing & Retrieval
Tokenization
●
Splitting the sequence of characters into meaningful aggregates
●
Words
●
Punctuation
●
Symbols
●
Values
– Dates, Monetary, ...
Language Identification
●
Determining the language(s) in which a text is written
●
Preliminary and necessary to the application of all subsequent phases
– Different languages different approaches & resources
●
Morphological/Lexical
Stopword Removal
●
Removing terms that are irrelevant for understanding or distinguishing a specific text
●
Very common and widespread terms
– Generic
– Domain-specific
●
Lexical
– Stopword: specific term
●
Grammatical
– Function Word: grammatical role
●
Articles, Prepositions, Conjunctions, ...
●
Adjectives? Adverbs? -> sentiment
●
Used for lexical processing
Normalization
●
Reporting (inflected) terms to a standard form
●
Morphological
– Stemming: root
●
Suffix stripping [Porter]
●
Cannot distinguish grammatical roles
●
May merge different terms
●
Grammatical
– Lemmatization: basic form
●
More complex
●
Requires grammatical knowledge
●
Needed for some tasks
Indexing & Retrieval
●
Organizing the documents so as to retrieve (all and only) those that are relevant for a given information need
●
Usually expressed as a query
– Set/Bag of terms
●
Lexical
Part-of-Speech (PoS) Tagging
●
Identification of the grammatical role of a word
●
Part of Speech = grammatical role
●
Indicated by
– Suffix
– Position in sentence
●
Grammatical
Parsing
●
Determining sentence structure
●
Several representations
– Tree (Parse Tree)
– Graph
●
Grammatical
Text Categorization
●
Determining the subject of a text
●
In a pre-defined set or taxonomy
– Classification task
●
Latent
●
Lexical
Text Understanding
●
Understanding something about the content and/or meaning of a text
●
Word Sense Disambiguation
●
Named Entity Recognition
●
Anaphora / Co-reference resolution
●
Semantic
Text Understanding
●
Understanding something about the content and/or meaning of a text
●
Keyword/Keyphrase Extraction
– (Sequences of) words representative of the text content
●
Information Extraction
– Filling the fields of a record about the text content
●
Sentiment Analysis (Opinion Mining)
●
Emotion Analysis
●
Semantic
Sentiment/Emotion Analysis
●
Determining the emotional content of a text
●
Sentiment Analysis (aka Opinion Mining)
– Positive / (Neutral) / Negative
●
Emotion Analysis
– Specific emotions [Ekmann]
●
Joy, happiness, sadness, anger, fear, ...
●
Uses
●
Industrial products
●
Political polls
Resources
●
Language Identification
●
Language models
– Letter n-grams
●
n = 2, 3 – Stopwords
– ...suffixes?
●
Stopword Removal
●
Stopword Lists
●
PoS tags
●
Normalization
●
Suffix Lists
●
Inflection Rules
Resources
●
PoS-tagging
●
Suffix Lists
●
Inflection Rules
●
Parsing
●
Grammars
●
Understanding
●
(Lexical) Taxonomies
●
Ontologies
Applications
●
Text Summarization
●
Producing a shorter text preserving its main content
– Extractive
– Abstractive
●
Conversation (chatbots)
●
...
Indexing
●
Vector Spaces [Salton, Yang, Wong, 1975]
●
Document = Bag-of-Words (BoW)
●
Term-Document Matrix
– Columns = Documents (Vectors)
– Rows = Terms (Vector Dimensions)
– Entry (i,j) = relevance of i-th term to j-th document
●
Weighting schemes
–Local
●
Binary
●
TF
–Global
●
TF*IDF
●
Log*Entropy
●
Usually 0 if term does not appear in document
●
Query = new vector (point in the space)
– Similarity to documents = distance (Cosine Distance)
Latent Semantic Analysis (LSA)
●
[Deerwester & Dumais]
●
Terms in a document are just surface clues for the underlying concepts
●
Concepts may be extracted from the vector space
– Latent = not named
– Singular Value Decomposition (SVD)
●
A matrix factorization approach
–Term-Document matrix A = U x W x V
●
U = Term-Concept matrix nxr
●
W = Concept matrix (diagonal) rxr
●
V = Document-Concept matrix mxr
–r rank of A = number of concepts
Latent Semantic Indexing
●
LSA + Dimensionality Reduction
●
Select k (< r) top values in the diagonal of W
– Most important concepts
●
U’, V’ = corresponding columns of U and rows of V
●
A’ = U’ x W’ x V’
– Document vectors reweighted based on most important latent concepts only
– Terms not appearing in a document may have non-0 relevance to it
●
[Incremental (approximated) approaches available]
Formal Concept Analysis
●
Task: find the class lattice
●
Matrix Objects x Features
●
Object Definition = features of object
– E.g., binary
●
Subclass = Subset of features
– Expanding features
Restricting objects
– Restricting features
Expanding objects
Topic Modeling
●
Latent topics
●
Document topic = mix of topics
●
Latent Dirichlet Allocation (LDA)
●
Topic = mix of words
●
Clusters of similar words
WordNet
●
Lexical Taxonomy (Ontology?) [Miller]
●
Synset (Synonymous Set)
●
212.558 concepts + glossas (v3)
●
20 relationships
– Syntactic, Semantic
●
Prolog version (v3)
●
Extensions
– Multiwordnet
– (Italwordnet)
– SentiWordnet
– Wordnet Domains
More...
●
Concept Indexing
ML x NLP
●
Automatic learning of linguistic resources
●
Impossible or costly manually building them for all languages
– Non-widespread languages, dialects, jargons, ...
●
Unsupervised vs Supervised Learning ?
BLA-BLA
●
Broad-spectrum Language Analysys-Based Learning Application
●
Morphological statistics
●
Stopword lists
●
Suffix lists
●
PoS
●
Conceptual Taxonomies
subject, verb, complement
subject, complement
ConNeKTion
●
Goal
●
Study, understanding and exploitation of the content of a collection / corpus / digital library
●
Easy exploration of its semantic content
●
Representation
●
Graph (Lexical taxonomy/ Ontology)
– Nodes = Terms / Concepts
– Arcs = Verbs (weighted)
●
Frequency (positive/negative)
ConNeKTion
●
(Core) pre-processing
●
Anaphora resolution
●
Syntactic analysis
●
Normalization
●
Tasks
●
Ontology Learning and Refinement
●
Association-based reasoning
●
Keyword Extraction
●
Information Retrieval
●
Author identification
●
...
ConNeKTion
Reasoning by association (BFS)
A Breadth-First Search starts from both nodes, the former searches the latter's frontier and vice versa, until the two frontiers meet.
The positive/negative-instances ratios over the total represents action gradations.
“the young looks television that talks about (and criticizes) facebook, because it typically does not help (rather distracts) schoolwork”.
Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application
Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 43
ConNeKTion
Reasoning by association (prob)
Defined a formalism based on ProbLog language: p
i:: f
i●
f
i: ground literal of the form link (subject, verb, complement)
●
p
i: ratio between the sum of all examples for which f
iholds and the sum of all possible links between subject and complement Real world data are typically noisy and uncertain → need for strategies that soften the classical rigid logical reasoning
Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application
Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 44
ConNeKTion GUI
Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application
Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 51
ConNeKTion GUI
Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application
Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 52
ConNeKTion GUI
Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application
Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 53
ConNeKTion GUI
Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application
Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 54
ConNeKTion GUI
Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application
Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 55
References
●
S. Ferilli
●