Laurea in Computer Science Laurea in Computer Science MAGISTRALE MAGISTRALE

(1)

1

Natural Language Processing / 1 Natural Language Processing / 1

Representation – Pre-processing – Tasks – Applications Representation – Pre-processing – Tasks – Applications

Laurea in Computer Science Laurea in Computer Science

MAGISTRALE MAGISTRALE

Corso di

ARTIFICIAL INTELLIGENCE Stefano Ferilli

Questi lucidi sono stati preparati per uso didattico. Essi contengono materiale originale di proprietà dell'Università degli Studi di Bari e/o figure di proprietà di altri autori, società e organizzazioni di cui e' riportato il riferimento. Tutto o parte del materiale può essere fotocopiato per uso personale o didattico ma non può essere distribuito per uso commerciale. Qualunque altro uso richiede una specifica autorizzazione da parte dell'Università degli Studi di Bari e degli altri autori coinvolti.

“They are flying planes”

Unknown Author

Motivations

●

Language is central in human life and development

●

Purposely developed by humans for communications

– Fundamental for evolution and society

●

Language is pervasive

●

Vast majority of information conveyed by natural language

– Written

– Spoken

●

Language is an expression of intelligence

●

Conceptual level

Introduction

●

Natural Language Processing (NLP)

●

One of the initial objectives of AI

●

Aims:

– Understanding natural language texts

– Generating natural language sentences

●

Need for automatic processing of natural language

Problems

●

Huge amount of data to be handled

●

Manual processing infeasible

●

Semantics

●

Computers mainly concerned with syntax

●

Ambiguity

●

Large number of different languages

●

Need for specific research on each

●

Languages variable and dynamic

Representation

●

(Natural) Language(s) expressed as text

●

Sequence of characters

– (printable) Graphical symbols (for humans)

– (numeric) Codes (for computers)

●

Need for an agreement about code-symbol mapping/encoding

●

Many characters used for different languages/purposes

– Continuously growing set

●

E.g., emoticons

(2)

Representation

●

Standard text formats (.txt)

●

(EBCDIC)

●

ASCII

– American Standard Code for Information Interchange

●

7 bit = 128 characters

●

1 byte = 256 characters

●

IANA: Internet Assigned Numbers Authority

– ISO Latin

– ISO-8859-1

●

UNICODE

Representation

●

UNICODE 3.0

●

4 bytes

– Currently 57700 code points

●

Segment-based handling

– More common subsets

– Less waiste of storage

●

A normal text editor sees more

insignificant characters

●

UTF-8

●

8-bit segments

●

Equivalent toASCII

– Most significant bit in the first byte at 1 indicates extension of the segment

●

UTF-16

– 16-bit segments

Linguistic Levels

●

Morphological

●

Characters

●

Lexical

●

Words

●

Syntactic

●

Sentences

●

Semantic

●

Concepts

●

Pragmatic

●

Application

Processing Phases

●

[Tokenization]

●

Language Identification

●

Stopword Removal

●

Normalization

●

Stemming

●

Lemmatization

●

Part-of-Speech Tagging

●

Parsing

●

Understanding

Processing Phases

Language Identification

Stopword Removal PoS-Tagging

Normalization

Parsing

Text Categorization

Text Understanding Indexing & Retrieval

Pre-processing

Processing Phases

Language Identification

Stopword Removal PoS-Tagging

Normalization

Parsing

Text Categorization

Text Understanding

Indexing & Retrieval

(3)

Morphology

Processing Phases

Language Identification

Stopword Removal PoS-Tagging

Normalization

Parsing

Text Categorization

Text Understanding Indexing & Retrieval

Lexicon

Processing Phases

Language Identification

Stopword Removal PoS-Tagging

Normalization

Parsing

Text Categorization

Text Understanding Indexing & Retrieval

Grammar

Processing Phases

Language Identification

Stopword Removal PoS-Tagging

Normalization

Parsing

Text Categorization

Text Understanding Indexing & Retrieval

Semantics

Processing Phases

Language Identification

Stopword Removal PoS-Tagging

Normalization

Parsing

Text Categorization

Text Understanding Indexing & Retrieval

Tokenization

●

Splitting the sequence of characters into meaningful aggregates

●

Words

●

Punctuation

●

Symbols

●

Values

– Dates, Monetary, ...

Language Identification

●

Determining the language(s) in which a text is written

●

Preliminary and necessary to the application of all subsequent phases

– Different languages  different approaches & resources

●

Morphological/Lexical

(4)

Stopword Removal

●

Removing terms that are irrelevant for understanding or distinguishing a specific text

●

Very common and widespread terms

– Generic

– Domain-specific

●

Lexical

– Stopword: specific term

●

Grammatical

– Function Word: grammatical role

●

Articles, Prepositions, Conjunctions, ...

●

Adjectives? Adverbs? -> sentiment

●

Used for lexical processing

Normalization

●

Reporting (inflected) terms to a standard form

●

Morphological

– Stemming: root

●

Suffix stripping [Porter]

●

Cannot distinguish grammatical roles

●

May merge different terms

●

Grammatical

– Lemmatization: basic form

●

More complex

●

Requires grammatical knowledge

●

Needed for some tasks

Indexing & Retrieval

●

Organizing the documents so as to retrieve (all and only) those that are relevant for a given information need

●

Usually expressed as a query

– Set/Bag of terms

●

Lexical

Part-of-Speech (PoS) Tagging

●

Identification of the grammatical role of a word

●

Part of Speech = grammatical role

●

Indicated by

– Suffix

– Position in sentence

●

Grammatical

Parsing

●

Determining sentence structure

●

Several representations

– Tree (Parse Tree)

– Graph

●

Grammatical

Text Categorization

●

Determining the subject of a text

●

In a pre-defined set or taxonomy

– Classification task

●

Latent

●

Lexical

(5)

Text Understanding

●

Understanding something about the content and/or meaning of a text

●

Word Sense Disambiguation

●

Named Entity Recognition

●

Anaphora / Co-reference resolution

●

Semantic

Text Understanding

●

Understanding something about the content and/or meaning of a text

●

Keyword/Keyphrase Extraction

– (Sequences of) words representative of the text content

●

Information Extraction

– Filling the fields of a record about the text content

●

Sentiment Analysis (Opinion Mining)

●

Emotion Analysis

●

Semantic

Sentiment/Emotion Analysis

●

Determining the emotional content of a text

●

Sentiment Analysis (aka Opinion Mining)

– Positive / (Neutral) / Negative

●

Emotion Analysis

– Specific emotions [Ekmann]

●

Joy, happiness, sadness, anger, fear, ...

●

Uses

●

Industrial products

●

Political polls

Resources

●

Language Identification

●

Language models

– Letter n-grams

●

n = 2, 3 – Stopwords

– ...suffixes?

●

Stopword Removal

●

Stopword Lists

●

PoS tags

●

Normalization

●

Suffix Lists

●

Inflection Rules

Resources

●

PoS-tagging

●

Suffix Lists

●

Inflection Rules

●

Parsing

●

Grammars

●

Understanding

●

(Lexical) Taxonomies

●

Ontologies

Applications

●

Text Summarization

●

Producing a shorter text preserving its main content

– Extractive

– Abstractive

●

Conversation (chatbots)

●

...

(6)

Indexing

●

Vector Spaces [Salton, Yang, Wong, 1975]

●

Document = Bag-of-Words (BoW)

●

Term-Document Matrix

– Columns = Documents (Vectors)

– Rows = Terms (Vector Dimensions)

– Entry (i,j) = relevance of i-th term to j-th document

●

Weighting schemes

–

Local

●

Binary

●

TF

–

Global

●

TF*IDF

●

Log*Entropy

●

Usually 0 if term does not appear in document

●

Query = new vector (point in the space)

– Similarity to documents = distance (Cosine Distance)

Latent Semantic Analysis (LSA)

●

[Deerwester & Dumais]

●

Terms in a document are just surface clues for the underlying concepts

●

Concepts may be extracted from the vector space

– Latent = not named

– Singular Value Decomposition (SVD)

●

A matrix factorization approach

–

Term-Document matrix A = U x W x V

●

U = Term-Concept matrix nxr

●

W = Concept matrix (diagonal) rxr

●

V = Document-Concept matrix mxr

–

r rank of A = number of concepts

Latent Semantic Indexing

●

LSA + Dimensionality Reduction

●

Select k (< r) top values in the diagonal of W

– Most important concepts

●

U’, V’ = corresponding columns of U and rows of V

●

A’ = U’ x W’ x V’

– Document vectors reweighted based on most important latent concepts only

– Terms not appearing in a document may have non-0 relevance to it

●

[Incremental (approximated) approaches available]

Formal Concept Analysis

●

Task: find the class lattice

●

Matrix Objects x Features

●

Object Definition = features of object

– E.g., binary

●

Subclass = Subset of features

– Expanding features

 Restricting objects

– Restricting features

 Expanding objects

Topic Modeling

●

Latent topics

●

Document topic = mix of topics

●

Latent Dirichlet Allocation (LDA)

●

Topic = mix of words

●

Clusters of similar words

WordNet

●

Lexical Taxonomy (Ontology?) [Miller]

●

Synset (Synonymous Set)

●

212.558 concepts + glossas (v3)

●

20 relationships

– Syntactic, Semantic

●

Prolog version (v3)

●

Extensions

– Multiwordnet

– (Italwordnet)

– SentiWordnet

– Wordnet Domains

(7)

More...

●

Concept Indexing

ML x NLP

●

Automatic learning of linguistic resources

●

Impossible or costly manually building them for all languages

– Non-widespread languages, dialects, jargons, ...

●

Unsupervised vs Supervised Learning ?

BLA-BLA

●

Broad-spectrum Language Analysys-Based Learning Application

●

Morphological statistics

●

Stopword lists

●

Suffix lists

●

PoS

●

Conceptual Taxonomies

subject, verb, complement

subject, complement

ConNeKTion

●

Goal

●

Study, understanding and exploitation of the content of a collection / corpus / digital library

●

Easy exploration of its semantic content

●

Representation

●

Graph (Lexical taxonomy/ Ontology)

– Nodes = Terms / Concepts

– Arcs = Verbs (weighted)

●

Frequency (positive/negative)

ConNeKTion

●

(Core) pre-processing

●

Anaphora resolution

●

Syntactic analysis

●

Normalization

●

Tasks

●

Ontology Learning and Refinement

●

Association-based reasoning

●

Keyword Extraction

●

Information Retrieval

●

Author identification

●

...

ConNeKTion

Reasoning by association (BFS)

A Breadth-First Search starts from both nodes, the former searches the latter's frontier and vice versa, until the two frontiers meet.

The positive/negative-instances ratios over the total represents action gradations.

“the young looks television that talks about (and criticizes) facebook, because it typically does not help (rather distracts) schoolwork”.

Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application

Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 43

(8)

ConNeKTion

Reasoning by association (prob)

Defined a formalism based on ProbLog language: p

_i

:: f

_i

●

f

i

: ground literal of the form link (subject, verb, complement)

●

p

i

: ratio between the sum of all examples for which f

i

holds and the sum of all possible links between subject and complement Real world data are typically noisy and uncertain → need for strategies that soften the classical rigid logical reasoning