• Non ci sono risultati.

Laurea in Computer Science Laurea in Computer Science MAGISTRALE MAGISTRALE

N/A
N/A
Protected

Academic year: 2021

Condividi "Laurea in Computer Science Laurea in Computer Science MAGISTRALE MAGISTRALE"

Copied!
9
0
0

Testo completo

(1)

1

Natural Language Processing / 1 Natural Language Processing / 1

Representation – Pre-processing – Tasks – Applications Representation – Pre-processing – Tasks – Applications

Laurea in Computer Science Laurea in Computer Science

MAGISTRALE MAGISTRALE

Corso di

ARTIFICIAL INTELLIGENCE Stefano Ferilli

Questi lucidi sono stati preparati per uso didattico. Essi contengono materiale originale di proprietà dell'Università degli Studi di Bari e/o figure di proprietà di altri autori, società e organizzazioni di cui e' riportato il riferimento. Tutto o parte del materiale può essere fotocopiato per uso personale o didattico ma non può essere distribuito per uso commerciale. Qualunque altro uso richiede una specifica autorizzazione da parte dell'Università degli Studi di Bari e degli altri autori coinvolti.

“They are flying planes”

Unknown Author

Motivations

Language is central in human life and development

Purposely developed by humans for communications

– Fundamental for evolution and society

Language is pervasive

Vast majority of information conveyed by natural language

– Written

– Spoken

Language is an expression of intelligence

Conceptual level

Introduction

Natural Language Processing (NLP)

One of the initial objectives of AI

Aims:

– Understanding natural language texts

– Generating natural language sentences

Need for automatic processing of natural language

Problems

Huge amount of data to be handled

Manual processing infeasible

Semantics

Computers mainly concerned with syntax

Ambiguity

Large number of different languages

Need for specific research on each

Languages variable and dynamic

Representation

(Natural) Language(s) expressed as text

Sequence of characters

– (printable) Graphical symbols (for humans)

– (numeric) Codes (for computers)

Need for an agreement about code-symbol mapping/encoding

Many characters used for different languages/purposes

– Continuously growing set

E.g., emoticons

(2)

Representation

Standard text formats (.txt)

(EBCDIC)

ASCII

– American Standard Code for Information Interchange

7 bit = 128 characters

1 byte = 256 characters

IANA: Internet Assigned Numbers Authority

– ISO Latin

– ISO-8859-1

UNICODE

Representation

UNICODE 3.0

4 bytes

– Currently 57700 code points

Segment-based handling

– More common subsets

– Less waiste of storage

A normal text editor sees more

insignificant characters

UTF-8

8-bit segments

Equivalent toASCII

– Most significant bit in the first byte at 1 indicates extension of the segment

UTF-16

– 16-bit segments

Linguistic Levels

Morphological

Characters

Lexical

Words

Syntactic

Sentences

Semantic

Concepts

Pragmatic

Application

Processing Phases

[Tokenization]

Language Identification

Stopword Removal

Normalization

Stemming

Lemmatization

Part-of-Speech Tagging

Parsing

Understanding

Processing Phases

Language Identification

Stopword Removal PoS-Tagging

Normalization

Parsing

Text Categorization

Text Understanding Indexing & Retrieval

Pre-processing

Processing Phases

Language Identification

Stopword Removal PoS-Tagging

Normalization

Parsing

Text Categorization

Text Understanding

Indexing & Retrieval

(3)

Morphology

Processing Phases

Language Identification

Stopword Removal PoS-Tagging

Normalization

Parsing

Text Categorization

Text Understanding Indexing & Retrieval

Lexicon

Processing Phases

Language Identification

Stopword Removal PoS-Tagging

Normalization

Parsing

Text Categorization

Text Understanding Indexing & Retrieval

Grammar

Processing Phases

Language Identification

Stopword Removal PoS-Tagging

Normalization

Parsing

Text Categorization

Text Understanding Indexing & Retrieval

Semantics

Processing Phases

Language Identification

Stopword Removal PoS-Tagging

Normalization

Parsing

Text Categorization

Text Understanding Indexing & Retrieval

Tokenization

Splitting the sequence of characters into meaningful aggregates

Words

Punctuation

Symbols

Values

– Dates, Monetary, ...

Language Identification

Determining the language(s) in which a text is written

Preliminary and necessary to the application of all subsequent phases

– Different languages  different approaches & resources

Morphological/Lexical

(4)

Stopword Removal

Removing terms that are irrelevant for understanding or distinguishing a specific text

Very common and widespread terms

– Generic

– Domain-specific

Lexical

– Stopword: specific term

Grammatical

– Function Word: grammatical role

Articles, Prepositions, Conjunctions, ...

Adjectives? Adverbs? -> sentiment

Used for lexical processing

Normalization

Reporting (inflected) terms to a standard form

Morphological

– Stemming: root

Suffix stripping [Porter]

Cannot distinguish grammatical roles

May merge different terms

Grammatical

– Lemmatization: basic form

More complex

Requires grammatical knowledge

Needed for some tasks

Indexing & Retrieval

Organizing the documents so as to retrieve (all and only) those that are relevant for a given information need

Usually expressed as a query

– Set/Bag of terms

Lexical

Part-of-Speech (PoS) Tagging

Identification of the grammatical role of a word

Part of Speech = grammatical role

Indicated by

– Suffix

– Position in sentence

Grammatical

Parsing

Determining sentence structure

Several representations

– Tree (Parse Tree)

– Graph

Grammatical

Text Categorization

Determining the subject of a text

In a pre-defined set or taxonomy

– Classification task

Latent

Lexical

(5)

Text Understanding

Understanding something about the content and/or meaning of a text

Word Sense Disambiguation

Named Entity Recognition

Anaphora / Co-reference resolution

Semantic

Text Understanding

Understanding something about the content and/or meaning of a text

Keyword/Keyphrase Extraction

– (Sequences of) words representative of the text content

Information Extraction

– Filling the fields of a record about the text content

Sentiment Analysis (Opinion Mining)

Emotion Analysis

Semantic

Sentiment/Emotion Analysis

Determining the emotional content of a text

Sentiment Analysis (aka Opinion Mining)

– Positive / (Neutral) / Negative

Emotion Analysis

– Specific emotions [Ekmann]

Joy, happiness, sadness, anger, fear, ...

Uses

Industrial products

Political polls

Resources

Language Identification

Language models

– Letter n-grams

n = 2, 3 – Stopwords

– ...suffixes?

Stopword Removal

Stopword Lists

PoS tags

Normalization

Suffix Lists

Inflection Rules

Resources

PoS-tagging

Suffix Lists

Inflection Rules

Parsing

Grammars

Understanding

(Lexical) Taxonomies

Ontologies

Applications

Text Summarization

Producing a shorter text preserving its main content

– Extractive

– Abstractive

Conversation (chatbots)

...

(6)

Indexing

Vector Spaces [Salton, Yang, Wong, 1975]

Document = Bag-of-Words (BoW)

Term-Document Matrix

– Columns = Documents (Vectors)

– Rows = Terms (Vector Dimensions)

– Entry (i,j) = relevance of i-th term to j-th document

Weighting schemes

Local

Binary

TF

Global

TF*IDF

Log*Entropy

Usually 0 if term does not appear in document

Query = new vector (point in the space)

– Similarity to documents = distance (Cosine Distance)

Latent Semantic Analysis (LSA)

[Deerwester & Dumais]

Terms in a document are just surface clues for the underlying concepts

Concepts may be extracted from the vector space

– Latent = not named

– Singular Value Decomposition (SVD)

A matrix factorization approach

Term-Document matrix A = U x W x V

U = Term-Concept matrix nxr

W = Concept matrix (diagonal) rxr

V = Document-Concept matrix mxr

r rank of A = number of concepts

Latent Semantic Indexing

LSA + Dimensionality Reduction

Select k (< r) top values in the diagonal of W

– Most important concepts

U’, V’ = corresponding columns of U and rows of V

A’ = U’ x W’ x V’

– Document vectors reweighted based on most important latent concepts only

– Terms not appearing in a document may have non-0 relevance to it

[Incremental (approximated) approaches available]

Formal Concept Analysis

Task: find the class lattice

Matrix Objects x Features

Object Definition = features of object

– E.g., binary

Subclass = Subset of features

– Expanding features

 Restricting objects

– Restricting features

 Expanding objects

Topic Modeling

Latent topics

Document topic = mix of topics

Latent Dirichlet Allocation (LDA)

Topic = mix of words

Clusters of similar words

WordNet

Lexical Taxonomy (Ontology?) [Miller]

Synset (Synonymous Set)

212.558 concepts + glossas (v3)

20 relationships

– Syntactic, Semantic

Prolog version (v3)

Extensions

– Multiwordnet

– (Italwordnet)

– SentiWordnet

– Wordnet Domains

(7)

More...

Concept Indexing

ML x NLP

Automatic learning of linguistic resources

Impossible or costly manually building them for all languages

– Non-widespread languages, dialects, jargons, ...

Unsupervised vs Supervised Learning ?

BLA-BLA

Broad-spectrum Language Analysys-Based Learning Application

Morphological statistics

Stopword lists

Suffix lists

PoS

Conceptual Taxonomies

subject, verb, complement

subject, complement

ConNeKTion

Goal

Study, understanding and exploitation of the content of a collection / corpus / digital library

Easy exploration of its semantic content

Representation

Graph (Lexical taxonomy/ Ontology)

– Nodes = Terms / Concepts

– Arcs = Verbs (weighted)

Frequency (positive/negative)

ConNeKTion

(Core) pre-processing

Anaphora resolution

Syntactic analysis

Normalization

Tasks

Ontology Learning and Refinement

Association-based reasoning

Keyword Extraction

Information Retrieval

Author identification

...

ConNeKTion

Reasoning by association (BFS)

A Breadth-First Search starts from both nodes, the former searches the latter's frontier and vice versa, until the two frontiers meet.

The positive/negative-instances ratios over the total represents action gradations.

“the young looks television that talks about (and criticizes) facebook, because it typically does not help (rather distracts) schoolwork”.

Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application

Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 43

(8)

ConNeKTion

Reasoning by association (prob)

Defined a formalism based on ProbLog language: p

i

:: f

i

f

i

: ground literal of the form link (subject, verb, complement)

p

i

: ratio between the sum of all examples for which f

i

holds and the sum of all possible links between subject and complement Real world data are typically noisy and uncertain → need for strategies that soften the classical rigid logical reasoning

Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application

Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 44

ConNeKTion GUI

Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application

Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 51

ConNeKTion GUI

Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application

Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 52

ConNeKTion GUI

Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application

Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 53

ConNeKTion GUI

Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application

Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 54

ConNeKTion GUI

Automatic Inductive and Analogical Reasoning on Concept Networks: a Forensic Application

Candidate: Fabio Leuzzi – Tutor: Prof. Stefano Ferilli 55

(9)

References

S. Ferilli

“Automatic Digital Document Processing and

Management - Problems, Algorithms and

Techniques”, Springer, 2011

Riferimenti

Documenti correlati

Examples of challenging real financial time series will be given and we will show how to use the simulation methodology offered by a variety of software agents optimized by a GA

Our OptimalSQM Framework solution is based on a Business Intelligent Simulation Architecture (BISA) with integrated Software expert tools (Profit eXpert, Planer eXpert, Risk

– Human reasoning: if a fact cannot be proven false, it is assumed to be true?.

Essi contengono materiale originale di proprietà dell'Università degli Studi di Bari e/o figure di proprietà di altri autori, società e organizzazioni di cui e' riportato

Essi contengono materiale originale di proprietà dell'Università degli Studi di Bari e/o figure di proprietà di altri autori, società e organizzazioni di cui e' riportato

● Abduction then emerges as an important computational paradigm that would be needed for certain problem solving tasks within a declarative representation of the problem

Essi contengono materiale originale di proprietà dell'Università degli Studi di Bari e/o figure di proprietà di altri autori, società e organizzazioni di cui e' riportato

Our main results are as follows: (a) exponential memory is sufficient and may be necessary for winning strategies in energy parity games; (b) the problem of deciding the winner