• Non ci sono risultati.

Types of summaries

N/A
N/A
Protected

Academic year: 2021

Condividi "Types of summaries"

Copied!
30
0
0

Testo completo

(1)

Summarization

Karola Klarl

seminario nel corso

Elaborazione di Linguaggio Naturale

(2)

Overview

• Introduction

• Steps of summarization

Extraction

Interpretation Generation

• Evaluation

• Future

(3)

Why summarization?

• Informing

• Decision making

• Time saving

Introduction

(4)

What is a summary?

• Text produced from one or more texts

• That contains a significant portion of  information from the original text

• That is no longer than half of the original text

Introduction

(5)

Types of summaries

• Indicative  Informative

• Extract   Abstract

• Generic   Query‐oriented

• Single‐Doc   Multi‐Doc

Introduction

(6)

Paradigms

Information Extraction / NLP

Top‐Down approach

Query‐driven focus 

Query‐oriented summaries

Tries to «understand»

Need of rules for text  analysis at all levels

Information Retrieval / Statistics

Bottom‐Up approach

Text‐driven focus

Generic summaries

Operates at lexical level

Need of large amount of  texts

Introduction

(7)

Paradigms

Information Extraction / NLP

+   Higher quality

+   Supports abstracting

‐ Speed

‐ Needs to scale up to robust  open‐domain 

summarization

Information Retrieval / Statistics

+   Robust

‐ Lower quality

‐ Inability to manipulate  information at abstract  levels

Introduction

 Combine strength of both paradigms

(8)

Overview

• Introduction

Steps of summarization

Extraction

Interpretation Generation

• Evaluation

• Future

(9)

Steps of summarization

Steps of summarization

• Extraction

 Extracts

• (Filtering)

Only for Multi‐Docs

• Interpretation

 Templates  (unreadable abstract representations)

• Generation

 Abstracts

(10)

Extraction

• General procedure

Several independent modules

Each module assigns score to each unit of input Combination module combines scores to a single 

score

System returns n highest‐scoring units

Extraction

(11)

Methods

• Position‐based

• Cue‐Phrase

• Word‐Frequency

• Cohesion‐based

• Discourse‐based

• And many more …

Extraction

(12)

Position‐based methods

Lead method

Claim:  important sentences occur at the  beginning (or end) of texts

OPP (Optimum Position Policy)

Claim:  important sentences are located at genre‐

dependent positions

positions can be determined through training

Title‐based method

Claim: words in titles are relevant to summarization

Extraction

(13)

Cue‐Phrase method

• Claim 1: important sentences contain «bonus  phrases»

(in this paper we show, significantly, in conclusion)

non important sentences contain 

«stigma phrases»

(hardly, impossible)

• Claim 2: phrases can be detected  automatically

Extraction

(14)

Word frequency‐based method

• Claim: important sentences contain words  that occur frequently 

 Zipf’s law distribution

• Generality makes it attractive for further study

Extraction

(15)

Cohesion‐based methods

• Claim:  important sentences are the highest  connected entities in semantic 

structures

• Classes of approaches

Word co‐occurrences

Local salience and grammatical relations Co‐reference

Lexical chains Combinations

Extraction

(16)

Discourse‐based method

• Claim: coherence structure of a text can be  constructed and «centrality» of units  reflects their importance

• Coherence structure = tree‐like representation

Extraction

(17)

Extraction

• All methods seem to work

• No method performs as well as humans

• No obvious best strategy

Extraction

(18)

Interpretation

• Occurs at conceptual level

• Result = something new, not contained in  input

• Need of „world knowledge“, separate from  input

Really difficult to build domain knowledge

Little work so far

Interpretation

(19)

Interpretation

• Methods

Condensation operators Topic signatures

And others …

Interpretation

(20)

Condensation operators

• Parse text

• Build  terminological representation

• Apply condensation operators

• Build hierarchy of topic descriptions

• Until now no parser/generator has been built!

Interpretation

(21)

Topic signatures

• Claim: can approximate topic identification  at lexical level using automatically  aquired «word families»

• Topic signature is defined by frequency  distribution of words related to concept

• Inverse of query expansion in Information  Retrieval

Interpretation

(22)

Generation

• Level 1: no separate generation

Produce extracts from input text

• Level 2: simple sentences

Assemble portions of extracted clauses together

• Level 3: full NL Generation

Sentence planner: plan content, lenght,  theme, order, words , … Surface realizer: linearize input 

grammatically

Generation

(23)

Overview

• Introduction

• Steps of summarization

Extraction

Interpretation Generation

Evaluation

• Future

(24)

Evaluation

• If you already have a summary

Compare the new one to it

Choose granularity (clause, sentence, paragraph) Measure similarity of each unit in the new 

summary to the most similar units in the «gold  standard»

Measure Precision and Recall

Evaluation

(25)

Evaluation

• If you don’t have a summary

Compression ratio CR = length S / length T Retention ratio       RR = info in S / info in T

• RR is measured through Q&A games

Shannon game: quantifies information  content

Question game: test reader’s understanding

Evaluation

(26)

Overview

• Introduction

• Steps of summarization

Extraction

Interpretation Generation

• Evaluation

Future

(27)

What has to be done …

• Data preparation

Collect sets of texts and abstracts Corpora of <text, abstract, extract>

• Types of summaries

Determine characteristics for each type

• Extraction

New extraction methods

Heuristics for method combination

Future

(28)

What has to be done …

Interpretation

Investigate types of fusion

Create collections of knowledge

Study incorporation of user’s knowledge in  interpretation

Generation

Develop sentence planner rules for dense packing of  content into sentences

Evaluation

Better evaluation metrics

Future

(29)

Grazie per l‘attenzione!

Spero che il mio italiano fosse 

comprensibile 

(30)

References

The Oxford Handbook of Computational Linguistics, 

Mitkov R., Oxford: Oxford University Press, pp.583‐598  (encyclopedia entry: Automated Text Summarization,  Hovy E.H. 2005)

Automated Text Summarization Tutorial, COLING/ACL  '98, by E.Hovy e D.Marcu

http://www.isi.edu/~marcu/acl‐tutorial.ppt

Text Summarization Tutorial ACM SIGIR, by D. R. Radev http://www.summarization.com/sigirtutorial2004.ppt

http://en.wikipedia.org/wiki/Automatic_summarizatio n

Riferimenti

Documenti correlati

Virgilio, Svetonio, Servio erano autori ben conosciuti nel Medioevo e bisogna supporre che, come accadeva per altre città, i passi delle loro opere in cui si citava Capys

Soundfield MKV + binaural (Ambassador) Angelo Farina Aurora (synchronous measurement. on PC+Layla) –

 Overloading is mostly used for methods because the compiler Overloading is mostly used for methods because the compiler may infer which version of the method should be invoked by

Statuimus et ordinamus quod nulus cuiuscumque condictionis et status qui non sit descriptus in matricula societatis quatuor artium civitatis Bononie audeat vel presumat tenere

International Migration Outlook 2015 Summary in Italian La pubblicazione è disponibile all’indirizzo:

Non entriamo nel merito di tutte le casistiche (alcune comprendenti frasi fatte), che sono molte, e possono essere imparate a memoria via via, man mano che si

È infatti possibile utilizzare delle macchine, dette chopper (che altro non sono che sistemi deflettori a radiofrequenza) che si limitano a tagliare il fascio buttando via quello

• Total number of FFTs is minimized, in fact each block of input data needs to be FFT transformed and IFFT antitransformed just once, after frequency-domain summation. •Latency of