Laurea in Computer Science Laurea in Computer Science MAGISTRALE MAGISTRALE

(1)

1

Image Processing (for documents) Image Processing (for documents)

Image Representation – Pre-processing – Tasks – Image Representation – Pre-processing – Tasks –

Applications Applications

Laurea in Computer Science Laurea in Computer Science

MAGISTRALE MAGISTRALE

Corso di

ARTIFICIAL INTELLIGENCE Stefano Ferilli

Questi lucidi sono stati preparati per uso didattico. Essi contengono materiale originale di proprietà dell'Università degli Studi di Bari e/o figure di proprietà di altri autori, società e organizzazioni di cui e' riportato il riferimento. Tutto o parte del materiale può essere fotocopiato per uso personale o didattico ma non può essere distribuito per uso commerciale. Qualunque altro uso richiede una specifica autorizzazione da parte dell'Università degli Studi di Bari e degli altri autori coinvolti.

“A picture is worth a thousand words”

Unknown Author

Motivations

●

Images widespread and central in everyday life

●

Pictures

●

Videos

●

(Documents)

–

Documents as images

–

Images in documents

●

Images extremely information-dense objects

●

Human processing of images expression of intelligence

●

Perceptual level

Introduction

●

Image Processing

●

A traditional objective of AI

●

Aims:

–

Identifying and recognizing objects

–

Understanding scenes

●

Need for automatic processing of images

●

Problems

●

Huge amount of data to be handled

–

Manual processing infeasible

●

Semantics

–

Objects not explicit

–

3D world into 2D images

Types

●

2 approaches

●

Vector graphics

–

Based on geometric-mathematical graphic primitives

●

+ Scalability

●

+ Storage occupation

–

Used for ‘artificial’ images

●

Raster graphics

–

Description based on a grid of colored dots

●

+ representation of halftones

–

Used for ‘natural’ images

Formats

●

Vector graphics

●

SVG (Scalable Vector Graphics)

●

Raster graphics

●

BMP (BitMaP)

●

GIF (Graphics Interchange Format)

●

TIFF (Tagged Image File Format)

●

JPEG (Joint Picture Experts Group)

●

PNG (Portable Network Graphics)

●

FITS (Flexible Image Transport System)

●

...

(2)

Raster Images

●

Sampling grid

●

Bitmap

●

Pixel (Picture Element)

●

Density or Depth

●

bit/pixel

●

n bits  2

ⁿ

colors

●

Resolution

●

width * height * depth / 8

Color Spaces

●

Mathematical model for color representation

●

3 channels

–

(RYB)

–

Tricromy (RGB)

●

Additive

–

Quadricromy (CMYK)

●

Subtractive

–

YUV

●

Gray-levels

–

HSV/HSB (Hue, Saturation, Value/Brightness)

●

Linear

–

...

●

Transparency

●

Alpha channel

Color Representation

●

Color

●

1 byte/channel  256 levels/channel

●

3 bytes/pixel  17 777 216 colors (true color)

●

Gray Level

●

1 byte/pixel  256 levels

●

Average of R, G, B

●

Black&White

●

1 bit/pixel

●

(Alpha channel)

●

1 byte  256 levels

Our focus

●

Documents

●

Documents as images

●

Images in documents

●

2 types

●

Born-digital

–

Layout and content information partly explicit

●

Digitized

–

Documents themselves are images

●

All information must be extracted from images

Documents

●

Examples

Documents

●

Examples

(3)

Documents

●

Examples

Documents

●

Examples

Documents

●

Examples

Document Layout

●

Visual appearance of documents

●

Organization of document components

●

Very regular in most document types

●

Articles, Books, Bills, Letters, ...

–

Useful to approach automatic processing

●

Content-based

–

Indication for semantics

●

Not explicit

–

Processing needed to extract it

Layout Structure

●

Geometric/perceptive structure in which components are visually organized

●

Characterizes each document

●

Fundamental for comprehension

●

Defines a hierarchy of abstract representations of a document’s geometric organization

–

Leaves = basic blocks identifiable in the document

–

Internal nodes = various livels of compound components

–

Root = whole document

●

Sequence of pages

Layout Hierarchy

●

(multi-page) Document

●

Page

–

Frame

●

Block

–

Text line

●

Word

●

Basic text block

–

Graphical line

–

Image

–

Graphic

(4)

Layout Structure

●

Not all pages interesting for one’s goals

●

Document processing by single pages

–

Scope restriction

–

Computational savings

●

A page may group different visual components

●

Frames

–

Rectangular areas of interest

–

Each associated to a specific logical function

Layout Analysis

●

Process for identifying a document’s components and their hierarchy

●

Should result in a set of frames, each of which can be associated to a specific logical component

●

Layout types

●

Manhattan

●

non-Manhattan

Layout Analysis

●

Problems in digitized documents (images)

●

Rotation

●

Specks

●

Deformations

●

Unstructured

●

Computational complexity

●

(Black & White)

Pre-processing Tasks

●

Binarization

●

Skew correction

●

Dewarping

●

Color separation

●

Line identification

●

Object Identification

●

Optical Character Recognition (OCR)

Binarization

●

Steps

●

Transformation into gray scale

–

Luminance = average of chromatic components

●

Thresholding

–

Based on distribution of gray levels

●

Global: computed on the whole image

–

E.g. Otsu

●

Local: based on a neighborhood of each pixel

–

E.g. Niblack

Scaling

●

Scale reduction

●

Squares of nxn pixels collapsed in a single pixel

–

Black&White: majority of collapsed pixels

–

(Gray level / color: average of collapsed pixels)

●

Advantages

●

Removes insignificant details

–

Small specks

●

Reduces computational complexity

(5)

Deskewing

●

Skew angle

●

Non-horizontal orientation of text lines

–

Positive: clockwise

–

Negative: counterclockwise

●

Estimated through horizontal projection

–

Try various positive and negative angles

●

OK high peaks having similar size as characters

●

KO smooth shape with low peaks

–

Possibly on selected

areas of the document

●

Correction:

opposite rotation

Deskewing

●

Problem: multiple orientations

●

Choice of dominant orientation

●

Separation of different zones

Dewarping

●

Some documents digitized through photography

–

Historical documents

●

Introduction of geometric distorsions

–

Waving

–

Perspective

●

2 approaches

●

Specialized hardware (expensive)

●

Generic hardware

–

Manipulation software

Dewarping

●

Grid straightening

●

Geometric approach

Dewarping

●

Grid straightening

●

Image

–

Original

–

Dewarped

●

OCR

–

Original

–

Dewarped

Spread Factor

●

Indicator of layout complexity

●

Average distance between regioni / average height of regions

–

>1 : simple structure

●

A few sparse regions

–

<1 : complex strutture

●

Many dense regions

0 5 10 15 20 25 30 35 ...

60 50

0 0,5 1 1,5 2 2,5 3

Spread Factor Th

res ho

ld

s Ch

C_v C_a

(6)

Layout Analysis

●

Identification of frames

●

Rectangles enclosing portions of contents

●

Classification of methodologies

●

Direction in tree building

–

top-down

●

More specific

–

bottom-up

●

More generic

–

hybrid

Layout Analysis

●

Geometric Structure Analysis

●

Input

–

Vector representation of blocks that make up a document

●

Graphic or text elements

●

Output

–

High-level geometric structure

●

frames

–

Boxes that should represent consistent and significant block aggregations

XY-cut

●

Alternate and recursive Horizontal / Vertical cut of portions of a document

●

Based on projections

–

Horizontal + Cut

–

Vertical + Cut

XY-cut

●

Results

Run Length Smoothing

●

Run

●

Sequence of pixels of the same type delimited by pixels of opposite type

–

White/Black

●

Run Length

●

Number of pixels that make up a run

●

Run Length Smoothing

●

Removal of some runs

–

Shorter than a pre-defined threshold by transforming pixels in the opposite type

t = 3

Run Length Smoothing

●

RLSA

●

Horizontal smoothing with threshold t

_h

●

Vertical smoothing with threshold t

_v

●

AND

●

Horizontal smoothing

with threshold t

_a

(7)

Run Length Smoothing

●

RLSO (Run Length Smoothing with OR)

●

Horizontal smoothing with threshold t

_h

●

Vertical smoothing with threshold t

_v

●

OR

RLSA RLSO

Run Length Smoothing

●

RLSO

●

Advantages

–

Handles non- Manhattan layout

–

Less parameters

●

Lower thresholds

–

Less steps

●

Efficiency

●

Can be applied to connected components

–

PDF

–

Different definition of run length

Run Length Smoothing

●

RLSO

●

Automatic assessment of thresholds

–

Histograms of runs

●

Cumulative

–

Graphic slope

●

Derivative

–

Problem

●

Documents with characters in very different sizes

Docstrum

●

Assessment of the document spectrum

●

Bottom-up

●

Works on connected components

–

Clustering

●

Based on k-Nearest Neighbors

–

Coordinates: Euclidean distance + Angle

●

Not sensitive to skew

–

Useful to assess it

CLiDE

●

Chemical Literature Data Extraction

●

Bottom-up

●

Works on connected components

–

Progressive grouping in a tree

●

Sensitive to skew

●

Some levels:

References

●

S. Ferilli

●