1
Image Processing (for documents) Image Processing (for documents)
Image Representation – Pre-processing – Tasks – Image Representation – Pre-processing – Tasks –
Applications Applications
Laurea in Computer Science Laurea in Computer Science
MAGISTRALE MAGISTRALE
Corso di
ARTIFICIAL INTELLIGENCE Stefano Ferilli
Questi lucidi sono stati preparati per uso didattico. Essi contengono materiale originale di proprietà dell'Università degli Studi di Bari e/o figure di proprietà di altri autori, società e organizzazioni di cui e' riportato il riferimento. Tutto o parte del materiale può essere fotocopiato per uso personale o didattico ma non può essere distribuito per uso commerciale. Qualunque altro uso richiede una specifica autorizzazione da parte dell'Università degli Studi di Bari e degli altri autori coinvolti.
“A picture is worth a thousand words”
Unknown Author
Motivations
●
Images widespread and central in everyday life
●
Pictures
●
Videos
●
(Documents)
–
Documents as images
–
Images in documents
●
Images extremely information-dense objects
●
Human processing of images expression of intelligence
●
Perceptual level
Introduction
●
Image Processing
●
A traditional objective of AI
●
Aims:
–
Identifying and recognizing objects
–
Understanding scenes
●
Need for automatic processing of images
●
Problems
●
Huge amount of data to be handled
–
Manual processing infeasible
●
Semantics
–
Objects not explicit
–
3D world into 2D images
Types
●
2 approaches
●
Vector graphics
–
Based on geometric-mathematical graphic primitives
●
+ Scalability
●
+ Storage occupation
–Used for ‘artificial’ images
●
Raster graphics
–
Description based on a grid of colored dots
●
+ representation of halftones
–Used for ‘natural’ images
Formats
●
Vector graphics
●
SVG (Scalable Vector Graphics)
●
Raster graphics
●
BMP (BitMaP)
●
GIF (Graphics Interchange Format)
●
TIFF (Tagged Image File Format)
●
JPEG (Joint Picture Experts Group)
●
PNG (Portable Network Graphics)
●
FITS (Flexible Image Transport System)
●
...
Raster Images
●
Sampling grid
●
Bitmap
●
Pixel (Picture Element)
●
Density or Depth
●
bit/pixel
●
n bits 2
ncolors
●
Resolution
●
width * height * depth / 8
Color Spaces
●
Mathematical model for color representation
●
3 channels
–
(RYB)
–
Tricromy (RGB)
●
Additive
–
Quadricromy (CMYK)
●
Subtractive
–YUV
●
Gray-levels
–
HSV/HSB (Hue, Saturation, Value/Brightness)
●
Linear
–...
●
Transparency
●
Alpha channel
Color Representation
●
Color
●
1 byte/channel 256 levels/channel
●
3 bytes/pixel 17 777 216 colors (true color)
●
Gray Level
●
1 byte/pixel 256 levels
●
Average of R, G, B
●
Black&White
●
1 bit/pixel
●
(Alpha channel)
●
1 byte 256 levels
Our focus
●
Documents
●
Documents as images
●
Images in documents
●
2 types
●
Born-digital
–
Layout and content information partly explicit
●
Digitized
–
Documents themselves are images
●
All information must be extracted from images
Documents
●
Examples
Documents
●
Examples
Documents
●
Examples
Documents
●
Examples
Documents
●
Examples
Document Layout
●
Visual appearance of documents
●
Organization of document components
●
Very regular in most document types
●
Articles, Books, Bills, Letters, ...
–
Useful to approach automatic processing
●
Content-based
–
Indication for semantics
●
Not explicit
–
Processing needed to extract it
Layout Structure
●
Geometric/perceptive structure in which components are visually organized
●
Characterizes each document
●
Fundamental for comprehension
●
Defines a hierarchy of abstract representations of a document’s geometric organization
–
Leaves = basic blocks identifiable in the document
–
Internal nodes = various livels of compound components
–
Root = whole document
●
Sequence of pages
Layout Hierarchy
●
(multi-page) Document
●
Page
–
Frame
●
Block
–Text line
●
Word
●
Basic text block
–Graphical line
–Image
–Graphic
Layout Structure
●
Not all pages interesting for one’s goals
●
Document processing by single pages
–
Scope restriction
–
Computational savings
●
A page may group different visual components
●
Frames
–
Rectangular areas of interest
–
Each associated to a specific logical function
Layout Analysis
●
Process for identifying a document’s components and their hierarchy
●
Should result in a set of frames, each of which can be associated to a specific logical component
●
Layout types
●
Manhattan
●
non-Manhattan
Layout Analysis
●
Problems in digitized documents (images)
●
Rotation
●
Specks
●
Deformations
●
Unstructured
●
Computational complexity
●
(Black & White)
Pre-processing Tasks
●
Binarization
●
Skew correction
●
Dewarping
●
Color separation
●
Line identification
●
Object Identification
●
Optical Character Recognition (OCR)
Binarization
●
Steps
●
Transformation into gray scale
–
Luminance = average of chromatic components
●
Thresholding
–
Based on distribution of gray levels
●
Global: computed on the whole image
–E.g. Otsu
●
Local: based on a neighborhood of each pixel
–E.g. Niblack
Scaling
●
Scale reduction
●
Squares of nxn pixels collapsed in a single pixel
–
Black&White: majority of collapsed pixels
–
(Gray level / color: average of collapsed pixels)
●
Advantages
●
Removes insignificant details
–
Small specks
●
Reduces computational complexity
Deskewing
●
Skew angle
●
Non-horizontal orientation of text lines
–
Positive: clockwise
–
Negative: counterclockwise
●
Estimated through horizontal projection
–
Try various positive and negative angles
●
OK high peaks having similar size as characters
●
KO smooth shape with low peaks
–Possibly on selected
areas of the document
●
Correction:
opposite rotation
Deskewing
●
Problem: multiple orientations
●
Choice of dominant orientation
●
Separation of different zones
Dewarping
●
Some documents digitized through photography
–
Historical documents
●
Introduction of geometric distorsions
–
Waving
–
Perspective
●
2 approaches
●
Specialized hardware (expensive)
●
Generic hardware
–
Manipulation software
Dewarping
●
Grid straightening
●
Geometric approach
Dewarping
●
Grid straightening
●
Image
–
Original
–
Dewarped
●
OCR
–
Original
–
Dewarped
Spread Factor
●
Indicator of layout complexity
●
Average distance between regioni / average height of regions
–
>1 : simple structure
●
A few sparse regions
–<1 : complex strutture
●
Many dense regions
0 5 10 15 20 25 30 35 ...
60 50
0 0,5 1 1,5 2 2,5 3
Spread Factor Th
res ho
ld
s Ch
Cv Ca
Layout Analysis
●
Identification of frames
●
Rectangles enclosing portions of contents
●
Classification of methodologies
●
Direction in tree building
–
top-down
●
More specific
–bottom-up
●
More generic
–hybrid
Layout Analysis
●
Geometric Structure Analysis
●
Input
–
Vector representation of blocks that make up a document
●
Graphic or text elements
●
Output
–
High-level geometric structure
●
frames
–
Boxes that should represent consistent and significant block aggregations
XY-cut
●
Alternate and recursive Horizontal / Vertical cut of portions of a document
●
Based on projections
–
Horizontal + Cut
–
Vertical + Cut
XY-cut
●
Results
Run Length Smoothing
●
Run
●
Sequence of pixels of the same type delimited by pixels of opposite type
–
White/Black
●
Run Length
●
Number of pixels that make up a run
●
Run Length Smoothing
●
Removal of some runs
–
Shorter than a pre-defined threshold by transforming pixels in the opposite type
t = 3
Run Length Smoothing
●
RLSA
●
Horizontal smoothing with threshold t
h●
Vertical smoothing with threshold t
v●
AND
●
Horizontal smoothing
with threshold t
aRun Length Smoothing
●
RLSO (Run Length Smoothing with OR)
●
Horizontal smoothing with threshold t
h●
Vertical smoothing with threshold t
v●
OR
RLSA RLSO
Run Length Smoothing
●
RLSO
●
Advantages
–
Handles non- Manhattan layout
–
Less parameters
●
Lower thresholds
–Less steps
●
Efficiency
●
Can be applied to connected components
–
–
Different definition of run length
Run Length Smoothing
●
RLSO
●
Automatic assessment of thresholds
–
Histograms of runs
●
Cumulative
–Graphic slope
●
Derivative
–Problem
●
Documents with characters in very different sizes
Docstrum
●
Assessment of the document spectrum
●
Bottom-up
●
Works on connected components
–
Clustering
●
Based on k-Nearest Neighbors
–
Coordinates: Euclidean distance + Angle
●
Not sensitive to skew
–
Useful to assess it
CLiDE
●
Chemical Literature Data Extraction
●
Bottom-up
●
Works on connected components
–
Progressive grouping in a tree
●
Sensitive to skew
●
Some levels:
References
●
S. Ferilli
●