On an average application 30% of CPU time is spent on sorting data

(1)

Classification

Paolo Camurati and Stefano Quer

Dipartimento di Automatica e Informatica Politecnico di Torino

(2)

On the importance of sorting



On an average application 30% of CPU time is spent on sorting data



Example

 CPU scheduling

 Processes pi with duration

● p₀ 21

● p₁ 3

● p₂ 1

● p₃ 2

 Impact of sorting on average wait time

(3)

On the importance of sorting

 average wait time (0+21+24+25)/4 =17.5

0 3 5 26 27

p₀

p₁ p₃ p₂

0 3 6 27

p₀ p₁

p₂p₃

1

0 21 24 25 27

p₁ p₂ p₃ p₀

 p₀-p₁-p₂-p₃

 p₁-p₃-p₀-p₂

 p₂-p₃-p₁-p₀

Sorted

(4)

Sorting applications



Trivial applications

 Sorting a list of names, organizing an MP3 library, displaying Google PageRank results, etc.



Simple problems if data are sorted

 Find the median, binary search in a database, find duplicates in a mailing list, etc.



Non trivial applications

 Data compression, computer graphics (e.g., convex hull), computational biology, etc.

(5)

Definitions



Sorting

 Input

 Symbols belonging to a set having an order relation

 a₁, a₂, …, a_n

 Output

 Permutation of the input symbols

● a₁’, a₂’, …, a_n’

 Such that the order relation

● a₁’ ≤ a₂’ ≤ …≤ a_n’

holds

(6)

Definitions



Order relation

≤

 Binary relation between elements of a set A satisfying the following properties

 Reflexivity

● ∀ x ∈ A  x ≤ x

 Antisymmetry

● ∀ x, y ∈ A  x ≤ y ∧ y ≤ x  x = y

 Transitivity

● ∀ x, y, z ∈ A  x ≤ y ∧ y ≤ z  x ≤ z



A is a partially ordered set (poset)



If relation

≤

holds

∀

x, y

∈

A, A is totally ordered

set

(7)

Classification



Internal sorting

 Data are in main memory

 Direct access to elements



External sorting

 Data are on mass memory

 Completely or at least partially

 Sequential access to elements

(8)

Classification



In place sorting

 n data in array plus a constant number of auxiliary memory locations



Stable sorting

 For data with duplicated keys the relative ordering is unchanged

 Example

 Record with 2 keys

● Name (key is a string)

● Group (key is an integer)

(9)

Stability: An example

Chiara 3 Barbara 4 Andrea 3 Roberto 2 Giada 4 Franco 1 Lucia 3 Fabio 3

Unsorted data

Second sorting according to group

NON stable algorithm

Second sorting according to

group

stable algorithm First sorting

according to string name

Andrea 3 Barbara 4 Chiara 3 Fabio 3 Franco 1 Giada 4 Lucia 3 Roberto 2

Franco 1 Roberto 2 Chiara 3 Fabio 3 Andrea 3 Lucia 3 Giada 4 Barbara 4

Franco 1 Roberto 2 Andrea 3 Chiara 3 Fabio 3 Lucia 3 Barbara 4 Giada 4

(10)

Classification



Complexity

 Complexity can be computed in terms of overall number of steps, each one with a constant cost

 A more detailed analysis is also possible, computing the total number of

 Comparisons

 Exchange operations

 When data is large, exchanging them may be expensive and may be better to have more comparisons and less exchange operations

 Asymptotic complexity however does not change

(11)

Classification

 O(n²)

 Simple, iterative, based on comparison

 Insertion sort, Selection sort, Exchange (Bubble) sort

 O(n^3/2)

 Shellsort (with certain sequences)

 O(n · log n)

 More complex, recursive, based on comparison

 Merge sort, Quicksort, Heapsort

 O(n)

 Applicable with restrictions on data, based on computation

 Counting sort, Radix sort, Bin/Bucket sort

(12)

A lower bound for the complexity



Algorithms

based on comparison

 Elementary operation

 Comparison

● a_i : a_j

 Outcome

 Decision

● a_i>a_j or a_i≤a_j

 Decisions organized as a decision tree

(13)

 Sort array of 3 distinct elements a

₁

, a

₂

, a

₃

a₁:a₂

a₂:a₃ a₁:a₃

a₁<a₂<a₃ a₁:a₃ a₂:a₃

a₃<a₁<a₂ a₃<a₂<a₁

a₂<a₁<a₃

a₁<a₃<a₂ a₂<a₃<a₁

>

<

A lower bound for the complexity

n = 3

for the sake of simplicity

(14)

 For n distinct integers

 The number of possible sortings equals the number of permutations, i.e., is n!

 Each solution

 Sits on a tree leaf

 Complexity

 Number h of comparisons, that is, the tree height h

 For a complete tree

 The number of leaves is 2^h

a₁:a₂

a₂:a₃ a₁:a₃

a₁<a₂<a₃ a₁:a₃ a₂:a₃

a₃<a₁<a₂ a₃<a₂<a₁ a₂<a₁<a₃

a₁<a₃<a₂ a₂<a₃<a₁

>

<

A lower bound for the complexity

(15)

 Then we must have

 2^h ≥ n!

The tree has to include (generate) all possibile sorting

on the leaves

a₁:a₂

a₂:a₃ a₁:a₃

a₁<a₂<a₃ a₁:a₃ a₂:a₃

a₃<a₁<a₂ a₃<a₂<a₁ a₂<a₁<a₃

a₁<a₃<a₂ a₂<a₃<a₁

>

<

A lower bound for the complexity

(16)

 Then we have

 2^h ≥ n!

 2^h ≥ n! > (n/e)ⁿ

 2^h > (n/e)ⁿ

 lg 2^h>lg (n/e)ⁿ

 h > n · lg (n/e)

 h > n (lg n - lg e) = Ω(n lg n)

a₁:a₂

a₂:a₃ a₁:a₃

a₁<a₂<a₃ a₁:a₃ a₂:a₃

a₃<a₁<a₂ a₃<a₂<a₁ a₂<a₁<a₃

a₁<a₃<a₂ a₂<a₃<a₁

>

<

Stirling’s approximation

n! > (n/e)n

We compute the log of both members

A lower bound for the complexity

Log property