Resources allocation for virtualized architectures

(1)

(2)

Ar hite tures

PaoloCampegiani

(3)

1 Introdu tion 1

2 Virtualization te hniques 4

2.1 Ageneraldenition ofvirtualization . . . 4

2.2 Virtualizationattheoperatingsystemlevel . . . 7

2.3 Virtualizationte hniques . . . 10

2.3.1 Binarytranslation . . . 11

2.3.2 Para-virtualization . . . 11

2.3.3 Hardwareassistedvirtualization . . . 12

2.3.4 Lightweightvirtualization . . . 12

2.4 VMMimplementations. . . 13

2.4.1 QEMU . . . 13

(4)

2.4.3 Xen . . . 20

2.4.4 Hardwareassistedvirtualization . . . 24

2.4.5 Lightweightvirtualization . . . 27

2.4.6 OtherVMMs . . . 28

2.5 Hardwarevirtualization . . . 30

2.5.1 Pro essor . . . 30

2.5.2 MemoryandDMA . . . 31

2.5.3 Storage . . . 39

2.5.4 Network . . . 42

2.6 Con ludingremarks . . . 49

3 Virtualization ar hite tures 50 3.1 Referen ear hite ture . . . 55

3.1.1 Modelingofmulti-tiersystems . . . 59

3.2 Virtualizationperforman esandmeasurement. . 63

3.3 Autonomi omputing . . . 68

3.3.1 Self-optimization . . . 70

3.3.2 Proposedextensiontothemodel . . . 75

4 The mapping problem 79 4.1 Problemformalization . . . 80

4.1.1 Dis ussionofformalization . . . 84

(5)

4.3 Computational omplexityofthemappingproblem 90

4.4 Optimalsolutionofthemappingproblem . . . . 92

4.5 Approximatesolutionsforthemappingproblem 95

4.5.1 Apa kingorientedheuristi . . . 96

4.5.2 Ageneti algorithm . . . 100

5 Simulationsresults 113

5.1 Implementationofthebinpa kingheuristi s . . 114

5.2 Implementationofthegeneti algorithm . . . 120

5.3 Modelsdataset . . . 121

(6)

2.1 VMMar hite ture. . . 8

2.2 Xenar hite ture. . . 21

2.3 Memoryvirtualizationdatapath. . . 35

2.4 Xenar hite tureforsharednetworkdevi es.. . . 45

2.5 CDNAar hite ture. . . 48

3.1 Avirtualizationar hite ture. . . 54

3.2 Amulti-tierdistributedsystem. . . 56

3.3 TheQoSController. . . 71

4.1 A partial de isiontree for aMMMKP problem. Inthedouble he kedleaf,setde isionalvariables are

x

11

1 = 2, x

21

1 = 1.

. . . 93

(7)

4.2 Anindividualforaproblemwith3groups. Ea h

Xmarksavariablesetto 1. . . 109

4.3 Fixingtherstgroupwith dierentindividuals. . 110

4.4 Fixingthese ondgroupbygeneratingtwo

dier-entindividuals. . . 111

4.5 Fixingthese ondandthirdgroupbygenerating

allpossiblefeasibleindividuals. . . 112

5.1 Averagetnessofpopulationforthethird model. 126

(8)

3.1 MeasurementsforestimationofVMMoverhead. 66

5.1 FirstmodelSLAsandprots. . . 123

5.2 Se ondmodelSLAsandprots.. . . 123

5.3 Physi alhosts hara terizationsforallmodels. . 124

5.4 Comparisons of prots for the optimal solution

andtheapproximatesolutionfortherstmodel,

fordierentvaluesof

C

. . . 124 5.5 Se ond model: range of prot and approximate

solutionprot,fordierentvaluesof

C.

. . . 124 5.6 ThirdmodelSLAsandprots. . . 125

(9)

5.8 Initialandnaltnessofbestindividualforthird

(10)

I'dliketothankmanypeoplethathelpedmeallalongthe

pur-suingofmyPh. D.

Professor Salvatore Tu iwasmytutor during these years,

and allowed me to freely explore my interests. He says that

thebestwaytodoresear hisbyndingatopi ofinterestand

developingit,andifI ouldsaythebestwaytounderstandthis

visionistoapplyitoneveryday'swork.

ProfessorFran oMa eriallowedtorunthesimulationsinthe

s ienti labofDipartimento di IngegneriaCivile,that it

hap-pensIalsomanageassystemadministratorhopingnot ausing

toomu hlongdowntimes. Fa inghigh omputationalproblems

fromthesystem'spointofviewhasdevelopedinmesome

(11)

me in harge of designingand administrationof their network

infrastru ture, an ex iting hallenge that he generously allows

metodowithoutinterruptingmyPh. D. duework.

ProfessorFran es o LoPresti isthe onewhomore

en our-agesmeandfollowmyprogress(orla kthereof),reallyhelping

mein nalizingthiswork.

I'd also like to thank people and sta of Dipartimento di

Ingegneria Civileand from S uola IaD for their kindness,and

espe iallyEusebioGiandomeni oandMar oOrazifortheir

(12)

Introdu tion

Asmanyotherte hnologiesandparadigmsinthe omputer

s i-en e eld, virtualization has a longhistory with periods with

greatmomentumandperiodswhenithasbeenputinthen

ba k-ground. Theadventofmassiveande onomi omputerpower,

aspredi ted byMoore's Law, hasnally resulted in the

avail-abilityofsystemlevelvirtualizationte hnologieson ommodity

hardware. This will leadto a omplete new lass ofproblems,

fromprovisioningtodeployment,thatarisewhenvirtualization

is intended asan ar hite tural assetthat ouldbring valueto

(13)

servi es.

Inthisdissertation,Iinvestigateaproblemthat ouldbe

ex-pressedas: given anumberof virtualma hines andsome

phys-i al ma hines, ea h des ribed respe tively by a demand ve tor

andaresour eve tor, whi h isthebestallo ation oftheformer

tothe latter,for agiven metri ?

Fora largedata enter,an example of metri would be to

minimizethenumberofphysi al ma hinesdevotedtohost

vir-tual ma hines, giving su ient resour es to ea h virtual

ma- hine. By minimizing this number, the data enter ould

in- reasethee ien yofthephysi al ma hines,andthe

underly-ingvirtualizationte hnologywill preventea h virtualma hine

tointerferewithothers,frombothaperforman eandase urity

prospe tive.

It will shown that this problem, stated in the most

gener-allyform, isNP-hard (it's ageneralizationof the lassi al0/1

knapsa kproblem),andits omplexityisdaunting,requiringto

deneanheuristi tondanapproximatesolution. Wewillalso

presentageneti algorithm that appears promisingin ta kling

thisproblem.

Thedissertationisorganizedasfollows.

(14)

pre-broadly lassifyingthemintwo ategories:te hniquesthatdoes

notrelyonhardwarefeaturetosupportvirtualization,and

te h-niquesthatleverageon.

InChapter3,wemovefromate hnologypointofview

to-wards an ar hite ture entri one,analyzing the performan es

problemsthat virtualizationfa es. Weputvirtualizationasan

assetof multi-tierdistributed systems, andwedes ribeit asa

fundamentalblo kforautonomi omputing. Currentworksin

thiseldla kofsomedegreeofgenerality,andwhenweextend

the urrentavailableframeworks.

InChapter4,the mappingproblemis formallydened,we

analyzeits omputational omplexity,anddevelopsome

heuris-ti s[44℄to solveit, omparingthemtoageneti algorithm[45℄

wealsopropose. Inthis hapter,weseethatthemapping

prob-lemisageneralizationoftheknapsa kproblem,andwebriey

analyzethes ar eliteratureongeneralizationofknapsa k

prob-lems.

InChapter5,weshowsimulationresultsforsomeinteresting

datasets.

Chapter6endsthisdissertation,brieyre allingtheresults

(15)

Virtualization

te hniques

2.1 A general denition of

virtualiza-tion

Virtualization ould bedened as atwophasepro ess. In the

rstphase,someresour esofthesamekindwillbegrouped

to-gether,hidingphysi al boundaries;in these ond phase,a

(16)

to anuser. Thereare manytypesof virtualization, depending

onthetypeoftheaggregatedresour e:

•

VirtualLAN(VLAN):aVLAN[23℄ isaset ofhoststhat

ommuni ateasiftheywere onthesamewire,

unregard-ing their physi al lo ation. Even when the hosts are on

dierentphysi alsegmentof thesameLAN,the

ongu-ration madeonnetwork devi eslikeswit hesand routers

allows the hosts to share the same virtual segment, so

broad astpa ketsareforwardedonlyontheVLAN.This

will in rease se urity, by avoiding unauthorizedhosts to

onne t tothevirtualsegment,and allowsforthe

deni-tionofper-segmentQualityofServi epoli ies;

•

StorageVirtualization: abun hofstorageresour es(disks

ortapes)aregroupedtogether,andthea esstothemis

sele tivelydened by amanagement fun tion. In a

Net-work Atta hedStorage (NAS) environment,and morein

aStorageAreaNetwork(NAS)[55℄,itispossibledo arve

outsomeresour esandallowoneormorehoststoa ess

them. Asaresult,thehostsare omputingnodesthatare

atta hed to thedata. This allowsforbetterand heaper

data onsolidation,ba kup andse urity;

(17)

appli ations, running on Javaor Flash. As an example,

when the user downloadsa Javaapplet viathe browser,

the applet is exe uted in the ontext of a JavaRuntime

Environment (JRE) [74℄. The JRE virtualizes the

om-putingresour estotheappletinthesensethattheapplet

iswritteninaso alledbyte ode,ama hinelanguagethat

theJREtranslatesintorealoperationsfortheunderlying

targetpro essor.Asaresult,thesameapplet ouldbe

ex-e uted overdierentpro essorar hite tures,aslongasa

JREisprovided. Besidesthis,theJREdenesasandbox

that has somese urity onstraints,likeanapplet annot

a ess systemlesonthetargetma hine.

Allthese examples,nowayexhaustive,showssomeofthe

ben-etsofvirtualization. Byaddinganintermediatelayerbetween

physi al resour eand resour e demand, it's possible to

multi-plex, demultipexand routingrequests to asinglemanagement

point,a hievingbetters alability,manageability,performan es

(18)

2.2 Virtualization at theoperating

sys-tem level

Virtualization at the operating system level has been

imple-mentedfor the rsttime on theIBM S/360system[8℄. In an

fundamentalarti leonvirtualization,PopekandGoldberg[86℄

denedtheformalrequirementsforavirtualizationar hite ture.

We will base our exploration and taxonomy of virtualization

te hniquesonthatpaper,soit'sworthytore apit.

First, it's dened the on ept of Virtual Ma hine Monitor

(VMM)asalayerthatseparatetheVirtualMa hine(VM)-that

is, theoperating system to be exe uted - from the underlying

hardware,asshowningure2.1.

Some ofthe instru tionsofthe VM ouldtrap, that in the

original arti le is dened by saving the program ounter to a

spe iedlo ationandthenjumpingtotheaddress ontainedin

anotherlo ation,whereatraproutineistobeexe uted,withthe

ma hine registerssaved. The traproutinewill do itsown job,

then itrestores theregistersand return ontrol to theaddress

savedintherstpla e. It'spossibiletodenenotblo kingtrap

routines. This me hanism is the pre ursor of today's system

(19)

Virtual

machine 1

machine 2

machine N

. . . . .

. . .

VMM

Hardware

Figure2.1: VMMar hite ture.

Trapareinstrumental to lassifyinstru tioninthree

dier-entgroups:

•

privilegedinstru tions: aretheinstru tionsthat ausesa

trap;

•

sensitive instru tions: they ame in two dierent types,

ontrol sensitive instru tions and behavior sensitive

in-stru tions. Todene them in terms of urrent

ar hite -tures,wedenetheseinstru tionsastheoneswhi h hanges

thepro essormode(or returnsit)orwhi hexe ution

de-pendsontherealmemoryaddressoftheiroperands;

(20)

TheVMMshouldhavesomepropertiestoallowforthe

exe u-tionofaVMontopofit:

•

E ien y: Everyinstru tionthatisinno uousisexe uted

dire tlybytheunderlyinghardware,withnointervention

oftheVMM;

•

Resour e ontrol: The VM annot hange its resour es

quota: everyrequestformoreresour eismediatedbythe

VMM;

•

Equivalen e: Everyprogramexe utedin the ontextofa

VMperformsinanalmostindistinguishablemanner,asif

itwereexe utedwithoutaVMMinterposingbetweenthe

VM and thehardware. In this ontext, almost

indistin-guishablemeansthatit'salloweda ertaindegreeof

devi-ation, asperforman esmaybeabit worse andresour es

availability ouldbenotidenti al(be ausetheVM annot

a ess dire tlythehardware).

Theworkof PapekandGoldbergisfundamentalastheyproof

thefollowingtheorem:

(21)

stru tionsisasubset ofthe privilegedinstru tion.

Thetheoremstillholdsfor urrentar hite ture,andweuse

it asa riteriato dis riminatebetweenvirtualizable pro essor

ar hite ture (or so alled virtualization friendly) and the not

virtualizable ones: a pro essor ar hite ture is virtualizable if

andonlyiftheexe utionofeverysensitiveinstru tioneventually

result in a trap, asthe traproutine ould be implemented by

theVMM.

It willbeshownlaterthat,surprisingly,theIntelx86

ar hi-te tureisnotvirtualizationfriendly.

2.3 Virtualization te hniques

Inspiteof theunifyingdenition statedabove,there aresome

dierentwaystovirtualizeanoperatingsystem. Broadly

speak-ing,thereisatradeobetweentheresultingperforman esand

thespe trumofpro essorsar hite turesthat ouldbe

virtual-ized: toa hievespeedit'susuallyne essarytofo usonaspe i

instru tionset andpresentingthevirtualma hine amore

gen-eralizedandless ustomizableabstra tionofphysi alhardware,

(22)

virtu-identifyfour dierentvirtualizationte hniques.

2.3.1 Binary translation

In this approa h, a software layer translates operations from

thevirtualma hinesettothephysi alma hineset,allowingfor

odeoptimizationandtranslation a hee ien y. The

virtual-izationlayer oulddoaso alled rossvirtualization,wherethe

virtualma hineinstru tionsetandphysi alma hineinstru tion

set are ompletely dierent- requiringto ompletely translate

the former into the latter - or a partial virtualization, where

inno uousinstru tionareexe uteddire tlybythehardware(in

a ontextset upbytheVMM)and riti alonesare translated

bytheVMMthatoperatesasaresour es'broker.

2.3.2 Para-virtualization

Theoperatingsystemofthevirtualma hineismodiedinsu h

awaythateverysystem allthatshouldhavea essedthe

hard-wareisinsteadmappedinansystem allexe utedbyandinthe

ontextoftheVMM.Themodi ationoftheto bevirtualized

operatingsystem ouldbeunfeasiblewhenit's releasedonlyin

(23)

2.3.3 Hardware assisted virtualization

The instru tion set has been augmented with operations that

en ompasses portion of ma hine ode. This se tions are

exe- utedin avirtualma hine ontext, whi h isdierentfrom the

physi alma hine ontext. TheVMMhassomedegreeof ontrol

overtheoperationsmadebyaspe i virtualma hine,ranging

fromanotrustrelationship(everyI/Ooperationperformedby

thevirtualma hineistrappedandresultsintheexe utionofthe

VMMthatoperatesasa ontrolinterfa e)toatotaltrust

rela-tionship,wherethevirtualma hine oulddire tly a essevery

hardware in thesystem. The latterresults in in reased speed

anddiminishedse urity.

2.3.4 Light weight virtualization

The operating system of the physi al ma hine is hanged to

allowdierent andnot- ommuni atingnamespa esforthe

dif-ferent resour e lasses. As a result, there are some zones (to

use a typi al terminology) and ea h one has its own le

sys-tem, users, pro essesnamespa e and hardware view. It ould

bearguedthatthisapproa hisnotavirtualization,mainly

(24)

some problems that otherwiserequire a traditional

virtualiza-tionte hnique,whileexperien ingnearlynoperforman e

penal-ties.

2.4 VMM implementations

Anumberof ompetingprodu ts,bothopenand losedsour e,

areavailableasVMM.Inthisse tionweseethemost

represen-tativeofthem,fo usingontheadoptedvirtualizationte hnique.

2.4.1 QEMU

QEMU[16℄, written by Fabri e Bellard,is anopensour e

ma- hineemulatorandvirtualizer. It ouldoperateasavirtualizer,

whenthevirtualma hine instru tionsetandphysi alma hine

instru tionsetarethesame,orasanemulator, apableof

trans-latinginstru tionsetfromsevendierentpro essorar hite ture

to sometarget ar hite ture, plusvirtualizingsystemhardware

toallowfora ompleteoperatingsystemvirtualization.

QEMU is adynami translator,i.e. the ode translatedis

storedinatranslation a hewhereit ouldbereusedtoin rease

(25)

generalapproa hforbinarytranslation.

ConsiderthefollowingPowerPCinstru tion:

addi r1,r1,

−

16 # r1 = r1

−

16

thatmustbetranslatedintoIntelx86 ode. First,therewill

be generated some mi ro operations, that are independent of

thenaltarget:

m o v l _ T 0 _ r 1 # T0 = r1

addl_T0_im

−

16 # T0 = T0

−

16 m o v l _ r 1 _ T 0 # r1 = T0

theT0 andT1 registerare typi allystored in hostregister

due theoptimization madebythe GCC ompiler. Therst of

themi rooperationistypi ally odedas:

void o p _ m o l v _ T 0 _ r 1(void) {

To = env

−

>regs[1℄; }

whereenvisthestru ture ontainingtheCPUstateof the

virtualma hine.

The ode generated is then translated in physi al ma hine

odebytheGCC ompiler,andtheresultwill be(foran Intel

(26)

# ebx = env

−

>regs[1℄ mov 0x4(%ebp), %ebx

# addl_T0_im

−

16 # ebx = ebx

−

16

add $0Xfffffff0,%ebx

# m o v l _ r 1 _ T 0

# env

−

>regs[1℄= ebx mov %ebx, 0x4(%ebp)

QEMUisadynami translator asitusesa16MByte a he

thatholdsthemostre entlyusedtranslationblo ks(TB).After

theexe utionofeveryTB,thenextinstru tionto beexe uted

willbedeterminedbyexaminingthestateoftheemulatedCPU;

ifthejump pointisin the a he,the odeisexe uteddire tly,

otherwise the translationpro esstakes pla e. A TB ould be

pat heddire tlytothelogi alfollowingonewhenthejump

des-tination isknown.

More omplex problem arises with self-modifying ode, as

theappli ationswritten fortheIntelx86ar hite turedoesnot

signal a he invalidation that ould trigger the removing of a

(27)

pli ationwritten for adierentpro essorar hite ture, but an

entireoperatingsystemrequiresthe virtualizationof the

hard-ware. QEMUallowsfor alimited set of virtualized hardware.

It's possible tohaveupto twoEIDE harddisks, abasi video

VGA ard,oneormoreFastEthernet NIC;whileit'salso

pos-sibleto onne tdire tlytheUSBsubsystem ofthevirtual

ma- hine tothephysi alUSBsubsystem.

Thevirtualized harddisks aremappedasleonthe

physi- al ma hine. This will resultin asigni ant performan e loss,

as every I/O request made from the virtualized ma hine will

traverse the virtualized operating system sta k, resulting in a

sequen eofI/OoperationsintertwinedwithvirtualizedOS

op-erations,andea hI/OoperationwillultimatelyresultinaI/O

operationmadeontheimageleonthephysi alsystem,

requir-ing for beingmade traversingagain the sta k of an operating

system,inthis asethephysi alone. Thenalresultisthatthe

datapathis doubled. QEMUhassomeexibility intheimage

leformat,it'spossibiletohavea opy-on-writeformatle,but

thisar hite turewon'thelpforperforman es.

Thenetwork ard emulation hassome interestingfeatures.

Ea h virtualma hine ouldhaveoneormoreNICs, andthese

(28)

pos- ompletely hidden from the rest of the world, bridged on the

physi al LANor even onaUDPmulti ast network that ould

spanseveralphysi alma hines.

2.4.2 VMWare

VMWare [24℄ is the market leader in the virtualization eld,

thanks to its performan es and management tools. Produ ts

fromVMWarerangefromVMWarePlayer,thatisonly apable

of run avirtualma hine,to the VMWareInfrastru turesuite,

thathastheabilitytomanageresour esallo ation,performing

liveba kup of running virtual ma hines, moving them from a

physi alma hinetoanotherwithverylittleservi einterruption.

VMWarerea tedtotheintrodu tionoftheopensour eXen

hypervisor (dis ussed below) by releasing its VMWare server

free but losed sour e, to gain and maintain market share at

the expenses of the new omer. Unfortunately, the li ense of

VMWareserverdi tatesthatben hmarkarepossibleonlywhen

themethodologyhasbeenapprovedbyVMWareIn .,andasa

resultofthisthereareveryfews ienti papersontheinternals

ofthisVMM.

(29)

hasoverallbetterperforman e.

Toa hievemaximumspeed,it'simperativethat,asstatedin

[86℄,mostpartofthe odeisexe uteddire tlybytheunderlying

physi alpro essor,butthisisimpossiblewiththeIntelx86

pro- essorar hite ture, as there are instru tionsthat aresensitive

but nottrappable. Asanexample,theCurrentPrivilegeLevel

(CPL) ould be obtained by reading the low two bits of the

odesegment sele tor register(% s), and the popfinstru tion

(popags)exe utedbyaprivilegedpro ess ouldmodifythe

IFagsthatgovernstheinterruptdelivery,anoperationthatan

unprivilegedguest annotdo[92℄. Asaresult,it'sne essaryto

haveabinarytranslatorthat,forsu hvirtualizationunfriendly

operations,simulatestheirexe ution in avirtual ontext. The

translatoradoptedbyVMWareis:

•

Binary: itsinputisIntelx86 ode;

•

Dynami andondemand:translationhappensatruntime,

andonlywhen odeisaboutto beexe uted;

•

Systemlevel: thearenoassumptionabouttheguest ode,

theABIisthex86IndustryStandardAr hite ture(ISA);

(30)

instru tions);

•

Adaptive: translated odeisadjustedinresponsetoguest

behavior hangetoimprovee ien y.

The last propertyis worthy noting. Whena CPU en ounters

a trap for a privileged instru tion, it has to jump to a trap

routine(typi allyanoperatingsystementrypoint)todealwith

it,andthis ouldbeexpensive. Abinarytranslator ouldavoid

it, by repla ing the original ode with a routine (that, being

exe utedbyaprogram,isinusermodeandnotinkernelmode).

As an example, the rdts instru tion for the Intel Pentium

ar hite ture,takes2030 y les fora lassi altrapandemulate

exe ution, andonly 216forthebinary translation. This ould

dealwithaminorpartofthesensitiveinstru tions,asloadsand

stores ould a esssensitivedata aspage tables. Theadopted

approa histhataninstru tionistranslatedidenti ally(i.e.,not

translated) and exe uted by the physi al pro essor. If a trap

happens,nexttimethesameinstru tionwillbere-translatedto

avoidthetrap,maybeinvokinganinterpreter.

VMWarehasputalotofeortinthemanagementand

on-gurationtools,bothforasinglesystemandforanentiredata

(31)

possibleto dene anarbitrarynumberof virtualized

peripher-als, in ludingstorage systems,network ards,video ards and

USB devi es. Hard disks anbe mapped into a le image, a

diskpartitionor aniSCSItarget[13℄to a hievemaximum

per-forman es. Thenetwork ouldbe onguratedtohaveavirtual

ma hine that has an host-only network (i.e., it ommuni ates

only with the physi al ma hine it's instantiated on), a NAT

network (where the physi al ma hine a ts as a Network

Ad-dressTranslator),ortohaveaunique,externallya essibleIP

address.

TherealvalueoftheVMWaresuites omeswiththeVMWare

Infrastru ture,that allowsfor a entraladministrationof

hun-dredsofvirtualma hines,overdozensofdierentphysi al

ma- hines, allowing for load balan ing, high availability and live

migration(movingavirtualma hinefromonephysi alnodeto

another[82℄)withlittleservi edisruption.

2.4.3 Xen

Xen [25℄ was originally developed at the University of

Cam-bridge Computer Lab [26℄ as a framework to have an

(32)

om-Hardware

Xen−aware device drivers

Block

Virtual

Device

Virtual

Network

Virtual

Physical

Memory

Virtual

CPU

Domain 0

interface

control

Xen−aware

device drivers

Dom0

OS

Guest OS

User Software

Software

Control

Figure2.2: Xenar hite ture.

foundedtogainpaying ustomersformanagementtools(the

hy-pervisoritselfisreleasedundertheGNUPubli Li ense);later

the ompanyhasbeena quiredbyCitrix.

In the Xen language, both physi al and virtual operating

systems are alled domains, with dom0 indi ating the

hyper-visorand domU for the unprivileged domains, i.e. the virtual

ma hines. Thegure2.2showstheXenar hite ture[36℄.

Xenadoptsthepara-virtualizationapproa h,borrowedfrom

theDenalisystem[102℄: theappli ationABIremainsun hanged,

(33)

ofhyper alls.

Anhyper allisessentiallyawayto ontrolintera tions

be-tweenavirtualma hineoperatingsystemandthephysi al

ma- hineoperatingsystem. Thehyper allinterfa eallowsdomains

toperformasyn hronoussoftwaretraptoperformaprivileged

operation,analogoustothesystem allsfoundintheoperating

system. DatatransfersaremanagedviaI/Orings,essentiallya

produ er- onsumerbuerofI/Oledes riptors,withageneral

interfa ethat ouldbeusedforalmosteverykindofI/Odevi e

intera tion.

CPU s heduling between dierent domains is made with

threedierents hedulersasXen3.0: theBorrowedVirtualTime

(BVT) s hedulingalgorithm [59℄, that is work- onserving and

apable of a low laten y wake up when a domain re eives an

event; the Simple Earliest Deadline First (sEDF) that ould

bebothwork- onserving andnot work- onserving,but la ksa

globalloadbalan ingbetweendierentCPUs;theCredit

S hed-ulerthatisalsogloballoadbalan ingalthoughnotpreemptive,

andhasas hedulingperiodhard- odedat30ms [47℄.

Networkinterfa esarequite omplex[28℄: thefoundationof

thear hite tureisaVirtualFirewallRouter (VFR),withea h

(34)

NIC, buttheadministrationoftheVFR ouldbe hallenging.

StoragesystemsforthedomUaremodeledasVirtualBlo k

Devi es(VBD):thedom0 ouldmapthemintoles,partitions

or LUNs. It's also possible to bla k list a PCI devi e for the

dom0,leavingitin theex lusivea essof oneormoredomU.

Xen hasthe ability to perform alive migration, with very

little QoS loss [49℄. On [97℄ it's exposed an ar hite ture that

allowsformigrationoveraMAN/WAN,attheexpenseof

hav-ing a dedi ated ommuni ation ir uit. On [43℄ it's shown an

extensionthat alsoallowsformigrationofthelo al lesystem

(hypervisorsassumethat thelo allesystem ouldalsobe

a - essedfrom thedestination physi alma hine, requiringaNAS

orSANinfrastru ture).

Che kpointing,astheabilitytosaveandrestoreoftenfrom

asavedimage that ontainsalso thepersistent state,is under

development, allowing for a global he kpointing of an entire

lusterofvirtualized ma hines[54℄.

Xenperforman esareoftheutmostinterest,asthe

paravir-tualization has a verylow impa t, at the ost of requiring to

hange both thedom0 and the domU operating system. This

isinfeasible foroperatingsystemsreleasedonlyin binaryform

(35)

sup-Xen has found its way in the mainline Linux kernel, after

sometimewhereitsintegrationwiththeoperatingsystemwas

thepremierefeatureofenterpriseorientedLinuxdistributionas

RedHatRHELandNovellSuSEserver.

2.4.4 Hardware assisted virtualization

The Intel x86ar hite ture is not avirtualization-friendly one.

Asaresult,untilsomeyearsagotheonlyavailablehypervisors

arebinarytranslator(asVMWare)orpara-virtualizer(asXen).

In2006,Intelhasannoun edtheVT-xar hite tureforhardware

assistedvirtualizationforthex86pro essorfamily,andtheVT-i

fortheItaniumfamily[12℄.

WiththeVT-xextension,thereareavailabletwonewCPU

operations, the VMX root operation and the VMX non-root

operation.

The VMX root operation is intended for aVMM, and it's

verysimilartoatraditionalIA-32operation. VMXnon-rootis

intended to support andisolate theexe utionof avirtual

ma- hine,allowingtheVMMtodeneadegreeoftrustforthe

vir-tualma hine,grantingsomedire t intera tions withthe

(36)

to the VMX non-root operation, the opposite transition is a

VMX exit. The Virtual Ma hine Control Stru ture (VMCS)

managesthesetransitions,being omposedofagueststatearea

andanhoststatearea.Pro essorstateisloadedfromthe

guest-stateareaoneveryVMentry,whileit'srestoredfromthe

host-state area on every VM exit. Exits happen always for some

instru tions, for othersit depends onsome variablesand ags

intheVMCS,that ouldbesetonlyintheVMXrootoperation

mode. As anexample, the VMCS ould dene howto deliver

interrupts(everyinterruptresultsinaVMexitwithnomask,or

theguestisabletore eiveinterrupts), hoosetoallowtheguest

to dire tlya ess some spe ial register(that denes paging or

oating point operationmode), whi h ex eptions ause a VM

exit,whi h I/Ooperationsareallowed(bydening a eptable

I/Oportrange).

This exibility allows for a ner grain of ontrol, be ause

a VMM ould hoose to give a spe i virtual ma hine more

privileges, resulting in fewer VM exits and entries. As noted

in [30℄, ea h entryorexit isanalogous toa ontext swit h,

re-sulting in some performan e losses. The exa t penalty varies

alot, be auseit depends onthe numberof privileged

(37)

ationbehavior). Nevertheless, theperforman eproblemmust

beaddressed.

Asaresultofthegrowing on erns,these ondgenerationof

virtualization apablepro essorhassomenewfeatures. AMD,

that developedasimilar ar hite ture alled Pa i a,presented

theBar elonapro essor,thathasathird level a heanda

vir-tualized addresstranslation, instead of a shadowpaging, that

should substantiallyredu e the memory performan e loss.

In-telhasinsteaddevelopedtheVirtualizationTe hnologyfor

Di-re tedI-O[10℄,thatallowsforadire tremappingofDMA

trans-fersanddevi egeneratedinterrupts.

It mustbenotedthatanhardwareassistedvirtualizationis

theonlywayto virtualizeanoperatingsystemthat's available

onlyin losedsour eform(likeMi rosoftWindowsseries),but

togetthebestperforman eit ouldberequiredtousespe i

driversintheguestkernel. Theso alledpara-virtualized(PV)

driversaredriversengineeredtoworkoptimallyinaguest

envi-ronment,where there'snoneedtoa essdire tly thehardware

(and, in fa t, tryingto dothat will usually ausesa VMexit)

(38)

2.4.5 Lightweight virtualization

TheVMMseensofarallowsformultipleanddierentoperating

systems hosted on the same physi al ma hines, giving a high

degree of exibility. In some s enarios, there is no need for

using dierent operating systems(or even dierent version of

the same), it's su ient to have multiple views of the same

system. Thisapproa histhegeneralizationofthejailor hroot

se urity feature found on Unix system: a pro ess is restri ted

tointera twithasubsetofthesystemles,soa ompromission

ofitwouldn't allowtheatta kerto manipulateothersprogram

les and resour es. From the system point of view, the les

namespa ehas been split, astwodierent pro essesmayrefer

to dierentles evenwhen theyuse thesame(lo al)name. If

thissplitting isextendedto allthesystem'sresour es,wehave

alightweight(or ontainerbased) virtualization[96℄.

Example of this aretheOpenVZ extensionto Linuxkernel

[18℄,theLinux-VServerproje t[15℄andtheSolaris10operating

system[20℄.

OpenVZ alls ea h autonomousnamespa e asavirtual

en-vironment(VE), alled zones in Solaris. With ontainer based

(39)

tem resour es. OpenVZ denes this resour es limits as bean

ounters, andtheyarein pla eforea hpossibleresour etype.

Infa t,resour esmanagementwithin ontainersisfarmore

sim-ple,asthere'sonlyoneoperatingsystemthatmustbeenhan ed

togovernthat,makingalsopossibleto hangetheselimitseven

at run-time. Overhead is also negligible [95℄, allowing for

in-stantiating even hundreds of ontainers in the same physi al

ma hine,makingthissolutionsparti ularlyappealingfor

Inter-netServi eProviderswhereea hhostedsite ould oin idewith

avirtual ontainer.

2.4.6 Other VMMs

TherearemanyVMMsolutionstoday,fromresear hprototypes

toprodu tionreadyinfrastru tures. We iteheresomeofthem

presentinginterestingfeatures:

Terra: Terra[62℄ is aVMMthat allowsforTrusted

Comput-ing. Avirtualma hine ouldbeinstantiated asan

open-box, allowing for data a ess and modi ation from the

administratorofthephysi alma hine,orasa losedbox,

where these operations are prohibited. Also, the Terra

(40)

pered. Thisexperimental approa h allowsforhigh

sensi-tivese ure virtual ma hines (e.g. voting ma hine) to be

allo atedover ommodityhardware;

P.R.O.S.E.: thePartitionedReliableOperatingSystems[63℄,

based on the Logi al Partitioning (LPAR). The

hyper-visor, rHype, is a para-virtualization engine that uses a

roundrobinxedslot CPUs heduler. Thissimple

s hed-ulerredu estheOSinterferen e[64℄,whi hhappenswhen

there's somejitter in the exe utionsequen e of dierent

virtualma hine,aplaguethat ismoreevidentongeneral

purpose VMM like Xen or VMWare asthe VMM are a

omponent of a generalpurpose operating system. This

la ksofstri ttiming oordination ouldeasilydestroy

ag-gregatedperforman esinaHighPerforman eComputing

s enario;

Virtual Box: it's a GPL released binary translator made by

InnotekandnowdevelopedbySun;

KVM: it'saLinuxkernelmodulethatoershardwareassisted

virtualization. Due toits integrationwiththekerneland

(41)

Lguest: it's a para-virtualizer for the Linux Kernel, made in

lessthan5000linesof ode[14℄;

Hyper-V: it'sthevirtualizationte hnologymadebyMi rosoft

andmadeavailableforWindowsServer2008andWindows

Vista. Itleveragesonhardwaresupportforvirtualization.

2.5 Hardware virtualization

Inthisse tion,wedis ussindetailshowa omputer omponent

ould bevirtualized, i.e. how it ould be abstra ted and

pre-sentedto oneor morevirtualma hines, preventingea honeof

themto a essorinterfereswithothers'data.

2.5.1 Pro essor

Pro essorvirtualizationisusuallyasimpletopi .ForthePopek

and Goldberg prin iple stated above, the most portion of

in-stru tions are exe uteddire tly bythe pro essoritself, for

a - ura y and performan es. Only sensitive instru tions require

to beinter epted and somehowmanaged by theVMM. When

this happens, there's a pro ess analogous to a ontext swit h:

(42)

It'salso requiredsomelevel ofprote tionfor riti al

stru -turesstored in memory, andthis is usuallydone byleveraging

onpro essors'a ess ontrol me hanisms. In theIntel x86

ar- hite ture,ea hpro ess ouldrunin oneoffourprivilegelevel,

the less privileged numbered 3 and the most privileged

num-bered0. Inanovirtualized s enario,operatingsystemrunsat

0level,andappli ationsrunat3,leavinglevels1and2unused.

With a hypervisors like Xen or VMWare, the hypervisor and

itsoperatingsystemstillrunningatlevel0,meaningfulla ess

tomemoryanddevi es,andthevirtualizedma hinesrunin an

intermediatelevel. Thisisthemain reasonwhyitisdi ultto

virtualizeanhypervisor.

2.5.2 Memory and DMA

In amodern ar hite ture, ea h pro ess hasasso iated itsown

uniqueaddressspa e,andinstru tionsanddataarestoredina

virtualaddressspa e. Thevirtualaddressspa eisimplemented

bytheMemoryManagementUnit(MMU)thatgetsthevirtual

address and returns the physi al address. This onversion is

per pro ess, meaning that two dierent pro esses will usually

(43)

phys-memory;also,twothreadsofthesamepro esswillusuallyshare

memory).

Thistranslationismadeupbyorganizingthememoryspa e

ofapro essin ahierar hi alstru ture, thepage dire tory, the

rootofwhi his apartofthepro ess ontext (onx86

ar hite -tureis aCPU register). A virtualaddressis omposed oftwo

parts,thedire torypartandtheoset. Thedire torypartwill

be ombinedwiththepagedire torytodeterminethephysi al

page, whi h is added to the oset to get thephysi al address

[58℄.

Thisoperation, alledpagetreewalking,willrequire

travers-ing the multi-level tree page table. Ea h Page Table Entry

(PTE) has the same size of a page table, whi h is 4 KiB or

4MiBonIntelx86ar hite ture(otherar hite tures ouldhave

dierentpagesize oexisting inthesystem): astherearemany

of them, the PTEs are stored in memory. So, everytime the

dire tory part of the virtual address hanges, it ould be

re-quired toa ess somePTEsin memory,resultingin verypoor

performan es. Toavoidthis,aMMUisequippedwitha

Trans-lationLook-AsideBuer(TLB),whi hisaspe ialized a hefor

virtualmemory onversionlookups.

(44)

swit ho urs,orwhenthereisatransition betweenthekernel

modeandtheusermode(toadopttheIntelx86nomen lature),

theTLBwill refertoapagetablethatisnolongerthe urrent

pagetable,soitmustbeinvalidated. Asaresult,thein oming

memorya esseswillrequireapagetreewalking,untiltheTLB

getsrelled.

A typi al pattern on a modern system is when a pro ess

(runningonusermode)requires anoperationto theoperating

systembyissuingasystem all:thepro essorswit hestokernel

mode, theoperating systemwill hopefullyhonor therequests,

thenthepro essorgoesba ktousermodeandthepro ess

exe- utionresumes. Thisowhastwotransitionsinit(therstfrom

usermodeto kernelmode,andthese ondfromkernelmodeto

usermode),whi hinanaiveTLBimplementationwouldrequire

twoTLBinvalidations. Thisisusuallyawasteofresour es,

be- auseabetterapproa hwouldbetosele tivelyinvalidatesome

oftheTLBlines. Ifthekernel omputationissmall(asusually

it is), the number of the referen ed memory addresses is also

small,soonlysomeoftheTLBlinesmustbeinvalidated.

Thisoptimizationrequiresthatea hTLBlineistagged,

as-so iatingtoitthepagetable whi hitrefersto. Taggingis also

(45)

invalidationisneeded.

Figure2.3(adaptedfrom[57℄)showsthegenerals hemefor

memoryvirtualization,stressingthatthedatapathrequiredfor

onverting a virtual memory address of a virtual ma hine to

a physi al address is almost doubled with respe t to the no

virtualized s enario: rst, a virtual ma hine virtual memory

address is translated into the physi al ma hine guest address,

and then the latter is translated into a physi al ma hine real

address.

Software memoryvirtualization

The VMWare hypervisor assumes that the hardware has not

been enhan ed for virtualization (although, when this is the

ase,itusessomeoftheavailablehardwarefeatures),soitworks

by deriving shadowstru tures from guest level primary

stru -ture.

Someofthesestru ture ouldbemappedintothestateofa

virtualma hine (i.e. pro essorstate),someothers asthe page

table dire torywill ne essarilyresidein memory. These

stru -turesarealsoprivileged,sotheVMMmustprote tthemfrom

unauthorized a ess,with the ompli ation that modi ations

(46)

Address

Memory

Virtual

Host

Address

Memory

Virtual

Offset

Virtual Machine OS

VMM

Page directories

Address

Page

Physical

Address

(47)

mapped.

VMWare use the hardware prote tion me hanism to

pro-te t and tra e modi ation to the shadow stru tures [30℄. If

thePTEsareprote ted,everya essestothemwillbetrapped

(the virtual ma hines are running de-privileged) and the

on-trol is transferred to the VMM. TheVMM de odes the

fault-ing instru tion, emulates its ee t on the primary stru ture,

andthenpropagatesthemodi ationontheshadowstru ture.

VMMmustdistinguishbetweentruepagefaults, ausedbythe

violation of the poli y en oded by the guest PTEs (this

hap-penswhenavirtualized pro esstriesto a essanother

virtual-izedpro ess'smemoryspa e)andhiddenpagefaults, ausedby

missesinthepagetable. Truepagefaults areforwardedto the

guest (that ouldfaults and kills the oending pro ess)whilst

hiddenpagefaults ausestheVMMto onstru tanappropriate

shadow PTE, and then resuming guest exe ution. The tra es

are used to keep in syn the shadow PTEs and the primary

PTEs.

Hardware memoryvirtualization

Inthepara-virtualizedapproa h,thevirtualma hineoperating

(48)

sor,theprivilegeddom0andthelessprivilegeddomUdomains

don'thaveunrestri ted a essto physi al memory. TheVMM

reatesitsownpagetableforea hdomain,andthevirtual

ma- hines onstru ttheirpagetableinawaythatissimilarforthe

para-virtualizedand hardwarevirtualized ase. Everytimethe

virtual ma hine operating system modies its page table, the

VMMisinvoked,anditwillupdate itsshadowpagetable.

Thisapproa h isquite expensive,for the TLBinvalidation

to take pla e and the reation and maintenan e of a shadow

pagetablestru ture.

IntelhasdenedtheExtendedPageTables(EPTs)[9℄,and

AMDtheNested PageTable(NPTs)fortheBar elona

pro es-sor [100, 3℄: both allowthe virtual ma hine operating system

to produ ehostvirtualaddresses fromguest virtualaddresses.

The host virtual address is then translated into physi al host

addressby using a per-virtual ma hine page tree, with a very

little performan e penalty, as this se ond stepis doneat

pro- essor speed without external memory a esses. At the time

ofthis writing,theseextension arenotgenerallyavailable,but

ben hmarksappearpromising[68,5,35℄.

Also,theresultofthis omplexaddresstranslationisstored

(49)

ID (ASID) [42℄. This one bit tag ould distinguish between

VMM's address spa e and guests' address spa e, allowing the

operating system to avoid ushing the entire TLB everytime

theVMMisenteredorexited. IntelhasVirtualPro essorIDs

(VPIDs) forthesamepurpose.

Even with hardwaresupport, the entire memory addresses

onversionpro essisquite omplex,asitrequiresthattwo

dif-ferentmemorys hedulers(oneforthevirtualma hineandone

used by theVMM) must ooperate. As thememorys heduler

isthemost omplexandtuned omponentoftheoperating

sys-tem, this eort is daunting, and is for su h reasons that the

Linux KVM VMM[17℄ is gaining in popularity: There's only

one s heduler, enhan ed with virtualization oriented features

that also leverage massivelyon hardware features, resultingin

onesingleimplementationtobemaintained(iftherunning

vir-tualma hineisLinux)insteadoftwo.

DMA Memorypinning

DMA apabledevi esusuallyside-steptheCPUwhile

transfer-ring large amounts of data. To keep devi e's implementation

simple,usuallytheydon'thaveanyideaaboutvirtualmemory,

(50)

dur-(pinned). Thisapproa hshouldbeextendedwhenanI/O

oper-ationisissuedbyavirtualma hine,andthe ommonapproa h

isbytheuseofalo kingme hanism. TheVMMshouldmanage

lo kstoavoid oni tsanddeadlo ksbetweenvirtualma hines.

TohelpvirtualizationofDMAfun tion,Intelhasdeveloped

theIntelVirtualizationDire ted I/O[10℄ andAMD has

intro-du edtheIOMMUunit [4℄.

2.5.3 Storage

In ontrasttootherperipherals,thevirtualizationofthestorage

is mu h more simpler. We stress out that in this paragraph

withstoragevirtualizationwedenethereservationofspe i

portionofasystemstoragespa e(madeupoflo alandremote

disks,tapesandwhatever)fortheex lusiveuseofoneormore

virtual ma hine. The most ommon ase is when a portion

ofstoragespa eisreservedforusebyasinglevirtualma hine,

analogouslytothenovirtualizeds enario,althoughit'spossibile

thattwoormorevirtualma hinesshareadatastoragearea(as

anexample,aquorumdisk).

All VMMs havesome degreeof exibility in hoosing how

(51)

systemthatthevirtualma hinea essesasanentiredisk,with

theVMMthat mapsthereadand writerequestsofthevirtual

ma hine to read andwrite requests onthe le. This approa h

oersagreat degreeofexibility(allthestorageofthevirtual

ma hinesis ontainedinale,whi h ouldbeeasilyba kedup

and restored) and it's possible to dene snapshots of the le,

whi hare oherentpoint-in-time opies ofthe virtualma hine

storage, allowingfor qui kly restoringof the virtualma hine's

status. The main drawba k of this approa h is performan e

penalty: thedatapathrequiredforanI/Orequestsis doubled.

On the other side, it's possible to assign an entire disk (or a

partition on it) to a virtual ma hine, at the expense of some

managementand exibility issues, gainingon performan es as

the datapath is redu ed (the le system layer of the physi al

ma hine isskipped).

It has to be noted that storage availability is, nearly, the

availability of the entire system, as it's the far most ommon

ause of system outages. A areful planning of a virtual

ma- hine installation should try to balan e between easy of

man-agement,ba kingup,migrationandperforman es,avoiding

un-ne essarydupli ation of eorts, the most typi al of it being a

(52)

vir-ignoreseveryproblemrelatedtoredundan y,seeingonlya

sim-ple storage system, where the VMM ould better map it to a

redundantdatastoragearea.

On the same side, today's storage for server is usually

re-mote,byusingNASorSANinfrastru tures. All ofthemaren't

virtualization-aware.Asanexample,aSANserver ouldbe

on-guredtosele tivelypresentsLUNstoaphysi alserver,

identi-edbyaphysi al onne tion(zoning). IfthisLUN ontainsdata

storageforavirtualma hinerunningintothephysi alserver,a

migrationofthisvirtualma hinewillrequiretore ongurethe

SAN server, as the LUN ontaining the virtual ma hine data

should be, from now on, only a essed from the new server,

whilst theoldservermust be disallowed toa ess thedata, as

the migration has been ompleted. This will require a

oor-dination betweenthe VMM and theSAN, and the SANmust

trusttheVMM,whi hin this s enariois usuallydeployedand

distributed among dierent servers. On [76℄ it's presented the

N_Port Identier Virtualization extension for the Xen VMM

to solve this. Others high availability solutions will program

theSANswit htosele tivelyalloworforbiddataa essasthe

(53)

2.5.4 Network

Networkvirtualizationrefersto theabilitytooerto ea h

vir-tual ma hine a NIC interfa e, allowing it to send and re eive

networktra withoutinterferen e,snoopingorservi e

degra-dation aused bytheothervirtualma hines.

Networkinterfa eis omplexasthenetworktra is

unso-li ited, requiring the VMM to be prepared to re eive and

re-spondtotra that ouldbere eivedatanytime.

Privatedevi es

The rst approa h, adopted by the IBM S/360, onsisted on

assigning aphysi al network interfa e to ea h virtual ma hine

(thiswasalsomadeforotherdevi essu hterminals,disksand

so on). Transfersto andfrom the network ard weremade by

hannelprograms,doingprogrammedI/Ototransferdatafrom

andtothememory.

Modernvirtualizationsystemsalsoallowfortheprivate

de-vi eI/O,asXendoeswiththep iba kmodule. Thisapproa h

asa relative degreeof exibility, asit requires that the VMM

mustbootwitha ongurationthatpreventssomePCIdevi es

(54)

ma hinetodire tlyintera twiththePCIdevi eby onguring

itsdes riptionle[27℄. Althoughit'spossibletoreassignaPCI

devi e to another virtualma hine,the set of dire tly a essed

devi es ouldbe hangedonlybyaVMMreboot.

The sameapproa his also used bythe IBMLogi al

Parti-tioning (LPAR) ar hite turefor the Power4 pro essor, relying

onspe i pro essorfeatures.

More re ent approa hes as the LPAR for the Power4

pro- essorallowforisolateda essatthePCI-level,leveragingona

IOMMUunitthat reatesaI/Opagetableforea hdevi e,with

memorymappingsfromthepagesownedbythevirtualma hine

totheassigneddevi e. Asaresult,forea hDMAoperationthe

pro essor onsultstheIOMMU,disablingI/Oa esstodevi es

notownedbythevirtualma hine.

The private devi e approa h hasa lear advantage,

perfor-man esmaximization, but at the expenseof apossible

under-utilization(orover-provisioning)ofphysi alresour es. Also,the

DMAmemorypinningproblemdis ussedinse tion2.5.2 ould

alsoseverelyrestri tthefeasibilityofthisapproa hforagiven

(55)

Shared devi es

In the Xen ar hite ture, the shared a ess to the network is

made by using a virtualized spool-le interfa e, alled an I/O

domain. TheVMMinterpretsreadingsonthisbueras

re eiv-ing a network pa ket, and writing to it as sending a network

pa ket.Asthegure2.4shows,theXenVMM ouldbe

de om-posed in two elements, the hypervisor and the driver domain

(the Xen ar hite ture is the ommon approa h for shared

de-vi es virtualization). The hypervisor is the abstra tion layer

between the virtual ma hine and thereal hardware, and ea h

I/O devi e is managed by aI/O domain, whi h runs a Linux

kernel. Ea hvirtualma hine ould ommuni ate withadevi e

byusingafront-enddriver,whi hthen onne tsto aba k-end

driver.

Asanexample,whenapa ketistransmittedfrom avirtual

ma hine, it's opied (or remapped) from the front-end driver

to theba k-enddriver,and thenqueued fortransmissionfrom

the NIC.An interruptis generated when apa ket is re eived,

triggeringthe opy(orremap)ofthepa ketfrom theba k-end

driverto the spe i front-end driver. The ba k-end driveris

(56)

vir-TUALIZA TION TECHNIQUES

back−end

driver

NIC

driver

Bridge

Ethernet

back−end

driver

virtual machine 2

front−end

driver

NIC

hypervisor

interrupt

dispatch

CPU/memory/disk/devices

Packet Data

Data

Control

Interrupts

Virtual Interrupts

Packet Data

front−end

Control

driver

2.4: Xen ar hite ture for shared net w ork devi es.

(57)

dressandthenroutesita ordinglytothedestinationfront-end

driver. Afterthe opyofthenetwork data, avirtualinterrupt

is sent to thevirtual ma hine, whi h will in turn wakeup the

front-enddriverandpro essthepa ket.

Data prote tion and isolation betweenthe dierent virtual

ma hineisensuredbythedriverdomain. Thisapproa hresults

insomeoverhead,asit'spossiblethatdatamustbe opiedfrom

andtomemory,andthenumberofinterruptsrequiredtopro ess

anetwork pa ket isdoubled. A spe ializeddriver ould result

inasubstantialin reaseinperforman e[29℄,bydoingmemory

mapping and not memory opying and avoiding to he k for

transmissionerrors,asthis ontrolisalsomadebytheba k-end

driver. AnotherproblemisthattheI/Odomainmustbe

s hed-uledtoallowforpa ketpro essing,andtheonlywaytoavoidit

istomovetheba k-enddriver odeintothehypervisor,

result-ing in abiggerhypervisor,moreexposedto awsand se urity

relatedproblems.

Con urrentDire t NetworkA ess

A modern NIC interfa e is usually organized with more than

onequeuefor pa ket transmission andre eption. This is done

(58)

to-network devi e, and then onguring them with two or more

IP addresses: if a network ard fails, the others will ontinue

to work,with aminimal servi edisruption(also,on multi- ore

ma hines,thispreventsforgloballo kingonnetworkresour es,

asit'spossibletoassignaqueuetoonespe i ore).

ThishardwarefeatureisemployedintheCon urrentDire t

NetworkA ess(CDNA),whereea honeofthisqueues ouldbe

assignedtoaspe i virtualma hines,asthegure2.5shows.

The hypervisor treats ea h queue as if it were a physi al

network ard, assigning ownership of it to a virtual ma hine,

withouttheneedtodeneanI/Odomain,resultinginnearzero

overhead: interruptsarerouteddire tlytothevirtualma hine

owner of thequeue, and the virtual ma hine reads and writes

dire tlyonthequeue.

Memoryprote tionisabitmore omplex,asthere'snomore

adriver domain that ould validate memorya ess to the

de-vi e. Theproblemisexa erbatedin theIntelx86ar hite ture,

whereI/Odevi eshaveonlyphysi aladdresses. Onthis

ar hi-te ture,thehypervisormustvalidateea hbuer,ensuringthat

everyvirtualma hinedoesnotaddtoorremovefromthequeue

pa kets belonging to aqueue it does notown, and preventing

(59)

TUALIZA TION TECHNIQUES

virtual machine 2

Driver

NIC

interrupt

dispatch

hypervisor

driver

NIC

CPU/memory/disk

Driver Control

Packet Data

CDNA NIC

Driver

Control

Interrupts

Virtual Interrupts

Figure 2.5: CDNA ar hite ture.

(60)

made by a MMU with respe t to memory a ess, so the

gen-eralavailabilityofaIOMMUwilleliminatetheseburdensfrom

thehypervisor. The performan esof CNDA aresu h that the

transmission throughput is linear with the in reasing number

ofvirtualma hines, whiletheshareddevi e approa halaXen

de reases exponentially. The performan e gap for re eivingis

redu ed,asXenworksbetterwhendemultiplexesre eived

pa k-ets[91℄. IntelhasdevelopedtheVirtualMa hineDevi eQueue

toee tivelyimplementtheCDNA networkvirtualization[11℄.

2.6 Con luding remarks

Virtualization dates ba k in omputer history, and omes in

many dierent forms. Providing and leveraging hardware

fea-turestogetthebestfromthisapproa hto omputationisa

om-plex pro ess, as there are many inter-dependen ies and many

dierent approa hes, that must be evaluated against

(61)

Virtualization

ar hite tures

Thedierenthypervisorsthat havebeenpresentedthroughout

Chapter 2are merelyte hniquesthat ouldbeused when

vir-tualizationhasto beput in pla e in asystem. Inthis hapter

we analyzethenextlogi alstep, wherevirtualizationis an

ar- hite turalassetthat bringsvalueto adistributed system,and

notonlyanavailablefeature. Ithasto benotedthat

(62)

s enario, where some (and possibly many) lega y systems are

virtualized. Thisisa ost-savvystrategy,butvirtualizationhas

a lot more to oer when it's an integral part of a distributed

system.

Adistributedsystemisusuallydesignedandbuiltup(often

withatrialanderrorapproa h) withalistofdesiredfeatures,

bothmeasurableandnot-measurable,thatdrivethedesign

pro- ess. Someofthefeaturesthat ouldtakeagreatbenetfrom

virtualizationare:

•

e ien y: theaverage serverusually works at10-15% of

its apa ity, with some temporary surges. By pa king

some(virtual) serversintoaphysi al one,it'spossibleto

utdownele tri ityand maintenan e osts;

•

availability: byleveragingon livemigration,it'spossible

to migrate a running virtual ma hine from one physi al

hosttoanother,in aproa tiveway(allowingforordinary

maintenan e) or rea tive (as in a disaster re overy

s e-nario). Re overyOrientedComputing[21℄isaremarkable

approa htorea hthis goal;

(63)

pos-•

loaddistribution:virtualizedma hines ouldbethe

build-ingblo kofa lusterthat spansoverdierentunderlying

physi al ma hines, hiding the heterogeneous underlying

hardware.

Virtualization ould be ee tive in a hieving these sometimes

oni ting goals, as it denes a entral management fun tion

thatis implementedbyanotherintermediatelayer. Having

an-other layermeans alsoadding omplexity, as omplex

intera -tionswith therestof thesta kresultin. As virtualization

be- omespervasive,theghtingarenaformanufa turerswill

pro-gressivelyshift from performan es to management tools: new

generation of omputerpro essors, operating systems and

pe-ripheralsaredesignedwithvirtualizationinmind,andthe

bur-denofhypervisorswillmovefrommakinga omputingplatform

virtualizationfriendly (i.e. virtualize alega yserver) to

man-aginghundredsand thousandsof dierentvirtualized systems,

thatoersdierentservi esto dierentusers.

Thiswillbea ommons enarioforalargedata enterwhi h

today'soerin lude sharedhosts forlow-tra sites and

dedi- atedhosting,butbothof themarefarlessthanideal. Shared

hostingisa eptableonlywhenbothtra andrequestedlevel

(64)

installation,espe iallywhenthe ustomerwantstodowngrade)

and itwillusually resultin awasteof money, from ustomer's

point of view, as the average server is usually under-utilized.

Forthisveryreasonamoderndata entershoulduse

virtualiza-tionin its orear hite ture, to rapidlyadapt to its ustomers'

needs. One of the pioneers in this approa h is Amazon, with

theAmazonElasti ComputingCloud(EC2)servi e[1℄.

WithAmazonEC2,a ustomer ouldleaseavirtualserver,

andpayonlyforthetime theserverisupandrunning. When

the serveris oine, the server'simage is stored oine (a

ol-lateralservi eof Amazon,theAmazonSimple StorageServi e

(S3) [2℄ ould take areof that). Amazonhas a largepool of

ma hines, so it ould instantiate even thousands of server for

a ustomer within minutes from the request. These ma hines

ouldbeusedforthetimerequiredtoperformtheirjob, as

do-ing anumber run hing omputation (that wasthe asewhen

NewYorkTimesneedtorepro essand onvertitsentirear hive

in ele troni format[19℄), a ting asaba kup systemorgiving

someextra apa itypower to ooadsome omputationsfor a

sitethatisexperien ing asurge intra . A start-up ompany

ouldleaseitsserversandexpandingitspoolwhenit'sneeded,

(65)

long-Intranet

Services

Web site

DB2

Testing

HPC

Hypervisor and virtualization manager

Node

DB 1

Figure 3.1: A virtualizationar hite ture.

all ostfa torsaretakenintoa ount,thisapproa hisusually

heaperintheshort-termthanthetraditionalone.

Thisunifyingapproa h ouldbeworthtobeusedevenwhen

the ustomerandtheprovider(of omputingma hines)arethe

same organization, i.e. by the IT department of a orporate.

Insteadofhavingmanydierent lusters,ea honedevotedtoa

spe i businessfun tion,there ouldbeonlyone luster,with

some isles on top of it, ea h onefor ahigh level task. These

isles ould be expanded or redu ed with respe t to their size

(asso iatedresour es)a ordingtotheevolutionofthebusiness.

Figure 3.1 shows an example where dierent area sizes of the

(66)

Atthepri eofamore omplexsetupandplanning,this

ar- hite ture will bring benets to the organization that uses it.

Ea h omputerisa omputingblo k,that isgloballymanaged

andassigned. As longas thehypervisorremains thesame,it's

possibleto mix and mat h dierent hardware solutions,

maxi-mizing the e ien y of the infrastru ture and obtaining,with

less eort, high prole features like high availability, disaster

re overy,rapiddeploymentandsoon.

As a result, virtualization must be in luded in the design

pro ess of a distributed system. In this hapter, we see some

standardte hniquestodevelopamoderndistributedsystemand

howvirtualization ouldbeintegratedinitsin ethedesign

pro- ess. We onsiderthisinthemoregeneral ontextofautonomi

omputing,thatwill provideausefulframework.

3.1 Referen e ar hite ture

Thereferen ear hite turewe onsiderisamulti-tierdistributed

system,showningure3.2.

Ea h tieris fun tionallydistin t fromtheothers. Ea htier

ismade upofnodes ofthesametypeandwith thesame

asso- iatedresour es(CPU,disks,memory,...). Inthisar hite ture,

(67)

Node

Tier N−1

Node

Tier N

Requests

Node

. . .

_{Tier 1}

. . . .

Figure3.2: Amulti-tierdistributedsystem.

them,requiresservi esfromthenexttier,

N − 1

,whi hinturn relays to tier

N − 2

and so on. It is assumed that arequest owsonlyfromonetiertothenextoneto beserved,although

it ouldbepossiblethatitdoesn'tneedtotraverseallthetiers.

Afterrea hingthelasttier,the omputedresultowsupwards,

isaggregatedbythedierenttraversedtiers,andisnallysent

tothe lient. Requestsaregroupedin lasses,distinguishedby

the amount of requests and type of resour es they require by

(68)

Inthe ontext ofwebservi es,distributed systemsare

usu-allymadeupofthreetiers:

•

front end tier,whi h is the onne tion pointfor external

lients a essing the web servi e, with its nodes serving

onlystati ontent. Thistierusuallydoesnotsuerfrom

s alability problems as a modern web server ould serve

stati ontentin su hafastwaytosaturatetheavailable

Internet onne tion;

•

appli ation tier,wheretheappli ationlogi resides. This

ismadeupofprogramsrunninginthe ontextofanHTTP

request, written in languageslike PHP, Javaand so on.

Ea h ustomer intera ting with the web servi e usually

produ es more than one servi e request, and these are

all interdependent (as an example, a buy order follows

a sear h atalog fun tion), so requests are grouped into

sessions. Toavoidrepli atingsessionstatefulinformation

alongallnodes onstitutingtheappli ationtier,theload

distribution fun tion must be session aware, and this

in-trodu essomelimitsinthes alabilityofthis tier;

•

database tier, whi h handles all queriesto the database

(69)

whi hmakesdi ulttorealize on urrent,write-a essto

thedata.

Moretiers ouldbeaddedtothismodel,asanexamplea

front-end tier ould manage web authorization and a ess, and an

innertier ouldmodelandintera twithlega ymainframe

sys-tems. In ea h ase, although this ar hite ture is the de fa to

standard for web servi es, s alability of it must arefully

ad-dressed: adding some extra nodes to a spe i tier wouldn't

ne essarilylettoageneralperforman esimprovementassome

ommon apa ityplanningproblems ouldarise:

•

theextranodesarenotaddedtothetier that isthe

bot-tlene kofthedistributedsystem;

•

the extra nodes are added to a tier that has almost no

needofthem;

•

thespeed-upofthetierbeingin reasedissu hthatthere's

awasteofpro essingpower.

The last point is riti al. Ea h tierhas asso iateda load

dis-tribution fun tion,that dispat hes thein oming requestsfrom

theuppertiertoaspe i node. Thisdispat hingrstrequires

(70)

re-theloadonea hnode)andit ouldbemoreorlesse ientfor

thespe i semanti ofoperationsperformedbythetier. Asa

result, adding nodes to a tierthat is alreadyexperien ing the

saturation of its distribution fun tion will result in no ee t.

In some ases it will be required to hange the strategy used

to distributetheloadbetweenthenodesofthesametier,that

it ould in turn require to hange the implementation of the

softwarerunningonthenodes.

3.1.1 Modeling of multi-tier systems

Modelingofmulti tiersystemshasattra tedalot ofinterestin

thelastseveralyears,aspervasivewebservi esareusuallybest

implementedin thisframework.

Asforea hperforman emodel, there's atradeo between

its a ura yand thefeasibility of theimplementation: amore

pre ise model, that requires too mu h instrumentation of the

realsystemtofeeditsmodelsolver,orthathasa omplexmodel

thatrequiresalotofpro essingtimetobesolved,isofnomore

thantheoreti al interest,as thepredi tions originatingfromit

annotbeapplied inanon-linesystem,sothelevelofdetailin

(71)

prised ofphysi al resour eslikeCPU, memoryand disks, that

ould intera t with other serversto a omplish the requested

fun tions.

In[98℄ the developed model is session-based and take into

a ount a hing ee ts and on urren y limiton the tiers. In

this model, ea h tier is modeled via a queue, and the system

is solved via the Mean Value Analysis algorithm ([90℄). This

modelallowsforperforman espredi tionanddynami apa ity

provisioning,andit ouldhandlemulti- lassmodels.

In[105℄ themostimportantinput forthe MVA model, the

averageservi etimefortheCPUatea htier,isestimatedviaa

regressionte hnique,minimizing the quadrati dieren e from

observedandestimatedutilizationrate.

In[99℄thefo usisinprovisioningtheappli ationtier. First,

itisshownthatforlimitedtimes aleatier ouldbemodeledas

aM/G/1/PSqueue,i.e. thearrivalrateofrequestsatthe

appli- ationtierisdes ribedbyaPoissondistribution(thisof ourse

is not true on large and dierent times ales, as self-similarity

appears due to anhigh degreeof orrelation betweenarrivals.

The times ales overwhi h orrelations exists are delimited by

anupperbound, alledCriti alTimeS ale[93℄). Thenthe

(72)

al ula-againstthreedierentapproximations. All ofthese

approxima-tionmethods a hieveallo ationswith ostsnearthe minimum

possible, while simpler heuristi s in ur in signi antly higher

osts. This model ould also be used to determine the

opti-malnumberof serversto bvedeployedfortheappli ation tier,

ignoringtheprovisioningoftheothertiers.

Theassumptionthatallnodesarework onservingandthat

thedis ipline ispro essorsharing(PS) is generallyappli able,

whileit ouldnotbethe asethatthearrivalsaremodeledviaa

Poissondistribution.Tohavethemostgeneralmulti-tiermodel,

weassumethatea hserverismodeledviaaG/G/1queue. The

behaviorof this server is des ribed via the followingequation

fromqueuingtheory[72℄:

λ

i

≥

s

i

+

σ

2 a

+ σ

2 b

2 ∗ (d

i

− s

i

)

−1

(3.1) where:

• λ

i

isthearrivalratefortier

i

;

• s

i

istheaverageservi etimeforrequestsontier

i

;

• d

i

isthemeanresponsetimeforrequestsontier

i

;

• σ

2 a

isthevarian eofinter-arrivaltimeforrequestsontier

(73)

• σ

2 b

isthevarian eofservi etimesforrequestsontier

i

.

Quantities as

d

i

and

s

i

andtheir varian esareknownor ould beon-line estimated (instrumenting the distributed system to

re ord,forea htransa tion,requesttimeand ompletiontime),

soa lowerbound on

λ

i

ouldbe evaluated for ea h server. If ea hsession hasathink timeof

Z

, byLittle'sLawthesession arrival rate of

λ

ould be translated in a request arrival rate of

λτ

Z

, where

τ

is the average session duration. So, when the apa ity

λ

i

of a single server is omputed, the number

n

i

of serverrequiredfortier

i

issimplydened as:

n

i

=

β

i

λτ

λ

i

Z

(3.2)

the

β

i

isa orre tionfa tor spe i for ea h tier,that take intotheformula ompetingee tsas a hing,loaddistribution