Ar hite tures
PaoloCampegiani
1 Introdu tion 1
2 Virtualization te hniques 4
2.1 Ageneraldenition ofvirtualization . . . 4
2.2 Virtualizationattheoperatingsystemlevel . . . 7
2.3 Virtualizationte hniques . . . 10
2.3.1 Binarytranslation . . . 11
2.3.2 Para-virtualization . . . 11
2.3.3 Hardwareassistedvirtualization . . . 12
2.3.4 Lightweightvirtualization . . . 12
2.4 VMMimplementations. . . 13
2.4.1 QEMU . . . 13
2.4.3 Xen . . . 20
2.4.4 Hardwareassistedvirtualization . . . 24
2.4.5 Lightweightvirtualization . . . 27
2.4.6 OtherVMMs . . . 28
2.5 Hardwarevirtualization . . . 30
2.5.1 Pro essor . . . 30
2.5.2 MemoryandDMA . . . 31
2.5.3 Storage . . . 39
2.5.4 Network . . . 42
2.6 Con ludingremarks . . . 49
3 Virtualization ar hite tures 50 3.1 Referen ear hite ture . . . 55
3.1.1 Modelingofmulti-tiersystems . . . 59
3.2 Virtualizationperforman esandmeasurement. . 63
3.3 Autonomi omputing . . . 68
3.3.1 Self-optimization . . . 70
3.3.2 Proposedextensiontothemodel . . . 75
4 The mapping problem 79 4.1 Problemformalization . . . 80
4.1.1 Dis ussionofformalization . . . 84
4.3 Computational omplexityofthemappingproblem 90
4.4 Optimalsolutionofthemappingproblem . . . . 92
4.5 Approximatesolutionsforthemappingproblem 95
4.5.1 Apa kingorientedheuristi . . . 96
4.5.2 Ageneti algorithm . . . 100
5 Simulationsresults 113
5.1 Implementationofthebinpa kingheuristi s . . 114
5.2 Implementationofthegeneti algorithm . . . 120
5.3 Modelsdataset . . . 121
2.1 VMMar hite ture. . . 8
2.2 Xenar hite ture. . . 21
2.3 Memoryvirtualizationdatapath. . . 35
2.4 Xenar hite tureforsharednetworkdevi es.. . . 45
2.5 CDNAar hite ture. . . 48
3.1 Avirtualizationar hite ture. . . 54
3.2 Amulti-tierdistributedsystem. . . 56
3.3 TheQoSController. . . 71
4.1 A partial de isiontree for aMMMKP problem. Inthedouble he kedleaf,setde isionalvariables are
x
11
1
= 2, x
21
1
= 1.
. . . 934.2 Anindividualforaproblemwith3groups. Ea h
Xmarksavariablesetto 1. . . 109
4.3 Fixingtherstgroupwith dierentindividuals. . 110
4.4 Fixingthese ondgroupbygeneratingtwo
dier-entindividuals. . . 111
4.5 Fixingthese ondandthirdgroupbygenerating
allpossiblefeasibleindividuals. . . 112
5.1 Averagetnessofpopulationforthethird model. 126
3.1 MeasurementsforestimationofVMMoverhead. 66
5.1 FirstmodelSLAsandprots. . . 123
5.2 Se ondmodelSLAsandprots.. . . 123
5.3 Physi alhosts hara terizationsforallmodels. . 124
5.4 Comparisons of prots for the optimal solution
andtheapproximatesolutionfortherstmodel,
fordierentvaluesof
C
. . . 124 5.5 Se ond model: range of prot and approximatesolutionprot,fordierentvaluesof
C.
. . . 124 5.6 ThirdmodelSLAsandprots. . . 1255.8 Initialandnaltnessofbestindividualforthird
I'dliketothankmanypeoplethathelpedmeallalongthe
pur-suingofmyPh. D.
Professor Salvatore Tu iwasmytutor during these years,
and allowed me to freely explore my interests. He says that
thebestwaytodoresear hisbyndingatopi ofinterestand
developingit,andifI ouldsaythebestwaytounderstandthis
visionistoapplyitoneveryday'swork.
ProfessorFran oMa eriallowedtorunthesimulationsinthe
s ienti labofDipartimento di IngegneriaCivile,that it
hap-pensIalsomanageassystemadministratorhopingnot ausing
toomu hlongdowntimes. Fa inghigh omputationalproblems
fromthesystem'spointofviewhasdevelopedinmesome
me in harge of designingand administrationof their network
infrastru ture, an ex iting hallenge that he generously allows
metodowithoutinterruptingmyPh. D. duework.
ProfessorFran es o LoPresti isthe onewhomore
en our-agesmeandfollowmyprogress(orla kthereof),reallyhelping
mein nalizingthiswork.
I'd also like to thank people and sta of Dipartimento di
Ingegneria Civileand from S uola IaD for their kindness,and
espe iallyEusebioGiandomeni oandMar oOrazifortheir
Introdu tion
Asmanyotherte hnologiesandparadigmsinthe omputer
s i-en e eld, virtualization has a longhistory with periods with
greatmomentumandperiodswhenithasbeenputinthen
ba k-ground. Theadventofmassiveande onomi omputerpower,
aspredi ted byMoore's Law, hasnally resulted in the
avail-abilityofsystemlevelvirtualizationte hnologieson ommodity
hardware. This will leadto a omplete new lass ofproblems,
fromprovisioningtodeployment,thatarisewhenvirtualization
is intended asan ar hite tural assetthat ouldbring valueto
servi es.
Inthisdissertation,Iinvestigateaproblemthat ouldbe
ex-pressedas: given anumberof virtualma hines andsome
phys-i al ma hines, ea h des ribed respe tively by a demand ve tor
andaresour eve tor, whi h isthebestallo ation oftheformer
tothe latter,for agiven metri ?
Fora largedata enter,an example of metri would be to
minimizethenumberofphysi al ma hinesdevotedtohost
vir-tual ma hines, giving su ient resour es to ea h virtual
ma- hine. By minimizing this number, the data enter ould
in- reasethee ien yofthephysi al ma hines,andthe
underly-ingvirtualizationte hnologywill preventea h virtualma hine
tointerferewithothers,frombothaperforman eandase urity
prospe tive.
It will shown that this problem, stated in the most
gener-allyform, isNP-hard (it's ageneralizationof the lassi al0/1
knapsa kproblem),andits omplexityisdaunting,requiringto
deneanheuristi tondanapproximatesolution. Wewillalso
presentageneti algorithm that appears promisingin ta kling
thisproblem.
Thedissertationisorganizedasfollows.
pre-broadly lassifyingthemintwo ategories:te hniquesthatdoes
notrelyonhardwarefeaturetosupportvirtualization,and
te h-niquesthatleverageon.
InChapter3,wemovefromate hnologypointofview
to-wards an ar hite ture entri one,analyzing the performan es
problemsthat virtualizationfa es. Weputvirtualizationasan
assetof multi-tierdistributed systems, andwedes ribeit asa
fundamentalblo kforautonomi omputing. Currentworksin
thiseldla kofsomedegreeofgenerality,andwhenweextend
the urrentavailableframeworks.
InChapter4,the mappingproblemis formallydened,we
analyzeits omputational omplexity,anddevelopsome
heuris-ti s[44℄to solveit, omparingthemtoageneti algorithm[45℄
wealsopropose. Inthis hapter,weseethatthemapping
prob-lemisageneralizationoftheknapsa kproblem,andwebriey
analyzethes ar eliteratureongeneralizationofknapsa k
prob-lems.
InChapter5,weshowsimulationresultsforsomeinteresting
datasets.
Chapter6endsthisdissertation,brieyre allingtheresults
Virtualization
te hniques
2.1 A general denition of
virtualiza-tion
Virtualization ould bedened as atwophasepro ess. In the
rstphase,someresour esofthesamekindwillbegrouped
to-gether,hidingphysi al boundaries;in these ond phase,a
to anuser. Thereare manytypesof virtualization, depending
onthetypeoftheaggregatedresour e:
•
VirtualLAN(VLAN):aVLAN[23℄ isaset ofhoststhatommuni ateasiftheywere onthesamewire,
unregard-ing their physi al lo ation. Even when the hosts are on
dierentphysi alsegmentof thesameLAN,the
ongu-ration madeonnetwork devi eslikeswit hesand routers
allows the hosts to share the same virtual segment, so
broad astpa ketsareforwardedonlyontheVLAN.This
will in rease se urity, by avoiding unauthorizedhosts to
onne t tothevirtualsegment,and allowsforthe
deni-tionofper-segmentQualityofServi epoli ies;
•
StorageVirtualization: abun hofstorageresour es(disksortapes)aregroupedtogether,andthea esstothemis
sele tivelydened by amanagement fun tion. In a
Net-work Atta hedStorage (NAS) environment,and morein
aStorageAreaNetwork(NAS)[55℄,itispossibledo arve
outsomeresour esandallowoneormorehoststoa ess
them. Asaresult,thehostsare omputingnodesthatare
atta hed to thedata. This allowsforbetterand heaper
data onsolidation,ba kup andse urity;
appli ations, running on Javaor Flash. As an example,
when the user downloadsa Javaapplet viathe browser,
the applet is exe uted in the ontext of a JavaRuntime
Environment (JRE) [74℄. The JRE virtualizes the
om-putingresour estotheappletinthesensethattheapplet
iswritteninaso alledbyte ode,ama hinelanguagethat
theJREtranslatesintorealoperationsfortheunderlying
targetpro essor.Asaresult,thesameapplet ouldbe
ex-e uted overdierentpro essorar hite tures,aslongasa
JREisprovided. Besidesthis,theJREdenesasandbox
that has somese urity onstraints,likeanapplet annot
a ess systemlesonthetargetma hine.
Allthese examples,nowayexhaustive,showssomeofthe
ben-etsofvirtualization. Byaddinganintermediatelayerbetween
physi al resour eand resour e demand, it's possible to
multi-plex, demultipexand routingrequests to asinglemanagement
point,a hievingbetters alability,manageability,performan es
2.2 Virtualization at theoperating
sys-tem level
Virtualization at the operating system level has been
imple-mentedfor the rsttime on theIBM S/360system[8℄. In an
fundamentalarti leonvirtualization,PopekandGoldberg[86℄
denedtheformalrequirementsforavirtualizationar hite ture.
We will base our exploration and taxonomy of virtualization
te hniquesonthatpaper,soit'sworthytore apit.
First, it's dened the on ept of Virtual Ma hine Monitor
(VMM)asalayerthatseparatetheVirtualMa hine(VM)-that
is, theoperating system to be exe uted - from the underlying
hardware,asshowningure2.1.
Some ofthe instru tionsofthe VM ouldtrap, that in the
original arti le is dened by saving the program ounter to a
spe iedlo ationandthenjumpingtotheaddress ontainedin
anotherlo ation,whereatraproutineistobeexe uted,withthe
ma hine registerssaved. The traproutinewill do itsown job,
then itrestores theregistersand return ontrol to theaddress
savedintherstpla e. It'spossibiletodenenotblo kingtrap
routines. This me hanism is the pre ursor of today's system
Virtual
Virtual
Virtual
machine 1
machine 2
machine N
. . . . .
. . .
VMM
Hardware
Figure2.1: VMMar hite ture.
Trapareinstrumental to lassifyinstru tioninthree
dier-entgroups:
•
privilegedinstru tions: aretheinstru tionsthat ausesatrap;
•
sensitive instru tions: they ame in two dierent types,ontrol sensitive instru tions and behavior sensitive
in-stru tions. Todene them in terms of urrent
ar hite -tures,wedenetheseinstru tionsastheoneswhi h hanges
thepro essormode(or returnsit)orwhi hexe ution
de-pendsontherealmemoryaddressoftheiroperands;
TheVMMshouldhavesomepropertiestoallowforthe
exe u-tionofaVMontopofit:
•
E ien y: Everyinstru tionthatisinno uousisexe uteddire tlybytheunderlyinghardware,withnointervention
oftheVMM;
•
Resour e ontrol: The VM annot hange its resour esquota: everyrequestformoreresour eismediatedbythe
VMM;
•
Equivalen e: Everyprogramexe utedin the ontextofaVMperformsinanalmostindistinguishablemanner,asif
itwereexe utedwithoutaVMMinterposingbetweenthe
VM and thehardware. In this ontext, almost
indistin-guishablemeansthatit'salloweda ertaindegreeof
devi-ation, asperforman esmaybeabit worse andresour es
availability ouldbenotidenti al(be ausetheVM annot
a ess dire tlythehardware).
Theworkof PapekandGoldbergisfundamentalastheyproof
thefollowingtheorem:
stru tionsisasubset ofthe privilegedinstru tion.
Thetheoremstillholdsfor urrentar hite ture,andweuse
it asa riteriato dis riminatebetweenvirtualizable pro essor
ar hite ture (or so alled virtualization friendly) and the not
virtualizable ones: a pro essor ar hite ture is virtualizable if
andonlyiftheexe utionofeverysensitiveinstru tioneventually
result in a trap, asthe traproutine ould be implemented by
theVMM.
It willbeshownlaterthat,surprisingly,theIntelx86
ar hi-te tureisnotvirtualizationfriendly.
2.3 Virtualization te hniques
Inspiteof theunifyingdenition statedabove,there aresome
dierentwaystovirtualizeanoperatingsystem. Broadly
speak-ing,thereisatradeobetweentheresultingperforman esand
thespe trumofpro essorsar hite turesthat ouldbe
virtual-ized: toa hievespeedit'susuallyne essarytofo usonaspe i
instru tionset andpresentingthevirtualma hine amore
gen-eralizedandless ustomizableabstra tionofphysi alhardware,
virtu-identifyfour dierentvirtualizationte hniques.
2.3.1 Binary translation
In this approa h, a software layer translates operations from
thevirtualma hinesettothephysi alma hineset,allowingfor
odeoptimizationandtranslation a hee ien y. The
virtual-izationlayer oulddoaso alled rossvirtualization,wherethe
virtualma hineinstru tionsetandphysi alma hineinstru tion
set are ompletely dierent- requiringto ompletely translate
the former into the latter - or a partial virtualization, where
inno uousinstru tionareexe uteddire tlybythehardware(in
a ontextset upbytheVMM)and riti alonesare translated
bytheVMMthatoperatesasaresour es'broker.
2.3.2 Para-virtualization
Theoperatingsystemofthevirtualma hineismodiedinsu h
awaythateverysystem allthatshouldhavea essedthe
hard-wareisinsteadmappedinansystem allexe utedbyandinthe
ontextoftheVMM.Themodi ationoftheto bevirtualized
operatingsystem ouldbeunfeasiblewhenit's releasedonlyin
2.3.3 Hardware assisted virtualization
The instru tion set has been augmented with operations that
en ompasses portion of ma hine ode. This se tions are
exe- utedin avirtualma hine ontext, whi h isdierentfrom the
physi alma hine ontext. TheVMMhassomedegreeof ontrol
overtheoperationsmadebyaspe i virtualma hine,ranging
fromanotrustrelationship(everyI/Ooperationperformedby
thevirtualma hineistrappedandresultsintheexe utionofthe
VMMthatoperatesasa ontrolinterfa e)toatotaltrust
rela-tionship,wherethevirtualma hine oulddire tly a essevery
hardware in thesystem. The latterresults in in reased speed
anddiminishedse urity.
2.3.4 Light weight virtualization
The operating system of the physi al ma hine is hanged to
allowdierent andnot- ommuni atingnamespa esforthe
dif-ferent resour e lasses. As a result, there are some zones (to
use a typi al terminology) and ea h one has its own le
sys-tem, users, pro essesnamespa e and hardware view. It ould
bearguedthatthisapproa hisnotavirtualization,mainly
some problems that otherwiserequire a traditional
virtualiza-tionte hnique,whileexperien ingnearlynoperforman e
penal-ties.
2.4 VMM implementations
Anumberof ompetingprodu ts,bothopenand losedsour e,
areavailableasVMM.Inthisse tionweseethemost
represen-tativeofthem,fo usingontheadoptedvirtualizationte hnique.
2.4.1 QEMU
QEMU[16℄, written by Fabri e Bellard,is anopensour e
ma- hineemulatorandvirtualizer. It ouldoperateasavirtualizer,
whenthevirtualma hine instru tionsetandphysi alma hine
instru tionsetarethesame,orasanemulator, apableof
trans-latinginstru tionsetfromsevendierentpro essorar hite ture
to sometarget ar hite ture, plusvirtualizingsystemhardware
toallowfora ompleteoperatingsystemvirtualization.
QEMU is adynami translator,i.e. the ode translatedis
storedinatranslation a hewhereit ouldbereusedtoin rease
generalapproa hforbinarytranslation.
ConsiderthefollowingPowerPCinstru tion:
addi r1,r1,
−
16 # r1 = r1−
16thatmustbetranslatedintoIntelx86 ode. First,therewill
be generated some mi ro operations, that are independent of
thenaltarget:
m o v l _ T 0 _ r 1 # T0 = r1
addl_T0_im
−
16 # T0 = T0−
16 m o v l _ r 1 _ T 0 # r1 = T0theT0 andT1 registerare typi allystored in hostregister
due theoptimization madebythe GCC ompiler. Therst of
themi rooperationistypi ally odedas:
void o p _ m o l v _ T 0 _ r 1(void) {
To = env
−
>regs[1℄; }whereenvisthestru ture ontainingtheCPUstateof the
virtualma hine.
The ode generated is then translated in physi al ma hine
odebytheGCC ompiler,andtheresultwill be(foran Intel
# ebx = env
−
>regs[1℄ mov 0x4(%ebp), %ebx# addl_T0_im
−
16 # ebx = ebx−
16add $0Xfffffff0,%ebx
# m o v l _ r 1 _ T 0
# env
−
>regs[1℄= ebx mov %ebx, 0x4(%ebp)QEMUisadynami translator asitusesa16MByte a he
thatholdsthemostre entlyusedtranslationblo ks(TB).After
theexe utionofeveryTB,thenextinstru tionto beexe uted
willbedeterminedbyexaminingthestateoftheemulatedCPU;
ifthejump pointisin the a he,the odeisexe uteddire tly,
otherwise the translationpro esstakes pla e. A TB ould be
pat heddire tlytothelogi alfollowingonewhenthejump
des-tination isknown.
More omplex problem arises with self-modifying ode, as
theappli ationswritten fortheIntelx86ar hite turedoesnot
signal a he invalidation that ould trigger the removing of a
pli ationwritten for adierentpro essorar hite ture, but an
entireoperatingsystemrequiresthe virtualizationof the
hard-ware. QEMUallowsfor alimited set of virtualized hardware.
It's possible tohaveupto twoEIDE harddisks, abasi video
VGA ard,oneormoreFastEthernet NIC;whileit'salso
pos-sibleto onne tdire tlytheUSBsubsystem ofthevirtual
ma- hine tothephysi alUSBsubsystem.
Thevirtualized harddisks aremappedasleonthe
physi- al ma hine. This will resultin asigni ant performan e loss,
as every I/O request made from the virtualized ma hine will
traverse the virtualized operating system sta k, resulting in a
sequen eofI/OoperationsintertwinedwithvirtualizedOS
op-erations,andea hI/OoperationwillultimatelyresultinaI/O
operationmadeontheimageleonthephysi alsystem,
requir-ing for beingmade traversingagain the sta k of an operating
system,inthis asethephysi alone. Thenalresultisthatthe
datapathis doubled. QEMUhassomeexibility intheimage
leformat,it'spossibiletohavea opy-on-writeformatle,but
thisar hite turewon'thelpforperforman es.
Thenetwork ard emulation hassome interestingfeatures.
Ea h virtualma hine ouldhaveoneormoreNICs, andthese
pos- ompletely hidden from the rest of the world, bridged on the
physi al LANor even onaUDPmulti ast network that ould
spanseveralphysi alma hines.
2.4.2 VMWare
VMWare [24℄ is the market leader in the virtualization eld,
thanks to its performan es and management tools. Produ ts
fromVMWarerangefromVMWarePlayer,thatisonly apable
of run avirtualma hine,to the VMWareInfrastru turesuite,
thathastheabilitytomanageresour esallo ation,performing
liveba kup of running virtual ma hines, moving them from a
physi alma hinetoanotherwithverylittleservi einterruption.
VMWarerea tedtotheintrodu tionoftheopensour eXen
hypervisor (dis ussed below) by releasing its VMWare server
free but losed sour e, to gain and maintain market share at
the expenses of the new omer. Unfortunately, the li ense of
VMWareserverdi tatesthatben hmarkarepossibleonlywhen
themethodologyhasbeenapprovedbyVMWareIn .,andasa
resultofthisthereareveryfews ienti papersontheinternals
ofthisVMM.
hasoverallbetterperforman e.
Toa hievemaximumspeed,it'simperativethat,asstatedin
[86℄,mostpartofthe odeisexe uteddire tlybytheunderlying
physi alpro essor,butthisisimpossiblewiththeIntelx86
pro- essorar hite ture, as there are instru tionsthat aresensitive
but nottrappable. Asanexample,theCurrentPrivilegeLevel
(CPL) ould be obtained by reading the low two bits of the
odesegment sele tor register(% s), and the popfinstru tion
(popags)exe utedbyaprivilegedpro ess ouldmodifythe
IFagsthatgovernstheinterruptdelivery,anoperationthatan
unprivilegedguest annotdo[92℄. Asaresult,it'sne essaryto
haveabinarytranslatorthat,forsu hvirtualizationunfriendly
operations,simulatestheirexe ution in avirtual ontext. The
translatoradoptedbyVMWareis:
•
Binary: itsinputisIntelx86 ode;•
Dynami andondemand:translationhappensatruntime,andonlywhen odeisaboutto beexe uted;
•
Systemlevel: thearenoassumptionabouttheguest ode,theABIisthex86IndustryStandardAr hite ture(ISA);
instru tions);
•
Adaptive: translated odeisadjustedinresponsetoguestbehavior hangetoimprovee ien y.
The last propertyis worthy noting. Whena CPU en ounters
a trap for a privileged instru tion, it has to jump to a trap
routine(typi allyanoperatingsystementrypoint)todealwith
it,andthis ouldbeexpensive. Abinarytranslator ouldavoid
it, by repla ing the original ode with a routine (that, being
exe utedbyaprogram,isinusermodeandnotinkernelmode).
As an example, the rdts instru tion for the Intel Pentium
ar hite ture,takes2030 y les fora lassi altrapandemulate
exe ution, andonly 216forthebinary translation. This ould
dealwithaminorpartofthesensitiveinstru tions,asloadsand
stores ould a esssensitivedata aspage tables. Theadopted
approa histhataninstru tionistranslatedidenti ally(i.e.,not
translated) and exe uted by the physi al pro essor. If a trap
happens,nexttimethesameinstru tionwillbere-translatedto
avoidthetrap,maybeinvokinganinterpreter.
VMWarehasputalotofeortinthemanagementand
on-gurationtools,bothforasinglesystemandforanentiredata
possibleto dene anarbitrarynumberof virtualized
peripher-als, in ludingstorage systems,network ards,video ards and
USB devi es. Hard disks anbe mapped into a le image, a
diskpartitionor aniSCSItarget[13℄to a hievemaximum
per-forman es. Thenetwork ouldbe onguratedtohaveavirtual
ma hine that has an host-only network (i.e., it ommuni ates
only with the physi al ma hine it's instantiated on), a NAT
network (where the physi al ma hine a ts as a Network
Ad-dressTranslator),ortohaveaunique,externallya essibleIP
address.
TherealvalueoftheVMWaresuites omeswiththeVMWare
Infrastru ture,that allowsfor a entraladministrationof
hun-dredsofvirtualma hines,overdozensofdierentphysi al
ma- hines, allowing for load balan ing, high availability and live
migration(movingavirtualma hinefromonephysi alnodeto
another[82℄)withlittleservi edisruption.
2.4.3 Xen
Xen [25℄ was originally developed at the University of
Cam-bridge Computer Lab [26℄ as a framework to have an
om-Hardware
Xen−aware device drivers
Xen−aware device drivers
Block
Virtual
Device
Virtual
Network
Virtual
Physical
Memory
Virtual
CPU
Domain 0
interface
control
Xen−aware
device drivers
Dom0
OS
Guest OS
Guest OS
User Software
User Software
Software
Control
Figure2.2: Xenar hite ture.
foundedtogainpaying ustomersformanagementtools(the
hy-pervisoritselfisreleasedundertheGNUPubli Li ense);later
the ompanyhasbeena quiredbyCitrix.
In the Xen language, both physi al and virtual operating
systems are alled domains, with dom0 indi ating the
hyper-visorand domU for the unprivileged domains, i.e. the virtual
ma hines. Thegure2.2showstheXenar hite ture[36℄.
Xenadoptsthepara-virtualizationapproa h,borrowedfrom
theDenalisystem[102℄: theappli ationABIremainsun hanged,
ofhyper alls.
Anhyper allisessentiallyawayto ontrolintera tions
be-tweenavirtualma hineoperatingsystemandthephysi al
ma- hineoperatingsystem. Thehyper allinterfa eallowsdomains
toperformasyn hronoussoftwaretraptoperformaprivileged
operation,analogoustothesystem allsfoundintheoperating
system. DatatransfersaremanagedviaI/Orings,essentiallya
produ er- onsumerbuerofI/Oledes riptors,withageneral
interfa ethat ouldbeusedforalmosteverykindofI/Odevi e
intera tion.
CPU s heduling between dierent domains is made with
threedierents hedulersasXen3.0: theBorrowedVirtualTime
(BVT) s hedulingalgorithm [59℄, that is work- onserving and
apable of a low laten y wake up when a domain re eives an
event; the Simple Earliest Deadline First (sEDF) that ould
bebothwork- onserving andnot work- onserving,but la ksa
globalloadbalan ingbetweendierentCPUs;theCredit
S hed-ulerthatisalsogloballoadbalan ingalthoughnotpreemptive,
andhasas hedulingperiodhard- odedat30ms [47℄.
Networkinterfa esarequite omplex[28℄: thefoundationof
thear hite tureisaVirtualFirewallRouter (VFR),withea h
NIC, buttheadministrationoftheVFR ouldbe hallenging.
StoragesystemsforthedomUaremodeledasVirtualBlo k
Devi es(VBD):thedom0 ouldmapthemintoles,partitions
or LUNs. It's also possible to bla k list a PCI devi e for the
dom0,leavingitin theex lusivea essof oneormoredomU.
Xen hasthe ability to perform alive migration, with very
little QoS loss [49℄. On [97℄ it's exposed an ar hite ture that
allowsformigrationoveraMAN/WAN,attheexpenseof
hav-ing a dedi ated ommuni ation ir uit. On [43℄ it's shown an
extensionthat alsoallowsformigrationofthelo al lesystem
(hypervisorsassumethat thelo allesystem ouldalsobe
a - essedfrom thedestination physi alma hine, requiringaNAS
orSANinfrastru ture).
Che kpointing,astheabilitytosaveandrestoreoftenfrom
asavedimage that ontainsalso thepersistent state,is under
development, allowing for a global he kpointing of an entire
lusterofvirtualized ma hines[54℄.
Xenperforman esareoftheutmostinterest,asthe
paravir-tualization has a verylow impa t, at the ost of requiring to
hange both thedom0 and the domU operating system. This
isinfeasible foroperatingsystemsreleasedonlyin binaryform
sup-Xen has found its way in the mainline Linux kernel, after
sometimewhereitsintegrationwiththeoperatingsystemwas
thepremierefeatureofenterpriseorientedLinuxdistributionas
RedHatRHELandNovellSuSEserver.
2.4.4 Hardware assisted virtualization
The Intel x86ar hite ture is not avirtualization-friendly one.
Asaresult,untilsomeyearsagotheonlyavailablehypervisors
arebinarytranslator(asVMWare)orpara-virtualizer(asXen).
In2006,Intelhasannoun edtheVT-xar hite tureforhardware
assistedvirtualizationforthex86pro essorfamily,andtheVT-i
fortheItaniumfamily[12℄.
WiththeVT-xextension,thereareavailabletwonewCPU
operations, the VMX root operation and the VMX non-root
operation.
The VMX root operation is intended for aVMM, and it's
verysimilartoatraditionalIA-32operation. VMXnon-rootis
intended to support andisolate theexe utionof avirtual
ma- hine,allowingtheVMMtodeneadegreeoftrustforthe
vir-tualma hine,grantingsomedire t intera tions withthe
to the VMX non-root operation, the opposite transition is a
VMX exit. The Virtual Ma hine Control Stru ture (VMCS)
managesthesetransitions,being omposedofagueststatearea
andanhoststatearea.Pro essorstateisloadedfromthe
guest-stateareaoneveryVMentry,whileit'srestoredfromthe
host-state area on every VM exit. Exits happen always for some
instru tions, for othersit depends onsome variablesand ags
intheVMCS,that ouldbesetonlyintheVMXrootoperation
mode. As anexample, the VMCS ould dene howto deliver
interrupts(everyinterruptresultsinaVMexitwithnomask,or
theguestisabletore eiveinterrupts), hoosetoallowtheguest
to dire tlya ess some spe ial register(that denes paging or
oating point operationmode), whi h ex eptions ause a VM
exit,whi h I/Ooperationsareallowed(bydening a eptable
I/Oportrange).
This exibility allows for a ner grain of ontrol, be ause
a VMM ould hoose to give a spe i virtual ma hine more
privileges, resulting in fewer VM exits and entries. As noted
in [30℄, ea h entryorexit isanalogous toa ontext swit h,
re-sulting in some performan e losses. The exa t penalty varies
alot, be auseit depends onthe numberof privileged
ationbehavior). Nevertheless, theperforman eproblemmust
beaddressed.
Asaresultofthegrowing on erns,these ondgenerationof
virtualization apablepro essorhassomenewfeatures. AMD,
that developedasimilar ar hite ture alled Pa i a,presented
theBar elonapro essor,thathasathird level a heanda
vir-tualized addresstranslation, instead of a shadowpaging, that
should substantiallyredu e the memory performan e loss.
In-telhasinsteaddevelopedtheVirtualizationTe hnologyfor
Di-re tedI-O[10℄,thatallowsforadire tremappingofDMA
trans-fersanddevi egeneratedinterrupts.
It mustbenotedthatanhardwareassistedvirtualizationis
theonlywayto virtualizeanoperatingsystemthat's available
onlyin losedsour eform(likeMi rosoftWindowsseries),but
togetthebestperforman eit ouldberequiredtousespe i
driversintheguestkernel. Theso alledpara-virtualized(PV)
driversaredriversengineeredtoworkoptimallyinaguest
envi-ronment,where there'snoneedtoa essdire tly thehardware
(and, in fa t, tryingto dothat will usually ausesa VMexit)
2.4.5 Lightweight virtualization
TheVMMseensofarallowsformultipleanddierentoperating
systems hosted on the same physi al ma hines, giving a high
degree of exibility. In some s enarios, there is no need for
using dierent operating systems(or even dierent version of
the same), it's su ient to have multiple views of the same
system. Thisapproa histhegeneralizationofthejailor hroot
se urity feature found on Unix system: a pro ess is restri ted
tointera twithasubsetofthesystemles,soa ompromission
ofitwouldn't allowtheatta kerto manipulateothersprogram
les and resour es. From the system point of view, the les
namespa ehas been split, astwodierent pro essesmayrefer
to dierentles evenwhen theyuse thesame(lo al)name. If
thissplitting isextendedto allthesystem'sresour es,wehave
alightweight(or ontainerbased) virtualization[96℄.
Example of this aretheOpenVZ extensionto Linuxkernel
[18℄,theLinux-VServerproje t[15℄andtheSolaris10operating
system[20℄.
OpenVZ alls ea h autonomousnamespa e asavirtual
en-vironment(VE), alled zones in Solaris. With ontainer based
tem resour es. OpenVZ denes this resour es limits as bean
ounters, andtheyarein pla eforea hpossibleresour etype.
Infa t,resour esmanagementwithin ontainersisfarmore
sim-ple,asthere'sonlyoneoperatingsystemthatmustbeenhan ed
togovernthat,makingalsopossibleto hangetheselimitseven
at run-time. Overhead is also negligible [95℄, allowing for
in-stantiating even hundreds of ontainers in the same physi al
ma hine,makingthissolutionsparti ularlyappealingfor
Inter-netServi eProviderswhereea hhostedsite ould oin idewith
avirtual ontainer.
2.4.6 Other VMMs
TherearemanyVMMsolutionstoday,fromresear hprototypes
toprodu tionreadyinfrastru tures. We iteheresomeofthem
presentinginterestingfeatures:
Terra: Terra[62℄ is aVMMthat allowsforTrusted
Comput-ing. Avirtualma hine ouldbeinstantiated asan
open-box, allowing for data a ess and modi ation from the
administratorofthephysi alma hine,orasa losedbox,
where these operations are prohibited. Also, the Terra
pered. Thisexperimental approa h allowsforhigh
sensi-tivese ure virtual ma hines (e.g. voting ma hine) to be
allo atedover ommodityhardware;
P.R.O.S.E.: thePartitionedReliableOperatingSystems[63℄,
based on the Logi al Partitioning (LPAR). The
hyper-visor, rHype, is a para-virtualization engine that uses a
roundrobinxedslot CPUs heduler. Thissimple
s hed-ulerredu estheOSinterferen e[64℄,whi hhappenswhen
there's somejitter in the exe utionsequen e of dierent
virtualma hine,aplaguethat ismoreevidentongeneral
purpose VMM like Xen or VMWare asthe VMM are a
omponent of a generalpurpose operating system. This
la ksofstri ttiming oordination ouldeasilydestroy
ag-gregatedperforman esinaHighPerforman eComputing
s enario;
Virtual Box: it's a GPL released binary translator made by
InnotekandnowdevelopedbySun;
KVM: it'saLinuxkernelmodulethatoershardwareassisted
virtualization. Due toits integrationwiththekerneland
Lguest: it's a para-virtualizer for the Linux Kernel, made in
lessthan5000linesof ode[14℄;
Hyper-V: it'sthevirtualizationte hnologymadebyMi rosoft
andmadeavailableforWindowsServer2008andWindows
Vista. Itleveragesonhardwaresupportforvirtualization.
2.5 Hardware virtualization
Inthisse tion,wedis ussindetailshowa omputer omponent
ould bevirtualized, i.e. how it ould be abstra ted and
pre-sentedto oneor morevirtualma hines, preventingea honeof
themto a essorinterfereswithothers'data.
2.5.1 Pro essor
Pro essorvirtualizationisusuallyasimpletopi .ForthePopek
and Goldberg prin iple stated above, the most portion of
in-stru tions are exe uteddire tly bythe pro essoritself, for
a - ura y and performan es. Only sensitive instru tions require
to beinter epted and somehowmanaged by theVMM. When
this happens, there's a pro ess analogous to a ontext swit h:
It'salso requiredsomelevel ofprote tionfor riti al
stru -turesstored in memory, andthis is usuallydone byleveraging
onpro essors'a ess ontrol me hanisms. In theIntel x86
ar- hite ture,ea hpro ess ouldrunin oneoffourprivilegelevel,
the less privileged numbered 3 and the most privileged
num-bered0. Inanovirtualized s enario,operatingsystemrunsat
0level,andappli ationsrunat3,leavinglevels1and2unused.
With a hypervisors like Xen or VMWare, the hypervisor and
itsoperatingsystemstillrunningatlevel0,meaningfulla ess
tomemoryanddevi es,andthevirtualizedma hinesrunin an
intermediatelevel. Thisisthemain reasonwhyitisdi ultto
virtualizeanhypervisor.
2.5.2 Memory and DMA
In amodern ar hite ture, ea h pro ess hasasso iated itsown
uniqueaddressspa e,andinstru tionsanddataarestoredina
virtualaddressspa e. Thevirtualaddressspa eisimplemented
bytheMemoryManagementUnit(MMU)thatgetsthevirtual
address and returns the physi al address. This onversion is
per pro ess, meaning that two dierent pro esses will usually
phys-memory;also,twothreadsofthesamepro esswillusuallyshare
memory).
Thistranslationismadeupbyorganizingthememoryspa e
ofapro essin ahierar hi alstru ture, thepage dire tory, the
rootofwhi his apartofthepro ess ontext (onx86
ar hite -tureis aCPU register). A virtualaddressis omposed oftwo
parts,thedire torypartandtheoset. Thedire torypartwill
be ombinedwiththepagedire torytodeterminethephysi al
page, whi h is added to the oset to get thephysi al address
[58℄.
Thisoperation, alledpagetreewalking,willrequire
travers-ing the multi-level tree page table. Ea h Page Table Entry
(PTE) has the same size of a page table, whi h is 4 KiB or
4MiBonIntelx86ar hite ture(otherar hite tures ouldhave
dierentpagesize oexisting inthesystem): astherearemany
of them, the PTEs are stored in memory. So, everytime the
dire tory part of the virtual address hanges, it ould be
re-quired toa ess somePTEsin memory,resultingin verypoor
performan es. Toavoidthis,aMMUisequippedwitha
Trans-lationLook-AsideBuer(TLB),whi hisaspe ialized a hefor
virtualmemory onversionlookups.
swit ho urs,orwhenthereisatransition betweenthekernel
modeandtheusermode(toadopttheIntelx86nomen lature),
theTLBwill refertoapagetablethatisnolongerthe urrent
pagetable,soitmustbeinvalidated. Asaresult,thein oming
memorya esseswillrequireapagetreewalking,untiltheTLB
getsrelled.
A typi al pattern on a modern system is when a pro ess
(runningonusermode)requires anoperationto theoperating
systembyissuingasystem all:thepro essorswit hestokernel
mode, theoperating systemwill hopefullyhonor therequests,
thenthepro essorgoesba ktousermodeandthepro ess
exe- utionresumes. Thisowhastwotransitionsinit(therstfrom
usermodeto kernelmode,andthese ondfromkernelmodeto
usermode),whi hinanaiveTLBimplementationwouldrequire
twoTLBinvalidations. Thisisusuallyawasteofresour es,
be- auseabetterapproa hwouldbetosele tivelyinvalidatesome
oftheTLBlines. Ifthekernel omputationissmall(asusually
it is), the number of the referen ed memory addresses is also
small,soonlysomeoftheTLBlinesmustbeinvalidated.
Thisoptimizationrequiresthatea hTLBlineistagged,
as-so iatingtoitthepagetable whi hitrefersto. Taggingis also
invalidationisneeded.
Figure2.3(adaptedfrom[57℄)showsthegenerals hemefor
memoryvirtualization,stressingthatthedatapathrequiredfor
onverting a virtual memory address of a virtual ma hine to
a physi al address is almost doubled with respe t to the no
virtualized s enario: rst, a virtual ma hine virtual memory
address is translated into the physi al ma hine guest address,
and then the latter is translated into a physi al ma hine real
address.
Software memoryvirtualization
The VMWare hypervisor assumes that the hardware has not
been enhan ed for virtualization (although, when this is the
ase,itusessomeoftheavailablehardwarefeatures),soitworks
by deriving shadowstru tures from guest level primary
stru -ture.
Someofthesestru ture ouldbemappedintothestateofa
virtualma hine (i.e. pro essorstate),someothers asthe page
table dire torywill ne essarilyresidein memory. These
stru -turesarealsoprivileged,sotheVMMmustprote tthemfrom
unauthorized a ess,with the ompli ation that modi ations
Address
Memory
Virtual
Host
Address
Memory
Virtual
Offset
Virtual Machine OS
VMM
Page directories
Address
Page
Page
Physical
Address
mapped.
VMWare use the hardware prote tion me hanism to
pro-te t and tra e modi ation to the shadow stru tures [30℄. If
thePTEsareprote ted,everya essestothemwillbetrapped
(the virtual ma hines are running de-privileged) and the
on-trol is transferred to the VMM. TheVMM de odes the
fault-ing instru tion, emulates its ee t on the primary stru ture,
andthenpropagatesthemodi ationontheshadowstru ture.
VMMmustdistinguishbetweentruepagefaults, ausedbythe
violation of the poli y en oded by the guest PTEs (this
hap-penswhenavirtualized pro esstriesto a essanother
virtual-izedpro ess'smemoryspa e)andhiddenpagefaults, ausedby
missesinthepagetable. Truepagefaults areforwardedto the
guest (that ouldfaults and kills the oending pro ess)whilst
hiddenpagefaults ausestheVMMto onstru tanappropriate
shadow PTE, and then resuming guest exe ution. The tra es
are used to keep in syn the shadow PTEs and the primary
PTEs.
Hardware memoryvirtualization
Inthepara-virtualizedapproa h,thevirtualma hineoperating
sor,theprivilegeddom0andthelessprivilegeddomUdomains
don'thaveunrestri ted a essto physi al memory. TheVMM
reatesitsownpagetableforea hdomain,andthevirtual
ma- hines onstru ttheirpagetableinawaythatissimilarforthe
para-virtualizedand hardwarevirtualized ase. Everytimethe
virtual ma hine operating system modies its page table, the
VMMisinvoked,anditwillupdate itsshadowpagetable.
Thisapproa h isquite expensive,for the TLBinvalidation
to take pla e and the reation and maintenan e of a shadow
pagetablestru ture.
IntelhasdenedtheExtendedPageTables(EPTs)[9℄,and
AMDtheNested PageTable(NPTs)fortheBar elona
pro es-sor [100, 3℄: both allowthe virtual ma hine operating system
to produ ehostvirtualaddresses fromguest virtualaddresses.
The host virtual address is then translated into physi al host
addressby using a per-virtual ma hine page tree, with a very
little performan e penalty, as this se ond stepis doneat
pro- essor speed without external memory a esses. At the time
ofthis writing,theseextension arenotgenerallyavailable,but
ben hmarksappearpromising[68,5,35℄.
Also,theresultofthis omplexaddresstranslationisstored
ID (ASID) [42℄. This one bit tag ould distinguish between
VMM's address spa e and guests' address spa e, allowing the
operating system to avoid ushing the entire TLB everytime
theVMMisenteredorexited. IntelhasVirtualPro essorIDs
(VPIDs) forthesamepurpose.
Even with hardwaresupport, the entire memory addresses
onversionpro essisquite omplex,asitrequiresthattwo
dif-ferentmemorys hedulers(oneforthevirtualma hineandone
used by theVMM) must ooperate. As thememorys heduler
isthemost omplexandtuned omponentoftheoperating
sys-tem, this eort is daunting, and is for su h reasons that the
Linux KVM VMM[17℄ is gaining in popularity: There's only
one s heduler, enhan ed with virtualization oriented features
that also leverage massivelyon hardware features, resultingin
onesingleimplementationtobemaintained(iftherunning
vir-tualma hineisLinux)insteadoftwo.
DMA Memorypinning
DMA apabledevi esusuallyside-steptheCPUwhile
transfer-ring large amounts of data. To keep devi e's implementation
simple,usuallytheydon'thaveanyideaaboutvirtualmemory,
dur-(pinned). Thisapproa hshouldbeextendedwhenanI/O
oper-ationisissuedbyavirtualma hine,andthe ommonapproa h
isbytheuseofalo kingme hanism. TheVMMshouldmanage
lo kstoavoid oni tsanddeadlo ksbetweenvirtualma hines.
TohelpvirtualizationofDMAfun tion,Intelhasdeveloped
theIntelVirtualizationDire ted I/O[10℄ andAMD has
intro-du edtheIOMMUunit [4℄.
2.5.3 Storage
In ontrasttootherperipherals,thevirtualizationofthestorage
is mu h more simpler. We stress out that in this paragraph
withstoragevirtualizationwedenethereservationofspe i
portionofasystemstoragespa e(madeupoflo alandremote
disks,tapesandwhatever)fortheex lusiveuseofoneormore
virtual ma hine. The most ommon ase is when a portion
ofstoragespa eisreservedforusebyasinglevirtualma hine,
analogouslytothenovirtualizeds enario,althoughit'spossibile
thattwoormorevirtualma hinesshareadatastoragearea(as
anexample,aquorumdisk).
All VMMs havesome degreeof exibility in hoosing how
systemthatthevirtualma hinea essesasanentiredisk,with
theVMMthat mapsthereadand writerequestsofthevirtual
ma hine to read andwrite requests onthe le. This approa h
oersagreat degreeofexibility(allthestorageofthevirtual
ma hinesis ontainedinale,whi h ouldbeeasilyba kedup
and restored) and it's possible to dene snapshots of the le,
whi hare oherentpoint-in-time opies ofthe virtualma hine
storage, allowingfor qui kly restoringof the virtualma hine's
status. The main drawba k of this approa h is performan e
penalty: thedatapathrequiredforanI/Orequestsis doubled.
On the other side, it's possible to assign an entire disk (or a
partition on it) to a virtual ma hine, at the expense of some
managementand exibility issues, gainingon performan es as
the datapath is redu ed (the le system layer of the physi al
ma hine isskipped).
It has to be noted that storage availability is, nearly, the
availability of the entire system, as it's the far most ommon
ause of system outages. A areful planning of a virtual
ma- hine installation should try to balan e between easy of
man-agement,ba kingup,migrationandperforman es,avoiding
un-ne essarydupli ation of eorts, the most typi al of it being a
vir-ignoreseveryproblemrelatedtoredundan y,seeingonlya
sim-ple storage system, where the VMM ould better map it to a
redundantdatastoragearea.
On the same side, today's storage for server is usually
re-mote,byusingNASorSANinfrastru tures. All ofthemaren't
virtualization-aware.Asanexample,aSANserver ouldbe
on-guredtosele tivelypresentsLUNstoaphysi alserver,
identi-edbyaphysi al onne tion(zoning). IfthisLUN ontainsdata
storageforavirtualma hinerunningintothephysi alserver,a
migrationofthisvirtualma hinewillrequiretore ongurethe
SAN server, as the LUN ontaining the virtual ma hine data
should be, from now on, only a essed from the new server,
whilst theoldservermust be disallowed toa ess thedata, as
the migration has been ompleted. This will require a
oor-dination betweenthe VMM and theSAN, and the SANmust
trusttheVMM,whi hin this s enariois usuallydeployedand
distributed among dierent servers. On [76℄ it's presented the
N_Port Identier Virtualization extension for the Xen VMM
to solve this. Others high availability solutions will program
theSANswit htosele tivelyalloworforbiddataa essasthe
2.5.4 Network
Networkvirtualizationrefersto theabilitytooerto ea h
vir-tual ma hine a NIC interfa e, allowing it to send and re eive
networktra withoutinterferen e,snoopingorservi e
degra-dation aused bytheothervirtualma hines.
Networkinterfa eis omplexasthenetworktra is
unso-li ited, requiring the VMM to be prepared to re eive and
re-spondtotra that ouldbere eivedatanytime.
Privatedevi es
The rst approa h, adopted by the IBM S/360, onsisted on
assigning aphysi al network interfa e to ea h virtual ma hine
(thiswasalsomadeforotherdevi essu hterminals,disksand
so on). Transfersto andfrom the network ard weremade by
hannelprograms,doingprogrammedI/Ototransferdatafrom
andtothememory.
Modernvirtualizationsystemsalsoallowfortheprivate
de-vi eI/O,asXendoeswiththep iba kmodule. Thisapproa h
asa relative degreeof exibility, asit requires that the VMM
mustbootwitha ongurationthatpreventssomePCIdevi es
ma hinetodire tlyintera twiththePCIdevi eby onguring
itsdes riptionle[27℄. Althoughit'spossibletoreassignaPCI
devi e to another virtualma hine,the set of dire tly a essed
devi es ouldbe hangedonlybyaVMMreboot.
The sameapproa his also used bythe IBMLogi al
Parti-tioning (LPAR) ar hite turefor the Power4 pro essor, relying
onspe i pro essorfeatures.
More re ent approa hes as the LPAR for the Power4
pro- essorallowforisolateda essatthePCI-level,leveragingona
IOMMUunitthat reatesaI/Opagetableforea hdevi e,with
memorymappingsfromthepagesownedbythevirtualma hine
totheassigneddevi e. Asaresult,forea hDMAoperationthe
pro essor onsultstheIOMMU,disablingI/Oa esstodevi es
notownedbythevirtualma hine.
The private devi e approa h hasa lear advantage,
perfor-man esmaximization, but at the expenseof apossible
under-utilization(orover-provisioning)ofphysi alresour es. Also,the
DMAmemorypinningproblemdis ussedinse tion2.5.2 ould
alsoseverelyrestri tthefeasibilityofthisapproa hforagiven
Shared devi es
In the Xen ar hite ture, the shared a ess to the network is
made by using a virtualized spool-le interfa e, alled an I/O
domain. TheVMMinterpretsreadingsonthisbueras
re eiv-ing a network pa ket, and writing to it as sending a network
pa ket.Asthegure2.4shows,theXenVMM ouldbe
de om-posed in two elements, the hypervisor and the driver domain
(the Xen ar hite ture is the ommon approa h for shared
de-vi es virtualization). The hypervisor is the abstra tion layer
between the virtual ma hine and thereal hardware, and ea h
I/O devi e is managed by aI/O domain, whi h runs a Linux
kernel. Ea hvirtualma hine ould ommuni ate withadevi e
byusingafront-enddriver,whi hthen onne tsto aba k-end
driver.
Asanexample,whenapa ketistransmittedfrom avirtual
ma hine, it's opied (or remapped) from the front-end driver
to theba k-enddriver,and thenqueued fortransmissionfrom
the NIC.An interruptis generated when apa ket is re eived,
triggeringthe opy(orremap)ofthepa ketfrom theba k-end
driverto the spe i front-end driver. The ba k-end driveris
vir-TUALIZA TION TECHNIQUES
back−end
driver
NIC
driver
Bridge
Ethernet
back−end
driver
virtual machine 2
front−end
driver
NIC
hypervisor
interrupt
dispatch
CPU/memory/disk/devices
Packet Data
Data
Control
Interrupts
Virtual Interrupts
Packet Data
front−end
Control
driver
2.4: Xen ar hite ture for shared net w ork devi es.dressandthenroutesita ordinglytothedestinationfront-end
driver. Afterthe opyofthenetwork data, avirtualinterrupt
is sent to thevirtual ma hine, whi h will in turn wakeup the
front-enddriverandpro essthepa ket.
Data prote tion and isolation betweenthe dierent virtual
ma hineisensuredbythedriverdomain. Thisapproa hresults
insomeoverhead,asit'spossiblethatdatamustbe opiedfrom
andtomemory,andthenumberofinterruptsrequiredtopro ess
anetwork pa ket isdoubled. A spe ializeddriver ould result
inasubstantialin reaseinperforman e[29℄,bydoingmemory
mapping and not memory opying and avoiding to he k for
transmissionerrors,asthis ontrolisalsomadebytheba k-end
driver. AnotherproblemisthattheI/Odomainmustbe
s hed-uledtoallowforpa ketpro essing,andtheonlywaytoavoidit
istomovetheba k-enddriver odeintothehypervisor,
result-ing in abiggerhypervisor,moreexposedto awsand se urity
relatedproblems.
Con urrentDire t NetworkA ess
A modern NIC interfa e is usually organized with more than
onequeuefor pa ket transmission andre eption. This is done
to-network devi e, and then onguring them with two or more
IP addresses: if a network ard fails, the others will ontinue
to work,with aminimal servi edisruption(also,on multi- ore
ma hines,thispreventsforgloballo kingonnetworkresour es,
asit'spossibletoassignaqueuetoonespe i ore).
ThishardwarefeatureisemployedintheCon urrentDire t
NetworkA ess(CDNA),whereea honeofthisqueues ouldbe
assignedtoaspe i virtualma hines,asthegure2.5shows.
The hypervisor treats ea h queue as if it were a physi al
network ard, assigning ownership of it to a virtual ma hine,
withouttheneedtodeneanI/Odomain,resultinginnearzero
overhead: interruptsarerouteddire tlytothevirtualma hine
owner of thequeue, and the virtual ma hine reads and writes
dire tlyonthequeue.
Memoryprote tionisabitmore omplex,asthere'snomore
adriver domain that ould validate memorya ess to the
de-vi e. Theproblemisexa erbatedin theIntelx86ar hite ture,
whereI/Odevi eshaveonlyphysi aladdresses. Onthis
ar hi-te ture,thehypervisormustvalidateea hbuer,ensuringthat
everyvirtualma hinedoesnotaddtoorremovefromthequeue
pa kets belonging to aqueue it does notown, and preventing
TUALIZA TION TECHNIQUES
virtual machine 2
Driver
NIC
interrupt
dispatch
hypervisor
driver
NIC
CPU/memory/disk
Driver Control
Packet Data
CDNA NIC
Driver
Control
Interrupts
Virtual Interrupts
Figure 2.5: CDNA ar hite ture.made by a MMU with respe t to memory a ess, so the
gen-eralavailabilityofaIOMMUwilleliminatetheseburdensfrom
thehypervisor. The performan esof CNDA aresu h that the
transmission throughput is linear with the in reasing number
ofvirtualma hines, whiletheshareddevi e approa halaXen
de reases exponentially. The performan e gap for re eivingis
redu ed,asXenworksbetterwhendemultiplexesre eived
pa k-ets[91℄. IntelhasdevelopedtheVirtualMa hineDevi eQueue
toee tivelyimplementtheCDNA networkvirtualization[11℄.
2.6 Con luding remarks
Virtualization dates ba k in omputer history, and omes in
many dierent forms. Providing and leveraging hardware
fea-turestogetthebestfromthisapproa hto omputationisa
om-plex pro ess, as there are many inter-dependen ies and many
dierent approa hes, that must be evaluated against
Virtualization
ar hite tures
Thedierenthypervisorsthat havebeenpresentedthroughout
Chapter 2are merelyte hniquesthat ouldbeused when
vir-tualizationhasto beput in pla e in asystem. Inthis hapter
we analyzethenextlogi alstep, wherevirtualizationis an
ar- hite turalassetthat bringsvalueto adistributed system,and
notonlyanavailablefeature. Ithasto benotedthat
s enario, where some (and possibly many) lega y systems are
virtualized. Thisisa ost-savvystrategy,butvirtualizationhas
a lot more to oer when it's an integral part of a distributed
system.
Adistributedsystemisusuallydesignedandbuiltup(often
withatrialanderrorapproa h) withalistofdesiredfeatures,
bothmeasurableandnot-measurable,thatdrivethedesign
pro- ess. Someofthefeaturesthat ouldtakeagreatbenetfrom
virtualizationare:
•
e ien y: theaverage serverusually works at10-15% ofits apa ity, with some temporary surges. By pa king
some(virtual) serversintoaphysi al one,it'spossibleto
utdownele tri ityand maintenan e osts;
•
availability: byleveragingon livemigration,it'spossibleto migrate a running virtual ma hine from one physi al
hosttoanother,in aproa tiveway(allowingforordinary
maintenan e) or rea tive (as in a disaster re overy
s e-nario). Re overyOrientedComputing[21℄isaremarkable
approa htorea hthis goal;
pos-•
loaddistribution:virtualizedma hines ouldbethebuild-ingblo kofa lusterthat spansoverdierentunderlying
physi al ma hines, hiding the heterogeneous underlying
hardware.
Virtualization ould be ee tive in a hieving these sometimes
oni ting goals, as it denes a entral management fun tion
thatis implementedbyanotherintermediatelayer. Having
an-other layermeans alsoadding omplexity, as omplex
intera -tionswith therestof thesta kresultin. As virtualization
be- omespervasive,theghtingarenaformanufa turerswill
pro-gressivelyshift from performan es to management tools: new
generation of omputerpro essors, operating systems and
pe-ripheralsaredesignedwithvirtualizationinmind,andthe
bur-denofhypervisorswillmovefrommakinga omputingplatform
virtualizationfriendly (i.e. virtualize alega yserver) to
man-aginghundredsand thousandsof dierentvirtualized systems,
thatoersdierentservi esto dierentusers.
Thiswillbea ommons enarioforalargedata enterwhi h
today'soerin lude sharedhosts forlow-tra sites and
dedi- atedhosting,butbothof themarefarlessthanideal. Shared
hostingisa eptableonlywhenbothtra andrequestedlevel
installation,espe iallywhenthe ustomerwantstodowngrade)
and itwillusually resultin awasteof money, from ustomer's
point of view, as the average server is usually under-utilized.
Forthisveryreasonamoderndata entershoulduse
virtualiza-tionin its orear hite ture, to rapidlyadapt to its ustomers'
needs. One of the pioneers in this approa h is Amazon, with
theAmazonElasti ComputingCloud(EC2)servi e[1℄.
WithAmazonEC2,a ustomer ouldleaseavirtualserver,
andpayonlyforthetime theserverisupandrunning. When
the serveris oine, the server'simage is stored oine (a
ol-lateralservi eof Amazon,theAmazonSimple StorageServi e
(S3) [2℄ ould take areof that). Amazonhas a largepool of
ma hines, so it ould instantiate even thousands of server for
a ustomer within minutes from the request. These ma hines
ouldbeusedforthetimerequiredtoperformtheirjob, as
do-ing anumber run hing omputation (that wasthe asewhen
NewYorkTimesneedtorepro essand onvertitsentirear hive
in ele troni format[19℄), a ting asaba kup systemorgiving
someextra apa itypower to ooadsome omputationsfor a
sitethatisexperien ing asurge intra . A start-up ompany
ouldleaseitsserversandexpandingitspoolwhenit'sneeded,
long-Intranet
Services
Web site
DB2
Testing
HPC
Hypervisor and virtualization manager
Node
Node
Node
Node
Node
Node
Node
DB 1
Figure 3.1: A virtualizationar hite ture.
all ostfa torsaretakenintoa ount,thisapproa hisusually
heaperintheshort-termthanthetraditionalone.
Thisunifyingapproa h ouldbeworthtobeusedevenwhen
the ustomerandtheprovider(of omputingma hines)arethe
same organization, i.e. by the IT department of a orporate.
Insteadofhavingmanydierent lusters,ea honedevotedtoa
spe i businessfun tion,there ouldbeonlyone luster,with
some isles on top of it, ea h onefor ahigh level task. These
isles ould be expanded or redu ed with respe t to their size
(asso iatedresour es)a ordingtotheevolutionofthebusiness.
Figure 3.1 shows an example where dierent area sizes of the
Atthepri eofamore omplexsetupandplanning,this
ar- hite ture will bring benets to the organization that uses it.
Ea h omputerisa omputingblo k,that isgloballymanaged
andassigned. As longas thehypervisorremains thesame,it's
possibleto mix and mat h dierent hardware solutions,
maxi-mizing the e ien y of the infrastru ture and obtaining,with
less eort, high prole features like high availability, disaster
re overy,rapiddeploymentandsoon.
As a result, virtualization must be in luded in the design
pro ess of a distributed system. In this hapter, we see some
standardte hniquestodevelopamoderndistributedsystemand
howvirtualization ouldbeintegratedinitsin ethedesign
pro- ess. We onsiderthisinthemoregeneral ontextofautonomi
omputing,thatwill provideausefulframework.
3.1 Referen e ar hite ture
Thereferen ear hite turewe onsiderisamulti-tierdistributed
system,showningure3.2.
Ea h tieris fun tionallydistin t fromtheothers. Ea htier
ismade upofnodes ofthesametypeandwith thesame
asso- iatedresour es(CPU,disks,memory,...). Inthisar hite ture,
Node
Node
Node
Tier N−1
Node
Node
Node
Tier N
Requests
Node
Node
Node
. . .
. . .
. . .
Tier 1
. . . .
Figure3.2: Amulti-tierdistributedsystem.
them,requiresservi esfromthenexttier,
N − 1
,whi hinturn relays to tierN − 2
and so on. It is assumed that arequest owsonlyfromonetiertothenextoneto beserved,althoughit ouldbepossiblethatitdoesn'tneedtotraverseallthetiers.
Afterrea hingthelasttier,the omputedresultowsupwards,
isaggregatedbythedierenttraversedtiers,andisnallysent
tothe lient. Requestsaregroupedin lasses,distinguishedby
the amount of requests and type of resour es they require by
Inthe ontext ofwebservi es,distributed systemsare
usu-allymadeupofthreetiers:
•
front end tier,whi h is the onne tion pointfor externallients a essing the web servi e, with its nodes serving
onlystati ontent. Thistierusuallydoesnotsuerfrom
s alability problems as a modern web server ould serve
stati ontentin su hafastwaytosaturatetheavailable
Internet onne tion;
•
appli ation tier,wheretheappli ationlogi resides. Thisismadeupofprogramsrunninginthe ontextofanHTTP
request, written in languageslike PHP, Javaand so on.
Ea h ustomer intera ting with the web servi e usually
produ es more than one servi e request, and these are
all interdependent (as an example, a buy order follows
a sear h atalog fun tion), so requests are grouped into
sessions. Toavoidrepli atingsessionstatefulinformation
alongallnodes onstitutingtheappli ationtier,theload
distribution fun tion must be session aware, and this
in-trodu essomelimitsinthes alabilityofthis tier;
•
database tier, whi h handles all queriesto the databasewhi hmakesdi ulttorealize on urrent,write-a essto
thedata.
Moretiers ouldbeaddedtothismodel,asanexamplea
front-end tier ould manage web authorization and a ess, and an
innertier ouldmodelandintera twithlega ymainframe
sys-tems. In ea h ase, although this ar hite ture is the de fa to
standard for web servi es, s alability of it must arefully
ad-dressed: adding some extra nodes to a spe i tier wouldn't
ne essarilylettoageneralperforman esimprovementassome
ommon apa ityplanningproblems ouldarise:
•
theextranodesarenotaddedtothetier that isthebot-tlene kofthedistributedsystem;
•
the extra nodes are added to a tier that has almost noneedofthem;
•
thespeed-upofthetierbeingin reasedissu hthatthere'sawasteofpro essingpower.
The last point is riti al. Ea h tierhas asso iateda load
dis-tribution fun tion,that dispat hes thein oming requestsfrom
theuppertiertoaspe i node. Thisdispat hingrstrequires
re-theloadonea hnode)andit ouldbemoreorlesse ientfor
thespe i semanti ofoperationsperformedbythetier. Asa
result, adding nodes to a tierthat is alreadyexperien ing the
saturation of its distribution fun tion will result in no ee t.
In some ases it will be required to hange the strategy used
to distributetheloadbetweenthenodesofthesametier,that
it ould in turn require to hange the implementation of the
softwarerunningonthenodes.
3.1.1 Modeling of multi-tier systems
Modelingofmulti tiersystemshasattra tedalot ofinterestin
thelastseveralyears,aspervasivewebservi esareusuallybest
implementedin thisframework.
Asforea hperforman emodel, there's atradeo between
its a ura yand thefeasibility of theimplementation: amore
pre ise model, that requires too mu h instrumentation of the
realsystemtofeeditsmodelsolver,orthathasa omplexmodel
thatrequiresalotofpro essingtimetobesolved,isofnomore
thantheoreti al interest,as thepredi tions originatingfromit
annotbeapplied inanon-linesystem,sothelevelofdetailin
prised ofphysi al resour eslikeCPU, memoryand disks, that
ould intera t with other serversto a omplish the requested
fun tions.
In[98℄ the developed model is session-based and take into
a ount a hing ee ts and on urren y limiton the tiers. In
this model, ea h tier is modeled via a queue, and the system
is solved via the Mean Value Analysis algorithm ([90℄). This
modelallowsforperforman espredi tionanddynami apa ity
provisioning,andit ouldhandlemulti- lassmodels.
In[105℄ themostimportantinput forthe MVA model, the
averageservi etimefortheCPUatea htier,isestimatedviaa
regressionte hnique,minimizing the quadrati dieren e from
observedandestimatedutilizationrate.
In[99℄thefo usisinprovisioningtheappli ationtier. First,
itisshownthatforlimitedtimes aleatier ouldbemodeledas
aM/G/1/PSqueue,i.e. thearrivalrateofrequestsatthe
appli- ationtierisdes ribedbyaPoissondistribution(thisof ourse
is not true on large and dierent times ales, as self-similarity
appears due to anhigh degreeof orrelation betweenarrivals.
The times ales overwhi h orrelations exists are delimited by
anupperbound, alledCriti alTimeS ale[93℄). Thenthe
al ula-againstthreedierentapproximations. All ofthese
approxima-tionmethods a hieveallo ationswith ostsnearthe minimum
possible, while simpler heuristi s in ur in signi antly higher
osts. This model ould also be used to determine the
opti-malnumberof serversto bvedeployedfortheappli ation tier,
ignoringtheprovisioningoftheothertiers.
Theassumptionthatallnodesarework onservingandthat
thedis ipline ispro essorsharing(PS) is generallyappli able,
whileit ouldnotbethe asethatthearrivalsaremodeledviaa
Poissondistribution.Tohavethemostgeneralmulti-tiermodel,
weassumethatea hserverismodeledviaaG/G/1queue. The
behaviorof this server is des ribed via the followingequation
fromqueuingtheory[72℄:
λ
i
≥
s
i
+
σ
2
a
+ σ
2
b
2 ∗ (d
i
− s
i
)
−1
(3.1) where:• λ
i
isthearrivalratefortieri
;• s
i
istheaverageservi etimeforrequestsontieri
;• d
i
isthemeanresponsetimeforrequestsontieri
;• σ
2
a
isthevarian eofinter-arrivaltimeforrequestsontier• σ
2
b
isthevarian eofservi etimesforrequestsontieri
.Quantities as
d
i
ands
i
andtheir varian esareknownor ould beon-line estimated (instrumenting the distributed system tore ord,forea htransa tion,requesttimeand ompletiontime),
soa lowerbound on
λ
i
ouldbe evaluated for ea h server. If ea hsession hasathink timeofZ
, byLittle'sLawthesession arrival rate ofλ
ould be translated in a request arrival rate ofλτ
Z
, whereτ
is the average session duration. So, when the apa ityλ
i
of a single server is omputed, the numbern
i
of serverrequiredfortieri
issimplydened as:n
i
=
β
i
λτ
λ
i
Z
(3.2)
the
β
i
isa orre tionfa tor spe i for ea h tier,that take intotheformula ompetingee tsas a hing,loaddistributionand speed-up. If the speed-up behavioris not onstantbut a
fun tionof