ContentslistsavailableatScienceDirect
Journal
of
Systems
Architecture
journalhomepage:www.elsevier.com/locate/sysarc
An
integrated
hardware/software
design
methodology
for
signal
processing
systems
Lin Li
a,∗, Carlo Sau
b, Tiziana Fanni
b, Jingui Li
c, Timo Viitanen
c, François Christophe
e,
Francesca Palumbo
d, Luigi Raffo
b, Heikki Huttunen
c, Jarmo Takala
c, Shuvra S. Bhattacharyya
a,c a University of Maryland, ECE Department, College Park, MD 20742, United Statesb University of Cagliari, Department of Electrical and Electronic Engineering, Italy c Tampere University, Finland
d University of Sassari, PolComIng-Information Engineering Unit, Italy e Department of Computer Science, University of Helsinki, Finland
a
r
t
i
c
l
e
i
n
f
o
Keywords: Dataflow Model-based design Hardware/software co-design Low power techniques Deep learning Signal processing systems
a
b
s
t
r
a
c
t
This paper presents a new methodology for design and implementation of signal processing systems on system-on- chip (SoC) platforms. The methodology is centered on the use of lightweight application programming interfaces for applying principles of dataflow design at different layers of abstraction. The development processes inte- grated in our approach are software implementation, hardware implementation, hardware-software co-design, and optimized application mapping. The proposed methodology facilitates development and integration of signal processing hardware and software modules that involve heterogeneous programming languages and platforms. As a demonstration of the proposed design framework, we present a dataflow-based deep neural network (DNN) implementation for vehicle classification that is streamlined for real-time operation on embedded SoC devices. Using the proposed methodology, we apply and integrate a variety of dataflow graph optimizations that are important for efficient mapping of the DNN system into a resource constrained implementation that involves co- operating multicore CPUs and field-programmable gate array subsystems. Through experiments, we demonstrate the flexibility and effectiveness with which different design transformations can be applied and integrated across multiple scales of the targeted computing system.
1. Introduction
Model-baseddesignhasbeenwidely studiedandappliedoverthe yearsinmanydomainsofembeddedprocessing.Dataflowiswell-known asaparadigmformodel-baseddesignthatiseffectiveforembedded digitalsignalprocessing(DSP) systems[1].Indataflow-based model-ing,signalprocessingapplicationsarerepresentedasdirectedgraphs (dataflowgraphs), andcomputationalfunctionsare modeled as ver-tices(actors)inthese graphs.Actorsexchange datapackets(tokens) throughunidirectional,first-in,first-out (FIFO)communication chan-nelsthatcorrespondtodataflowgraphedges.Manydataflow-based de-signmethodsforDSP systemshavebeenexploredin recentyearsto supportvariousaspectsofdesignandimplementation,including model-ingandsimulation;schedulingandmappingofactorstoheterogeneous platforms;andbuffermanagement(e.g.see[1,2]).
Thediversityofdesignscalesanddataflowtechniquesthatare rel-evanttosignalprocessingsystemsposesmajorchallengestoachieving
∗Corresponding author.
E-mail addresses: lli12311@umd.edu (L. Li), carlo.sau@diee.unica.it (C. Sau), tiziana.fanni@diee.unica.it (T. Fanni), jingui.li@tuni.fi (J. Li), timo.2.viitanen@tuni.fi (T. Viitanen), francois.christophe@helsinki.fi (F. Christophe), fpalumbo@uniss.it (F. Palumbo), raffo@unica.it (L. Raffo), heikki.huttunen@tuni.fi(H. Huttunen), jarmo.takala@tuni.fi(J. Takala), ssb@umd.edu(S.S. Bhattacharyya).
thefullypotentialthatisofferedbysignalprocessingplatformsunder stringenttime-to-marketconstraints.Whileautomatedtechniques,such asthosereferredtoaboveforschedulingandbuffermapping,are ef-fectiveforspecializedcombinationsofplatformsanddataflowmodels (e.g.,multicoreCPUsandsynchronousdataflow,respectively),theyare limitedintheirabilitytosupportmorecomprehensiveassessmentofthe designspace,wherethemodelsandtargetplatformsthemselveshave greatinfluenceonaddressingimplementationconstraintsand optimiza-tionobjectives.Systemdesignersmustthereforeresorttoad-hoc meth-odstoexploredesignalternativesthatspanmultipleimplementation scales,platformtypes,ordataflowmodelingtechniques.
Inthiswork,weproposeadesignmethodologyandanintegrated setof toolsandlibrariesthataredevelopedtohelpbridgethis gap. Werefertothismethodology astheSTMCMethodologyorSTMCM, which isnamedafterthedifferent institutionsacross whichitis de-veloped(Sassari,Tampere,Maryland,Cagliari).STMCMfocuseson en-ablingexperimentationacrossdifferentlevelsofabstractionthroughout
https://doi.org/10.1016/j.sysarc.2018.12.010
Received 27 April 2018; Received in revised form 4 November 2018; Accepted 31 December 2018 Available online 31 December 2018
(LWDF)programming[3],anditsunderlyingcorefunctionaldataflow (CFDF)model of computation[4]. LWDF providesa compactset of application programming interfaces (APIs) that allows one toapply signal-processing-orienteddataflowtechniquesrelativelyeasilyand ef-ficientlyinthecontextofexistingdesignprocesses,targetplatforms, andsimulation-andplatform-orientedlanguages,suchasMATLAB,C, CUDA,andVHDL.Additionally, CFDFisa generalform of dataflow thataccommodatesmorespecializedformsofdataflow,suchasBoolean dataflow[5],cyclo-staticdataflow[6],synchronousdataflow[7],and RVC-CAL[8] asnaturalspecialcases. Thisaccommodation of differ-entdataflowmodelsinturnprovidespotentialtointegratedesignswith otherdataflowframeworksandDSPlibraries,suchasthosedescribed in[8–13].Furthermore,LWDFisgranularity-agnostic,inthesensethat actorcomplexitydoesnotlimittheapplicabilityoftheframework.
TodemonstratethecapabilitiesofSTMCMinaddressingthe chal-lengesofmappingpracticaldataflow-basedstructuresonheterogeneous signalprocessingplatforms,weexploredifferentimplementationsofa deep neuralnetwork(DNN)for vehicleclassificationona heteroge-neous,embeddedsystem-on-chip(SoC),theXilinxZynqZ-7020SoC. DNNapplicationsposegreatchallengesintheirdeploymenton embed-deddevices.InvestigationofDNNimplementationsonembeddedSoC devicesischallengingduetothelimitedresourcesforprocessingand storageinthesedevices,andespecially,duetothehighcomputational complexityofDNNs.Theyinvolveverylargeandcomplexsignalflow structuresthatinvolveintensivecomputation,dataexchange,and multi-layerprocessing.ThesecharacteristicsmakeembeddedDNN implemen-tationhighlyrelevantasacasestudyforSTMCM.
2. Relatedwork
Dataflowprovidesvaluablemodel-baseddesignpropertiesfor sig-nalprocessingsystems,andhasbeenadoptedinawidevarietyoftools forbothsoftwareandhardware design.Forexample,LWDFAPIsfor CUDA andC have beentargetedin theDIF-GPUtoolfor automated synthesisofhybridCPU/GPUimplementations[14].TheCAL program-minglanguageandtheOpenRVC-CALCompiler(Orcc)toolsetprovide adataflowenvironmentforgeneratingdataflowimplementationsina numberoflanguages,suchasC,Jade,andVerilog[8,9,15] (notethat theVerilogbackendofOrcchasbeendiscontinuedandXronos synthe-sizerhasbeenreplaced).TheCAPHlanguageandframeworkgenerate hardwaredescriptionlanguage(HDL)codefromhigh-leveldataflow de-scriptions[10].
Theworkin[16] presentsanintegrateddesignflowandtoolsfor theautomaticoptimizationofdataflowspecificationstogenerateHDL designs. The Multi-Dataflow Composer(MDC) toolis a dataflow-to-hardwareframeworkabletoautomaticallycreatemulti-functional re-configurable architectures.In addition tothis baseline functionality, MDCoffersthreeadditionalfeatures: (1)a structuralprofilerto per-formacompletedesignspaceexploration,evaluatingtrade-offsamong resourceusage,powerconsumptionandoperatingfrequency[17];(2)a dynamicpowermanagertoperform,atthedataflowlevel,thelogic par-titioningofthesubstratetoimplementatthehardwarelevel,andapply apowersavingstrategy[18];(3)acoprocessorgeneratortoperform thecompletedataflow-to-hardwarecustomizationofaXilinxcompliant multi-functionalIP[16].
STMCMiscomplementarytotheseeffortsthatemphasizedataflow design automation. Byapplying LWDF APIs in novel ways, STMCM facilitates implementationof anditerative experimentationwithnew dataflow-basedhardware/softwarearchitecturesanddesign optimiza-tion techniques. LWDFis applied as anintegral partof STMCM be-causeofLWDF’sminimalinfrastructurerequirementsanditspotential forrapidretargetabilitytodifferentplatformsandactorimplementation languages.Furthermore,LWDFdoesnothaveanyrestrictioninterms ofactorgranularityandcanbeextendedwithdifferentcombinations ofdataflowgraphtransformations,aswellasotherformsofsignal pro-cessingoptimizations(e.g.,see[1]).
In[21],wepresentedanefficientintegrationoftheLWDF methodol-ogywithhardwaredescriptionlanguages(HDLs).Buildingonthis HDL-integratedformofLWDF,wedevelopedmethodsforlowpowersignal processinghardwareimplementation,andsystem-leveltrade-off explo-ration.Inthispaper,weapplythehardwaredesigntechniques intro-ducedin [21] aspartofageneralmethodology thatspanssoftware, hardware,andmixedhardware/softwaredesign,implementation,and trade-off exploration.Thus,whilethefocusin[21]isonrigorous inte-grationacrossdigitalhardwaredesign,lightweightdataflow program-ming, andlowpoweroptimization,theemphasisin thispaper ison amethodology forapplyingLWDFconceptsinanintegratedmanner acrosscompletehardware/softwaredevelopmentprocessesfor embed-dedsignalprocessingsystems.
Insummary,STMCMprovidesmethodstoseamlesslyand compre-hensivelyintegrateLWDF-basedactorimplementationtechniqueswith design processesforreal-time,resource-constrained signalprocessing systems.STMCMcanbeusedasanalternativetoorinconjunctionwith moreconventionalautomateddataflowtools(e.g.,fordisjoint subsys-tems).STMCMrequiresmoreeffortinprogrammingcomparedtofully automatedtoolchains,howeveritprovidesmoreagilityintermsof re-targetabilityandexperimentation,asdescribedabove.Thisisauseful trade-off pointtohaveavailableformodel-baseddesignofcomplex sig-nalprocessingsystems.
3. Proposeddesignmethodology
OurproposedmethodologySTMCMisillustratedinFig.1.As moti-vatedinSection1 andSection2,STMCMisadesignmethodologythat emphasizesLWDFconcepts,andisspecializedforSoC-basedsignal pro-cessingsystems.TheupperpartofFig.1representsapplication-specific andalgorithmicaspects,whilethelowerpartrepresentsthegeneralpart ofthemethodologythatisreusableacrossdifferentapplications.The up-perpartisillustratedconcretelyinthecontextofDNNsystemdesign; thispartcanbereplacedwithotherapplication/algorithmleveldesign aspectswhenapplyingSTMCMtootherapplications.
InSTMCM,we applytheLWDFprogrammingmodelthrough the LightweightDataflowEnvironment(LIDE).LIDEisasoftwaretoolfor dataflow-based design andimplementation of signal processing sys-tems[3,22].LIDEisbasedonacompactsetofapplication program-minginterfaces (APIs)thatis usedforinstantiating,connecting, and executingdataflowactors.TheseAPIshavebeenimplementedina va-rietyofimplementationlanguages.Forexample,LIDE-C[22]and LIDE-V[21] provideC andVeriloglanguageimplementationsoftheLIDE APIs,respectively.
Fig.1. An illustration of STMCM in the context of DNN system design.
AsmentionedinSection1 andillustratedinFig.1,corefunctional dataflow(CFDF)[4],is theformofdataflowthatLWDFisbasedon. InCFDF,each actorisspecifiedasasetof modes.Each actorfiring operatesaccordingtooneofthespecifiedmodes(calledthe“current mode” associatedwiththefiring),anddeterminesauniquenextmode, whichwillbethecurrentmodeforthenextfiring.Theproductionand consumptionrates(dataflow rates )fortheactorportsareconstantfora givenmode.However,differentmodesofthesameactorcanhave
dif-ferentrates,whichallowsactorstoexhibitdynamicdataflowbehavior. WepresenttheswitchactorasanexampleofCFDFactor.Switchactor hasthreemodes:Control,TrueandFalse.InControlmode,theswitch actorconsumesonetokenfromControlport.InTrueorFalsemode,the switchactorconsumesonetokenfromDataportandforwardthattoken toTrueorFalseOutputportaccordingly.Thedataflowtableandmode transitiondiagrambetweenCFDFmodesofswitchactorareillustrated inFig.2.
Fig.2. Switch actor in CFDF. (a) Switch Actor, (b) Dataflow Table, (c) Mode Transition Diagram between CFDF Modes.
ThedefinitionofaCFDFactorincludestwofunctionscalledthe en- able function andinvoke function oftheactor.Theenablefunctionchecks whetherthereissufficientdataavailableontheactor’sinputedgesand sufficientemptyspaceavailableontheoutputedgestofiretheactorin itsnextmode.Theinvokefunctionexecutesanactorfiringaccording totheactor’scurrentmode,consumingandproducingamountsofdata thataredeterminedbythefixeddataflowratesofthecurrentmode. Theinvokefunctionalsodeterminestheactor’snextmode,asdescribed above.
Intheremainderofthissection,wediscussindetailtheapplication-, software-,andhardware-specificprocessesillustratedinFig.1. 3.1. Application-specific tools and processes
InFig.1,application-specifictoolsandassociateddesignprocesses areillustratedbygrayblocks.Throughoutthispaper,weadoptaDNN applicationasaconcretedemonstrationofhowsuchapplication-specific aspectsareusedasanintegralpartofSTMCM.TheDNN-focuseddesign processillustratedinFig.1 startswiththederivationofDNN hyperpa-rametersandthenetworkconfiguration.Thentheparametersassociated withthederivedDNNstructureareextractedandtheDNNalgorithmis carefullyvalidatedtoensurethattargetlevelsofaccuracyaresatisfied. Theblocklabeled“DesignRequirementsandConstraints” refersto theapplication-andplatform-specificrequirementsandconstraintson theDNNimplementation.Examplesoftheseincludetheaccuracyand throughputrequirementsforimageclassificationDNNsystems,and con-straintsonavailablepowerandhardwareresourcesforatargetedSoC platform.
In the remainder of this section, we introduce the software-relatedandhardware-relateddesignprocessesthatprovidethecoreof STMCM.Theseprocessesareappliedinanintegratedmannerfor hard-ware/softwareco-design,asrepresentedbythelowerlefthandpartof Fig.1.DetailedexplanationsofthemajorcomponentsinSTMCMare providedinSection4.3.
3.2. Software-related process
Inthenextmainphaseoftheproposeddesignmethodology,theDNN network configuration derived using application-specific, algorithm-leveltoolsismappedtoasoftwareimplementationusingLIDE-C.Note thatLIDE-CisinnowayrestrictedtoDNNsystems,andisinstead de-signedtosupportabroadclassofdataflow-basedsignalandinformation processingsystems.Forexample,intheworkof[23],thedesignspace explorationofadigitalpredistortionsystemforwireless communica-tionisbasedonimplementationusingLIDE-C.In[24],LIDE-Cis ex-tendedtosupportparameterizedsynchronousdataflow[25] modeling andappliedtotheimplementationofanadaptivewireless communica-tionreceiver.In[26],optimizedvectorizationtechniquesareappliedto LIDE-basedactorsforthroughputoptimization,anddemonstratedusing anOrthogonalFrequencyDivisionMultiplexing(OFDM)receiver.For
moredetailsaboutLIDE-CandthedevelopmentofDNNcomponentsin LIDE-C,wereferthereaderto[22,27].
WorkingwiththeLIDE-CimplementationoftheDNN,anumberof optimizationprocessesarecarriedoutiterativelytostreamlinethe soft-wareimplementationintermsoftherelevantdesignobjectivesand con-straints. Thisiterativeoptimizationprocessis illustratedinFig.1 by the cyclicpath thatinvolvesthe blocks labeled Dataflow Representa- tion, LIDE-C Implementation, and Optimized LIDE-C Implementation .The proposed approach supports efficient application of commonly-used DNNsoftwareoptimizationmethodssuchasfor-looptilingandbuffer memorysharingamongdataflowgraphedges.Wereferthereaderto Section4.1 formoredetailsabouttheseoptimizationmethodsandthe integrationofthemwiththeLIDE-Cimplementation.
Next,softwareprofilingisperformedontheoptimizedLIDE-C im-plementationoftheDNNsystemtoextractprofilingdata.Thisdatais extractedforeachdataflowcomponentoftheDNNarchitecture.Inthe profilingprocessappliedinSTMCM,thememorysizesofthebuffers andexecutiontimeoftheactorsinthegrapharemeasured.According tothecharacteristicsofDNNarchitecture,theDNNsystemisdivided intomultiplecomputationlayers.InourapplicationofSTMCM, soft-wareprofilingisspecializedtoDNNimplementationbymeasuringthe totalmemorysizesforthebuffersbothinsideeachlayerandbetween pairsofadjacentlayers.Wealsomeasurethetotaltimecomplexityof eachDNNlayer.
3.3. Hardware-related process
Thedataflowmodelofthesubgraphtoaccelerateisimplemented in hardware using LIDE-V.Hardware profilingbased onthe specific implementationplatformisperformedontheLIDE-Vimplementation. Thisprofilingisusedtocollectmeasurementsonhardwareperformance andhelp identifypossible optimizations. Detailson hardware profil-ingaredemonstratedconcretelythroughthecasestudypresentedin Section4.2.Likethesoftwareimplementation,thehardware implemen-tationwillingeneralgothroughmultipleoptimizationiterationsbefore itisfinalized.
InLIDE-V,thehardwareimplementationofadataflowactoris de-composedintoimplementationsofitsenablefunctionandinvoke func-tion.ThesecomponentsareimplementedastwocoupledVerilog mod-ules— theactorenablemodule(AEM),andactorinvokemodule(AIM). Dataflowedgesareimplementedasdataflowedgemodules(DEMs);we informallyrefertoDEMsalsoas“FIFOs”.
Toprovidefullydistributedschedulingofactors,onecanconnecta LIDE-V actor scheduling module (ASM )toeachactor.TheASMinitiates anewfiringofitsassociatedactoranytimetheactorisnotalreadyin thefiringmode,hassufficientdataonitsinputedges,andhassufficient emptyspaceonitsoutputedges.SchedulingofLIDE-Vactorsisnot re-strictedtosuchafullydistributedschedulingapproach.Forexample, withappropriately-designedcontrollogic,subsetsofactorscanbe se-rializedtoallowsharingofresourceswithinthesubsets.Inthispaper,
however,werestrictourattentiontofullydistributedscheduling.Fully distributedschedulingofdataflowgraphshasbeenanalyzedinvarious contexts.Forexample,Ghamarianelal.havedevelopedmethodsfor throughputanalysisofsynchronousdataflowgraphsthatarescheduled inafullydistributedmanner[28].Suchanalysistechniquescanbe ap-pliedtohardwaresubsystemsinSTMCM.
Theorthogonality(separationofconcerns)amongactor,edge,and schedulerdesigninLIDE-Vlaysavaluablefoundationforrigorous inte-grationofpower-managementwithintheassociatedAPIs.Inparticular, wedemonstratedin[29] and[21] thatmethodsforasynchronous de-sign,GloballyAsynchronousLocallySynchronous(GALS)design,and clockgatingcanbeappliedefficientlythroughnaturalextensionsofthe LIDE-VAPIs.Wealsodemonstratedtheuseoftheseextensionstopower optimization.
Tomanagecomplexityandimprovereuseofsubsystemswithinand acrossdesigns,one canencapsulatesubgraphsin LIDE-V within hier- archical actors (HAs ).AnHAinLIDE-V appearsfromtheoutsideasa regular(non-hierarchical)LIDE-VactorwithanassociatedAEM,AIM, andASM.Execution ofanHA asanactorin theenclosingdataflow graphiscoordinatedbythe external scheduler associatedwiththeHA. WhenanHAisfiredbyitsexternalscheduler,the internal scheduler of theHAcoordinatesthefiringsofactorsthatareencapsulatedwithinthe HA(nestedactors).Theinternalschedulercarriesoutthesetofnested actorfiringsthatmustbecompletedforagivenfiringoftheHA.An exampleofanHAwithinternalandexternalschedulersisdiscussedin detailandillustratedinFig.6.
Sinceitappearsfromtheoutsideasaregularactor,anHAcanbe clockgatedinexactlythesameway,allowingthedesignertoefficiently switchoff thewholesubgraphatappropriatetimesduringoperation. 4. Casestudy:Adeepneuralnetworkforvehicleclassification
AsaconcretedemonstrationofSTMCM,weadoptaDNNusecase forautomaticdiscriminationamongfourtypesofvehicles— bus,car, truck,andvan.Thisimplementationisbasedonaneuralnetwork de-signpresentedin[30],whereanetworkconfiguration— i.e.,the num-berandtypesoflayersandotherDNNhyperparameters— was care-fullyderivedanddemonstratedtohaveveryhighaccuracy.The accu-racyof themethodswasvalidatedwithadatabaseofover 6500 im-ages,andtheresultingpredictionaccuracywasfoundtobeover97%. Theworkinthis paperandthework in[30] havedifferentfocuses. Theworkof[30]focusesonderivinghyperparameters,networkdesign, anddemonstratingnetworkaccuracy,anddoesnotaddressaspectsof resource-constrainedimplementationorhardware/softwareco-design. Inthispaper,wegobeyondthedevelopmentsof[30] byinvestigating resourceconstrainedimplementationonarelevantSoCplatform,and optimizedhardware/softwareco-designinvolvinganembedded multi-coreprocessorandFPGAaccelerationfabricthatareintegratedonthe platform.In[30],theproposedDNNarchitecturesareevaluatedbased ontheclassificationaccuracy,whileinourworkonSTMCM,the ob-jectivesthatwearetryingtooptimizearesystemthroughput,memory footprintandpowerefficiency.Inaddition,ourworkinthispapercan begeneralizedtothedesignandimplementationofarbitraryDNN ar-chitectures,andalsoitcanbegeneralizedbeyondDNNapplicationsto othersignalandinformationprocessingapplications;thearchitecture of[30] isselectedasacasestudytoconcretelydemonstratetheusage ofthemethodologyproposedinthispaper.
InrelationtoFig.1,weapplytheresultsfrom[30]intheblock la-beled“derivationofhyperparametersandDNNdesign” aspartofthe designmethodologythatisdemonstratedinthispaper.Fig.3illustrates thecompleteDNNarchitecturethatwe implementinthis work.For moredetailsaboutthisusecase,suchasthedataset,thederivationof theDNNarchitectureandtheapplicationoftheusecaseinvehicle clas-sification,wereferthereaderto[30].
TheDNNnetworkdesigniscomposedoftwoconvolutionallayers, twodenselayersandoneclassifierlayer,asdepictedinFig.3.Thefirst
convolutionallayertakesanRGBimage(3×96×96)asinput,and pro-duces32featuremaps,eachwithdimensions(48×48).Thesecond con-volutionallayertakesthese32featuremapsasinputandproduces32 smallerfeaturemaps,eachhavingdimensions(24×24).Werefertoa subsystemthatprocessesmultipleinputimagestoproduceasingle fea-turemapasa branch .Thus,thefirstandsecondconvolutionallayers have32brancheseach.Thetwodenselayerscombinetotransformthe feature mapsintoa(1×100)vector,which isthenmultipliedin the classifier layerbya(100×4)matrixtodeterminethe(1×4) classifi-cationresult.Eachofthefourvaluesintheresultcorrespondstothe likelihoodthatthevehicleintheinputimagebelongstooneofthefour vehicletypes(i.e.,bus,car,truckandvan).
Thestudieduse caseisrelativelyeasytosolvecomparedto com-mon imagerecognition benchmarks,such asMSCOCO[31], or Ima-geNet[32]. Therefore,onecanreachhighaccuracywitharelatively simplenetworkrequiringsignificantlylowerresources thancommon networktopologiesintended formobileuse (such asMobilenets).As such,thefocusofourworkisnotinmobiledevices(e.g.,smartphones), butin simplerIoTdevicestargetedtosolvingless complexmachine learningproblemsatlowcost.ForfurtherdetailsontheDNNnetwork designandhyperparameterspecifications,wereferthereaderto[30].
Thespecificplatformandassociatedplatform-basedtoolsthatwe employ arebased ontheXilinx ZynqZ-7020SoC.Theremainder of thisSectionfocusesondetailsassociatedwithSTMCMandits associ-ateddesignprocesses.Thesedetailsarepresentedconcretelythrough thedevelopmentofthisDNNcasestudy.
4.1. Software implementation and optimization
Inthissection,wediscussdataflow-graph-andactor-level optimiza-tionsandassociated designiterations, asillustratedin Fig.1 by the blocks labeledDataflowRepresentation,LIDE-CImplementation, and OptimizedLIDE-CImplementation.Westartwithadataflowgraph im-plementationthatisderivedusingLIDE-C[3,22],whichprovidesa C-languageimplementationoftheLWDFAPIssothatCFDF-basedactors anddataflowgraphscanbeimplementedinastructuredmannerusing C.Theinitial(sequential)LIDE-Cdesignisdevelopedinadesignphase thatcorrespondstotheblocklabeledLIDE-CImplementationinFig.1. Aftervalidatingthecorrect,dataflow-basedoperationoftheinitial DNNdataflowgraphimplementationin LIDE-C,we experimentwith varioustransformationsattheactor,subgraph,anddataflowgraph lev-els.Here,weexploittheorthogonalityofactor,edge,andgraph imple-mentationinLIDE-C,whichallowsdesignerstoflexiblyandefficiently perform experimentationwithawidevarietyoftransformations,and withdifferentcombinationsofappliedtransformations.Theactor-level transformationsperformedherearefocusedonoptimizationmethods appliedtotheconvolutionactor,whichisamajorperformance bottle-neckinthedesign.Thesubgraph-leveltransformationsinvolvememory managementoptimizationsperformedonFIFOsbothinsideeach sub-graph(DNNlayer)andbetweenpairsofadjacentlayers.
4.1.1. Actor-level optimization
Wedemonstrateactor-leveloptimizationatthisstageofthedesign processusingtheconvolutionactorinourDNNexample.InourLIDE-C implementationofthisactor,weapplyatransformationofthe convo-lutioncomputationthatiscommonlyusedtosimplifythedesign,and improveclassificationspeed.Thetransformationinvolveslooptilingto reducethecachemissrate.TheutilityoflooptilinginDNN implemen-tationhasbeendemonstratedpreviously, forexample,in[33].Using looptiling,wedecomposethemainloopoftheconvolution computa-tionintoaninnerloopthatiterateswithincontiguous“strips” ofdata, andanouterloopthatiteratesacrossstrips.Applyinglooptilinginthis wayallowsonetoenhancecachereusebasedonanarraysize(strip length)thatfitswithinthecache.
Fig.4 showsasegmentofcodefromourapplicationofthetiling transformationtotheconvolutionactor.
Fig.3. DNN for automatic discrimination of four types of vehicles.
Throughtheorthogonalityprovidedbythemodel-baseddesignrules inLIDE-C,thistransformationcanbeappliedatalatestageinourdesign process,inawaythatisinteroperablewithpreviouslyapplied trans-formations,andina waythat requiresno modificationstoother ac-toror edgeimplementations.Inthiscase,nomodification isneeded tothedataflowgraphschedulerimplementationaswell,althoughfor sometransformations,scheduleradjustmentscanbeusefultointegrate transformedactorsintotheoverallsysteminanoptimizedway.The
CFDF-basedAPIs(enableandinvokefunctions)inLIDE-Cforscheduler implementationallowthedesignertoexperimentefficientlywithsuch schedulingadjustmentsasneeded.
4.1.2. Buffer memory management
Amajorchallengeinresource-constrainedimplementationofaDNN architectureismanagingthelargevolumeofdatatransfersthatare car-riedoutduringnetworkoperation.EachDNNlayertypicallyprocesses
Fig.4. The code segment that implements loop tiling within the LIDE-C actor for convolution.
alargeamountofdata,andrequiresmemorytostoretheinputdata fromthepreviouslayerorsubsystem,theintermediatedataduringthe computationprocessing,andthecomputationresultsthatwillbe trans-mittedtothefollowinglayerorsubsystem.
Consider,forexample,thebuffermemorycosts(thestoragecosts associatedwiththedataflowgraphedges)fortheDNNofFig.3.Inour LIDE-C implementation, thesecond convolutional layerrequires the mostbuffermemory.Inthislayer,eachofthe32branchesiscomposed of32convolutionactors,31additionactorsandoneactorperforming bothmaxpoolingandReLU(RectifiedLinearUnit).Giventhatthesize oftheinputfeaturemapprocessedbyeachbranchis48×48pixels,the buffermemoryrequiredforactorcommunicationinsideeachbranchis 𝑖𝑚𝑎𝑔𝑒_𝑠𝑖𝑧𝑒× (𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑐𝑜𝑛𝑣_𝑎𝑐𝑡𝑜𝑟𝑠+𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑜𝑢𝑡𝑝𝑢𝑡_𝑓𝑒𝑎𝑡𝑢𝑟𝑒_𝑚𝑎𝑝𝑠), whichis48× 48× (32+1)=76,032pixels.Thus,thetotalbuffer mem-ory inside the second convolutional layer is 76, 032× 32=2, 433, 024 pixels.Thebuffermemoryrequiredfordatacommunicationbetween thefirstandthesecondlayercanbecomputedas48× 48× 32=73,728 pixels.
InSTMCM,weapplyabuffermemoryoptimizationtechniquethat isusefulforresource-constrainedDNNimplementation.Inparticular, weincorporateanewFIFOabstractdatatype(ADT)implementation inLIDE-C,called shared FIFO ,thatenablesmultipledataflowedgesin agraphtobeimplementedthroughFIFOADTinstancesthatsharethe sameregionofmemory.Suchbuffer sharing indataflowimplementations hasbeeninvestigatedin differentformsforvariouscontextsof auto-matedschedulingandsoftwaresynthesis(e.g.,see[34–36]).InSTMCM, wemake iteasy forthesystem designertoapply buffersharing ex- plicitly withinherorhisimplementationratherthandependingonits implicitsupportthroughthetoolset thatis used.This is anexample oftheagilitythatissupportedinSTMCM,asdescribedattheendof Section2.
Again, by exploiting the orthogonality among dataflow compo-nents, buffer sharing in STMCM is performed only on the targeted dataflow edges andrequiresno modificationto otheractorsor sub-graphs.Throughthesupportforsuchseparationofconcernsin LIDE-C,differentADTimplementationsforaFIFOorgroupofFIFOscanbe interchangedwithoutaffectingoverallsystemfunctionality.
TherearethreekeyaspectstoourapplicationofsharedFIFOsinour LIDE-CDNNimplementation.First,attheinputofeachconvolutional layer L ,inputdatafromthepreviouslayerisstoredcentrallyinsteadof beingcopiedseparatelyintoeachbranchofL .Second,edgesindifferent layerssharethesamememorysothatthememoryistime-division mul-tiplexedbetweenthelayers— theprocessingofagivenlayeroverwrites memoryinitssharedFIFOswithoutintroducingconflictsthataffectthe computationresults.Third, actorsoperateondatafromsharedinput FIFOsdirectlythroughtheirread pointersintotheFIFO(ratherthan firstcopyingthedatalocallywithintheactor’sinternalmemory).This kindof copy-eliminationis similartodataflowmemorymanagement techniquesintroducedbyOhandHa[35].
Improvementsresultingfrom ourapplication ofshared FIFOsare demonstratedquantitativelyinSection5.1.
Table1
Layer-level software profiling. Here, the row labeled “T ” gives the exe- cution time of each layer, and the row labeled “T% ” gives the percentage of the total DNN execution time that is attributed to each layer.
Layer Total
1 2 3 4 5
T [ms] 18.71 22.08 0.0149 0.0034 0.0036 40.812 T% 45.84 54.10 0.04 0.01 0.01 100
Table2
Actor-level software profiling.
Layer Convolutional layer 1 Convolutional layer 2 Actor Conv Add M&ReLU Conv Add M&ReLU Tic [ 𝜇s] 230.10 0.03 0.025 59.77 0.005 0.006
Layer Dense Layer 3 Dense Layer 4 Output Layer 5 Actor Mult ReLU Mult ReLU Mult Softmax Tic [ 𝜇s] 5.1 0.0012 0.029 0.0012 0.0023 0.0031
4.1.3. Software profiling
Inthissubsection,wedemonstratetheprocessofsoftwareprofiling, asillustratedinFig.1,inthecontextofouroptimizedLIDE-C imple-mentationoftheDNNarchitecture.Theimplementationplatformisan Inteli7-2600Krunningat3.4GHz.Table1 andTable2showlayer-and actor-levelsoftwareprofilingmeasurements,respectively.
InTable2, T icdenotestheinvoke to firing completion time ofagiven
actor.Thisis theaveragetimethatelapsesbetweenthetimethatan actorfiringisinitiatedandwhenthefiringcompletes.Wealsoreferto T ic asthe average execution time oftheassociatedactor.The
abbrevia-tionsAdd,Conv,Mult,andM&ReLUstand,respectively,forAddition, Convolution,Multiplication,andMaxpool-and-ReLU.
Layer-andactor-levelsoftwareprofilingprovideinsightintothe pro-cessingcomplexityofactorsineach layer.AccordingtoTable 1, the convolutionallayersaccountfor99.94%ofthesystemexecutiontime. Also,theexecutiontimeofLayer2isveryclosetothatofLayer1.In bothconvolutionallayers,theConvactorsaccountformostofthe pro-cessingtimecomparedwiththeothertwoactors— AddandM&ReLU — intheconvolutionallayers.Additionally,theaverageexecutiontime oftheConvactorsinLayer2isonlyaboutaquarterofthatoftheConv actorsinLayer1.ThisisprimarilybecauseeachoftheConvactorsin Layer1processesinputimagesofsize96×96,whiletheConvactorsin Layer2processinputfeaturemapsthathavesize48×48.
4.2. Hardware implementation and design exploration
Inthissection,wedescribethemaincapabilitiesofthedesignflow depictedinFig.1 withrespecttodesignandimplementationof hard-wareaccelerators.Thesecapabilitiesarerepresentedbytheblocksin theregionlabeled“Hardware-relatedProcess”.
Fig.5. LIDE-V implementation for the accelerated SFM.
Forexample,throughapreliminaryhardwareprofilingphaseofthe DNNapplicationdescribedinSection4.2.1,wecanidentifythree hard-waredesignaspectsthatareinterestingtoinvestigateindetail— the adoptionofclockgatingtechniques,exploitationofasynchronythatis inherentindataflows,andexplorationofdifferentlevelsofactor gran-ularity.
Wedemonstratethehardware-relateddesignprocessofSTMCM us-ingahardwareacceleratorthatisintroducedin[21].Theaccelerator providesasubsystemforproducingfeaturemapsfromthefirst convo-lutionallayeroftheDNNapplication.Intheremainderofthispaper, werefertothissubsystemasthe Subtree for Feature Map (SFM).
Duetotheinterfacing consistencythat ismaintainedacrossLIDE actorimplementationsindifferentlanguages,onecanreadilyconvert theLIDE-CbasedSMFsubsystemimplementationintohardwareby re-placingeachsoftwareactorwithahardwaremodulethatisdesigned inLIDE-V,andbyconnectingthederivedhardwareactorswithLIDE-V FIFOs.FollowingthegeneralapproachofrealizingLIDEactorsin hard-ware,eachLIDE-VactorimplementationisdecomposedintoanAEM andAIM.TheAEMisreusableamongdifferentactorsinour implemen-tation, althoughingeneralitcan be usefultohavespecializedAEM implementationsthatarestreamlinedforthespecificrequirementsof individualactors[21].
ThehardwareimplementationdivergesfromtheLIDE-Cdesignin twomajorways.First,wefeedtheinputdatainaninterleavedformat, reducingthecomplexityofthehardwareinterfaceanddriversoftware sincethereisonlyoneinputFIFOtomanage.Second,thehardware ac-torsaredesignedtoproduceonerowperfiringinsteadofentireimages. ThisreducestheFIFOsizerequirementsinthefirstlayerfrom96×96 pixelstoonly96pixels.Thehardwareactorsinourimplementationare scheduledusingafullydistributedapproach.
Theresulting SMFis showninFig.5.Theimplementedhardware isverifiedagainstreferenceoutputsextractedfromtheLIDE-C imple-mentation.InthisFigure,productionandconsumptionrates(dataflow rates)areannotatednexttoactorports,andw istheinputimagewidth. Theconvolutionactorhasmultipleoperatingmodes(CFDFmodes)with differentconsumptionrates.
4.2.1. Hardware profiling
WeemployhardwareprofilinginSTMCMtoextractexecutiontime data,which islaterused toguide theprocessof iterativedesign op-timization.Inthissection,wedemonstratehardwareprofilinginthe contextofourDNNapplication.Profilingisperformedusingthetarget platform,whichinourdemonstrationistheZynqZ-7020SoC.We pro-filetheLIDE-CimplementationontheARMA9MPCoresprovidedby thetargetplatformanddevelopafirstversionimplementationofthe SFMonthisplatformandextractexecutiontimedatafromthis imple-mentation.
Table3 depicts variousdataassociatedwithexecution timesand waitingtimesfor theSFM hardwareaccelerator illustratedinFig.5. Here,thesymbol t totrepresentsthetotaltimenecessarytoexecutethe
SFM; T icistheaveragetimeperiodbetweenanactorinvocationandits
correspondingfiringcompletion; T ciistheaveragetimeperiodthatan
Table3
Measured data associated with actor execution times and waiting (idle) times.
SFM t tot 232,831
Tic Tci firings Tot ( Tot %) Tii / T ic Deinterleave 3 2 9216 27,648 (11.87) 1.67 Convolution 2402 2 96 230,592 (99.04) 1.00 Sum 107 2297 96 10,272 (4.41) 22.46 Maxpool&ReLU 195 4613 48 9360 (4.02) 24.66
actorhastowaittobefiredafteritspreviousfiringcompletion; firings isthenumberoffiringsofagivenactorduringexecutionof SFM; Tot , calculatedas(T ic)×(firings ),givesthetotalexecutiontimeofagiven
actorduringtheexecutionof SFM ; 𝑇 𝑖𝑖=(𝑇 𝑖𝑐+𝑇 𝑐𝑖)denotestheaverage timeperiodbetweenthebeginningofoneinvocationtothebeginning ofthenext;andtheratioT ii/T icmeasurestheextentofactoridleness.
Thisrichcollectionofmetrics,whichissupportedbytheunderlying CFDFmodelcomputation,providesvariousinsightsonthe dataflow-based system architectureand its implementation. Forexample, the T ii/T icratioprovidesinsightondifferencesinprocessingspeedthatare
usefulinexploitingtheinherentasynchronybetweendataflowactors. Fromanalysisofourhardwareprofilingresults(Table 3),wecan derivedifferentversionsoftheSFMhardwareacceleratorwithdifferent trade-offsamongpowerconsumption,systemthroughput,andhardware resourcecost.Firstly,lookingatcolumnTot %,weseethatallofthe ac-torsexceptforConvolutionareinactivethroughoutmostofthe execu-tiontime.Themaximumproportionofactivetimeamongtheseactors is11.87%,reachedbyDeinterleave.Gatingtheclockofthesefrequently inactiveactorscanprovidemoreenergyefficientacceleratoroperation byeliminatingdynamicpowerconsumptionduringidlephases.
Furthermore, the DeinterleaveandConvolution actorshave rela-tivelysmallidlenesslevels(T ii/T ic),withawaitingtime T ci equalto
2 clockcyclesfor bothof them. Ontheother hand,Sum and Max-pool&ReLUexhibitmuchlargerwaitingtimesandidlenesslevels.An importanthintcomingfromthe T civaluesisthat,thankstothe
inher-entasynchronyofdataflowactors,itispossibletopartitionthedesign intodifferentclockregions workingatdifferentfrequencies,thus ob-tainingaGALSdesign.Inparticular,the Deinterleave and Convolution actorscanbeplacedinoneclockregion(Region1),drivenby clock 1 , whileSum andMaxpool&ReLU canbeplacedinanotherregion(Region 2),drivenbyclock 2 .OnthebasisofthemeasuredT ii/T icvalues,wecan set clock 2 tobe20timesslowerthan clock 1 .
Moreover,thesubgraphincludedinRegion2canbeencapsulated into ahierarchicalactor(see Section3.3).This actor, seenfrom the “outside”,islikeanyotherLIDE-Vactor.Theactoranditsencapsulated subsystemcanbeclockgatedorclockedwithadifferentfrequency, pro-vidingadditionalcandidatesolutionsforSFMacceleratoroptimization. 4.2.2. SFM Exploration
BasedonthehardwareprofilinganalysisdiscussedinSection4.2.1, weexploredsixdifferentvariantsoftheSFMdesign:
Fig.6. An illustration of the hierarchical actor associated with Design SFMh.
• SFM a:Thisisanasynchronousdesignwhereactorsbelongingto
dif-ferentlogicregions runat differentclockfrequencies.In particu-lar,theclockfrequencyfor clock 1 issetto100MHz,andtheclock frequencyfor clock 2 issetto5MHz.ReferringtoFig.5,theonly modificationrequiredinthedesignisthereplacementofFIFOsthat areplaced between thetwoclockregions.TheseFIFOsneedtobe replacedwithasynchronousFIFOs— forthispurpose,weemploy theclockdomaincrossing(CDC)FIFOspresentedin[21].CDC FI-FOsaredesignedwithreadandwritelogicthatcanbedrivenby differentclocks.Atthesametime,theirmoduleinterfacesconform tostandardLIDE-VedgeinterfacessotheycanreplaceotherFIFO implementationswithoutrequiringchangestoactorsthat commu-nicatewiththem.
• SFM CG:Basedonourhardwareprofilingresults,weapplyclock
gat-ingtotheDeinterleave,SumandMaxpool&ReLUactors.Tobeclock gated,aLIDE-Vactorneedsonlytheinstantiationofaclock gat-ingmodule(CGM)[21].TheCGMinvolvesaBUFGprimitivethat physicallyenables/disablestheclocksignalinthetargetSoC.Thus foreachclockgatedactor A in SFM CG,aCGMisinstantiatedand
connectedtotheclockinputsof A andtotheread-andwrite-clock inputs,respectively,oftheFIFOsthat A readsfromandwritesto. • SFM aCG: This design incorporatesboth asynchronous design and
clockgatingtechniques.AsinSFM a,theFIFOsbetweenthetwoclock
regionsarereplacedwithCDCFIFOs.Additionally,theDeinterleave, SumandMaxpool&ReLUactorsareclockgatedasin SFM CG,anda CGMisinstantiatedforeachoftheseactors.
• SFM h:Thisis ahierarchicalSFMdesign,whichcan beviewedas
abaselineforevaluatingourenhancedhierarchicaldesign SFM hCG (definedbelow).In SFM h,Region2(seeFig.5)isencapsulatedin
ahierarchicalactor H .Anillustrationofthis hierarchicalactoris providedinFig.6.Thesubgraphthatisencapsulatedby H contains threeactors A1, A2 and B .Wedenotethissubgraphby G H.Actors
A1 andA2 correspondto Sum 1 and Sum 2 ,respectively,whichare twoactorsthataddoutputsfromthethreeconvolutionactors.Actor B correspondstotheMaxpool&ReLUactor.
When H isviewedasasingleactorfromtheoutside,afiringof H startswhentheinternalscheduler I_ASM_HA for G Hreceivesthein-
voke_HA signalfromtheexternal scheduler E_ASM_HA .Insidethe subgraph G H,the invoke_HA signalisreceivedbyASM_A1 ,whichis
theASMofactor A1 .Once ASM_A1 receivesthe invoke_HA signal, thefiringofthesubgraphG Hstarts.
• SFM hCG:Thisdesignisthesameas SFM h,exceptthatthe
Deinter-leaveactorandthehierarchicalactorareclockgated.Itis impor-tanttohighlightthattheapplicationofclockgatingattheregion levelisadvantageousiftheexecutiontimesoftheactorswithinthe regionareoverlapped.Inthisdesign,however,theexecutiontimes ofthethreeactorsarenotoverlapped.Whenoneactorisexecuted, theotherswaitinanidlestateandwastepower.Therefore,we ex-pectthatthisconfigurationwouldnotbereallyeffectiveinreducing powerconsumptionas SFM CGinthetargetedDNNcase.However,
weincludethetestinourexplorationstopresentthecompletewide varietyofoptionsmadeavailablebySTMCM(evenifsomeofthem maybelessefficientthanothersforthisparticularapplication sce-nario).
• SFM auto: Thisis aversionof theSFMthatis synthesizedand
im-plementedbyenablingtheautomaticpoweroptimizationavailable withintheadoptedXilinxVivadoenvironment.Thisdesignapplies fine-grain clock-gatingand fine-grain logic-gating at the Verilog leveland excludes allofthehigher-level,dataflow-based optimiza-tions(coarse-grainasynchronous design,clock-gating,and hierar-chicaldecomposition)thatareappliedintheotherfiveinvestigated designs.Thus, SFM autoisusefulasacommonbaselinetoassessthe
higher-levelmodelsandtransformationsprovidedby STMCM com-paredtoexistingoff-the-shelfsynthesistechniques.
4.3. Joint hardware/software implementation and optimization
Thissectionshowshowtheproposeddesignflow(summarizedin Fig.1)providesavarietyof interestinghardware/softwareco-design implementation choices and optimization possibilities. In particular, thesefeaturesarerepresentedbythe“Co-design-relatedProcess” area ofFig.1.Foragivenhigh-levelLWDFmodel,theinteractionbetween software(seeSection4.1)andhardware(seeSection4.2)actorsor sub-graphscanbeshapedandrefineddependingonthespecificconstraints andrequirementsoftheapplication.
Inparticular,wedemonstratetwomainimplementationaspectsthat canbeefficientlyexploredwithSTMCM:parallelismacrossactor exe-cution,andtheadoptedcommunicationinterfaces.Thedegreeof paral-lelismcanbetuneddependingonthenumberofsoftwareand/or
hard-applicationthatwillbeacceleratedinthePL,whiletheremainingpart, involvingthesecondconvolutionallayer,twodense layersandfinal classificationlayer,willbeexecutedbythePS.
Notethatthefirstconvolutionallayerconstitutesonlyoneofthe maincomputationallyintensivestepsoftheDNNapplication. Accord-ingtosoftwareprofilingresultsthatarebasedontheSoCplatformthat weappliedforhardware/softwareco-design(seeTable8),thefirst con-volutionallayeronlyaccounts forabout27%of thepredictiontime. Forthisreason,thespeedupbroughtbyhardwareaccelerationtothe overallDNNapplicationis notdramatic,aswillbediscussedfurther inSection5.However,theresultsconcretelydemonstratehowSTMCM canbeappliedtoperformextensivedesignspaceexplorationacrossa varietyofdiversedesignstoachievesystemperformanceenhancement underhighly-constrainedhardwareresourceavailability.
TheSFMacceleratorhasbeenintegratedintotheLIDE-Cdesign pre-sentedin Section4.1 byreplacing theSFM softwareimplementation withfunctioncallstodriversoftwarethatiscapableofoffloadingthe computationtothePL.WehaveexperimentedwithusingaLinuxkernel driverbasedontheUserspaceI/O(UIO)framework[37],andadriver thatisindependentoftheLinuxkernelandoperatesbydirectly access-ingmemorywiththemmapsystemcall.TheUIOapproachismore suit-ableforproductionuse,whilemmapworkswellforprototyping,and thislatterapproachhasbeenusedinthisworkforevaluation.ThePS andPLcancommunicatebymeansofAXIinterfacesexploitingGeneral Purpose(GP)ports;32-bitwidthPSmasterorslaveportswith600Mbps bandwidthforbothread andwritechannels;HighPerformance(HP) portsorAcceleratorCoherencyPorts(ACP);and64-bitwidthPSslave portswith1200Mbpsbandwidthforbothreadandwritechannels.
Fig.7 depictsthereferenceconfigurationfortheco-design explo-rations.InordertointegratetheacceleratorintotheSoC,agenericAXI wrapperforhardwaredataflowsubgraphshastobeprovided.The wrap-periscompliantwiththeadoptedAXIinterfaceandletstheprogrammer accesstheinputandoutputFIFOsofthedataflowgraphandmonitor theirpopulations.Forthispurpose,thewrapperincludesallthe neces-sarylogicforthecommunicationmanagement.
Inourhardwareaccelerationapproach,wemaptheSFMsubsystem tohardware.Thissubsystemproducesa48x48featuremaponeach ex-ecution.Thus,inordertoperformtheentirefirstconvolutionallayer oftheDNNapplication,whichmustproduce3248x48featuremaps, theSFMacceleratorhastobeexecuted32timeswiththeappropriate convolutioncoefficients.ForeachoftheseSFMexecutions,thePSwill sendthecorrespondingconvolutioncoefficientstotheaccelerator.The inputimage,whichremainsthesameacrossall32executions,issent onlyoncefromthePSandstoredwithinalocalbufferwithinthe ac-celerator.All32executionsoftheSFMaccesstheinputimagefromthis localbuffer.Inthisway,weavoidthelargeamountofdatatransferthat wouldberequirediftheinputimagehadtobesentseparatelyfromthe PStothePLforeachSFMexecution.UponcompletionofeachSFM ex-ecution,thePSretrievestheresultingfeaturemapfromtheaccelerator. Intheremainderofthissection,wediscussindetailthreedifferent setsofco-designimplementationsandoptimizationsthatarefacilitated
bySTMCM:
• theamountofparallelismthatisexploitedinthesoftwareand hard-waresubsystems;
amounts of parallelismthat areexploitedin both thehardware and softwaresubsystems(seethedashedsquares inFig.7).Inparticular, dependingonthespecificapplicationrequirements,multipleparallel in-stancesofsoftwarecoresorhardwareacceleratorscanbeutilized.While softwarecoresareabletoexecuteallDNNapplicationsteps,hardware acceleratorscanonlyperformthestepsthattheyhavebeenconceived for.Generallyspeaking,hardwareacceleratorsachievehigherefficiency thansoftwarecoreswhenexecutingagivencomputationalstep,bothin termsofexecutiontimeandresourceefficiency(resourceutilizationand consumption).
InthetargetedXilinxZynqZ-7020SoCplatform,apairof homoge-neouscoresisavailable,sothatthemaximumdegreeofsoftware par-allelisminourimplementationsis2.TheavailablecoresarebothARM A9MPCoreswithtwolevelsofcacheandaccesstoa512Mboff-chip DDRRAM.Inourexperiments,wehaveexploitedsoftwareparallelism forthetwomostcomputationallyintensivestepsoftheapplication— thetwoconvolutionallayers.
WhenusingFPGAfabric,designershavethepossibilitytoutilizeas muchparallelismastheFPGAresourcesallow.Inthiswork,wehave investigatedthreealternativedesignsthatutilize1,2or4parallelSFM instances,respectively,in thesamehardwareaccelerator.Inthefirst case,theacceleratorisexecuted32timesinordertocompletethe32 branchesofthefirstconvolutionallayer.Thisdesignexecutesa differ-entbranchwithdifferentconvolutioncoefficientsforeachaccelerator invocation.Inthesecondcase(2parallelSFMinstances),theaccelerator executiontimeishalved,butforeachrun,twonewsetsofconvolution coefficientsarenecessary.Finally,with4parallelSFMinstances,only8 acceleratorexecutionsareneeded,witheachrequiringtheupdatingof fourdifferentsetsofcoefficients.
4.3.2. Communication interfaces
Duringtheprocessofco-designexploration,STMCMgivesthe de-signersignificantflexibilitytoselectinterfacesforcommunicatingdata betweenthehardwareandsoftwaresubsystems.Thisflexibilityis pro-vided bythe general dataflow model of computation that underlies STMCM.Flexibilityinselectingacommunicationinterfacecanbevery usefulinthecontextofresource-orperformance-constraineddesign. This isdemonstrated,forexample,bytheworkofSilvaetal.,which analyzestrade-offsamongthedifferentAXIinterfaceoptions[38].
WeinvestigatedtheusageoftwodifferentAXIcommunication inter-faceslocatedattheextremesoftheresource-versus-performance trade-off:
• thememory-mappedAXI4-lite(mm-lite )interface;and • FIFO-basedAXI4-stream(stream )interface.
Comparedtothestreaminterface,themm-liteinterfacehaslower resourcerequirements,butitalsoexhibitslowerperformance.The mm-liteinterfaceusesmemory-mapped,one-by-onetransferofdataitems. Theinterfaceisparticularlyintendedforcontrolsignalsandsmall-scale dataaccesses.Itdoesnot needanyadditionalmodules beyondthose depictedinFig.7,anditusesonlyoneofthePSmasterGPports.For example,theexecutionofonebranchofthefirstlayerrequirestheinput images(3RGBimageswith96×96pixelseach)andkernelcoefficients (3kernelswith5×5coefficientseach).Sincethemm-liteinterfaceuses aseparatedatatransferoperationforeachpixel,thisresultsinatotal of3× 96× 96+3× 5× 5datatransferoperations.Oncetheaccelerator
Fig.7. Reference configuration for hardware/software co-design exploration in our experiments.
completesitscomputation,themm-liteinterfacerequires48×48data transferoperationstoenabletheprocessortoreadtheoutputfeature map.
Unlikethemm-liteinterface,whichperformsdatatransfers one-by-one,thestreaminterfaceemploysaDMAenginethattransfersdata be-tweenprocessormemoryandtheacceleratorinblocks,wheretheblock sizecanbeupto256bytes.Successivedataitemswithinablockare transferredinconsecutiveclockcycles.Thestreaminterfacerequires aDMAengine,asmentionedabove,andadditionalFIFObuffers,and thereforeincurssignificantoverheadintermsofresourcerequirements. Notethattheadditionalhardwarerequiredbythestreaminterfaceis notdepictedinFig.7.TheDMAengineisconfiguredthroughoneof thePSmasterGPports,andrequirestwodifferentPSslaveHPports todirectlyaccessthememorywheredatatobetransferredto/fromthe acceleratorisstored.
Toexecute one branch of the first DNNlayer, the stream inter-face performs(a)96 memory-to-accelerator DMAoperationstosend theinputimages,with96×3pixelsforeachDMAoperation,and(b) onememory-to-acceleratorDMAoperationtosend5×5×3kernel co-efficients. Additionally,thestreaminterface needs48 accelerator-to-memoryDMAoperationstoretrievethecomputedfeaturemap,with 48pixelsforeachDMAoperation.
4.3.3. Local buffering
Asmentioned previously,weincorporatelocalbufferingof image pixelsintheSFMacceleratortoavoidredundanttransmissionof com-mondataacrossdifferentbranchesoftheaccelerator.Thislocal
buffer-ingoptimizationisappliedtoboththemm-lite-interface-and stream-interface-basedacceleratorimplementations.
ForanacceleratorconfigurationwithasingleSFMinstance,the in-putimagedataistransferredtotheacceleratoronlyduringexecution of thefirstbranch.Afterbeing transferred,this dataisretainedina localbufferwithintheacceleratorforreusebytheremaining31 execu-tions.Foracceleratorconfigurationsthathavemultiple(parallel)SFM instances,theinputimageisalsotransferredonlyoncetothe accelera-tor.Fortheseconfigurations,theimagedataisreusedbytheremaining executionsofalloftheSFMinstances.Thus,ourincorporationof lo-calbufferingoptimizationeliminatesinputimagedatatransfersforall branchesexceptthefirstone.
5. Results
Inthissection,wepresentexperimentalresultstodemonstratethe designandimplementationmethodsprovidedbySTMCMbasedonthe detailedcasestudypresentedin Section4. Themaincontributionof thissectionistodemonstratethattheproposedmethodologyfacilitates efficientexperimentationwithalternativedataflow-basedarchitectures, designoptimizationmethods,andimplementationtrade-offs.
5.1. Embedded software implementation
In this section we present results of our experimentation using STMCMtoexplorealternativeembeddedsoftwareimplementations.We focusspecificallyontheoptimizedapplicationoflooptilingandbuffer memorymanagement.
5.1.1. Loop tiling
AsintroducedinSection4.1.1,in theoptimization ofourLIDE-C implementationoftheDNNapplication,weexploredloop-tiled convo-lutionactordesignswithdifferenttilesizes.Specifically,wemeasured thenumberofcacheloadmissesandthecacheloadmissratesduring ex-ecutionofaconvolutionactor.Thevalidtilesizesforeachconvolution actorwerethosewithintherangeof1to D ,where D isthedimension ofinputimagestotheactor.Forexample,fortheconvolutionactors inLayer1,whichprocessinputimageswithsize96×96pixels,we ex-ploredtilesizeswithintherangeof1–96.
Fig.8 showsthenumberofcacheloadmissesandcacheloadmiss rateunderdifferenttilesizesforconvolutionactorswithdifferentinput imagedimensions(48×48,96×96,750×750,and1500×1500).As wecanseefromtheresults,thecacheloadmissratesareverysmallfor imagedimensionsD ∈{48,96,750}.Thisindicatesthatthedatacanbe fullystoredoralmostfullystoredinthecachewithanyvalidtilesize.
For 𝐷 =1500, however,thereis significantvariationinthecache loadmissrateacrossdifferenttilesizes.Theratereachesitslowestvalue whenthetilesizeisapproximately400.Withcarefulsettingofthetile size,looptilingsignificantlyreducesthecachemissrateforconvolution actorsthathaverelativelylargeimagedimensions.
Additionally,wecanseethatthereisalargeaverageCPUcyclecount forsmalltilesizesinallfigures.Weexpectthatthisisduetothe over-headcausedbytheadditional
for
loopsthatareintroducedbytheloop tilingtransformation.Insummary,basedonoursimulationanalysisforsmallimage di-mensions(96×96and48×48),looptilingdoesnothelptoreducethe cachemissrateonthetargetplatform,andfurthermore,itintroduces overheadduetotheadditionalforloops.Thus,looptilingshouldnot beappliedtothisDNNapplicationforlowimagedimensions.However, ourexperimentsalsoshowthatforlargerimagedimensions,looptiling doeshelptoimprovetheefficiencybyreducingthecacheloadmissrate. 5.1.2. Buffer memory management
Fig.9 showstheamount of memoryrequiredfordatastorage in each DNNlayer. We report memoryrequirements in this section in termsof pixels.Inourexperiments, weused a4-byte floatingpoint datatypeforeach pixel.Fig.9 alsoshowstheamountof data com-municationthat is neededbetween adjacentlayers,andtheamount of memory that must be active simultaneously during the compu-tation associated with each layer. The memoryneeded for input is calculated as 𝑖𝑛𝑝𝑢𝑡 _𝑖𝑚𝑎𝑔𝑒 _𝑠𝑖𝑧𝑒 ×𝑛𝑢𝑚𝑏𝑒𝑟 _𝑜𝑓 _𝑖𝑛𝑝𝑢𝑡 _𝑖𝑚𝑎𝑔𝑒𝑠 . The memory neededforexecutionofeach layeris calculatedas𝑖𝑛𝑝𝑢𝑡_𝑖𝑚𝑎𝑔𝑒_𝑠𝑖𝑧𝑒× (𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑖𝑛𝑝𝑢𝑡_𝑖𝑚𝑎𝑔𝑒𝑠+1)×𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑜𝑢𝑡𝑝𝑢𝑡_𝑓𝑒𝑎𝑡𝑢𝑟𝑒_𝑚𝑎𝑝𝑠.
Aswe canseefromFig.9,theprocessing inLayer2requiresthe largestamountofactivememory,andaminimumof2,525,184pixels mustbeallocatedforbufferstorage.Thememorysizecanbeoptimized subjecttothisconstraintthroughtheapplicationofsharedFIFOs,which wereintroducedinSection4.1.2.Thebuffermemoryallocationthatwe proposeforthisDNNapplicationbasedonsharedFIFOsisillustratedin Fig.10.
Table4 summarizesthememoryrequirementsfordataflowedges (FIFObuffers)andactorsinthetwoconvolutionallayers,whichrequire mostofthememoryamongthefivelayers.Thesememoryrequirements areshownbothwithandwithouttheuseofsharedFIFOs.Asdiscussed inSection4.1.2,actorsoperateondatafromsharedinputFIFOsdirectly
Table5
Resource utilization. In parentheses: the percentage of utilization with re- spect to the resources available on the targeted FPGA.
Available LUTs REGs BUFGs BRAMs DSPs
53200 106400 32 140 220 SFM 5188 (9.75) 3472 (3.26) 1 (3.1) 11 (7.9) 13 (5.9) SFMa 5430 (10.20) 3687 (3.47) 2 (6.3) 11 (7.9) 13 (5.9) SFMCG 5206 (9.79) 3496 (3.29) 5 (15.6) 11 (7.9) 13 (5.9) SFMaCG 5479 (10.30) 3704 (3.48) 6 (18.8) 11 (7.9) 13 (5.9) SFMh 5170 (9.72) 3472 (3.26) 1 (3.1) 11 (7.9) 13 (5.9) SFMhCG 5198 (9.77) 3480 (3.27) 3 (9.4) 11 (7.9) 13 (5.9) SFMauto 5230 (9.83) 3472 (3.26) 1 (3.1) 11 (7.9) 13 (5.9)
withoutcopyingdatatoitsinternalmemory.Thus,convolutionactors onlyneed memoryforits intermediatecomputationresults.Addand Maxpool&ReLUactorsdo notrequireadditionalmemory.Theresults presentedinthistablequantitativelydemonstratetheutilityofshared FIFOsforthisapplication.Inparticular,theapplicationofsharedFIFOs reducesthememoryrequirementsby65%.
5.2. Hardware implementation
Inthissection,weinvestigatetrade-offsamongthevariantsofthe SFMdesignthatwereintroducedinSection4.2.2.STMCMandthe un-derlyingLIDE-Vapproachallowonetoperform suchtrade-off explo-ration,basedondifferentcombinationsofhigh-leveloptimization tech-niques,inasystematic manner.Inparticular,STMCMallowsthe de-signertofocusondifferentstrategiesforinstantiating,configuring,and coordinatingdifferentcombinationsofactorandbuffer(edge) imple-mentations,andeliminatestheneedformodificationinsidetheactor andedgeimplementations.WeexploitedtheseadvantagesofSTMCM whenderivingtheresultspresentedinthissection.
Table5 depictsresourceutilizationdatathatisextractedfromthe post-placeandroutereportsgeneratedbytheXilinxVivadotoolusing thetargetedZynqZ-7020SoC.FromtheresultsinTable5,weseethat thedifferentdesignvariantsallexhibitsimilarlevelsofresourcecost. Theasynchronousdesigns SFM aandSFM aCGincurthehighestresource
costsduetotheadditionallogicrequiredbytheCDCFIFOs.Thenumber ofBUFGsvariessignificantlyamongthedifferentdesigns,dependingon thenumberofclockdomainsandthenumberofclockgatedactors.
Eachoftheimplementeddesignshasbeensimulatedinorderto gen-erateaswitchingactivityfile,whichhasbeenback-annotatedtoVivado PowerEstimationtoextractpowerconsumptiondata.Sincethedesigns havedifferentexecutiontimes,theenergyconsumptionlevelsdonot varyinthesameproportionsasthepowerconsumptionlevels.Table6 summarizes thepowerconsumption,executiontimeandenergy con-sumptionofthesixalternativedesigns.
Intheseexperiments,theclockfrequenciesofthesynchronous de-signsandofRegion1(CLK1)intheasynchronousdesignsareallset to100MHz,whichisthemaximumachievablefrequencyforthe tar-getedplatform.ForRegion2(CLK2)intheasynchronousdesigns,the frequency issetto5MHz.This settingof 5MHzisderivedfrom the hardwareprofilingdata(seeTable3)as1/20ofCLK1.Theseclock fre-quenciesarespecifiedinTable6 withthesuffix _F ,where F represents thefrequencyvalueinMHz.
Fig.9. Buffer memory and communication requirements in the DNN architecture.
Fig.10. Buffer memory allocation for the DNN application.
Table6
Dynamic power consumption, execution time and energy con- sumption of the different SFM variants. In parentheses: the per- centage difference with respect to the baseline SFM.
Power [ mW ] Time [ns] Energy [ 𝜇J ]
SFM 115 2,329,165 268 𝑆𝐹 𝑀 𝑎_5 89 (-22.61) 2,407,300 ( + 3.354) 214 (-20.01) SFMCG 89 (-22.61) 2,329,245 ( + 0.003) 207 (-22.61) 𝑆𝐹 𝑀 𝑎𝐶𝐺_5 88 (-23.48) 2,408,100 ( + 3.389) 212 (-20.89) SFMh 117 ( + 1.74) 2,329,155 (-0.000) 273 ( + 1.74) SFMhCG 105 (-8.70) 2,329,175 ( + 0.000) 244 (-8.70) SFMauto 113 (-1.74) 2,329,165 ( + 0.000) 263 (-1.74)
AccordingtoTable6,theclockgateddesignsSFM CGand𝑆𝐹𝑀𝑎𝐶𝐺_5
havethebestcapabilitiesforsavingenergy,reducingthetotalenergy consumptionby22.61%and20.89%,respectively.Design 𝑆𝐹 𝑀 𝑎𝐶𝐺_5
saveslessenergythanSFM CGsincetheformeremploysonemoreBUFG.
Furthermore,in𝑆𝐹𝑀𝑎𝐶𝐺_5,theactorsintheslowerdomain(Region2)
areactiveforarelativelylargeportionoftheexecutiontime,andthus, theycannotbeswitchedoff forlargeproportionsoftime.Incontrast, accordingtoTable3,theDeinterleaveactorinRegion1canbeswitched off foralmost90%ofthetotalexecutiontime.
Thedesigns𝑆𝐹 𝑀 𝑎𝐶𝐺_5and𝑆𝐹 𝑀 𝑎_5, bothofwhichemploytwoclock
domainswithCLK1at100MHzandCLK2at5MHz,havesimilar ca-pabilitiestosaveenergy.Theformerdesignisslightlymoreenergy ef-ficientcomparedtothelatter.Theresultsforthesetwodesignsshow thattheenergysavedbyswitchingoff theactors,wheninactive,and alsothesavingoftheunusedlogicintheCDCFIFOscounterbalancethe energyoverheadduetotheadditionalcircuitry.
Asexpected,SFM hhasasmallamountofenergyoverheadduetothe logicnecessarytoencapsulateSum1,Sum2andMaxpool&ReLUintothe hierarchicalactor.Thedesign SFM hCG,amongtheclockgateddesigns,
isnotasadvantageousasthepreviouslyanalyzeddesignsintermsof en-ergysaving.ThisisbecauseeventhoughitemploysonlythreeBUFGs, thehierarchicalactorisswitchedoff onlywhennoneofthe
underly-ingactorsareworking.Thismeansthat,forinstance,while Sum1is active,theactorsSum2andMaxpool&ReLUwillhaveanactiveclock evenwhentheactorsareinanidlestate(sothattheykeepwasting energy).Finally SFM autoisthedesignwiththesmallestenergysaving,
only1.74%comparedtoSFM .Evenconsideringthesameoptimization technique(clockgating),thelevelonwhichitisappliedturnsoutto befundamental:atalowlevel(singleflip-flopsin SFM auto)onlythe
dy-namicpowerofarestrictednumberofgatescanbesaved.Ontheother hand,atacoarse-grainlevel(groupsofdataflowactorsin SFM CG),it
ispossibletoactalsoontheclocktree,whichis highlyeffectivefor improvingpowersaving.
5.3. Hardware/software co-design results
Inthissection,weinvestigatedifferenthardware/softwareco-design configurations.AsanticipatedinSection4.1,dependingontheportion oftheapplicationthatisacceleratedinhardwareandonthegiven re-quirementsandconstraints,differentdesignchoicesregardingthe hard-ware/softwarecommunicationinterfaceleadtodifferenttrade-offs be-tweenresourcerequirementsandperformance.FortheSFMaccelerator, weinvestigatedseveralimplementationandoptimizationsolutions, ex-ploringthreekeyaspects:exploitingparallelism,communication inter-facesandlocalbuffering(seeSection4.3).Inthissection,byan SFM accelerator ,wemeanspecificallyahardwareaccelerator.
Differentsoftwareandhardwareconfigurationsthatweexploredin ourco-designexplorationaresummarizedasfollows.
• SW1 — TheapplicationrunsinsoftwareonasingleARMcore.This designcanbeviewedasabaselinedesignwithoutanyoptimization orhardwareacceleration.Comparisonsbetweenthisbaselinedesign andalternativedesignsarediscussedintheremainderofthissection. • SW2 — TheapplicationrunsinsoftwarebyusingbothoftheARM
coresonthetargetplatform.
• HW1 — Asingle-branchSFMacceleratorisemployedtoexecutethe firstconvolutionallayer.
• HW2 — AnSFMacceleratorwithtwoparallelbranches.Inthis con-figuration,alocalbufferissharedbetweenthebranches.
• HW4 — AnSFMacceleratorwithfourparallelbranches.Again,a localbufferissharedamongthebranches.
Formulticoresoftwareimplementationsandhardware implementa-tionswithmultiplebranches,thelayerorlayersthatareexecutedin parallel(i.e.,intra-layerparallelismisexploited)areindicatedin paren-theses.Similarly,hardwareconfigurationsareannoatedwith -mm or -s depending,respectively,onwhetheramemory-mappedAXI-lite com-municationinterfaceisused, oraFIFO-basedAXI-streaminterfaceis used.
Forexample,SW2(L1,L2)representsasoftware-only implementa-tioninwhichlayer1andlayer2areexecutedinparallel.Asanother example,SW2(L2)/HW2(L1)-mm representsa hardware/software im-plementationbasedonconfigurationsSW2andHW2;inthis implemen-tation,layer2isexecutedacrossmultiplecores,layer1isparallelized inhardwarewith2parallelbranches,andAXI-liteisusedasthe com-municationinterface.
NotethattheSFMacceleratorsareabletoexecuteonlythefirst con-volutionallayer.Thus,inalloftheDNNsystemimplementations,the acceleratorsarecoupledwithoneofthesoftwareconfigurations. 5.3.1. Resource costs of accelerator implementations
Table7depictstheresourceoccupancyinthetargetedZynqZ-7020 deviceforthedifferentSFMaccelerator implementationsthatwe ex-perimentedwith.Asexpected,ahigherlevelofparallelism(goingfrom HW1-mmtoHW4-mm)requiresmoreresources,andourexperiments herehelptoquantifytheassociatedtrends.Forexample,fine-grained andcomputation-relatedresources(LUTs,REGsandDSPs)increase lin-earlywiththenumberof parallelbranchesplaced intheaccelerator (about+100% withonemorebranchandabout+300% withthree
Table7
Resource occupancy for different SFM accelerator implementations. In parentheses: the percentage of utilization with respect to the resources available on the targeted FPGA. The bottom part of the table depicts the percentage of variation with respect to HW1-mm.
Available LUTs REGs BRAMs DSPs
53200 106400 140 220 HW1-mm 5395(10.14) 4668(4.39) 43 (30.71) 13 (5.91) HW2-mm 10890 (20.47) 8197 (7.70) 54 (38.57) 26 (11.82) HW4-mm 21474 (40.36) 16331(15.35) 76 (54.29) 52 (23.64) HW2-mm + 101.85 + 75.60 + 25.58 + 100.00 HW4-mm + 298.04 + 249.85 + 76.74 + 300.00 Table8
Performance of different co-design solutions. The top part of the table de- picts execution time in milliseconds (ms). The bottom part depicts the per- centage of execution time variation for each configuration with respect to SW1.
input Layer Prediction
1 2 3:5 SW1 118.9 640.3 1594.7 34.4 2388.2 SW2(L1) 118.7 368.3 1609.8 34.0 1639.5 SW2(L1,L2) 117.4 354.7 842.1 33.8 1348.0 SW2(L2)/HW1(L1)-mm 118.9 118.5 856.7 35.4 1129.5 SW2(L2)/HW2(L1)-mm 118.4 74.6 866.5 35.1 1094.5 SW2(L2)/HW4(L1)-mm 117.9 54.5 859.0 35.4 1066.8 SW2(L1) -0.13 -42.48 + 0.95 + 1.12 -10.75 SW2(L1,L2) -1.22 -44.61 -47.19 -1.72 -43.56 SW2(L2)/HW1(L1)-mm -0.00 -81.49 -46.28 + 2.89 -52.71 SW2(L2)/HW2(L1)-mm -0.40 -88.36 -45.66 + 2.09 -54.17 SW2(L2)/HW4(L1)-mm -0.81 -91.48 -46.13 + 2.85 -55.33
morebranches),whilecoarse-grainedmemoryresources(BRAMs) ex-hibitagentlerslope.Weexpectthatthisgentlersloperesultsbecause theprimaryBRAM-demandingmodule,thelocalbuffer,issharedacross parallelbranches.
TheresultsaboveindicatethatwhentheDNNarchitectureismade deeper(i.e.,asthenumberof convolutionallayersis increased),the biggestrestrictionwillbethehardwareresourcelimitations.Usually,as aDNNismadedeeper,moreparallelbranchesareneededtocomplete thecomputationwithoutcompromisingtheprocessingspeedandmore memoryresourcesareneededtostoretheintermediatefeaturemaps. However,deepernetworksdonotnecessarilyimplymorecomputational complexity.Forexample,thewell-knownResNet101, whichhas101 layers,needslesscomputationthanthe16-layerVGG16becausethe VGGlayersaresignificantlylarger[39,40].
5.3.2. Comparison of co-design solutions
Table8presentsperformanceresultsfordifferentsoftware-onlyand hardware/softwaresolutionsthatweinvestigatedusingSTMCM.In par-ticular,thetable reportstheexecutiontimein termsof milliseconds (ms)fordifferentexecutionphases:readingtheinputfile(column in-put),computingthefirstandthesecondlayers,andcomputingthedeep layers(Layers3,4and5).Thetablealsoreportstheexecutiontimeof theoverallapplication(prediction)fordifferentdegreesofsoftwareand hardwareparallelism.
ThereferencetimeisgivenbytheexecutionoftheentireDNN ap-plicationonasingleARMcore(SW1),whichiscapableofcompleting thepredictioninabout2.4seconds.Fromthisreferenceconfiguration, itisalsopossibletoappreciatethecomputationalloadofthedifferent applicationphases.TheheaviestpartisLayer2,whichisresponsible formorethan65%oftheoverallexecutiontime,whilemostofthe re-mainingloadisattributabletoLayer1(around25%),andtoreadingof theinputfile(about5%).Forthisreason,softwareparallelizationhas beenevaluatedonlyonLayer1(SW2(L1)),andonbothLayers1and2 together(SW2(L1,L2)).
(3) DMA (stream) 1490 (2.80) 1881 (1.77) 3 (2.14) 0 (0.00) (1) + (2)+(3) 7486 (14.07) 6480 (6.09) 56 (40.00) 13 (5.91)
HW1-s + 7.21 -6.66 + 0.00 + 0.00
(1) + (2)+(3) + 38.75 + 38.81 + 53.49 + 0.00%
Theexecutiontimeneededbyeachofthemajorexecutionphasesis almosthalvedwhentwocoresareadopted.Aprecise50%reductionis notreachedbecauseofthesoftwareoverheadnecessarytomanage mul-titasking.Withsoftwareparallelizationonly,theoverallexecutiontime isreducedto1.13seconds,about44%lessthantheSW1configuration. Hardwareaccelerationandrelatedparallelizationareonlyappliedto thefirstconvolutionallayer,whileonlysoftwareparallelizationis ap-pliedtoLayer2.Ifweconsideronlytheexecutiontimeoflayer1,then SW2(L2)/HW1(L1)reducesexecutiontimebymorethan80%compared toSW1,andmorethan65%comparedtoSW2(L1,L2).
IfmultiplebranchesofLayer1areprocessedinparallel,the hard-wareacceleratorachievesfurtherperformancebenefits— atime sav-ingupto88%fora2-branchconfiguration(SW2(L2)/HW2(L1)-mm), andupto91%fora4-branchconfiguration(SW2(L2)/HW4(L1)-mm). TheseperformanceimprovementsarewithrespecttoSW1.Notethatthe speed-upobtainedbydoublingthenumberofbranches(goingfrom1to 2andfrom2to4)islessthan2ineithercase(1.6from1to2and1.4 from2to4).Thisisduetothesoftwareoverheadrelatedtomanaging multiplebranches.DuetothelimitedcomputationalloadofLayer1,the benefitsofhardwareaccelerationandparallelizationontheoverall sys-temaresomewhatlimited.Thebestsolution,SW2(L2)/HW4(L1)-mm, requires1.07secondstoperformthewholeapplication,55%lessthana fullsoftwareexecutiononasinglecore(SW1)and21%lessthanafull softwareexecutionontwocores(SW2(L1,L2)).
Another aspect that has been studied in our co-design experi-mentsistheinterfacingbetweensystemcomponents.Asdiscussedin Section4.3.2,theadoptedcommunicationinterfacebetweensoftware andhardware portionsof a designcan havea significantimpact on overallsystemperformance.Inourco-designexperiments,wehave ap-pliedtwoveryinterfaces— mm-liteandstream,whicharediscussedin Section4.3.2.
Table9 helpstounderstanddifferencesbetweentheresourcecosts ofthesetwointerfaces.Thefirstrowofthistableshowsresource avail-abilityonthetargetplatform.Thesecondandthirdrowsshowresource costsfortheHW1-mmaccelerator,andHW1-saccelerator.Thefourth andfifthrowsshowresourcecostsforFIFOandDMAmodules(external totheaccelerator)thatarenecessaryforthestreaminterface.Thesixth rowshowstotalresourcecostsinducedbyuseofthestreaminterface
((1)+(2)+(3)),whileover50%moreBRAMcostisincurred. Tomakethestreaminterfaceausefuloptioninoursystemdesign,its significantincreaseinresourcecostsshouldbeaccompaniedbytangible advantagesinexecutionperformance.Table10showsresultspertaining totheimpactofthecommunicationinterfaceonexecutiontime.In or-dertobetterexposetheeffectsoftheselectedcommunicationinterface, detailsondatatransfers(input,convolutioncoefficientsandoutputs) be-tweenthehardware(accelerator)andsoftwaresubsystemsisreported. FortheHW1-sdesign,twodifferentsetsofresultsarereported depend-ingonwhetherprogramdataisdirectlyaccessiblebytheDMAengine. Foroneset,theprogramdataislocatedinamemorythatisnotdirectly accessiblebytheDMA.Thisscenariocorrespondstothedesignthatwe haveimplemented.Itrequiresanadditionalcopyoftheprogramdata inamemorythatisaccesseddirectlybytheDMA.Fortheotherset,the programdataislocatedinamemorythatisdirectlyaccessiblebythe DMA.ThissetisindicatedinTable10usingtheannotation HW1-s dir . WehavenotimplementedHW1-sdir;instead,wehaveestimatedthe correspondingresultstogainsomeideaaboutthemaximumachievable performance.Detailsontheestimationapproachareomittedforbrevity. The results in Table 10 demonstratethe utility of the resource-hungryHW1-sdesign,andquantifyitsclearabilitytooutperform HW1-mm.Inparticular,theinputdataandtransmissionofconvolution coef-ficientsarerespectivelyabout84%and67%fasterwhentheAXI-stream protocolisadopted.Thisleadstoanestimatedtimesavingofupto92% and80%,respectively,whentheDMAhasdirectaccesstotheprogram data(HW1-sdir).
Ontheotherhand,theoutputdatatransmissiontimeisthesame amongallofthereportedconfigurations.Weexpectthatthisisbecause theoutputsareproducedinarow-by-rowfashion(48dataunitsata time),andthetimingofoutputproductionisdeterminedbythe compu-tationlatency,whichisgreaterthanthecommunicationlatencyforall of theinterfacingconfigurations.However,lookingatthetotalLayer 1 andDNNapplicationexecution times,we see thattheadvantages ofadoptingthestreaminterfacearenolongervisible.Indeed,forthe consideredSFMaccelerator,theinputdataistransmittedonlyduring thefirstbranchexecutionduetoouruseoflocalbuffering. Addition-ally,eventhoughthecoefficientsaretransmittedforeachbranch,their transmissionrequiresarelativelysmallamountoftime.
Theseresultsinvolvingcommunicationinterfaceselectionillustrate theimportanceofcomprehensivesystem-levelevaluationofalternative designoptions,whichisoneofthekeypartsofthedesignprocessthat isfacilitatedbySTMCM.
Table10
Results pertaining to the impact of the communication interface on execution time. The top part of the table depicts the execution time of the different DNN application steps. The bottom part depicts the execution time variation of each configuration with respect to HW1-mm.
File input [ms] Layer 1 Layer 2 [ms] Layers 3, 4, 5 [ms] Prediction [ms] input tx [ 𝜇s] coeffs tx [ 𝜇s] output tx [ 𝜇s] total [ms]
HW1-mm 118.9 5222 15 2339 118.5 856.7 35.4 1129.5
HW1-s 117.5 854 5 2333 114.6 864.3 34.9 1131.3
HW1-s dir 118.3 418 3 2333 114.1 869.5 35.0 1137.0
HW1-s -1.17 -83.65 -66.67 -0.26 -3.27 + 0.87 -1.39 + 0.16