• Non ci sono risultati.

An integrated hardware/software design methodology for signal processing systems

N/A
N/A
Protected

Academic year: 2021

Condividi "An integrated hardware/software design methodology for signal processing systems"

Copied!
19
0
0

Testo completo

(1)

ContentslistsavailableatScienceDirect

Journal

of

Systems

Architecture

journalhomepage:www.elsevier.com/locate/sysarc

An

integrated

hardware/software

design

methodology

for

signal

processing

systems

Lin Li

a,∗

, Carlo Sau

b

, Tiziana Fanni

b

, Jingui Li

c

, Timo Viitanen

c

, François Christophe

e

,

Francesca Palumbo

d

, Luigi Raffo

b

, Heikki Huttunen

c

, Jarmo Takala

c

, Shuvra S. Bhattacharyya

a,c a University of Maryland, ECE Department, College Park, MD 20742, United States

b University of Cagliari, Department of Electrical and Electronic Engineering, Italy c Tampere University, Finland

d University of Sassari, PolComIng-Information Engineering Unit, Italy e Department of Computer Science, University of Helsinki, Finland

a

r

t

i

c

l

e

i

n

f

o

Keywords: Dataflow Model-based design Hardware/software co-design Low power techniques Deep learning Signal processing systems

a

b

s

t

r

a

c

t

This paper presents a new methodology for design and implementation of signal processing systems on system-on- chip (SoC) platforms. The methodology is centered on the use of lightweight application programming interfaces for applying principles of dataflow design at different layers of abstraction. The development processes inte- grated in our approach are software implementation, hardware implementation, hardware-software co-design, and optimized application mapping. The proposed methodology facilitates development and integration of signal processing hardware and software modules that involve heterogeneous programming languages and platforms. As a demonstration of the proposed design framework, we present a dataflow-based deep neural network (DNN) implementation for vehicle classification that is streamlined for real-time operation on embedded SoC devices. Using the proposed methodology, we apply and integrate a variety of dataflow graph optimizations that are important for efficient mapping of the DNN system into a resource constrained implementation that involves co- operating multicore CPUs and field-programmable gate array subsystems. Through experiments, we demonstrate the flexibility and effectiveness with which different design transformations can be applied and integrated across multiple scales of the targeted computing system.

1. Introduction

Model-baseddesignhasbeenwidely studiedandappliedoverthe yearsinmanydomainsofembeddedprocessing.Dataflowiswell-known asaparadigmformodel-baseddesignthatiseffectiveforembedded digitalsignalprocessing(DSP) systems[1].Indataflow-based model-ing,signalprocessingapplicationsarerepresentedasdirectedgraphs (dataflowgraphs), andcomputationalfunctionsare modeled as ver-tices(actors)inthese graphs.Actorsexchange datapackets(tokens) throughunidirectional,first-in,first-out (FIFO)communication chan-nelsthatcorrespondtodataflowgraphedges.Manydataflow-based de-signmethodsforDSP systemshavebeenexploredin recentyearsto supportvariousaspectsofdesignandimplementation,including model-ingandsimulation;schedulingandmappingofactorstoheterogeneous platforms;andbuffermanagement(e.g.see[1,2]).

Thediversityofdesignscalesanddataflowtechniquesthatare rel-evanttosignalprocessingsystemsposesmajorchallengestoachieving

Corresponding author.

E-mail addresses: lli12311@umd.edu (L. Li), carlo.sau@diee.unica.it (C. Sau), tiziana.fanni@diee.unica.it (T. Fanni), jingui.li@tuni.fi (J. Li), timo.2.viitanen@tuni.fi (T. Viitanen), francois.christophe@helsinki.fi (F. Christophe), fpalumbo@uniss.it (F. Palumbo), raffo@unica.it (L. Raffo), heikki.huttunen@tuni.fi(H. Huttunen), jarmo.takala@tuni.fi(J. Takala), ssb@umd.edu(S.S. Bhattacharyya).

thefullypotentialthatisofferedbysignalprocessingplatformsunder stringenttime-to-marketconstraints.Whileautomatedtechniques,such asthosereferredtoaboveforschedulingandbuffermapping,are ef-fectiveforspecializedcombinationsofplatformsanddataflowmodels (e.g.,multicoreCPUsandsynchronousdataflow,respectively),theyare limitedintheirabilitytosupportmorecomprehensiveassessmentofthe designspace,wherethemodelsandtargetplatformsthemselveshave greatinfluenceonaddressingimplementationconstraintsand optimiza-tionobjectives.Systemdesignersmustthereforeresorttoad-hoc meth-odstoexploredesignalternativesthatspanmultipleimplementation scales,platformtypes,ordataflowmodelingtechniques.

Inthiswork,weproposeadesignmethodologyandanintegrated setof toolsandlibrariesthataredevelopedtohelpbridgethis gap. Werefertothismethodology astheSTMCMethodologyorSTMCM, which isnamedafterthedifferent institutionsacross whichitis de-veloped(Sassari,Tampere,Maryland,Cagliari).STMCMfocuseson en-ablingexperimentationacrossdifferentlevelsofabstractionthroughout

https://doi.org/10.1016/j.sysarc.2018.12.010

Received 27 April 2018; Received in revised form 4 November 2018; Accepted 31 December 2018 Available online 31 December 2018

(2)

(LWDF)programming[3],anditsunderlyingcorefunctionaldataflow (CFDF)model of computation[4]. LWDF providesa compactset of application programming interfaces (APIs) that allows one toapply signal-processing-orienteddataflowtechniquesrelativelyeasilyand ef-ficientlyinthecontextofexistingdesignprocesses,targetplatforms, andsimulation-andplatform-orientedlanguages,suchasMATLAB,C, CUDA,andVHDL.Additionally, CFDFisa generalform of dataflow thataccommodatesmorespecializedformsofdataflow,suchasBoolean dataflow[5],cyclo-staticdataflow[6],synchronousdataflow[7],and RVC-CAL[8] asnaturalspecialcases. Thisaccommodation of differ-entdataflowmodelsinturnprovidespotentialtointegratedesignswith otherdataflowframeworksandDSPlibraries,suchasthosedescribed in[8–13].Furthermore,LWDFisgranularity-agnostic,inthesensethat actorcomplexitydoesnotlimittheapplicabilityoftheframework.

TodemonstratethecapabilitiesofSTMCMinaddressingthe chal-lengesofmappingpracticaldataflow-basedstructuresonheterogeneous signalprocessingplatforms,weexploredifferentimplementationsofa deep neuralnetwork(DNN)for vehicleclassificationona heteroge-neous,embeddedsystem-on-chip(SoC),theXilinxZynqZ-7020SoC. DNNapplicationsposegreatchallengesintheirdeploymenton embed-deddevices.InvestigationofDNNimplementationsonembeddedSoC devicesischallengingduetothelimitedresourcesforprocessingand storageinthesedevices,andespecially,duetothehighcomputational complexityofDNNs.Theyinvolveverylargeandcomplexsignalflow structuresthatinvolveintensivecomputation,dataexchange,and multi-layerprocessing.ThesecharacteristicsmakeembeddedDNN implemen-tationhighlyrelevantasacasestudyforSTMCM.

2. Relatedwork

Dataflowprovidesvaluablemodel-baseddesignpropertiesfor sig-nalprocessingsystems,andhasbeenadoptedinawidevarietyoftools forbothsoftwareandhardware design.Forexample,LWDFAPIsfor CUDA andC have beentargetedin theDIF-GPUtoolfor automated synthesisofhybridCPU/GPUimplementations[14].TheCAL program-minglanguageandtheOpenRVC-CALCompiler(Orcc)toolsetprovide adataflowenvironmentforgeneratingdataflowimplementationsina numberoflanguages,suchasC,Jade,andVerilog[8,9,15] (notethat theVerilogbackendofOrcchasbeendiscontinuedandXronos synthe-sizerhasbeenreplaced).TheCAPHlanguageandframeworkgenerate hardwaredescriptionlanguage(HDL)codefromhigh-leveldataflow de-scriptions[10].

Theworkin[16] presentsanintegrateddesignflowandtoolsfor theautomaticoptimizationofdataflowspecificationstogenerateHDL designs. The Multi-Dataflow Composer(MDC) toolis a dataflow-to-hardwareframeworkabletoautomaticallycreatemulti-functional re-configurable architectures.In addition tothis baseline functionality, MDCoffersthreeadditionalfeatures: (1)a structuralprofilerto per-formacompletedesignspaceexploration,evaluatingtrade-offsamong resourceusage,powerconsumptionandoperatingfrequency[17];(2)a dynamicpowermanagertoperform,atthedataflowlevel,thelogic par-titioningofthesubstratetoimplementatthehardwarelevel,andapply apowersavingstrategy[18];(3)acoprocessorgeneratortoperform thecompletedataflow-to-hardwarecustomizationofaXilinxcompliant multi-functionalIP[16].

STMCMiscomplementarytotheseeffortsthatemphasizedataflow design automation. Byapplying LWDF APIs in novel ways, STMCM facilitates implementationof anditerative experimentationwithnew dataflow-basedhardware/softwarearchitecturesanddesign optimiza-tion techniques. LWDFis applied as anintegral partof STMCM be-causeofLWDF’sminimalinfrastructurerequirementsanditspotential forrapidretargetabilitytodifferentplatformsandactorimplementation languages.Furthermore,LWDFdoesnothaveanyrestrictioninterms ofactorgranularityandcanbeextendedwithdifferentcombinations ofdataflowgraphtransformations,aswellasotherformsofsignal pro-cessingoptimizations(e.g.,see[1]).

In[21],wepresentedanefficientintegrationoftheLWDF methodol-ogywithhardwaredescriptionlanguages(HDLs).Buildingonthis HDL-integratedformofLWDF,wedevelopedmethodsforlowpowersignal processinghardwareimplementation,andsystem-leveltrade-off explo-ration.Inthispaper,weapplythehardwaredesigntechniques intro-ducedin [21] aspartofageneralmethodology thatspanssoftware, hardware,andmixedhardware/softwaredesign,implementation,and trade-off exploration.Thus,whilethefocusin[21]isonrigorous inte-grationacrossdigitalhardwaredesign,lightweightdataflow program-ming, andlowpoweroptimization,theemphasisin thispaper ison amethodology forapplyingLWDFconceptsinanintegratedmanner acrosscompletehardware/softwaredevelopmentprocessesfor embed-dedsignalprocessingsystems.

Insummary,STMCMprovidesmethodstoseamlesslyand compre-hensivelyintegrateLWDF-basedactorimplementationtechniqueswith design processesforreal-time,resource-constrained signalprocessing systems.STMCMcanbeusedasanalternativetoorinconjunctionwith moreconventionalautomateddataflowtools(e.g.,fordisjoint subsys-tems).STMCMrequiresmoreeffortinprogrammingcomparedtofully automatedtoolchains,howeveritprovidesmoreagilityintermsof re-targetabilityandexperimentation,asdescribedabove.Thisisauseful trade-off pointtohaveavailableformodel-baseddesignofcomplex sig-nalprocessingsystems.

3. Proposeddesignmethodology

OurproposedmethodologySTMCMisillustratedinFig.1.As moti-vatedinSection1 andSection2,STMCMisadesignmethodologythat emphasizesLWDFconcepts,andisspecializedforSoC-basedsignal pro-cessingsystems.TheupperpartofFig.1representsapplication-specific andalgorithmicaspects,whilethelowerpartrepresentsthegeneralpart ofthemethodologythatisreusableacrossdifferentapplications.The up-perpartisillustratedconcretelyinthecontextofDNNsystemdesign; thispartcanbereplacedwithotherapplication/algorithmleveldesign aspectswhenapplyingSTMCMtootherapplications.

InSTMCM,we applytheLWDFprogrammingmodelthrough the LightweightDataflowEnvironment(LIDE).LIDEisasoftwaretoolfor dataflow-based design andimplementation of signal processing sys-tems[3,22].LIDEisbasedonacompactsetofapplication program-minginterfaces (APIs)thatis usedforinstantiating,connecting, and executingdataflowactors.TheseAPIshavebeenimplementedina va-rietyofimplementationlanguages.Forexample,LIDE-C[22]and LIDE-V[21] provideC andVeriloglanguageimplementationsoftheLIDE APIs,respectively.

(3)

Fig.1. An illustration of STMCM in the context of DNN system design.

AsmentionedinSection1 andillustratedinFig.1,corefunctional dataflow(CFDF)[4],is theformofdataflowthatLWDFisbasedon. InCFDF,each actorisspecifiedasasetof modes.Each actorfiring operatesaccordingtooneofthespecifiedmodes(calledthe“current mode” associatedwiththefiring),anddeterminesauniquenextmode, whichwillbethecurrentmodeforthenextfiring.Theproductionand consumptionrates(dataflow rates )fortheactorportsareconstantfora givenmode.However,differentmodesofthesameactorcanhave

dif-ferentrates,whichallowsactorstoexhibitdynamicdataflowbehavior. WepresenttheswitchactorasanexampleofCFDFactor.Switchactor hasthreemodes:Control,TrueandFalse.InControlmode,theswitch actorconsumesonetokenfromControlport.InTrueorFalsemode,the switchactorconsumesonetokenfromDataportandforwardthattoken toTrueorFalseOutputportaccordingly.Thedataflowtableandmode transitiondiagrambetweenCFDFmodesofswitchactorareillustrated inFig.2.

(4)

Fig.2. Switch actor in CFDF. (a) Switch Actor, (b) Dataflow Table, (c) Mode Transition Diagram between CFDF Modes.

ThedefinitionofaCFDFactorincludestwofunctionscalledthe en- able function andinvoke function oftheactor.Theenablefunctionchecks whetherthereissufficientdataavailableontheactor’sinputedgesand sufficientemptyspaceavailableontheoutputedgestofiretheactorin itsnextmode.Theinvokefunctionexecutesanactorfiringaccording totheactor’scurrentmode,consumingandproducingamountsofdata thataredeterminedbythefixeddataflowratesofthecurrentmode. Theinvokefunctionalsodeterminestheactor’snextmode,asdescribed above.

Intheremainderofthissection,wediscussindetailtheapplication-, software-,andhardware-specificprocessesillustratedinFig.1. 3.1. Application-specific tools and processes

InFig.1,application-specifictoolsandassociateddesignprocesses areillustratedbygrayblocks.Throughoutthispaper,weadoptaDNN applicationasaconcretedemonstrationofhowsuchapplication-specific aspectsareusedasanintegralpartofSTMCM.TheDNN-focuseddesign processillustratedinFig.1 startswiththederivationofDNN hyperpa-rametersandthenetworkconfiguration.Thentheparametersassociated withthederivedDNNstructureareextractedandtheDNNalgorithmis carefullyvalidatedtoensurethattargetlevelsofaccuracyaresatisfied. Theblocklabeled“DesignRequirementsandConstraints” refersto theapplication-andplatform-specificrequirementsandconstraintson theDNNimplementation.Examplesoftheseincludetheaccuracyand throughputrequirementsforimageclassificationDNNsystems,and con-straintsonavailablepowerandhardwareresourcesforatargetedSoC platform.

In the remainder of this section, we introduce the software-relatedandhardware-relateddesignprocessesthatprovidethecoreof STMCM.Theseprocessesareappliedinanintegratedmannerfor hard-ware/softwareco-design,asrepresentedbythelowerlefthandpartof Fig.1.DetailedexplanationsofthemajorcomponentsinSTMCMare providedinSection4.3.

3.2. Software-related process

Inthenextmainphaseoftheproposeddesignmethodology,theDNN network configuration derived using application-specific, algorithm-leveltoolsismappedtoasoftwareimplementationusingLIDE-C.Note thatLIDE-CisinnowayrestrictedtoDNNsystems,andisinstead de-signedtosupportabroadclassofdataflow-basedsignalandinformation processingsystems.Forexample,intheworkof[23],thedesignspace explorationofadigitalpredistortionsystemforwireless communica-tionisbasedonimplementationusingLIDE-C.In[24],LIDE-Cis ex-tendedtosupportparameterizedsynchronousdataflow[25] modeling andappliedtotheimplementationofanadaptivewireless communica-tionreceiver.In[26],optimizedvectorizationtechniquesareappliedto LIDE-basedactorsforthroughputoptimization,anddemonstratedusing anOrthogonalFrequencyDivisionMultiplexing(OFDM)receiver.For

moredetailsaboutLIDE-CandthedevelopmentofDNNcomponentsin LIDE-C,wereferthereaderto[22,27].

WorkingwiththeLIDE-CimplementationoftheDNN,anumberof optimizationprocessesarecarriedoutiterativelytostreamlinethe soft-wareimplementationintermsoftherelevantdesignobjectivesand con-straints. Thisiterativeoptimizationprocessis illustratedinFig.1 by the cyclicpath thatinvolvesthe blocks labeled Dataflow Representa- tion, LIDE-C Implementation, and Optimized LIDE-C Implementation .The proposed approach supports efficient application of commonly-used DNNsoftwareoptimizationmethodssuchasfor-looptilingandbuffer memorysharingamongdataflowgraphedges.Wereferthereaderto Section4.1 formoredetailsabouttheseoptimizationmethodsandthe integrationofthemwiththeLIDE-Cimplementation.

Next,softwareprofilingisperformedontheoptimizedLIDE-C im-plementationoftheDNNsystemtoextractprofilingdata.Thisdatais extractedforeachdataflowcomponentoftheDNNarchitecture.Inthe profilingprocessappliedinSTMCM,thememorysizesofthebuffers andexecutiontimeoftheactorsinthegrapharemeasured.According tothecharacteristicsofDNNarchitecture,theDNNsystemisdivided intomultiplecomputationlayers.InourapplicationofSTMCM, soft-wareprofilingisspecializedtoDNNimplementationbymeasuringthe totalmemorysizesforthebuffersbothinsideeachlayerandbetween pairsofadjacentlayers.Wealsomeasurethetotaltimecomplexityof eachDNNlayer.

3.3. Hardware-related process

Thedataflowmodelofthesubgraphtoaccelerateisimplemented in hardware using LIDE-V.Hardware profilingbased onthe specific implementationplatformisperformedontheLIDE-Vimplementation. Thisprofilingisusedtocollectmeasurementsonhardwareperformance andhelp identifypossible optimizations. Detailson hardware profil-ingaredemonstratedconcretelythroughthecasestudypresentedin Section4.2.Likethesoftwareimplementation,thehardware implemen-tationwillingeneralgothroughmultipleoptimizationiterationsbefore itisfinalized.

InLIDE-V,thehardwareimplementationofadataflowactoris de-composedintoimplementationsofitsenablefunctionandinvoke func-tion.ThesecomponentsareimplementedastwocoupledVerilog mod-ules— theactorenablemodule(AEM),andactorinvokemodule(AIM). Dataflowedgesareimplementedasdataflowedgemodules(DEMs);we informallyrefertoDEMsalsoas“FIFOs”.

Toprovidefullydistributedschedulingofactors,onecanconnecta LIDE-V actor scheduling module (ASM )toeachactor.TheASMinitiates anewfiringofitsassociatedactoranytimetheactorisnotalreadyin thefiringmode,hassufficientdataonitsinputedges,andhassufficient emptyspaceonitsoutputedges.SchedulingofLIDE-Vactorsisnot re-strictedtosuchafullydistributedschedulingapproach.Forexample, withappropriately-designedcontrollogic,subsetsofactorscanbe se-rializedtoallowsharingofresourceswithinthesubsets.Inthispaper,

(5)

however,werestrictourattentiontofullydistributedscheduling.Fully distributedschedulingofdataflowgraphshasbeenanalyzedinvarious contexts.Forexample,Ghamarianelal.havedevelopedmethodsfor throughputanalysisofsynchronousdataflowgraphsthatarescheduled inafullydistributedmanner[28].Suchanalysistechniquescanbe ap-pliedtohardwaresubsystemsinSTMCM.

Theorthogonality(separationofconcerns)amongactor,edge,and schedulerdesigninLIDE-Vlaysavaluablefoundationforrigorous inte-grationofpower-managementwithintheassociatedAPIs.Inparticular, wedemonstratedin[29] and[21] thatmethodsforasynchronous de-sign,GloballyAsynchronousLocallySynchronous(GALS)design,and clockgatingcanbeappliedefficientlythroughnaturalextensionsofthe LIDE-VAPIs.Wealsodemonstratedtheuseoftheseextensionstopower optimization.

Tomanagecomplexityandimprovereuseofsubsystemswithinand acrossdesigns,one canencapsulatesubgraphsin LIDE-V within hier- archical actors (HAs ).AnHAinLIDE-V appearsfromtheoutsideasa regular(non-hierarchical)LIDE-VactorwithanassociatedAEM,AIM, andASM.Execution ofanHA asanactorin theenclosingdataflow graphiscoordinatedbythe external scheduler associatedwiththeHA. WhenanHAisfiredbyitsexternalscheduler,the internal scheduler of theHAcoordinatesthefiringsofactorsthatareencapsulatedwithinthe HA(nestedactors).Theinternalschedulercarriesoutthesetofnested actorfiringsthatmustbecompletedforagivenfiringoftheHA.An exampleofanHAwithinternalandexternalschedulersisdiscussedin detailandillustratedinFig.6.

Sinceitappearsfromtheoutsideasaregularactor,anHAcanbe clockgatedinexactlythesameway,allowingthedesignertoefficiently switchoff thewholesubgraphatappropriatetimesduringoperation. 4. Casestudy:Adeepneuralnetworkforvehicleclassification

AsaconcretedemonstrationofSTMCM,weadoptaDNNusecase forautomaticdiscriminationamongfourtypesofvehicles— bus,car, truck,andvan.Thisimplementationisbasedonaneuralnetwork de-signpresentedin[30],whereanetworkconfiguration— i.e.,the num-berandtypesoflayersandotherDNNhyperparameters— was care-fullyderivedanddemonstratedtohaveveryhighaccuracy.The accu-racyof themethodswasvalidatedwithadatabaseofover 6500 im-ages,andtheresultingpredictionaccuracywasfoundtobeover97%. Theworkinthis paperandthework in[30] havedifferentfocuses. Theworkof[30]focusesonderivinghyperparameters,networkdesign, anddemonstratingnetworkaccuracy,anddoesnotaddressaspectsof resource-constrainedimplementationorhardware/softwareco-design. Inthispaper,wegobeyondthedevelopmentsof[30] byinvestigating resourceconstrainedimplementationonarelevantSoCplatform,and optimizedhardware/softwareco-designinvolvinganembedded multi-coreprocessorandFPGAaccelerationfabricthatareintegratedonthe platform.In[30],theproposedDNNarchitecturesareevaluatedbased ontheclassificationaccuracy,whileinourworkonSTMCM,the ob-jectivesthatwearetryingtooptimizearesystemthroughput,memory footprintandpowerefficiency.Inaddition,ourworkinthispapercan begeneralizedtothedesignandimplementationofarbitraryDNN ar-chitectures,andalsoitcanbegeneralizedbeyondDNNapplicationsto othersignalandinformationprocessingapplications;thearchitecture of[30] isselectedasacasestudytoconcretelydemonstratetheusage ofthemethodologyproposedinthispaper.

InrelationtoFig.1,weapplytheresultsfrom[30]intheblock la-beled“derivationofhyperparametersandDNNdesign” aspartofthe designmethodologythatisdemonstratedinthispaper.Fig.3illustrates thecompleteDNNarchitecturethatwe implementinthis work.For moredetailsaboutthisusecase,suchasthedataset,thederivationof theDNNarchitectureandtheapplicationoftheusecaseinvehicle clas-sification,wereferthereaderto[30].

TheDNNnetworkdesigniscomposedoftwoconvolutionallayers, twodenselayersandoneclassifierlayer,asdepictedinFig.3.Thefirst

convolutionallayertakesanRGBimage(3×96×96)asinput,and pro-duces32featuremaps,eachwithdimensions(48×48).Thesecond con-volutionallayertakesthese32featuremapsasinputandproduces32 smallerfeaturemaps,eachhavingdimensions(24×24).Werefertoa subsystemthatprocessesmultipleinputimagestoproduceasingle fea-turemapasa branch .Thus,thefirstandsecondconvolutionallayers have32brancheseach.Thetwodenselayerscombinetotransformthe feature mapsintoa(1×100)vector,which isthenmultipliedin the classifier layerbya(100×4)matrixtodeterminethe(1×4) classifi-cationresult.Eachofthefourvaluesintheresultcorrespondstothe likelihoodthatthevehicleintheinputimagebelongstooneofthefour vehicletypes(i.e.,bus,car,truckandvan).

Thestudieduse caseisrelativelyeasytosolvecomparedto com-mon imagerecognition benchmarks,such asMSCOCO[31], or Ima-geNet[32]. Therefore,onecanreachhighaccuracywitharelatively simplenetworkrequiringsignificantlylowerresources thancommon networktopologiesintended formobileuse (such asMobilenets).As such,thefocusofourworkisnotinmobiledevices(e.g.,smartphones), butin simplerIoTdevicestargetedtosolvingless complexmachine learningproblemsatlowcost.ForfurtherdetailsontheDNNnetwork designandhyperparameterspecifications,wereferthereaderto[30].

Thespecificplatformandassociatedplatform-basedtoolsthatwe employ arebased ontheXilinx ZynqZ-7020SoC.Theremainder of thisSectionfocusesondetailsassociatedwithSTMCMandits associ-ateddesignprocesses.Thesedetailsarepresentedconcretelythrough thedevelopmentofthisDNNcasestudy.

4.1. Software implementation and optimization

Inthissection,wediscussdataflow-graph-andactor-level optimiza-tionsandassociated designiterations, asillustratedin Fig.1 by the blocks labeledDataflowRepresentation,LIDE-CImplementation, and OptimizedLIDE-CImplementation.Westartwithadataflowgraph im-plementationthatisderivedusingLIDE-C[3,22],whichprovidesa C-languageimplementationoftheLWDFAPIssothatCFDF-basedactors anddataflowgraphscanbeimplementedinastructuredmannerusing C.Theinitial(sequential)LIDE-Cdesignisdevelopedinadesignphase thatcorrespondstotheblocklabeledLIDE-CImplementationinFig.1. Aftervalidatingthecorrect,dataflow-basedoperationoftheinitial DNNdataflowgraphimplementationin LIDE-C,we experimentwith varioustransformationsattheactor,subgraph,anddataflowgraph lev-els.Here,weexploittheorthogonalityofactor,edge,andgraph imple-mentationinLIDE-C,whichallowsdesignerstoflexiblyandefficiently perform experimentationwithawidevarietyoftransformations,and withdifferentcombinationsofappliedtransformations.Theactor-level transformationsperformedherearefocusedonoptimizationmethods appliedtotheconvolutionactor,whichisamajorperformance bottle-neckinthedesign.Thesubgraph-leveltransformationsinvolvememory managementoptimizationsperformedonFIFOsbothinsideeach sub-graph(DNNlayer)andbetweenpairsofadjacentlayers.

4.1.1. Actor-level optimization

Wedemonstrateactor-leveloptimizationatthisstageofthedesign processusingtheconvolutionactorinourDNNexample.InourLIDE-C implementationofthisactor,weapplyatransformationofthe convo-lutioncomputationthatiscommonlyusedtosimplifythedesign,and improveclassificationspeed.Thetransformationinvolveslooptilingto reducethecachemissrate.TheutilityoflooptilinginDNN implemen-tationhasbeendemonstratedpreviously, forexample,in[33].Using looptiling,wedecomposethemainloopoftheconvolution computa-tionintoaninnerloopthatiterateswithincontiguous“strips” ofdata, andanouterloopthatiteratesacrossstrips.Applyinglooptilinginthis wayallowsonetoenhancecachereusebasedonanarraysize(strip length)thatfitswithinthecache.

Fig.4 showsasegmentofcodefromourapplicationofthetiling transformationtotheconvolutionactor.

(6)

Fig.3. DNN for automatic discrimination of four types of vehicles.

Throughtheorthogonalityprovidedbythemodel-baseddesignrules inLIDE-C,thistransformationcanbeappliedatalatestageinourdesign process,inawaythatisinteroperablewithpreviouslyapplied trans-formations,andina waythat requiresno modificationstoother ac-toror edgeimplementations.Inthiscase,nomodification isneeded tothedataflowgraphschedulerimplementationaswell,althoughfor sometransformations,scheduleradjustmentscanbeusefultointegrate transformedactorsintotheoverallsysteminanoptimizedway.The

CFDF-basedAPIs(enableandinvokefunctions)inLIDE-Cforscheduler implementationallowthedesignertoexperimentefficientlywithsuch schedulingadjustmentsasneeded.

4.1.2. Buffer memory management

Amajorchallengeinresource-constrainedimplementationofaDNN architectureismanagingthelargevolumeofdatatransfersthatare car-riedoutduringnetworkoperation.EachDNNlayertypicallyprocesses

(7)

Fig.4. The code segment that implements loop tiling within the LIDE-C actor for convolution.

alargeamountofdata,andrequiresmemorytostoretheinputdata fromthepreviouslayerorsubsystem,theintermediatedataduringthe computationprocessing,andthecomputationresultsthatwillbe trans-mittedtothefollowinglayerorsubsystem.

Consider,forexample,thebuffermemorycosts(thestoragecosts associatedwiththedataflowgraphedges)fortheDNNofFig.3.Inour LIDE-C implementation, thesecond convolutional layerrequires the mostbuffermemory.Inthislayer,eachofthe32branchesiscomposed of32convolutionactors,31additionactorsandoneactorperforming bothmaxpoolingandReLU(RectifiedLinearUnit).Giventhatthesize oftheinputfeaturemapprocessedbyeachbranchis48×48pixels,the buffermemoryrequiredforactorcommunicationinsideeachbranchis 𝑖𝑚𝑎𝑔𝑒_𝑠𝑖𝑧𝑒× (𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑐𝑜𝑛𝑣_𝑎𝑐𝑡𝑜𝑟𝑠+𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑜𝑢𝑡𝑝𝑢𝑡_𝑓𝑒𝑎𝑡𝑢𝑟𝑒_𝑚𝑎𝑝𝑠), whichis48× 48× (32+1)=76,032pixels.Thus,thetotalbuffer mem-ory inside the second convolutional layer is 76, 032× 32=2, 433, 024 pixels.Thebuffermemoryrequiredfordatacommunicationbetween thefirstandthesecondlayercanbecomputedas48× 48× 32=73,728 pixels.

InSTMCM,weapplyabuffermemoryoptimizationtechniquethat isusefulforresource-constrainedDNNimplementation.Inparticular, weincorporateanewFIFOabstractdatatype(ADT)implementation inLIDE-C,called shared FIFO ,thatenablesmultipledataflowedgesin agraphtobeimplementedthroughFIFOADTinstancesthatsharethe sameregionofmemory.Suchbuffer sharing indataflowimplementations hasbeeninvestigatedin differentformsforvariouscontextsof auto-matedschedulingandsoftwaresynthesis(e.g.,see[34–36]).InSTMCM, wemake iteasy forthesystem designertoapply buffersharing ex- plicitly withinherorhisimplementationratherthandependingonits implicitsupportthroughthetoolset thatis used.This is anexample oftheagilitythatissupportedinSTMCM,asdescribedattheendof Section2.

Again, by exploiting the orthogonality among dataflow compo-nents, buffer sharing in STMCM is performed only on the targeted dataflow edges andrequiresno modificationto otheractorsor sub-graphs.Throughthesupportforsuchseparationofconcernsin LIDE-C,differentADTimplementationsforaFIFOorgroupofFIFOscanbe interchangedwithoutaffectingoverallsystemfunctionality.

TherearethreekeyaspectstoourapplicationofsharedFIFOsinour LIDE-CDNNimplementation.First,attheinputofeachconvolutional layer L ,inputdatafromthepreviouslayerisstoredcentrallyinsteadof beingcopiedseparatelyintoeachbranchofL .Second,edgesindifferent layerssharethesamememorysothatthememoryistime-division mul-tiplexedbetweenthelayers— theprocessingofagivenlayeroverwrites memoryinitssharedFIFOswithoutintroducingconflictsthataffectthe computationresults.Third, actorsoperateondatafromsharedinput FIFOsdirectlythroughtheirread pointersintotheFIFO(ratherthan firstcopyingthedatalocallywithintheactor’sinternalmemory).This kindof copy-eliminationis similartodataflowmemorymanagement techniquesintroducedbyOhandHa[35].

Improvementsresultingfrom ourapplication ofshared FIFOsare demonstratedquantitativelyinSection5.1.

Table1

Layer-level software profiling. Here, the row labeled “T ” gives the exe- cution time of each layer, and the row labeled “T% ” gives the percentage of the total DNN execution time that is attributed to each layer.

Layer Total

1 2 3 4 5

T [ms] 18.71 22.08 0.0149 0.0034 0.0036 40.812 T% 45.84 54.10 0.04 0.01 0.01 100

Table2

Actor-level software profiling.

Layer Convolutional layer 1 Convolutional layer 2 Actor Conv Add M&ReLU Conv Add M&ReLU Tic [ 𝜇s] 230.10 0.03 0.025 59.77 0.005 0.006

Layer Dense Layer 3 Dense Layer 4 Output Layer 5 Actor Mult ReLU Mult ReLU Mult Softmax Tic [ 𝜇s] 5.1 0.0012 0.029 0.0012 0.0023 0.0031

4.1.3. Software profiling

Inthissubsection,wedemonstratetheprocessofsoftwareprofiling, asillustratedinFig.1,inthecontextofouroptimizedLIDE-C imple-mentationoftheDNNarchitecture.Theimplementationplatformisan Inteli7-2600Krunningat3.4GHz.Table1 andTable2showlayer-and actor-levelsoftwareprofilingmeasurements,respectively.

InTable2, T icdenotestheinvoke to firing completion time ofagiven

actor.Thisis theaveragetimethatelapsesbetweenthetimethatan actorfiringisinitiatedandwhenthefiringcompletes.Wealsoreferto T ic asthe average execution time oftheassociatedactor.The

abbrevia-tionsAdd,Conv,Mult,andM&ReLUstand,respectively,forAddition, Convolution,Multiplication,andMaxpool-and-ReLU.

Layer-andactor-levelsoftwareprofilingprovideinsightintothe pro-cessingcomplexityofactorsineach layer.AccordingtoTable 1, the convolutionallayersaccountfor99.94%ofthesystemexecutiontime. Also,theexecutiontimeofLayer2isveryclosetothatofLayer1.In bothconvolutionallayers,theConvactorsaccountformostofthe pro-cessingtimecomparedwiththeothertwoactors— AddandM&ReLU — intheconvolutionallayers.Additionally,theaverageexecutiontime oftheConvactorsinLayer2isonlyaboutaquarterofthatoftheConv actorsinLayer1.ThisisprimarilybecauseeachoftheConvactorsin Layer1processesinputimagesofsize96×96,whiletheConvactorsin Layer2processinputfeaturemapsthathavesize48×48.

4.2. Hardware implementation and design exploration

Inthissection,wedescribethemaincapabilitiesofthedesignflow depictedinFig.1 withrespecttodesignandimplementationof hard-wareaccelerators.Thesecapabilitiesarerepresentedbytheblocksin theregionlabeled“Hardware-relatedProcess”.

(8)

Fig.5. LIDE-V implementation for the accelerated SFM.

Forexample,throughapreliminaryhardwareprofilingphaseofthe DNNapplicationdescribedinSection4.2.1,wecanidentifythree hard-waredesignaspectsthatareinterestingtoinvestigateindetail— the adoptionofclockgatingtechniques,exploitationofasynchronythatis inherentindataflows,andexplorationofdifferentlevelsofactor gran-ularity.

Wedemonstratethehardware-relateddesignprocessofSTMCM us-ingahardwareacceleratorthatisintroducedin[21].Theaccelerator providesasubsystemforproducingfeaturemapsfromthefirst convo-lutionallayeroftheDNNapplication.Intheremainderofthispaper, werefertothissubsystemasthe Subtree for Feature Map (SFM).

Duetotheinterfacing consistencythat ismaintainedacrossLIDE actorimplementationsindifferentlanguages,onecanreadilyconvert theLIDE-CbasedSMFsubsystemimplementationintohardwareby re-placingeachsoftwareactorwithahardwaremodulethatisdesigned inLIDE-V,andbyconnectingthederivedhardwareactorswithLIDE-V FIFOs.FollowingthegeneralapproachofrealizingLIDEactorsin hard-ware,eachLIDE-VactorimplementationisdecomposedintoanAEM andAIM.TheAEMisreusableamongdifferentactorsinour implemen-tation, althoughingeneralitcan be usefultohavespecializedAEM implementationsthatarestreamlinedforthespecificrequirementsof individualactors[21].

ThehardwareimplementationdivergesfromtheLIDE-Cdesignin twomajorways.First,wefeedtheinputdatainaninterleavedformat, reducingthecomplexityofthehardwareinterfaceanddriversoftware sincethereisonlyoneinputFIFOtomanage.Second,thehardware ac-torsaredesignedtoproduceonerowperfiringinsteadofentireimages. ThisreducestheFIFOsizerequirementsinthefirstlayerfrom96×96 pixelstoonly96pixels.Thehardwareactorsinourimplementationare scheduledusingafullydistributedapproach.

Theresulting SMFis showninFig.5.Theimplementedhardware isverifiedagainstreferenceoutputsextractedfromtheLIDE-C imple-mentation.InthisFigure,productionandconsumptionrates(dataflow rates)areannotatednexttoactorports,andw istheinputimagewidth. Theconvolutionactorhasmultipleoperatingmodes(CFDFmodes)with differentconsumptionrates.

4.2.1. Hardware profiling

WeemployhardwareprofilinginSTMCMtoextractexecutiontime data,which islaterused toguide theprocessof iterativedesign op-timization.Inthissection,wedemonstratehardwareprofilinginthe contextofourDNNapplication.Profilingisperformedusingthetarget platform,whichinourdemonstrationistheZynqZ-7020SoC.We pro-filetheLIDE-CimplementationontheARMA9MPCoresprovidedby thetargetplatformanddevelopafirstversionimplementationofthe SFMonthisplatformandextractexecutiontimedatafromthis imple-mentation.

Table3 depicts variousdataassociatedwithexecution timesand waitingtimesfor theSFM hardwareaccelerator illustratedinFig.5. Here,thesymbol t totrepresentsthetotaltimenecessarytoexecutethe

SFM; T icistheaveragetimeperiodbetweenanactorinvocationandits

correspondingfiringcompletion; T ciistheaveragetimeperiodthatan

Table3

Measured data associated with actor execution times and waiting (idle) times.

SFM t tot 232,831

Tic Tci firings Tot ( Tot %) Tii / T ic Deinterleave 3 2 9216 27,648 (11.87) 1.67 Convolution 2402 2 96 230,592 (99.04) 1.00 Sum 107 2297 96 10,272 (4.41) 22.46 Maxpool&ReLU 195 4613 48 9360 (4.02) 24.66

actorhastowaittobefiredafteritspreviousfiringcompletion; firings isthenumberoffiringsofagivenactorduringexecutionof SFM; Tot , calculatedas(T ic)×(firings ),givesthetotalexecutiontimeofagiven

actorduringtheexecutionof SFM ; 𝑇 𝑖𝑖=(𝑇 𝑖𝑐+𝑇 𝑐𝑖)denotestheaverage timeperiodbetweenthebeginningofoneinvocationtothebeginning ofthenext;andtheratioT ii/T icmeasurestheextentofactoridleness.

Thisrichcollectionofmetrics,whichissupportedbytheunderlying CFDFmodelcomputation,providesvariousinsightsonthe dataflow-based system architectureand its implementation. Forexample, the T ii/T icratioprovidesinsightondifferencesinprocessingspeedthatare

usefulinexploitingtheinherentasynchronybetweendataflowactors. Fromanalysisofourhardwareprofilingresults(Table 3),wecan derivedifferentversionsoftheSFMhardwareacceleratorwithdifferent trade-offsamongpowerconsumption,systemthroughput,andhardware resourcecost.Firstly,lookingatcolumnTot %,weseethatallofthe ac-torsexceptforConvolutionareinactivethroughoutmostofthe execu-tiontime.Themaximumproportionofactivetimeamongtheseactors is11.87%,reachedbyDeinterleave.Gatingtheclockofthesefrequently inactiveactorscanprovidemoreenergyefficientacceleratoroperation byeliminatingdynamicpowerconsumptionduringidlephases.

Furthermore, the DeinterleaveandConvolution actorshave rela-tivelysmallidlenesslevels(T ii/T ic),withawaitingtime T ci equalto

2 clockcyclesfor bothof them. Ontheother hand,Sum and Max-pool&ReLUexhibitmuchlargerwaitingtimesandidlenesslevels.An importanthintcomingfromthe T civaluesisthat,thankstothe

inher-entasynchronyofdataflowactors,itispossibletopartitionthedesign intodifferentclockregions workingatdifferentfrequencies,thus ob-tainingaGALSdesign.Inparticular,the Deinterleave and Convolution actorscanbeplacedinoneclockregion(Region1),drivenby clock 1 , whileSum andMaxpool&ReLU canbeplacedinanotherregion(Region 2),drivenbyclock 2 .OnthebasisofthemeasuredT ii/T icvalues,wecan set clock 2 tobe20timesslowerthan clock 1 .

Moreover,thesubgraphincludedinRegion2canbeencapsulated into ahierarchicalactor(see Section3.3).This actor, seenfrom the “outside”,islikeanyotherLIDE-Vactor.Theactoranditsencapsulated subsystemcanbeclockgatedorclockedwithadifferentfrequency, pro-vidingadditionalcandidatesolutionsforSFMacceleratoroptimization. 4.2.2. SFM Exploration

BasedonthehardwareprofilinganalysisdiscussedinSection4.2.1, weexploredsixdifferentvariantsoftheSFMdesign:

(9)

Fig.6. An illustration of the hierarchical actor associated with Design SFMh.

• SFM a:Thisisanasynchronousdesignwhereactorsbelongingto

dif-ferentlogicregions runat differentclockfrequencies.In particu-lar,theclockfrequencyfor clock 1 issetto100MHz,andtheclock frequencyfor clock 2 issetto5MHz.ReferringtoFig.5,theonly modificationrequiredinthedesignisthereplacementofFIFOsthat areplaced between thetwoclockregions.TheseFIFOsneedtobe replacedwithasynchronousFIFOs— forthispurpose,weemploy theclockdomaincrossing(CDC)FIFOspresentedin[21].CDC FI-FOsaredesignedwithreadandwritelogicthatcanbedrivenby differentclocks.Atthesametime,theirmoduleinterfacesconform tostandardLIDE-VedgeinterfacessotheycanreplaceotherFIFO implementationswithoutrequiringchangestoactorsthat commu-nicatewiththem.

• SFM CG:Basedonourhardwareprofilingresults,weapplyclock

gat-ingtotheDeinterleave,SumandMaxpool&ReLUactors.Tobeclock gated,aLIDE-Vactorneedsonlytheinstantiationofaclock gat-ingmodule(CGM)[21].TheCGMinvolvesaBUFGprimitivethat physicallyenables/disablestheclocksignalinthetargetSoC.Thus foreachclockgatedactor A in SFM CG,aCGMisinstantiatedand

connectedtotheclockinputsof A andtotheread-andwrite-clock inputs,respectively,oftheFIFOsthat A readsfromandwritesto. • SFM aCG: This design incorporatesboth asynchronous design and

clockgatingtechniques.AsinSFM a,theFIFOsbetweenthetwoclock

regionsarereplacedwithCDCFIFOs.Additionally,theDeinterleave, SumandMaxpool&ReLUactorsareclockgatedasin SFM CG,anda CGMisinstantiatedforeachoftheseactors.

• SFM h:Thisis ahierarchicalSFMdesign,whichcan beviewedas

abaselineforevaluatingourenhancedhierarchicaldesign SFM hCG (definedbelow).In SFM h,Region2(seeFig.5)isencapsulatedin

ahierarchicalactor H .Anillustrationofthis hierarchicalactoris providedinFig.6.Thesubgraphthatisencapsulatedby H contains threeactors A1, A2 and B .Wedenotethissubgraphby G H.Actors

A1 andA2 correspondto Sum 1 and Sum 2 ,respectively,whichare twoactorsthataddoutputsfromthethreeconvolutionactors.Actor B correspondstotheMaxpool&ReLUactor.

When H isviewedasasingleactorfromtheoutside,afiringof H startswhentheinternalscheduler I_ASM_HA for G Hreceivesthein-

voke_HA signalfromtheexternal scheduler E_ASM_HA .Insidethe subgraph G H,the invoke_HA signalisreceivedbyASM_A1 ,whichis

theASMofactor A1 .Once ASM_A1 receivesthe invoke_HA signal, thefiringofthesubgraphG Hstarts.

• SFM hCG:Thisdesignisthesameas SFM h,exceptthatthe

Deinter-leaveactorandthehierarchicalactorareclockgated.Itis impor-tanttohighlightthattheapplicationofclockgatingattheregion levelisadvantageousiftheexecutiontimesoftheactorswithinthe regionareoverlapped.Inthisdesign,however,theexecutiontimes ofthethreeactorsarenotoverlapped.Whenoneactorisexecuted, theotherswaitinanidlestateandwastepower.Therefore,we ex-pectthatthisconfigurationwouldnotbereallyeffectiveinreducing powerconsumptionas SFM CGinthetargetedDNNcase.However,

weincludethetestinourexplorationstopresentthecompletewide varietyofoptionsmadeavailablebySTMCM(evenifsomeofthem maybelessefficientthanothersforthisparticularapplication sce-nario).

• SFM auto: Thisis aversionof theSFMthatis synthesizedand

im-plementedbyenablingtheautomaticpoweroptimizationavailable withintheadoptedXilinxVivadoenvironment.Thisdesignapplies fine-grain clock-gatingand fine-grain logic-gating at the Verilog leveland excludes allofthehigher-level,dataflow-based optimiza-tions(coarse-grainasynchronous design,clock-gating,and hierar-chicaldecomposition)thatareappliedintheotherfiveinvestigated designs.Thus, SFM autoisusefulasacommonbaselinetoassessthe

higher-levelmodelsandtransformationsprovidedby STMCM com-paredtoexistingoff-the-shelfsynthesistechniques.

4.3. Joint hardware/software implementation and optimization

Thissectionshowshowtheproposeddesignflow(summarizedin Fig.1)providesavarietyof interestinghardware/softwareco-design implementation choices and optimization possibilities. In particular, thesefeaturesarerepresentedbythe“Co-design-relatedProcess” area ofFig.1.Foragivenhigh-levelLWDFmodel,theinteractionbetween software(seeSection4.1)andhardware(seeSection4.2)actorsor sub-graphscanbeshapedandrefineddependingonthespecificconstraints andrequirementsoftheapplication.

Inparticular,wedemonstratetwomainimplementationaspectsthat canbeefficientlyexploredwithSTMCM:parallelismacrossactor exe-cution,andtheadoptedcommunicationinterfaces.Thedegreeof paral-lelismcanbetuneddependingonthenumberofsoftwareand/or

(10)

hard-applicationthatwillbeacceleratedinthePL,whiletheremainingpart, involvingthesecondconvolutionallayer,twodense layersandfinal classificationlayer,willbeexecutedbythePS.

Notethatthefirstconvolutionallayerconstitutesonlyoneofthe maincomputationallyintensivestepsoftheDNNapplication. Accord-ingtosoftwareprofilingresultsthatarebasedontheSoCplatformthat weappliedforhardware/softwareco-design(seeTable8),thefirst con-volutionallayeronlyaccounts forabout27%of thepredictiontime. Forthisreason,thespeedupbroughtbyhardwareaccelerationtothe overallDNNapplicationis notdramatic,aswillbediscussedfurther inSection5.However,theresultsconcretelydemonstratehowSTMCM canbeappliedtoperformextensivedesignspaceexplorationacrossa varietyofdiversedesignstoachievesystemperformanceenhancement underhighly-constrainedhardwareresourceavailability.

TheSFMacceleratorhasbeenintegratedintotheLIDE-Cdesign pre-sentedin Section4.1 byreplacing theSFM softwareimplementation withfunctioncallstodriversoftwarethatiscapableofoffloadingthe computationtothePL.WehaveexperimentedwithusingaLinuxkernel driverbasedontheUserspaceI/O(UIO)framework[37],andadriver thatisindependentoftheLinuxkernelandoperatesbydirectly access-ingmemorywiththemmapsystemcall.TheUIOapproachismore suit-ableforproductionuse,whilemmapworkswellforprototyping,and thislatterapproachhasbeenusedinthisworkforevaluation.ThePS andPLcancommunicatebymeansofAXIinterfacesexploitingGeneral Purpose(GP)ports;32-bitwidthPSmasterorslaveportswith600Mbps bandwidthforbothread andwritechannels;HighPerformance(HP) portsorAcceleratorCoherencyPorts(ACP);and64-bitwidthPSslave portswith1200Mbpsbandwidthforbothreadandwritechannels.

Fig.7 depictsthereferenceconfigurationfortheco-design explo-rations.InordertointegratetheacceleratorintotheSoC,agenericAXI wrapperforhardwaredataflowsubgraphshastobeprovided.The wrap-periscompliantwiththeadoptedAXIinterfaceandletstheprogrammer accesstheinputandoutputFIFOsofthedataflowgraphandmonitor theirpopulations.Forthispurpose,thewrapperincludesallthe neces-sarylogicforthecommunicationmanagement.

Inourhardwareaccelerationapproach,wemaptheSFMsubsystem tohardware.Thissubsystemproducesa48x48featuremaponeach ex-ecution.Thus,inordertoperformtheentirefirstconvolutionallayer oftheDNNapplication,whichmustproduce3248x48featuremaps, theSFMacceleratorhastobeexecuted32timeswiththeappropriate convolutioncoefficients.ForeachoftheseSFMexecutions,thePSwill sendthecorrespondingconvolutioncoefficientstotheaccelerator.The inputimage,whichremainsthesameacrossall32executions,issent onlyoncefromthePSandstoredwithinalocalbufferwithinthe ac-celerator.All32executionsoftheSFMaccesstheinputimagefromthis localbuffer.Inthisway,weavoidthelargeamountofdatatransferthat wouldberequirediftheinputimagehadtobesentseparatelyfromthe PStothePLforeachSFMexecution.UponcompletionofeachSFM ex-ecution,thePSretrievestheresultingfeaturemapfromtheaccelerator. Intheremainderofthissection,wediscussindetailthreedifferent setsofco-designimplementationsandoptimizationsthatarefacilitated

bySTMCM:

• theamountofparallelismthatisexploitedinthesoftwareand hard-waresubsystems;

amounts of parallelismthat areexploitedin both thehardware and softwaresubsystems(seethedashedsquares inFig.7).Inparticular, dependingonthespecificapplicationrequirements,multipleparallel in-stancesofsoftwarecoresorhardwareacceleratorscanbeutilized.While softwarecoresareabletoexecuteallDNNapplicationsteps,hardware acceleratorscanonlyperformthestepsthattheyhavebeenconceived for.Generallyspeaking,hardwareacceleratorsachievehigherefficiency thansoftwarecoreswhenexecutingagivencomputationalstep,bothin termsofexecutiontimeandresourceefficiency(resourceutilizationand consumption).

InthetargetedXilinxZynqZ-7020SoCplatform,apairof homoge-neouscoresisavailable,sothatthemaximumdegreeofsoftware par-allelisminourimplementationsis2.TheavailablecoresarebothARM A9MPCoreswithtwolevelsofcacheandaccesstoa512Mboff-chip DDRRAM.Inourexperiments,wehaveexploitedsoftwareparallelism forthetwomostcomputationallyintensivestepsoftheapplication— thetwoconvolutionallayers.

WhenusingFPGAfabric,designershavethepossibilitytoutilizeas muchparallelismastheFPGAresourcesallow.Inthiswork,wehave investigatedthreealternativedesignsthatutilize1,2or4parallelSFM instances,respectively,in thesamehardwareaccelerator.Inthefirst case,theacceleratorisexecuted32timesinordertocompletethe32 branchesofthefirstconvolutionallayer.Thisdesignexecutesa differ-entbranchwithdifferentconvolutioncoefficientsforeachaccelerator invocation.Inthesecondcase(2parallelSFMinstances),theaccelerator executiontimeishalved,butforeachrun,twonewsetsofconvolution coefficientsarenecessary.Finally,with4parallelSFMinstances,only8 acceleratorexecutionsareneeded,witheachrequiringtheupdatingof fourdifferentsetsofcoefficients.

4.3.2. Communication interfaces

Duringtheprocessofco-designexploration,STMCMgivesthe de-signersignificantflexibilitytoselectinterfacesforcommunicatingdata betweenthehardwareandsoftwaresubsystems.Thisflexibilityis pro-vided bythe general dataflow model of computation that underlies STMCM.Flexibilityinselectingacommunicationinterfacecanbevery usefulinthecontextofresource-orperformance-constraineddesign. This isdemonstrated,forexample,bytheworkofSilvaetal.,which analyzestrade-offsamongthedifferentAXIinterfaceoptions[38].

WeinvestigatedtheusageoftwodifferentAXIcommunication inter-faceslocatedattheextremesoftheresource-versus-performance trade-off:

• thememory-mappedAXI4-lite(mm-lite )interface;and • FIFO-basedAXI4-stream(stream )interface.

Comparedtothestreaminterface,themm-liteinterfacehaslower resourcerequirements,butitalsoexhibitslowerperformance.The mm-liteinterfaceusesmemory-mapped,one-by-onetransferofdataitems. Theinterfaceisparticularlyintendedforcontrolsignalsandsmall-scale dataaccesses.Itdoesnot needanyadditionalmodules beyondthose depictedinFig.7,anditusesonlyoneofthePSmasterGPports.For example,theexecutionofonebranchofthefirstlayerrequirestheinput images(3RGBimageswith96×96pixelseach)andkernelcoefficients (3kernelswith5×5coefficientseach).Sincethemm-liteinterfaceuses aseparatedatatransferoperationforeachpixel,thisresultsinatotal of3× 96× 96+3× 5× 5datatransferoperations.Oncetheaccelerator

(11)

Fig.7. Reference configuration for hardware/software co-design exploration in our experiments.

completesitscomputation,themm-liteinterfacerequires48×48data transferoperationstoenabletheprocessortoreadtheoutputfeature map.

Unlikethemm-liteinterface,whichperformsdatatransfers one-by-one,thestreaminterfaceemploysaDMAenginethattransfersdata be-tweenprocessormemoryandtheacceleratorinblocks,wheretheblock sizecanbeupto256bytes.Successivedataitemswithinablockare transferredinconsecutiveclockcycles.Thestreaminterfacerequires aDMAengine,asmentionedabove,andadditionalFIFObuffers,and thereforeincurssignificantoverheadintermsofresourcerequirements. Notethattheadditionalhardwarerequiredbythestreaminterfaceis notdepictedinFig.7.TheDMAengineisconfiguredthroughoneof thePSmasterGPports,andrequirestwodifferentPSslaveHPports todirectlyaccessthememorywheredatatobetransferredto/fromthe acceleratorisstored.

Toexecute one branch of the first DNNlayer, the stream inter-face performs(a)96 memory-to-accelerator DMAoperationstosend theinputimages,with96×3pixelsforeachDMAoperation,and(b) onememory-to-acceleratorDMAoperationtosend5×5×3kernel co-efficients. Additionally,thestreaminterface needs48 accelerator-to-memoryDMAoperationstoretrievethecomputedfeaturemap,with 48pixelsforeachDMAoperation.

4.3.3. Local buffering

Asmentioned previously,weincorporatelocalbufferingof image pixelsintheSFMacceleratortoavoidredundanttransmissionof com-mondataacrossdifferentbranchesoftheaccelerator.Thislocal

buffer-ingoptimizationisappliedtoboththemm-lite-interface-and stream-interface-basedacceleratorimplementations.

ForanacceleratorconfigurationwithasingleSFMinstance,the in-putimagedataistransferredtotheacceleratoronlyduringexecution of thefirstbranch.Afterbeing transferred,this dataisretainedina localbufferwithintheacceleratorforreusebytheremaining31 execu-tions.Foracceleratorconfigurationsthathavemultiple(parallel)SFM instances,theinputimageisalsotransferredonlyoncetothe accelera-tor.Fortheseconfigurations,theimagedataisreusedbytheremaining executionsofalloftheSFMinstances.Thus,ourincorporationof lo-calbufferingoptimizationeliminatesinputimagedatatransfersforall branchesexceptthefirstone.

5. Results

Inthissection,wepresentexperimentalresultstodemonstratethe designandimplementationmethodsprovidedbySTMCMbasedonthe detailedcasestudypresentedin Section4. Themaincontributionof thissectionistodemonstratethattheproposedmethodologyfacilitates efficientexperimentationwithalternativedataflow-basedarchitectures, designoptimizationmethods,andimplementationtrade-offs.

5.1. Embedded software implementation

In this section we present results of our experimentation using STMCMtoexplorealternativeembeddedsoftwareimplementations.We focusspecificallyontheoptimizedapplicationoflooptilingandbuffer memorymanagement.

(12)

5.1.1. Loop tiling

AsintroducedinSection4.1.1,in theoptimization ofourLIDE-C implementationoftheDNNapplication,weexploredloop-tiled convo-lutionactordesignswithdifferenttilesizes.Specifically,wemeasured thenumberofcacheloadmissesandthecacheloadmissratesduring ex-ecutionofaconvolutionactor.Thevalidtilesizesforeachconvolution actorwerethosewithintherangeof1to D ,where D isthedimension ofinputimagestotheactor.Forexample,fortheconvolutionactors inLayer1,whichprocessinputimageswithsize96×96pixels,we ex-ploredtilesizeswithintherangeof1–96.

Fig.8 showsthenumberofcacheloadmissesandcacheloadmiss rateunderdifferenttilesizesforconvolutionactorswithdifferentinput imagedimensions(48×48,96×96,750×750,and1500×1500).As wecanseefromtheresults,thecacheloadmissratesareverysmallfor imagedimensionsD ∈{48,96,750}.Thisindicatesthatthedatacanbe fullystoredoralmostfullystoredinthecachewithanyvalidtilesize.

For 𝐷 =1500, however,thereis significantvariationinthecache loadmissrateacrossdifferenttilesizes.Theratereachesitslowestvalue whenthetilesizeisapproximately400.Withcarefulsettingofthetile size,looptilingsignificantlyreducesthecachemissrateforconvolution actorsthathaverelativelylargeimagedimensions.

Additionally,wecanseethatthereisalargeaverageCPUcyclecount forsmalltilesizesinallfigures.Weexpectthatthisisduetothe over-headcausedbytheadditional

for

loopsthatareintroducedbytheloop tilingtransformation.

Insummary,basedonoursimulationanalysisforsmallimage di-mensions(96×96and48×48),looptilingdoesnothelptoreducethe cachemissrateonthetargetplatform,andfurthermore,itintroduces overheadduetotheadditionalforloops.Thus,looptilingshouldnot beappliedtothisDNNapplicationforlowimagedimensions.However, ourexperimentsalsoshowthatforlargerimagedimensions,looptiling doeshelptoimprovetheefficiencybyreducingthecacheloadmissrate. 5.1.2. Buffer memory management

Fig.9 showstheamount of memoryrequiredfordatastorage in each DNNlayer. We report memoryrequirements in this section in termsof pixels.Inourexperiments, weused a4-byte floatingpoint datatypeforeach pixel.Fig.9 alsoshowstheamountof data com-municationthat is neededbetween adjacentlayers,andtheamount of memory that must be active simultaneously during the compu-tation associated with each layer. The memoryneeded for input is calculated as 𝑖𝑛𝑝𝑢𝑡 _𝑖𝑚𝑎𝑔𝑒 _𝑠𝑖𝑧𝑒 ×𝑛𝑢𝑚𝑏𝑒𝑟 _𝑜𝑓 _𝑖𝑛𝑝𝑢𝑡 _𝑖𝑚𝑎𝑔𝑒𝑠 . The memory neededforexecutionofeach layeris calculatedas𝑖𝑛𝑝𝑢𝑡_𝑖𝑚𝑎𝑔𝑒_𝑠𝑖𝑧𝑒× (𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑖𝑛𝑝𝑢𝑡_𝑖𝑚𝑎𝑔𝑒𝑠+1)×𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑜𝑢𝑡𝑝𝑢𝑡_𝑓𝑒𝑎𝑡𝑢𝑟𝑒_𝑚𝑎𝑝𝑠.

Aswe canseefromFig.9,theprocessing inLayer2requiresthe largestamountofactivememory,andaminimumof2,525,184pixels mustbeallocatedforbufferstorage.Thememorysizecanbeoptimized subjecttothisconstraintthroughtheapplicationofsharedFIFOs,which wereintroducedinSection4.1.2.Thebuffermemoryallocationthatwe proposeforthisDNNapplicationbasedonsharedFIFOsisillustratedin Fig.10.

Table4 summarizesthememoryrequirementsfordataflowedges (FIFObuffers)andactorsinthetwoconvolutionallayers,whichrequire mostofthememoryamongthefivelayers.Thesememoryrequirements areshownbothwithandwithouttheuseofsharedFIFOs.Asdiscussed inSection4.1.2,actorsoperateondatafromsharedinputFIFOsdirectly

Table5

Resource utilization. In parentheses: the percentage of utilization with re- spect to the resources available on the targeted FPGA.

Available LUTs REGs BUFGs BRAMs DSPs

53200 106400 32 140 220 SFM 5188 (9.75) 3472 (3.26) 1 (3.1) 11 (7.9) 13 (5.9) SFMa 5430 (10.20) 3687 (3.47) 2 (6.3) 11 (7.9) 13 (5.9) SFMCG 5206 (9.79) 3496 (3.29) 5 (15.6) 11 (7.9) 13 (5.9) SFMaCG 5479 (10.30) 3704 (3.48) 6 (18.8) 11 (7.9) 13 (5.9) SFMh 5170 (9.72) 3472 (3.26) 1 (3.1) 11 (7.9) 13 (5.9) SFMhCG 5198 (9.77) 3480 (3.27) 3 (9.4) 11 (7.9) 13 (5.9) SFMauto 5230 (9.83) 3472 (3.26) 1 (3.1) 11 (7.9) 13 (5.9)

withoutcopyingdatatoitsinternalmemory.Thus,convolutionactors onlyneed memoryforits intermediatecomputationresults.Addand Maxpool&ReLUactorsdo notrequireadditionalmemory.Theresults presentedinthistablequantitativelydemonstratetheutilityofshared FIFOsforthisapplication.Inparticular,theapplicationofsharedFIFOs reducesthememoryrequirementsby65%.

5.2. Hardware implementation

Inthissection,weinvestigatetrade-offsamongthevariantsofthe SFMdesignthatwereintroducedinSection4.2.2.STMCMandthe un-derlyingLIDE-Vapproachallowonetoperform suchtrade-off explo-ration,basedondifferentcombinationsofhigh-leveloptimization tech-niques,inasystematic manner.Inparticular,STMCMallowsthe de-signertofocusondifferentstrategiesforinstantiating,configuring,and coordinatingdifferentcombinationsofactorandbuffer(edge) imple-mentations,andeliminatestheneedformodificationinsidetheactor andedgeimplementations.WeexploitedtheseadvantagesofSTMCM whenderivingtheresultspresentedinthissection.

Table5 depictsresourceutilizationdatathatisextractedfromthe post-placeandroutereportsgeneratedbytheXilinxVivadotoolusing thetargetedZynqZ-7020SoC.FromtheresultsinTable5,weseethat thedifferentdesignvariantsallexhibitsimilarlevelsofresourcecost. Theasynchronousdesigns SFM aandSFM aCGincurthehighestresource

costsduetotheadditionallogicrequiredbytheCDCFIFOs.Thenumber ofBUFGsvariessignificantlyamongthedifferentdesigns,dependingon thenumberofclockdomainsandthenumberofclockgatedactors.

Eachoftheimplementeddesignshasbeensimulatedinorderto gen-erateaswitchingactivityfile,whichhasbeenback-annotatedtoVivado PowerEstimationtoextractpowerconsumptiondata.Sincethedesigns havedifferentexecutiontimes,theenergyconsumptionlevelsdonot varyinthesameproportionsasthepowerconsumptionlevels.Table6 summarizes thepowerconsumption,executiontimeandenergy con-sumptionofthesixalternativedesigns.

Intheseexperiments,theclockfrequenciesofthesynchronous de-signsandofRegion1(CLK1)intheasynchronousdesignsareallset to100MHz,whichisthemaximumachievablefrequencyforthe tar-getedplatform.ForRegion2(CLK2)intheasynchronousdesigns,the frequency issetto5MHz.This settingof 5MHzisderivedfrom the hardwareprofilingdata(seeTable3)as1/20ofCLK1.Theseclock fre-quenciesarespecifiedinTable6 withthesuffix _F ,where F represents thefrequencyvalueinMHz.

(13)
(14)

Fig.9. Buffer memory and communication requirements in the DNN architecture.

Fig.10. Buffer memory allocation for the DNN application.

Table6

Dynamic power consumption, execution time and energy con- sumption of the different SFM variants. In parentheses: the per- centage difference with respect to the baseline SFM.

Power [ mW ] Time [ns] Energy [ 𝜇J ]

SFM 115 2,329,165 268 𝑆𝐹 𝑀 𝑎_5 89 (-22.61) 2,407,300 ( + 3.354) 214 (-20.01) SFMCG 89 (-22.61) 2,329,245 ( + 0.003) 207 (-22.61) 𝑆𝐹 𝑀 𝑎𝐶𝐺_5 88 (-23.48) 2,408,100 ( + 3.389) 212 (-20.89) SFMh 117 ( + 1.74) 2,329,155 (-0.000) 273 ( + 1.74) SFMhCG 105 (-8.70) 2,329,175 ( + 0.000) 244 (-8.70) SFMauto 113 (-1.74) 2,329,165 ( + 0.000) 263 (-1.74)

AccordingtoTable6,theclockgateddesignsSFM CGand𝑆𝐹𝑀𝑎𝐶𝐺_5

havethebestcapabilitiesforsavingenergy,reducingthetotalenergy consumptionby22.61%and20.89%,respectively.Design 𝑆𝐹 𝑀 𝑎𝐶𝐺_5

saveslessenergythanSFM CGsincetheformeremploysonemoreBUFG.

Furthermore,in𝑆𝐹𝑀𝑎𝐶𝐺_5,theactorsintheslowerdomain(Region2)

areactiveforarelativelylargeportionoftheexecutiontime,andthus, theycannotbeswitchedoff forlargeproportionsoftime.Incontrast, accordingtoTable3,theDeinterleaveactorinRegion1canbeswitched off foralmost90%ofthetotalexecutiontime.

Thedesigns𝑆𝐹 𝑀 𝑎𝐶𝐺_5and𝑆𝐹 𝑀 𝑎_5, bothofwhichemploytwoclock

domainswithCLK1at100MHzandCLK2at5MHz,havesimilar ca-pabilitiestosaveenergy.Theformerdesignisslightlymoreenergy ef-ficientcomparedtothelatter.Theresultsforthesetwodesignsshow thattheenergysavedbyswitchingoff theactors,wheninactive,and alsothesavingoftheunusedlogicintheCDCFIFOscounterbalancethe energyoverheadduetotheadditionalcircuitry.

Asexpected,SFM hhasasmallamountofenergyoverheadduetothe logicnecessarytoencapsulateSum1,Sum2andMaxpool&ReLUintothe hierarchicalactor.Thedesign SFM hCG,amongtheclockgateddesigns,

isnotasadvantageousasthepreviouslyanalyzeddesignsintermsof en-ergysaving.ThisisbecauseeventhoughitemploysonlythreeBUFGs, thehierarchicalactorisswitchedoff onlywhennoneofthe

(15)

underly-ingactorsareworking.Thismeansthat,forinstance,while Sum1is active,theactorsSum2andMaxpool&ReLUwillhaveanactiveclock evenwhentheactorsareinanidlestate(sothattheykeepwasting energy).Finally SFM autoisthedesignwiththesmallestenergysaving,

only1.74%comparedtoSFM .Evenconsideringthesameoptimization technique(clockgating),thelevelonwhichitisappliedturnsoutto befundamental:atalowlevel(singleflip-flopsin SFM auto)onlythe

dy-namicpowerofarestrictednumberofgatescanbesaved.Ontheother hand,atacoarse-grainlevel(groupsofdataflowactorsin SFM CG),it

ispossibletoactalsoontheclocktree,whichis highlyeffectivefor improvingpowersaving.

5.3. Hardware/software co-design results

Inthissection,weinvestigatedifferenthardware/softwareco-design configurations.AsanticipatedinSection4.1,dependingontheportion oftheapplicationthatisacceleratedinhardwareandonthegiven re-quirementsandconstraints,differentdesignchoicesregardingthe hard-ware/softwarecommunicationinterfaceleadtodifferenttrade-offs be-tweenresourcerequirementsandperformance.FortheSFMaccelerator, weinvestigatedseveralimplementationandoptimizationsolutions, ex-ploringthreekeyaspects:exploitingparallelism,communication inter-facesandlocalbuffering(seeSection4.3).Inthissection,byan SFM accelerator ,wemeanspecificallyahardwareaccelerator.

Differentsoftwareandhardwareconfigurationsthatweexploredin ourco-designexplorationaresummarizedasfollows.

• SW1 — TheapplicationrunsinsoftwareonasingleARMcore.This designcanbeviewedasabaselinedesignwithoutanyoptimization orhardwareacceleration.Comparisonsbetweenthisbaselinedesign andalternativedesignsarediscussedintheremainderofthissection. • SW2 — TheapplicationrunsinsoftwarebyusingbothoftheARM

coresonthetargetplatform.

• HW1 — Asingle-branchSFMacceleratorisemployedtoexecutethe firstconvolutionallayer.

• HW2 — AnSFMacceleratorwithtwoparallelbranches.Inthis con-figuration,alocalbufferissharedbetweenthebranches.

• HW4 — AnSFMacceleratorwithfourparallelbranches.Again,a localbufferissharedamongthebranches.

Formulticoresoftwareimplementationsandhardware implementa-tionswithmultiplebranches,thelayerorlayersthatareexecutedin parallel(i.e.,intra-layerparallelismisexploited)areindicatedin paren-theses.Similarly,hardwareconfigurationsareannoatedwith -mm or -s depending,respectively,onwhetheramemory-mappedAXI-lite com-municationinterfaceisused, oraFIFO-basedAXI-streaminterfaceis used.

Forexample,SW2(L1,L2)representsasoftware-only implementa-tioninwhichlayer1andlayer2areexecutedinparallel.Asanother example,SW2(L2)/HW2(L1)-mm representsa hardware/software im-plementationbasedonconfigurationsSW2andHW2;inthis implemen-tation,layer2isexecutedacrossmultiplecores,layer1isparallelized inhardwarewith2parallelbranches,andAXI-liteisusedasthe com-municationinterface.

NotethattheSFMacceleratorsareabletoexecuteonlythefirst con-volutionallayer.Thus,inalloftheDNNsystemimplementations,the acceleratorsarecoupledwithoneofthesoftwareconfigurations. 5.3.1. Resource costs of accelerator implementations

Table7depictstheresourceoccupancyinthetargetedZynqZ-7020 deviceforthedifferentSFMaccelerator implementationsthatwe ex-perimentedwith.Asexpected,ahigherlevelofparallelism(goingfrom HW1-mmtoHW4-mm)requiresmoreresources,andourexperiments herehelptoquantifytheassociatedtrends.Forexample,fine-grained andcomputation-relatedresources(LUTs,REGsandDSPs)increase lin-earlywiththenumberof parallelbranchesplaced intheaccelerator (about+100% withonemorebranchandabout+300% withthree

Table7

Resource occupancy for different SFM accelerator implementations. In parentheses: the percentage of utilization with respect to the resources available on the targeted FPGA. The bottom part of the table depicts the percentage of variation with respect to HW1-mm.

Available LUTs REGs BRAMs DSPs

53200 106400 140 220 HW1-mm 5395(10.14) 4668(4.39) 43 (30.71) 13 (5.91) HW2-mm 10890 (20.47) 8197 (7.70) 54 (38.57) 26 (11.82) HW4-mm 21474 (40.36) 16331(15.35) 76 (54.29) 52 (23.64) HW2-mm + 101.85 + 75.60 + 25.58 + 100.00 HW4-mm + 298.04 + 249.85 + 76.74 + 300.00 Table8

Performance of different co-design solutions. The top part of the table de- picts execution time in milliseconds (ms). The bottom part depicts the per- centage of execution time variation for each configuration with respect to SW1.

input Layer Prediction

1 2 3:5 SW1 118.9 640.3 1594.7 34.4 2388.2 SW2(L1) 118.7 368.3 1609.8 34.0 1639.5 SW2(L1,L2) 117.4 354.7 842.1 33.8 1348.0 SW2(L2)/HW1(L1)-mm 118.9 118.5 856.7 35.4 1129.5 SW2(L2)/HW2(L1)-mm 118.4 74.6 866.5 35.1 1094.5 SW2(L2)/HW4(L1)-mm 117.9 54.5 859.0 35.4 1066.8 SW2(L1) -0.13 -42.48 + 0.95 + 1.12 -10.75 SW2(L1,L2) -1.22 -44.61 -47.19 -1.72 -43.56 SW2(L2)/HW1(L1)-mm -0.00 -81.49 -46.28 + 2.89 -52.71 SW2(L2)/HW2(L1)-mm -0.40 -88.36 -45.66 + 2.09 -54.17 SW2(L2)/HW4(L1)-mm -0.81 -91.48 -46.13 + 2.85 -55.33

morebranches),whilecoarse-grainedmemoryresources(BRAMs) ex-hibitagentlerslope.Weexpectthatthisgentlersloperesultsbecause theprimaryBRAM-demandingmodule,thelocalbuffer,issharedacross parallelbranches.

TheresultsaboveindicatethatwhentheDNNarchitectureismade deeper(i.e.,asthenumberof convolutionallayersis increased),the biggestrestrictionwillbethehardwareresourcelimitations.Usually,as aDNNismadedeeper,moreparallelbranchesareneededtocomplete thecomputationwithoutcompromisingtheprocessingspeedandmore memoryresourcesareneededtostoretheintermediatefeaturemaps. However,deepernetworksdonotnecessarilyimplymorecomputational complexity.Forexample,thewell-knownResNet101, whichhas101 layers,needslesscomputationthanthe16-layerVGG16becausethe VGGlayersaresignificantlylarger[39,40].

5.3.2. Comparison of co-design solutions

Table8presentsperformanceresultsfordifferentsoftware-onlyand hardware/softwaresolutionsthatweinvestigatedusingSTMCM.In par-ticular,thetable reportstheexecutiontimein termsof milliseconds (ms)fordifferentexecutionphases:readingtheinputfile(column in-put),computingthefirstandthesecondlayers,andcomputingthedeep layers(Layers3,4and5).Thetablealsoreportstheexecutiontimeof theoverallapplication(prediction)fordifferentdegreesofsoftwareand hardwareparallelism.

ThereferencetimeisgivenbytheexecutionoftheentireDNN ap-plicationonasingleARMcore(SW1),whichiscapableofcompleting thepredictioninabout2.4seconds.Fromthisreferenceconfiguration, itisalsopossibletoappreciatethecomputationalloadofthedifferent applicationphases.TheheaviestpartisLayer2,whichisresponsible formorethan65%oftheoverallexecutiontime,whilemostofthe re-mainingloadisattributabletoLayer1(around25%),andtoreadingof theinputfile(about5%).Forthisreason,softwareparallelizationhas beenevaluatedonlyonLayer1(SW2(L1)),andonbothLayers1and2 together(SW2(L1,L2)).

(16)

(3) DMA (stream) 1490 (2.80) 1881 (1.77) 3 (2.14) 0 (0.00) (1) + (2)+(3) 7486 (14.07) 6480 (6.09) 56 (40.00) 13 (5.91)

HW1-s + 7.21 -6.66 + 0.00 + 0.00

(1) + (2)+(3) + 38.75 + 38.81 + 53.49 + 0.00%

Theexecutiontimeneededbyeachofthemajorexecutionphasesis almosthalvedwhentwocoresareadopted.Aprecise50%reductionis notreachedbecauseofthesoftwareoverheadnecessarytomanage mul-titasking.Withsoftwareparallelizationonly,theoverallexecutiontime isreducedto1.13seconds,about44%lessthantheSW1configuration. Hardwareaccelerationandrelatedparallelizationareonlyappliedto thefirstconvolutionallayer,whileonlysoftwareparallelizationis ap-pliedtoLayer2.Ifweconsideronlytheexecutiontimeoflayer1,then SW2(L2)/HW1(L1)reducesexecutiontimebymorethan80%compared toSW1,andmorethan65%comparedtoSW2(L1,L2).

IfmultiplebranchesofLayer1areprocessedinparallel,the hard-wareacceleratorachievesfurtherperformancebenefits— atime sav-ingupto88%fora2-branchconfiguration(SW2(L2)/HW2(L1)-mm), andupto91%fora4-branchconfiguration(SW2(L2)/HW4(L1)-mm). TheseperformanceimprovementsarewithrespecttoSW1.Notethatthe speed-upobtainedbydoublingthenumberofbranches(goingfrom1to 2andfrom2to4)islessthan2ineithercase(1.6from1to2and1.4 from2to4).Thisisduetothesoftwareoverheadrelatedtomanaging multiplebranches.DuetothelimitedcomputationalloadofLayer1,the benefitsofhardwareaccelerationandparallelizationontheoverall sys-temaresomewhatlimited.Thebestsolution,SW2(L2)/HW4(L1)-mm, requires1.07secondstoperformthewholeapplication,55%lessthana fullsoftwareexecutiononasinglecore(SW1)and21%lessthanafull softwareexecutionontwocores(SW2(L1,L2)).

Another aspect that has been studied in our co-design experi-mentsistheinterfacingbetweensystemcomponents.Asdiscussedin Section4.3.2,theadoptedcommunicationinterfacebetweensoftware andhardware portionsof a designcan havea significantimpact on overallsystemperformance.Inourco-designexperiments,wehave ap-pliedtwoveryinterfaces— mm-liteandstream,whicharediscussedin Section4.3.2.

Table9 helpstounderstanddifferencesbetweentheresourcecosts ofthesetwointerfaces.Thefirstrowofthistableshowsresource avail-abilityonthetargetplatform.Thesecondandthirdrowsshowresource costsfortheHW1-mmaccelerator,andHW1-saccelerator.Thefourth andfifthrowsshowresourcecostsforFIFOandDMAmodules(external totheaccelerator)thatarenecessaryforthestreaminterface.Thesixth rowshowstotalresourcecostsinducedbyuseofthestreaminterface

((1)+(2)+(3)),whileover50%moreBRAMcostisincurred. Tomakethestreaminterfaceausefuloptioninoursystemdesign,its significantincreaseinresourcecostsshouldbeaccompaniedbytangible advantagesinexecutionperformance.Table10showsresultspertaining totheimpactofthecommunicationinterfaceonexecutiontime.In or-dertobetterexposetheeffectsoftheselectedcommunicationinterface, detailsondatatransfers(input,convolutioncoefficientsandoutputs) be-tweenthehardware(accelerator)andsoftwaresubsystemsisreported. FortheHW1-sdesign,twodifferentsetsofresultsarereported depend-ingonwhetherprogramdataisdirectlyaccessiblebytheDMAengine. Foroneset,theprogramdataislocatedinamemorythatisnotdirectly accessiblebytheDMA.Thisscenariocorrespondstothedesignthatwe haveimplemented.Itrequiresanadditionalcopyoftheprogramdata inamemorythatisaccesseddirectlybytheDMA.Fortheotherset,the programdataislocatedinamemorythatisdirectlyaccessiblebythe DMA.ThissetisindicatedinTable10usingtheannotation HW1-s dir . WehavenotimplementedHW1-sdir;instead,wehaveestimatedthe correspondingresultstogainsomeideaaboutthemaximumachievable performance.Detailsontheestimationapproachareomittedforbrevity. The results in Table 10 demonstratethe utility of the resource-hungryHW1-sdesign,andquantifyitsclearabilitytooutperform HW1-mm.Inparticular,theinputdataandtransmissionofconvolution coef-ficientsarerespectivelyabout84%and67%fasterwhentheAXI-stream protocolisadopted.Thisleadstoanestimatedtimesavingofupto92% and80%,respectively,whentheDMAhasdirectaccesstotheprogram data(HW1-sdir).

Ontheotherhand,theoutputdatatransmissiontimeisthesame amongallofthereportedconfigurations.Weexpectthatthisisbecause theoutputsareproducedinarow-by-rowfashion(48dataunitsata time),andthetimingofoutputproductionisdeterminedbythe compu-tationlatency,whichisgreaterthanthecommunicationlatencyforall of theinterfacingconfigurations.However,lookingatthetotalLayer 1 andDNNapplicationexecution times,we see thattheadvantages ofadoptingthestreaminterfacearenolongervisible.Indeed,forthe consideredSFMaccelerator,theinputdataistransmittedonlyduring thefirstbranchexecutionduetoouruseoflocalbuffering. Addition-ally,eventhoughthecoefficientsaretransmittedforeachbranch,their transmissionrequiresarelativelysmallamountoftime.

Theseresultsinvolvingcommunicationinterfaceselectionillustrate theimportanceofcomprehensivesystem-levelevaluationofalternative designoptions,whichisoneofthekeypartsofthedesignprocessthat isfacilitatedbySTMCM.

Table10

Results pertaining to the impact of the communication interface on execution time. The top part of the table depicts the execution time of the different DNN application steps. The bottom part depicts the execution time variation of each configuration with respect to HW1-mm.

File input [ms] Layer 1 Layer 2 [ms] Layers 3, 4, 5 [ms] Prediction [ms] input tx [ 𝜇s] coeffs tx [ 𝜇s] output tx [ 𝜇s] total [ms]

HW1-mm 118.9 5222 15 2339 118.5 856.7 35.4 1129.5

HW1-s 117.5 854 5 2333 114.6 864.3 34.9 1131.3

HW1-s dir 118.3 418 3 2333 114.1 869.5 35.0 1137.0

HW1-s -1.17 -83.65 -66.67 -0.26 -3.27 + 0.87 -1.39 + 0.16

Riferimenti

Documenti correlati

Table 8: Structure of the Economic Causes of Migration According to Urban and Rural Areas, 1998 Causes Get a better paid job Look for a job Improve living standard Better

Adults: ampicillin 2.0 g IM or IV plus gentamicin 1.5 mg/kg (not to exceed 120 mg) within 30 min of starting procedure; 6 h later, ampicillin 1 g IM/IV or amoxicillin 1 g orally

Variation of left ventricular outflow tract velocity and global end-diastolic volume index reliably predict fluid responsiveness in cardiac surgery patients.. Feissel M, Teboul

Use of all other works requires consent of the right holder (author or publisher) if not exempted from copyright protection by the

Problems occurred in dense stellar fields (globular clusters), where more than two stars were in the 2” FOV of the AO- lenslets used for correction by the FLAO systems. The loop

All'inizio del mio percorso di studio mi sono occupata del ruolo del suono e della musica nella filosofia di Marsilio Ficino e, basandomi sulle fonti platoniche

5.7 The IConflictManager Interface.. Negli ultimi anni, il paradigma di ubiquitous computing ha mirato alla creazione di ambienti intelligenti tramite l’uso di dispositivi embedded

Alla raccolta è stata pesata la produzione in oli- ve di ogni singola pianta, mentre il peso medio delle drupe (Fig. 5), intesa come grasso complessivo, da non confondersi