An integrated hardware/software design methodology for signal processing systems

(1)

ContentslistsavailableatScienceDirect

Journal

of

Systems

Architecture

journalhomepage:www.elsevier.com/locate/sysarc

An

integrated

hardware/software

design

methodology

for

signal

processing

systems

Lin Li

a,∗

_{, Carlo Sau}

b

_{, Tiziana Fanni}

b

_{, Jingui Li}

c

_{, Timo Viitanen}

c

_{, François Christophe}

e

_,

Francesca Palumbo

d

_{, Luigi Raﬀo}

b

_{, Heikki Huttunen}

c

_{, Jarmo Takala}

c

_{, Shuvra S. Bhattacharyya}

a,c a University of Maryland, ECE Department, College Park, MD 20742, United States

b University of Cagliari, Department of Electrical and Electronic Engineering, Italy c Tampere University, Finland

d University of Sassari, PolComIng-Information Engineering Unit, Italy e Department of Computer Science, University of Helsinki, Finland

a

r

t

i

c

l

e

i

n

f

o

Keywords: Dataﬂow Model-based design Hardware/software co-design Low power techniques Deep learning Signal processing systems

a

b

s

t

r

a

c

t

This paper presents a new methodology for design and implementation of signal processing systems on system-on- chip (SoC) platforms. The methodology is centered on the use of lightweight application programming interfaces for applying principles of dataflow design at different layers of abstraction. The development processes integrated in our approach are software implementation, hardware implementation, hardware-software co-design, and optimized application mapping. The proposed methodology facilitates development and integration of signal processing hardware and software modules that involve heterogeneous programming languages and platforms. As a demonstration of the proposed design framework, we present a dataflow-based deep neural network (DNN) implementation for vehicle classification that is streamlined for real-time operation on embedded SoC devices. Using the proposed methodology, we apply and integrate a variety of dataflow graph optimizations that are important for efficient mapping of the DNN system into a resource constrained implementation that involves co- operating multicore CPUs and field-programmable gate array subsystems. Through experiments, we demonstrate the flexibility and effectiveness with which different design transformations can be applied and integrated across multiple scales of the targeted computing system.

1. Introduction

Model-baseddesignhasbeenwidely studiedandappliedoverthe yearsinmanydomainsofembeddedprocessing.Dataflowiswell-known asaparadigmformodel-baseddesignthatiseffectiveforembedded digitalsignalprocessing(DSP) systems[1].Indataflow-based model-ing,signalprocessingapplicationsarerepresentedasdirectedgraphs (dataflowgraphs), andcomputationalfunctionsare modeled as ver-tices(actors)inthese graphs.Actorsexchange datapackets(tokens) throughunidirectional,first-in,first-out (FIFO)communication chan-nelsthatcorrespondtodataflowgraphedges.Manydataflow-based de-signmethodsforDSP systemshavebeenexploredin recentyearsto supportvariousaspectsofdesignandimplementation,including model-ingandsimulation;schedulingandmappingofactorstoheterogeneous platforms;andbuffermanagement(e.g.see[1,2]).

Thediversityofdesignscalesanddataﬂowtechniquesthatare rel-evanttosignalprocessingsystemsposesmajorchallengestoachieving

∗_{Corresponding}_author.

E-mail addresses: lli12311@umd.edu (L. Li), carlo.sau@diee.unica.it (C. Sau), tiziana.fanni@diee.unica.it (T. Fanni), jingui.li@tuni.fi (J. Li), timo.2.viitanen@tuni.fi (T. Viitanen), francois.christophe@helsinki.fi (F. Christophe), fpalumbo@uniss.it (F. Palumbo), raffo@unica.it (L. Raffo), heikki.huttunen@tuni.fi(H. Huttunen), jarmo.takala@tuni.fi(J. Takala), ssb@umd.edu(S.S. Bhattacharyya).

thefullypotentialthatisofferedbysignalprocessingplatformsunder stringenttime-to-marketconstraints.Whileautomatedtechniques,such asthosereferredtoaboveforschedulingandbuffermapping,are ef-fectiveforspecializedcombinationsofplatformsanddataflowmodels (e.g.,multicoreCPUsandsynchronousdataflow,respectively),theyare limitedintheirabilitytosupportmorecomprehensiveassessmentofthe designspace,wherethemodelsandtargetplatformsthemselveshave greatinfluenceonaddressingimplementationconstraintsand optimiza-tionobjectives.Systemdesignersmustthereforeresorttoad-hoc meth-odstoexploredesignalternativesthatspanmultipleimplementation scales,platformtypes,ordataflowmodelingtechniques.

Inthiswork,weproposeadesignmethodologyandanintegrated setof toolsandlibrariesthataredevelopedtohelpbridgethis gap. Werefertothismethodology astheSTMCMethodologyorSTMCM, which isnamedafterthediﬀerent institutionsacross whichitis de-veloped(Sassari,Tampere,Maryland,Cagliari).STMCMfocuseson en-ablingexperimentationacrossdiﬀerentlevelsofabstractionthroughout

https://doi.org/10.1016/j.sysarc.2018.12.010

Received 27 April 2018; Received in revised form 4 November 2018; Accepted 31 December 2018 Available online 31 December 2018

(2)

(LWDF)programming[3],anditsunderlyingcorefunctionaldataflow (CFDF)model of computation[4]. LWDF providesa compactset of application programming interfaces (APIs) that allows one toapply signal-processing-orienteddataflowtechniquesrelativelyeasilyand ef-ficientlyinthecontextofexistingdesignprocesses,targetplatforms, andsimulation-andplatform-orientedlanguages,suchasMATLAB,C, CUDA,andVHDL.Additionally, CFDFisa generalform of dataflow thataccommodatesmorespecializedformsofdataflow,suchasBoolean dataflow[5],cyclo-staticdataflow[6],synchronousdataflow[7],and RVC-CAL[8] asnaturalspecialcases. Thisaccommodation of differ-entdataflowmodelsinturnprovidespotentialtointegratedesignswith otherdataflowframeworksandDSPlibraries,suchasthosedescribed in[8–13].Furthermore,LWDFisgranularity-agnostic,inthesensethat actorcomplexitydoesnotlimittheapplicabilityoftheframework.

TodemonstratethecapabilitiesofSTMCMinaddressingthe chal-lengesofmappingpracticaldataflow-basedstructuresonheterogeneous signalprocessingplatforms,weexploredifferentimplementationsofa deep neuralnetwork(DNN)for vehicleclassificationona heteroge-neous,embeddedsystem-on-chip(SoC),theXilinxZynqZ-7020SoC. DNNapplicationsposegreatchallengesintheirdeploymenton embed-deddevices.InvestigationofDNNimplementationsonembeddedSoC devicesischallengingduetothelimitedresourcesforprocessingand storageinthesedevices,andespecially,duetothehighcomputational complexityofDNNs.Theyinvolveverylargeandcomplexsignalflow structuresthatinvolveintensivecomputation,dataexchange,and multi-layerprocessing.ThesecharacteristicsmakeembeddedDNN implemen-tationhighlyrelevantasacasestudyforSTMCM.

2. Relatedwork

Dataflowprovidesvaluablemodel-baseddesignpropertiesfor sig-nalprocessingsystems,andhasbeenadoptedinawidevarietyoftools forbothsoftwareandhardware design.Forexample,LWDFAPIsfor CUDA andC have beentargetedin theDIF-GPUtoolfor automated synthesisofhybridCPU/GPUimplementations[14].TheCAL program-minglanguageandtheOpenRVC-CALCompiler(Orcc)toolsetprovide adataflowenvironmentforgeneratingdataflowimplementationsina numberoflanguages,suchasC,Jade,andVerilog[8,9,15] (notethat theVerilogbackendofOrcchasbeendiscontinuedandXronos synthe-sizerhasbeenreplaced).TheCAPHlanguageandframeworkgenerate hardwaredescriptionlanguage(HDL)codefromhigh-leveldataflow de-scriptions[10].

Theworkin[16] presentsanintegrateddesignflowandtoolsfor theautomaticoptimizationofdataflowspecificationstogenerateHDL designs. The Multi-Dataflow Composer(MDC) toolis a dataflow-to-hardwareframeworkabletoautomaticallycreatemulti-functional re-configurable architectures.In addition tothis baseline functionality, MDCoffersthreeadditionalfeatures: (1)a structuralprofilerto per-formacompletedesignspaceexploration,evaluatingtrade-offsamong resourceusage,powerconsumptionandoperatingfrequency[17];(2)a dynamicpowermanagertoperform,atthedataflowlevel,thelogic par-titioningofthesubstratetoimplementatthehardwarelevel,andapply apowersavingstrategy[18];(3)acoprocessorgeneratortoperform thecompletedataflow-to-hardwarecustomizationofaXilinxcompliant multi-functionalIP[16].

STMCMiscomplementarytotheseeffortsthatemphasizedataflow design automation. Byapplying LWDF APIs in novel ways, STMCM facilitates implementationof anditerative experimentationwithnew dataflow-basedhardware/softwarearchitecturesanddesign optimiza-tion techniques. LWDFis applied as anintegral partof STMCM be-causeofLWDF’sminimalinfrastructurerequirementsanditspotential forrapidretargetabilitytodifferentplatformsandactorimplementation languages.Furthermore,LWDFdoesnothaveanyrestrictioninterms ofactorgranularityandcanbeextendedwithdifferentcombinations ofdataflowgraphtransformations,aswellasotherformsofsignal pro-cessingoptimizations(e.g.,see[1]).

In[21],wepresentedanefficientintegrationoftheLWDF methodol-ogywithhardwaredescriptionlanguages(HDLs).Buildingonthis HDL-integratedformofLWDF,wedevelopedmethodsforlowpowersignal processinghardwareimplementation,andsystem-leveltrade-off explo-ration.Inthispaper,weapplythehardwaredesigntechniques intro-ducedin [21] aspartofageneralmethodology thatspanssoftware, hardware,andmixedhardware/softwaredesign,implementation,and trade-off exploration.Thus,whilethefocusin[21]isonrigorous inte-grationacrossdigitalhardwaredesign,lightweightdataflow program-ming, andlowpoweroptimization,theemphasisin thispaper ison amethodology forapplyingLWDFconceptsinanintegratedmanner acrosscompletehardware/softwaredevelopmentprocessesfor embed-dedsignalprocessingsystems.

Insummary,STMCMprovidesmethodstoseamlesslyand compre-hensivelyintegrateLWDF-basedactorimplementationtechniqueswith design processesforreal-time,resource-constrained signalprocessing systems.STMCMcanbeusedasanalternativetoorinconjunctionwith moreconventionalautomateddataflowtools(e.g.,fordisjoint subsys-tems).STMCMrequiresmoreeffortinprogrammingcomparedtofully automatedtoolchains,howeveritprovidesmoreagilityintermsof re-targetabilityandexperimentation,asdescribedabove.Thisisauseful trade-off pointtohaveavailableformodel-baseddesignofcomplex sig-nalprocessingsystems.

3. Proposeddesignmethodology

OurproposedmethodologySTMCMisillustratedinFig.1.As moti-vatedinSection1 andSection2,STMCMisadesignmethodologythat emphasizesLWDFconcepts,andisspecializedforSoC-basedsignal pro-cessingsystems.TheupperpartofFig.1representsapplication-speciﬁc andalgorithmicaspects,whilethelowerpartrepresentsthegeneralpart ofthemethodologythatisreusableacrossdiﬀerentapplications.The up-perpartisillustratedconcretelyinthecontextofDNNsystemdesign; thispartcanbereplacedwithotherapplication/algorithmleveldesign aspectswhenapplyingSTMCMtootherapplications.

InSTMCM,we applytheLWDFprogrammingmodelthrough the LightweightDataflowEnvironment(LIDE).LIDEisasoftwaretoolfor dataflow-based design andimplementation of signal processing sys-tems[3,22].LIDEisbasedonacompactsetofapplication program-minginterfaces (APIs)thatis usedforinstantiating,connecting, and executingdataflowactors.TheseAPIshavebeenimplementedina va-rietyofimplementationlanguages.Forexample,LIDE-C[22]and LIDE-V[21] provideC andVeriloglanguageimplementationsoftheLIDE APIs,respectively.

(3)

Fig.1. An illustration of STMCM in the context of DNN system design.

AsmentionedinSection1 andillustratedinFig.1,corefunctional dataflow(CFDF)[4],is theformofdataflowthatLWDFisbasedon. InCFDF,each actorisspecifiedasasetof modes.Each actorfiring operatesaccordingtooneofthespecifiedmodes(calledthe“current mode” associatedwiththefiring),anddeterminesauniquenextmode, whichwillbethecurrentmodeforthenextfiring.Theproductionand consumptionrates(dataflow rates )fortheactorportsareconstantfora givenmode.However,differentmodesofthesameactorcanhave

dif-ferentrates,whichallowsactorstoexhibitdynamicdataﬂowbehavior. WepresenttheswitchactorasanexampleofCFDFactor.Switchactor hasthreemodes:Control,TrueandFalse.InControlmode,theswitch actorconsumesonetokenfromControlport.InTrueorFalsemode,the switchactorconsumesonetokenfromDataportandforwardthattoken toTrueorFalseOutputportaccordingly.Thedataﬂowtableandmode transitiondiagrambetweenCFDFmodesofswitchactorareillustrated inFig.2.

(4)

Fig.2. Switch actor in CFDF. (a) Switch Actor, (b) Dataﬂow Table, (c) Mode Transition Diagram between CFDF Modes.

ThedefinitionofaCFDFactorincludestwofunctionscalledthe en- able function andinvoke function oftheactor.Theenablefunctionchecks whetherthereissufficientdataavailableontheactor’sinputedgesand sufficientemptyspaceavailableontheoutputedgestofiretheactorin itsnextmode.Theinvokefunctionexecutesanactorfiringaccording totheactor’scurrentmode,consumingandproducingamountsofdata thataredeterminedbythefixeddataflowratesofthecurrentmode. Theinvokefunctionalsodeterminestheactor’snextmode,asdescribed above.

Intheremainderofthissection,wediscussindetailtheapplication-, software-,andhardware-speciﬁcprocessesillustratedinFig.1. 3.1. Application-speciﬁc tools and processes

InFig.1,application-specifictoolsandassociateddesignprocesses areillustratedbygrayblocks.Throughoutthispaper,weadoptaDNN applicationasaconcretedemonstrationofhowsuchapplication-specific aspectsareusedasanintegralpartofSTMCM.TheDNN-focuseddesign processillustratedinFig.1 startswiththederivationofDNN hyperpa-rametersandthenetworkconfiguration.Thentheparametersassociated withthederivedDNNstructureareextractedandtheDNNalgorithmis carefullyvalidatedtoensurethattargetlevelsofaccuracyaresatisfied. Theblocklabeled“DesignRequirementsandConstraints” refersto theapplication-andplatform-specificrequirementsandconstraintson theDNNimplementation.Examplesoftheseincludetheaccuracyand throughputrequirementsforimageclassificationDNNsystems,and con-straintsonavailablepowerandhardwareresourcesforatargetedSoC platform.

In the remainder of this section, we introduce the software-relatedandhardware-relateddesignprocessesthatprovidethecoreof STMCM.Theseprocessesareappliedinanintegratedmannerfor hard-ware/softwareco-design,asrepresentedbythelowerlefthandpartof Fig.1.DetailedexplanationsofthemajorcomponentsinSTMCMare providedinSection4.3.

3.2. Software-related process

Inthenextmainphaseoftheproposeddesignmethodology,theDNN network configuration derived using application-specific, algorithm-leveltoolsismappedtoasoftwareimplementationusingLIDE-C.Note thatLIDE-CisinnowayrestrictedtoDNNsystems,andisinstead de-signedtosupportabroadclassofdataflow-basedsignalandinformation processingsystems.Forexample,intheworkof[23],thedesignspace explorationofadigitalpredistortionsystemforwireless communica-tionisbasedonimplementationusingLIDE-C.In[24],LIDE-Cis ex-tendedtosupportparameterizedsynchronousdataflow[25] modeling andappliedtotheimplementationofanadaptivewireless communica-tionreceiver.In[26],optimizedvectorizationtechniquesareappliedto LIDE-basedactorsforthroughputoptimization,anddemonstratedusing anOrthogonalFrequencyDivisionMultiplexing(OFDM)receiver.For

moredetailsaboutLIDE-CandthedevelopmentofDNNcomponentsin LIDE-C,wereferthereaderto[22,27].

WorkingwiththeLIDE-CimplementationoftheDNN,anumberof optimizationprocessesarecarriedoutiterativelytostreamlinethe soft-wareimplementationintermsoftherelevantdesignobjectivesand con-straints. Thisiterativeoptimizationprocessis illustratedinFig.1 by the cyclicpath thatinvolvesthe blocks labeled Dataflow Representa- tion, LIDE-C Implementation, and Optimized LIDE-C Implementation .The proposed approach supports efficient application of commonly-used DNNsoftwareoptimizationmethodssuchasfor-looptilingandbuffer memorysharingamongdataflowgraphedges.Wereferthereaderto Section4.1 formoredetailsabouttheseoptimizationmethodsandthe integrationofthemwiththeLIDE-Cimplementation.

Next,softwareprofilingisperformedontheoptimizedLIDE-C im-plementationoftheDNNsystemtoextractprofilingdata.Thisdatais extractedforeachdataflowcomponentoftheDNNarchitecture.Inthe profilingprocessappliedinSTMCM,thememorysizesofthebuffers andexecutiontimeoftheactorsinthegrapharemeasured.According tothecharacteristicsofDNNarchitecture,theDNNsystemisdivided intomultiplecomputationlayers.InourapplicationofSTMCM, soft-wareprofilingisspecializedtoDNNimplementationbymeasuringthe totalmemorysizesforthebuffersbothinsideeachlayerandbetween pairsofadjacentlayers.Wealsomeasurethetotaltimecomplexityof eachDNNlayer.

3.3. Hardware-related process

Thedataflowmodelofthesubgraphtoaccelerateisimplemented in hardware using LIDE-V.Hardware profilingbased onthe specific implementationplatformisperformedontheLIDE-Vimplementation. Thisprofilingisusedtocollectmeasurementsonhardwareperformance andhelp identifypossible optimizations. Detailson hardware profil-ingaredemonstratedconcretelythroughthecasestudypresentedin Section4.2.Likethesoftwareimplementation,thehardware implemen-tationwillingeneralgothroughmultipleoptimizationiterationsbefore itisfinalized.

InLIDE-V,thehardwareimplementationofadataflowactoris de-composedintoimplementationsofitsenablefunctionandinvoke func-tion.ThesecomponentsareimplementedastwocoupledVerilog mod-ules— theactorenablemodule(AEM),andactorinvokemodule(AIM). Dataflowedgesareimplementedasdataflowedgemodules(DEMs);we informallyrefertoDEMsalsoas“FIFOs”.

Toprovidefullydistributedschedulingofactors,onecanconnecta LIDE-V actor scheduling module (ASM )toeachactor.TheASMinitiates anewfiringofitsassociatedactoranytimetheactorisnotalreadyin thefiringmode,hassufficientdataonitsinputedges,andhassufficient emptyspaceonitsoutputedges.SchedulingofLIDE-Vactorsisnot re-strictedtosuchafullydistributedschedulingapproach.Forexample, withappropriately-designedcontrollogic,subsetsofactorscanbe se-rializedtoallowsharingofresourceswithinthesubsets.Inthispaper,

(5)

however,werestrictourattentiontofullydistributedscheduling.Fully distributedschedulingofdataﬂowgraphshasbeenanalyzedinvarious contexts.Forexample,Ghamarianelal.havedevelopedmethodsfor throughputanalysisofsynchronousdataﬂowgraphsthatarescheduled inafullydistributedmanner[28].Suchanalysistechniquescanbe ap-pliedtohardwaresubsystemsinSTMCM.

Theorthogonality(separationofconcerns)amongactor,edge,and schedulerdesigninLIDE-Vlaysavaluablefoundationforrigorous inte-grationofpower-managementwithintheassociatedAPIs.Inparticular, wedemonstratedin[29] and[21] thatmethodsforasynchronous de-sign,GloballyAsynchronousLocallySynchronous(GALS)design,and clockgatingcanbeappliedeﬃcientlythroughnaturalextensionsofthe LIDE-VAPIs.Wealsodemonstratedtheuseoftheseextensionstopower optimization.

Tomanagecomplexityandimprovereuseofsubsystemswithinand acrossdesigns,one canencapsulatesubgraphsin LIDE-V within hier- archical actors (HAs ).AnHAinLIDE-V appearsfromtheoutsideasa regular(non-hierarchical)LIDE-VactorwithanassociatedAEM,AIM, andASM.Execution ofanHA asanactorin theenclosingdataflow graphiscoordinatedbythe external scheduler associatedwiththeHA. WhenanHAisfiredbyitsexternalscheduler,the internal scheduler of theHAcoordinatesthefiringsofactorsthatareencapsulatedwithinthe HA(nestedactors).Theinternalschedulercarriesoutthesetofnested actorfiringsthatmustbecompletedforagivenfiringoftheHA.An exampleofanHAwithinternalandexternalschedulersisdiscussedin detailandillustratedinFig.6.

Sinceitappearsfromtheoutsideasaregularactor,anHAcanbe clockgatedinexactlythesameway,allowingthedesignertoefficiently switchoff thewholesubgraphatappropriatetimesduringoperation. 4. Casestudy:Adeepneuralnetworkforvehicleclassification

AsaconcretedemonstrationofSTMCM,weadoptaDNNusecase forautomaticdiscriminationamongfourtypesofvehicles— bus,car, truck,andvan.Thisimplementationisbasedonaneuralnetwork de-signpresentedin[30],whereanetworkconfiguration— i.e.,the num-berandtypesoflayersandotherDNNhyperparameters— was care-fullyderivedanddemonstratedtohaveveryhighaccuracy.The accu-racyof themethodswasvalidatedwithadatabaseofover 6500 im-ages,andtheresultingpredictionaccuracywasfoundtobeover97%. Theworkinthis paperandthework in[30] havedifferentfocuses. Theworkof[30]focusesonderivinghyperparameters,networkdesign, anddemonstratingnetworkaccuracy,anddoesnotaddressaspectsof resource-constrainedimplementationorhardware/softwareco-design. Inthispaper,wegobeyondthedevelopmentsof[30] byinvestigating resourceconstrainedimplementationonarelevantSoCplatform,and optimizedhardware/softwareco-designinvolvinganembedded multi-coreprocessorandFPGAaccelerationfabricthatareintegratedonthe platform.In[30],theproposedDNNarchitecturesareevaluatedbased ontheclassificationaccuracy,whileinourworkonSTMCM,the ob-jectivesthatwearetryingtooptimizearesystemthroughput,memory footprintandpowerefficiency.Inaddition,ourworkinthispapercan begeneralizedtothedesignandimplementationofarbitraryDNN ar-chitectures,andalsoitcanbegeneralizedbeyondDNNapplicationsto othersignalandinformationprocessingapplications;thearchitecture of[30] isselectedasacasestudytoconcretelydemonstratetheusage ofthemethodologyproposedinthispaper.

InrelationtoFig.1,weapplytheresultsfrom[30]intheblock la-beled“derivationofhyperparametersandDNNdesign” aspartofthe designmethodologythatisdemonstratedinthispaper.Fig.3illustrates thecompleteDNNarchitecturethatwe implementinthis work.For moredetailsaboutthisusecase,suchasthedataset,thederivationof theDNNarchitectureandtheapplicationoftheusecaseinvehicle clas-siﬁcation,wereferthereaderto[30].

TheDNNnetworkdesigniscomposedoftwoconvolutionallayers, twodenselayersandoneclassiﬁerlayer,asdepictedinFig.3.Theﬁrst

convolutionallayertakesanRGBimage(3×96×96)asinput,and pro-duces32featuremaps,eachwithdimensions(48×48).Thesecond con-volutionallayertakesthese32featuremapsasinputandproduces32 smallerfeaturemaps,eachhavingdimensions(24×24).Werefertoa subsystemthatprocessesmultipleinputimagestoproduceasingle fea-turemapasa branch .Thus,thefirstandsecondconvolutionallayers have32brancheseach.Thetwodenselayerscombinetotransformthe feature mapsintoa(1×100)vector,which isthenmultipliedin the classifier layerbya(100×4)matrixtodeterminethe(1×4) classifi-cationresult.Eachofthefourvaluesintheresultcorrespondstothe likelihoodthatthevehicleintheinputimagebelongstooneofthefour vehicletypes(i.e.,bus,car,truckandvan).

Thestudieduse caseisrelativelyeasytosolvecomparedto com-mon imagerecognition benchmarks,such asMSCOCO[31], or Ima-geNet[32]. Therefore,onecanreachhighaccuracywitharelatively simplenetworkrequiringsigniﬁcantlylowerresources thancommon networktopologiesintended formobileuse (such asMobilenets).As such,thefocusofourworkisnotinmobiledevices(e.g.,smartphones), butin simplerIoTdevicestargetedtosolvingless complexmachine learningproblemsatlowcost.ForfurtherdetailsontheDNNnetwork designandhyperparameterspeciﬁcations,wereferthereaderto[30].

Thespeciﬁcplatformandassociatedplatform-basedtoolsthatwe employ arebased ontheXilinx ZynqZ-7020SoC.Theremainder of thisSectionfocusesondetailsassociatedwithSTMCMandits associ-ateddesignprocesses.Thesedetailsarepresentedconcretelythrough thedevelopmentofthisDNNcasestudy.

4.1. Software implementation and optimization

Inthissection,wediscussdataflow-graph-andactor-level optimiza-tionsandassociated designiterations, asillustratedin Fig.1 by the blocks labeledDataflowRepresentation,LIDE-CImplementation, and OptimizedLIDE-CImplementation.Westartwithadataflowgraph im-plementationthatisderivedusingLIDE-C[3,22],whichprovidesa C-languageimplementationoftheLWDFAPIssothatCFDF-basedactors anddataflowgraphscanbeimplementedinastructuredmannerusing C.Theinitial(sequential)LIDE-Cdesignisdevelopedinadesignphase thatcorrespondstotheblocklabeledLIDE-CImplementationinFig.1. Aftervalidatingthecorrect,dataflow-basedoperationoftheinitial DNNdataflowgraphimplementationin LIDE-C,we experimentwith varioustransformationsattheactor,subgraph,anddataflowgraph lev-els.Here,weexploittheorthogonalityofactor,edge,andgraph imple-mentationinLIDE-C,whichallowsdesignerstoflexiblyandefficiently perform experimentationwithawidevarietyoftransformations,and withdifferentcombinationsofappliedtransformations.Theactor-level transformationsperformedherearefocusedonoptimizationmethods appliedtotheconvolutionactor,whichisamajorperformance bottle-neckinthedesign.Thesubgraph-leveltransformationsinvolvememory managementoptimizationsperformedonFIFOsbothinsideeach sub-graph(DNNlayer)andbetweenpairsofadjacentlayers.

4.1.1. Actor-level optimization

Wedemonstrateactor-leveloptimizationatthisstageofthedesign processusingtheconvolutionactorinourDNNexample.InourLIDE-C implementationofthisactor,weapplyatransformationofthe convo-lutioncomputationthatiscommonlyusedtosimplifythedesign,and improveclassiﬁcationspeed.Thetransformationinvolveslooptilingto reducethecachemissrate.TheutilityoflooptilinginDNN implemen-tationhasbeendemonstratedpreviously, forexample,in[33].Using looptiling,wedecomposethemainloopoftheconvolution computa-tionintoaninnerloopthatiterateswithincontiguous“strips” ofdata, andanouterloopthatiteratesacrossstrips.Applyinglooptilinginthis wayallowsonetoenhancecachereusebasedonanarraysize(strip length)thatﬁtswithinthecache.

Fig.4 showsasegmentofcodefromourapplicationofthetiling transformationtotheconvolutionactor.

(6)

Fig.3. DNN for automatic discrimination of four types of vehicles.

Throughtheorthogonalityprovidedbythemodel-baseddesignrules inLIDE-C,thistransformationcanbeappliedatalatestageinourdesign process,inawaythatisinteroperablewithpreviouslyapplied trans-formations,andina waythat requiresno modificationstoother ac-toror edgeimplementations.Inthiscase,nomodification isneeded tothedataflowgraphschedulerimplementationaswell,althoughfor sometransformations,scheduleradjustmentscanbeusefultointegrate transformedactorsintotheoverallsysteminanoptimizedway.The

CFDF-basedAPIs(enableandinvokefunctions)inLIDE-Cforscheduler implementationallowthedesignertoexperimenteﬃcientlywithsuch schedulingadjustmentsasneeded.

4.1.2. Buﬀer memory management

Amajorchallengeinresource-constrainedimplementationofaDNN architectureismanagingthelargevolumeofdatatransfersthatare car-riedoutduringnetworkoperation.EachDNNlayertypicallyprocesses

(7)

Fig.4. The code segment that implements loop tiling within the LIDE-C actor for convolution.

alargeamountofdata,andrequiresmemorytostoretheinputdata fromthepreviouslayerorsubsystem,theintermediatedataduringthe computationprocessing,andthecomputationresultsthatwillbe trans-mittedtothefollowinglayerorsubsystem.

Consider,forexample,thebuffermemorycosts(thestoragecosts associatedwiththedataflowgraphedges)fortheDNNofFig.3.Inour LIDE-C implementation, thesecond convolutional layerrequires the mostbuffermemory.Inthislayer,eachofthe32branchesiscomposed of32convolutionactors,31additionactorsandoneactorperforming bothmaxpoolingandReLU(RectifiedLinearUnit).Giventhatthesize oftheinputfeaturemapprocessedbyeachbranchis48×48pixels,the buffermemoryrequiredforactorcommunicationinsideeachbranchis 𝑖𝑚𝑎𝑔𝑒_𝑠𝑖𝑧𝑒× (𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑐𝑜𝑛𝑣_𝑎𝑐𝑡𝑜𝑟𝑠+𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑜𝑢𝑡𝑝𝑢𝑡_𝑓𝑒𝑎𝑡𝑢𝑟𝑒_𝑚𝑎𝑝𝑠), whichis48× 48× (32+1)=76,032pixels.Thus,thetotalbuffer mem-ory inside the second convolutional layer is 76, 032× 32=2, 433, 024 pixels.Thebuffermemoryrequiredfordatacommunicationbetween thefirstandthesecondlayercanbecomputedas48× 48× 32=73,728 pixels.

InSTMCM,weapplyabuffermemoryoptimizationtechniquethat isusefulforresource-constrainedDNNimplementation.Inparticular, weincorporateanewFIFOabstractdatatype(ADT)implementation inLIDE-C,called shared FIFO ,thatenablesmultipledataflowedgesin agraphtobeimplementedthroughFIFOADTinstancesthatsharethe sameregionofmemory.Suchbuffer sharing indataflowimplementations hasbeeninvestigatedin differentformsforvariouscontextsof auto-matedschedulingandsoftwaresynthesis(e.g.,see[34–36]).InSTMCM, wemake iteasy forthesystem designertoapply buffersharing ex- plicitly withinherorhisimplementationratherthandependingonits implicitsupportthroughthetoolset thatis used.This is anexample oftheagilitythatissupportedinSTMCM,asdescribedattheendof Section2.

Again, by exploiting the orthogonality among dataflow compo-nents, buffer sharing in STMCM is performed only on the targeted dataflow edges andrequiresno modificationto otheractorsor sub-graphs.Throughthesupportforsuchseparationofconcernsin LIDE-C,differentADTimplementationsforaFIFOorgroupofFIFOscanbe interchangedwithoutaffectingoverallsystemfunctionality.

TherearethreekeyaspectstoourapplicationofsharedFIFOsinour LIDE-CDNNimplementation.First,attheinputofeachconvolutional layer L ,inputdatafromthepreviouslayerisstoredcentrallyinsteadof beingcopiedseparatelyintoeachbranchofL .Second,edgesindifferent layerssharethesamememorysothatthememoryistime-division mul-tiplexedbetweenthelayers— theprocessingofagivenlayeroverwrites memoryinitssharedFIFOswithoutintroducingconflictsthataffectthe computationresults.Third, actorsoperateondatafromsharedinput FIFOsdirectlythroughtheirread pointersintotheFIFO(ratherthan firstcopyingthedatalocallywithintheactor’sinternalmemory).This kindof copy-eliminationis similartodataflowmemorymanagement techniquesintroducedbyOhandHa[35].

Improvementsresultingfrom ourapplication ofshared FIFOsare demonstratedquantitativelyinSection5.1.

Table1

Layer-level software proﬁling. Here, the row labeled “T ” gives the execution time of each layer, and the row labeled “T% ” gives the percentage of the total DNN execution time that is attributed to each layer.

Layer Total

1 2 3 4 5

T [ms] 18.71 22.08 0.0149 0.0034 0.0036 40.812 T% 45.84 54.10 0.04 0.01 0.01 100

Table2

Actor-level software proﬁling.

Layer Convolutional layer 1 Convolutional layer 2 Actor Conv Add M&ReLU Conv Add M&ReLU Tic [ 𝜇s] 230.10 0.03 0.025 59.77 0.005 0.006

Layer Dense Layer 3 Dense Layer 4 Output Layer 5 Actor Mult ReLU Mult ReLU Mult Softmax Tic [ 𝜇s] 5.1 0.0012 0.029 0.0012 0.0023 0.0031

4.1.3. Software proﬁling

Inthissubsection,wedemonstratetheprocessofsoftwareproﬁling, asillustratedinFig.1,inthecontextofouroptimizedLIDE-C imple-mentationoftheDNNarchitecture.Theimplementationplatformisan Inteli7-2600Krunningat3.4GHz.Table1 andTable2showlayer-and actor-levelsoftwareproﬁlingmeasurements,respectively.

InTable2, T icdenotestheinvoke to ﬁring completion time ofagiven

actor.Thisis theaveragetimethatelapsesbetweenthetimethatan actorﬁringisinitiatedandwhentheﬁringcompletes.Wealsoreferto T ic asthe average execution time oftheassociatedactor.The

abbrevia-tionsAdd,Conv,Mult,andM&ReLUstand,respectively,forAddition, Convolution,Multiplication,andMaxpool-and-ReLU.

Layer-andactor-levelsoftwareproﬁlingprovideinsightintothe pro-cessingcomplexityofactorsineach layer.AccordingtoTable 1, the convolutionallayersaccountfor99.94%ofthesystemexecutiontime. Also,theexecutiontimeofLayer2isveryclosetothatofLayer1.In bothconvolutionallayers,theConvactorsaccountformostofthe pro-cessingtimecomparedwiththeothertwoactors— AddandM&ReLU — intheconvolutionallayers.Additionally,theaverageexecutiontime oftheConvactorsinLayer2isonlyaboutaquarterofthatoftheConv actorsinLayer1.ThisisprimarilybecauseeachoftheConvactorsin Layer1processesinputimagesofsize96×96,whiletheConvactorsin Layer2processinputfeaturemapsthathavesize48×48.

4.2. Hardware implementation and design exploration

Inthissection,wedescribethemaincapabilitiesofthedesignﬂow depictedinFig.1 withrespecttodesignandimplementationof hard-wareaccelerators.Thesecapabilitiesarerepresentedbytheblocksin theregionlabeled“Hardware-relatedProcess”.

(8)

Fig.5. LIDE-V implementation for the accelerated SFM.

Forexample,throughapreliminaryhardwareprofilingphaseofthe DNNapplicationdescribedinSection4.2.1,wecanidentifythree hard-waredesignaspectsthatareinterestingtoinvestigateindetail— the adoptionofclockgatingtechniques,exploitationofasynchronythatis inherentindataflows,andexplorationofdifferentlevelsofactor gran-ularity.

Wedemonstratethehardware-relateddesignprocessofSTMCM us-ingahardwareacceleratorthatisintroducedin[21].Theaccelerator providesasubsystemforproducingfeaturemapsfromtheﬁrst convo-lutionallayeroftheDNNapplication.Intheremainderofthispaper, werefertothissubsystemasthe Subtree for Feature Map (SFM).

Duetotheinterfacing consistencythat ismaintainedacrossLIDE actorimplementationsindifferentlanguages,onecanreadilyconvert theLIDE-CbasedSMFsubsystemimplementationintohardwareby re-placingeachsoftwareactorwithahardwaremodulethatisdesigned inLIDE-V,andbyconnectingthederivedhardwareactorswithLIDE-V FIFOs.FollowingthegeneralapproachofrealizingLIDEactorsin hard-ware,eachLIDE-VactorimplementationisdecomposedintoanAEM andAIM.TheAEMisreusableamongdifferentactorsinour implemen-tation, althoughingeneralitcan be usefultohavespecializedAEM implementationsthatarestreamlinedforthespecificrequirementsof individualactors[21].

ThehardwareimplementationdivergesfromtheLIDE-Cdesignin twomajorways.First,wefeedtheinputdatainaninterleavedformat, reducingthecomplexityofthehardwareinterfaceanddriversoftware sincethereisonlyoneinputFIFOtomanage.Second,thehardware ac-torsaredesignedtoproduceonerowperﬁringinsteadofentireimages. ThisreducestheFIFOsizerequirementsintheﬁrstlayerfrom96×96 pixelstoonly96pixels.Thehardwareactorsinourimplementationare scheduledusingafullydistributedapproach.

Theresulting SMFis showninFig.5.Theimplementedhardware isverifiedagainstreferenceoutputsextractedfromtheLIDE-C imple-mentation.InthisFigure,productionandconsumptionrates(dataflow rates)areannotatednexttoactorports,andw istheinputimagewidth. Theconvolutionactorhasmultipleoperatingmodes(CFDFmodes)with differentconsumptionrates.

4.2.1. Hardware proﬁling

WeemployhardwareprofilinginSTMCMtoextractexecutiontime data,which islaterused toguide theprocessof iterativedesign op-timization.Inthissection,wedemonstratehardwareprofilinginthe contextofourDNNapplication.Profilingisperformedusingthetarget platform,whichinourdemonstrationistheZynqZ-7020SoC.We pro-filetheLIDE-CimplementationontheARMA9MPCoresprovidedby thetargetplatformanddevelopafirstversionimplementationofthe SFMonthisplatformandextractexecutiontimedatafromthis imple-mentation.

Table3 depicts variousdataassociatedwithexecution timesand waitingtimesfor theSFM hardwareaccelerator illustratedinFig.5. Here,thesymbol t totrepresentsthetotaltimenecessarytoexecutethe

SFM; T icistheaveragetimeperiodbetweenanactorinvocationandits

correspondingﬁringcompletion; T _ciistheaveragetimeperiodthatan

Table3

Measured data associated with actor execution times and waiting (idle) times.

SFM t tot 232,831

Tic Tci ﬁrings Tot ( Tot %) Tii / T ic Deinterleave 3 2 9216 27,648 (11.87) 1.67 Convolution 2402 2 96 230,592 (99.04) 1.00 Sum 107 2297 96 10,272 (4.41) 22.46 Maxpool&ReLU 195 4613 48 9360 (4.02) 24.66

actorhastowaittobefiredafteritspreviousfiringcompletion; firings isthenumberoffiringsofagivenactorduringexecutionof SFM; Tot , calculatedas(T ic)×(firings ),givesthetotalexecutiontimeofagiven

actorduringtheexecutionof SFM ; 𝑇 _𝑖𝑖=(𝑇 _𝑖𝑐+𝑇 _𝑐𝑖)denotestheaverage timeperiodbetweenthebeginningofoneinvocationtothebeginning ofthenext;andtheratioT ii/T icmeasurestheextentofactoridleness.

Thisrichcollectionofmetrics,whichissupportedbytheunderlying CFDFmodelcomputation,providesvariousinsightsonthe dataﬂow-based system architectureand its implementation. Forexample, the T ii/T icratioprovidesinsightondiﬀerencesinprocessingspeedthatare

usefulinexploitingtheinherentasynchronybetweendataflowactors. Fromanalysisofourhardwareprofilingresults(Table 3),wecan derivedifferentversionsoftheSFMhardwareacceleratorwithdifferent trade-offsamongpowerconsumption,systemthroughput,andhardware resourcecost.Firstly,lookingatcolumnTot %,weseethatallofthe ac-torsexceptforConvolutionareinactivethroughoutmostofthe execu-tiontime.Themaximumproportionofactivetimeamongtheseactors is11.87%,reachedbyDeinterleave.Gatingtheclockofthesefrequently inactiveactorscanprovidemoreenergyefficientacceleratoroperation byeliminatingdynamicpowerconsumptionduringidlephases.

Furthermore, the DeinterleaveandConvolution actorshave rela-tivelysmallidlenesslevels(T ii/T ic),withawaitingtime T ci equalto

2 clockcyclesfor bothof them. Ontheother hand,Sum and Max-pool&ReLUexhibitmuchlargerwaitingtimesandidlenesslevels.An importanthintcomingfromthe T civaluesisthat,thankstothe

inher-entasynchronyofdataflowactors,itispossibletopartitionthedesign intodifferentclockregions workingatdifferentfrequencies,thus ob-tainingaGALSdesign.Inparticular,the Deinterleave and Convolution actorscanbeplacedinoneclockregion(Region1),drivenby clock 1 , whileSum andMaxpool&ReLU canbeplacedinanotherregion(Region 2),drivenbyclock 2 .OnthebasisofthemeasuredT _ii/T _icvalues,wecan set clock 2 tobe20timesslowerthan clock 1 .

Moreover,thesubgraphincludedinRegion2canbeencapsulated into ahierarchicalactor(see Section3.3).This actor, seenfrom the “outside”,islikeanyotherLIDE-Vactor.Theactoranditsencapsulated subsystemcanbeclockgatedorclockedwithadiﬀerentfrequency, pro-vidingadditionalcandidatesolutionsforSFMacceleratoroptimization. 4.2.2. SFM Exploration

BasedonthehardwareproﬁlinganalysisdiscussedinSection4.2.1, weexploredsixdiﬀerentvariantsoftheSFMdesign:

(9)

Fig.6. An illustration of the hierarchical actor associated with Design SFMh.

• SFM a:Thisisanasynchronousdesignwhereactorsbelongingto

dif-ferentlogicregions runat differentclockfrequencies.In particu-lar,theclockfrequencyfor clock 1 issetto100MHz,andtheclock frequencyfor clock 2 issetto5MHz.ReferringtoFig.5,theonly modificationrequiredinthedesignisthereplacementofFIFOsthat areplaced between thetwoclockregions.TheseFIFOsneedtobe replacedwithasynchronousFIFOs— forthispurpose,weemploy theclockdomaincrossing(CDC)FIFOspresentedin[21].CDC FI-FOsaredesignedwithreadandwritelogicthatcanbedrivenby differentclocks.Atthesametime,theirmoduleinterfacesconform tostandardLIDE-VedgeinterfacessotheycanreplaceotherFIFO implementationswithoutrequiringchangestoactorsthat commu-nicatewiththem.

• SFM CG:Basedonourhardwareproﬁlingresults,weapplyclock

gat-ingtotheDeinterleave,SumandMaxpool&ReLUactors.Tobeclock gated,aLIDE-Vactorneedsonlytheinstantiationofaclock gat-ingmodule(CGM)[21].TheCGMinvolvesaBUFGprimitivethat physicallyenables/disablestheclocksignalinthetargetSoC.Thus foreachclockgatedactor A in SFM CG,aCGMisinstantiatedand

connectedtotheclockinputsof A andtotheread-andwrite-clock inputs,respectively,oftheFIFOsthat A readsfromandwritesto. • SFM aCG: This design incorporatesboth asynchronous design and

clockgatingtechniques.AsinSFM a,theFIFOsbetweenthetwoclock

regionsarereplacedwithCDCFIFOs.Additionally,theDeinterleave, SumandMaxpool&ReLUactorsareclockgatedasin SFM _CG,anda CGMisinstantiatedforeachoftheseactors.

• SFM h:Thisis ahierarchicalSFMdesign,whichcan beviewedas

abaselineforevaluatingourenhancedhierarchicaldesign SFM _hCG (deﬁnedbelow).In SFM h,Region2(seeFig.5)isencapsulatedin

ahierarchicalactor H .Anillustrationofthis hierarchicalactoris providedinFig.6.Thesubgraphthatisencapsulatedby H contains threeactors A1, A2 and B .Wedenotethissubgraphby G H.Actors

A1 andA2 correspondto Sum 1 and Sum 2 ,respectively,whichare twoactorsthataddoutputsfromthethreeconvolutionactors.Actor B correspondstotheMaxpool&ReLUactor.

When H isviewedasasingleactorfromtheoutside,aﬁringof H startswhentheinternalscheduler I_ASM_HA for G Hreceivesthein-

voke_HA signalfromtheexternal scheduler E_ASM_HA .Insidethe subgraph G H,the invoke_HA signalisreceivedbyASM_A1 ,whichis

theASMofactor A1 .Once ASM_A1 receivesthe invoke_HA signal, theﬁringofthesubgraphG _Hstarts.

• SFM hCG:Thisdesignisthesameas SFM h,exceptthatthe

Deinter-leaveactorandthehierarchicalactorareclockgated.Itis impor-tanttohighlightthattheapplicationofclockgatingattheregion levelisadvantageousiftheexecutiontimesoftheactorswithinthe regionareoverlapped.Inthisdesign,however,theexecutiontimes ofthethreeactorsarenotoverlapped.Whenoneactorisexecuted, theotherswaitinanidlestateandwastepower.Therefore,we ex-pectthatthisconﬁgurationwouldnotbereallyeﬀectiveinreducing powerconsumptionas SFM CGinthetargetedDNNcase.However,

weincludethetestinourexplorationstopresentthecompletewide varietyofoptionsmadeavailablebySTMCM(evenifsomeofthem maybelesseﬃcientthanothersforthisparticularapplication sce-nario).

• SFM auto: Thisis aversionof theSFMthatis synthesizedand

im-plementedbyenablingtheautomaticpoweroptimizationavailable withintheadoptedXilinxVivadoenvironment.Thisdesignapplies fine-grain clock-gatingand fine-grain logic-gating at the Verilog leveland excludes allofthehigher-level,dataflow-based optimiza-tions(coarse-grainasynchronous design,clock-gating,and hierar-chicaldecomposition)thatareappliedintheotherfiveinvestigated designs.Thus, SFM autoisusefulasacommonbaselinetoassessthe

higher-levelmodelsandtransformationsprovidedby STMCM com-paredtoexistingoﬀ-the-shelfsynthesistechniques.

4.3. Joint hardware/software implementation and optimization

Thissectionshowshowtheproposeddesignflow(summarizedin Fig.1)providesavarietyof interestinghardware/softwareco-design implementation choices and optimization possibilities. In particular, thesefeaturesarerepresentedbythe“Co-design-relatedProcess” area ofFig.1.Foragivenhigh-levelLWDFmodel,theinteractionbetween software(seeSection4.1)andhardware(seeSection4.2)actorsor sub-graphscanbeshapedandrefineddependingonthespecificconstraints andrequirementsoftheapplication.

Inparticular,wedemonstratetwomainimplementationaspectsthat canbeeﬃcientlyexploredwithSTMCM:parallelismacrossactor exe-cution,andtheadoptedcommunicationinterfaces.Thedegreeof paral-lelismcanbetuneddependingonthenumberofsoftwareand/or

(10)

hard-applicationthatwillbeacceleratedinthePL,whiletheremainingpart, involvingthesecondconvolutionallayer,twodense layersandﬁnal classiﬁcationlayer,willbeexecutedbythePS.

Notethatthefirstconvolutionallayerconstitutesonlyoneofthe maincomputationallyintensivestepsoftheDNNapplication. Accord-ingtosoftwareprofilingresultsthatarebasedontheSoCplatformthat weappliedforhardware/softwareco-design(seeTable8),thefirst con-volutionallayeronlyaccounts forabout27%of thepredictiontime. Forthisreason,thespeedupbroughtbyhardwareaccelerationtothe overallDNNapplicationis notdramatic,aswillbediscussedfurther inSection5.However,theresultsconcretelydemonstratehowSTMCM canbeappliedtoperformextensivedesignspaceexplorationacrossa varietyofdiversedesignstoachievesystemperformanceenhancement underhighly-constrainedhardwareresourceavailability.

TheSFMacceleratorhasbeenintegratedintotheLIDE-Cdesign pre-sentedin Section4.1 byreplacing theSFM softwareimplementation withfunctioncallstodriversoftwarethatiscapableofoﬄoadingthe computationtothePL.WehaveexperimentedwithusingaLinuxkernel driverbasedontheUserspaceI/O(UIO)framework[37],andadriver thatisindependentoftheLinuxkernelandoperatesbydirectly access-ingmemorywiththemmapsystemcall.TheUIOapproachismore suit-ableforproductionuse,whilemmapworkswellforprototyping,and thislatterapproachhasbeenusedinthisworkforevaluation.ThePS andPLcancommunicatebymeansofAXIinterfacesexploitingGeneral Purpose(GP)ports;32-bitwidthPSmasterorslaveportswith600Mbps bandwidthforbothread andwritechannels;HighPerformance(HP) portsorAcceleratorCoherencyPorts(ACP);and64-bitwidthPSslave portswith1200Mbpsbandwidthforbothreadandwritechannels.

Fig.7 depictsthereferenceconfigurationfortheco-design explo-rations.InordertointegratetheacceleratorintotheSoC,agenericAXI wrapperforhardwaredataflowsubgraphshastobeprovided.The wrap-periscompliantwiththeadoptedAXIinterfaceandletstheprogrammer accesstheinputandoutputFIFOsofthedataflowgraphandmonitor theirpopulations.Forthispurpose,thewrapperincludesallthe neces-sarylogicforthecommunicationmanagement.

Inourhardwareaccelerationapproach,wemaptheSFMsubsystem tohardware.Thissubsystemproducesa48x48featuremaponeach ex-ecution.Thus,inordertoperformtheentirefirstconvolutionallayer oftheDNNapplication,whichmustproduce3248x48featuremaps, theSFMacceleratorhastobeexecuted32timeswiththeappropriate convolutioncoefficients.ForeachoftheseSFMexecutions,thePSwill sendthecorrespondingconvolutioncoefficientstotheaccelerator.The inputimage,whichremainsthesameacrossall32executions,issent onlyoncefromthePSandstoredwithinalocalbufferwithinthe ac-celerator.All32executionsoftheSFMaccesstheinputimagefromthis localbuffer.Inthisway,weavoidthelargeamountofdatatransferthat wouldberequirediftheinputimagehadtobesentseparatelyfromthe PStothePLforeachSFMexecution.UponcompletionofeachSFM ex-ecution,thePSretrievestheresultingfeaturemapfromtheaccelerator. Intheremainderofthissection,wediscussindetailthreedifferent setsofco-designimplementationsandoptimizationsthatarefacilitated

bySTMCM:

• theamountofparallelismthatisexploitedinthesoftwareand hard-waresubsystems;

amounts of parallelismthat areexploitedin both thehardware and softwaresubsystems(seethedashedsquares inFig.7).Inparticular, dependingonthespecificapplicationrequirements,multipleparallel in-stancesofsoftwarecoresorhardwareacceleratorscanbeutilized.While softwarecoresareabletoexecuteallDNNapplicationsteps,hardware acceleratorscanonlyperformthestepsthattheyhavebeenconceived for.Generallyspeaking,hardwareacceleratorsachievehigherefficiency thansoftwarecoreswhenexecutingagivencomputationalstep,bothin termsofexecutiontimeandresourceefficiency(resourceutilizationand consumption).

InthetargetedXilinxZynqZ-7020SoCplatform,apairof homoge-neouscoresisavailable,sothatthemaximumdegreeofsoftware par-allelisminourimplementationsis2.TheavailablecoresarebothARM A9MPCoreswithtwolevelsofcacheandaccesstoa512Mboﬀ-chip DDRRAM.Inourexperiments,wehaveexploitedsoftwareparallelism forthetwomostcomputationallyintensivestepsoftheapplication— thetwoconvolutionallayers.

WhenusingFPGAfabric,designershavethepossibilitytoutilizeas muchparallelismastheFPGAresourcesallow.Inthiswork,wehave investigatedthreealternativedesignsthatutilize1,2or4parallelSFM instances,respectively,in thesamehardwareaccelerator.Inthefirst case,theacceleratorisexecuted32timesinordertocompletethe32 branchesofthefirstconvolutionallayer.Thisdesignexecutesa differ-entbranchwithdifferentconvolutioncoefficientsforeachaccelerator invocation.Inthesecondcase(2parallelSFMinstances),theaccelerator executiontimeishalved,butforeachrun,twonewsetsofconvolution coefficientsarenecessary.Finally,with4parallelSFMinstances,only8 acceleratorexecutionsareneeded,witheachrequiringtheupdatingof fourdifferentsetsofcoefficients.

4.3.2. Communication interfaces

Duringtheprocessofco-designexploration,STMCMgivesthe de-signersignificantflexibilitytoselectinterfacesforcommunicatingdata betweenthehardwareandsoftwaresubsystems.Thisflexibilityis pro-vided bythe general dataflow model of computation that underlies STMCM.Flexibilityinselectingacommunicationinterfacecanbevery usefulinthecontextofresource-orperformance-constraineddesign. This isdemonstrated,forexample,bytheworkofSilvaetal.,which analyzestrade-offsamongthedifferentAXIinterfaceoptions[38].

WeinvestigatedtheusageoftwodiﬀerentAXIcommunication inter-faceslocatedattheextremesoftheresource-versus-performance trade-oﬀ:

• thememory-mappedAXI4-lite(mm-lite )interface;and • FIFO-basedAXI4-stream(stream )interface.

Comparedtothestreaminterface,themm-liteinterfacehaslower resourcerequirements,butitalsoexhibitslowerperformance.The mm-liteinterfaceusesmemory-mapped,one-by-onetransferofdataitems. Theinterfaceisparticularlyintendedforcontrolsignalsandsmall-scale dataaccesses.Itdoesnot needanyadditionalmodules beyondthose depictedinFig.7,anditusesonlyoneofthePSmasterGPports.For example,theexecutionofonebranchofthefirstlayerrequirestheinput images(3RGBimageswith96×96pixelseach)andkernelcoefficients (3kernelswith5×5coefficientseach).Sincethemm-liteinterfaceuses aseparatedatatransferoperationforeachpixel,thisresultsinatotal of3× 96× 96+3× 5× 5datatransferoperations.Oncetheaccelerator

(11)

Fig.7. Reference conﬁguration for hardware/software co-design exploration in our experiments.

completesitscomputation,themm-liteinterfacerequires48×48data transferoperationstoenabletheprocessortoreadtheoutputfeature map.

Unlikethemm-liteinterface,whichperformsdatatransfers one-by-one,thestreaminterfaceemploysaDMAenginethattransfersdata be-tweenprocessormemoryandtheacceleratorinblocks,wheretheblock sizecanbeupto256bytes.Successivedataitemswithinablockare transferredinconsecutiveclockcycles.Thestreaminterfacerequires aDMAengine,asmentionedabove,andadditionalFIFObuffers,and thereforeincurssignificantoverheadintermsofresourcerequirements. Notethattheadditionalhardwarerequiredbythestreaminterfaceis notdepictedinFig.7.TheDMAengineisconfiguredthroughoneof thePSmasterGPports,andrequirestwodifferentPSslaveHPports todirectlyaccessthememorywheredatatobetransferredto/fromthe acceleratorisstored.

Toexecute one branch of the ﬁrst DNNlayer, the stream inter-face performs(a)96 memory-to-accelerator DMAoperationstosend theinputimages,with96×3pixelsforeachDMAoperation,and(b) onememory-to-acceleratorDMAoperationtosend5×5×3kernel co-eﬃcients. Additionally,thestreaminterface needs48 accelerator-to-memoryDMAoperationstoretrievethecomputedfeaturemap,with 48pixelsforeachDMAoperation.

4.3.3. Local buﬀering

Asmentioned previously,weincorporatelocalbuﬀeringof image pixelsintheSFMacceleratortoavoidredundanttransmissionof com-mondataacrossdiﬀerentbranchesoftheaccelerator.Thislocal

buﬀer-ingoptimizationisappliedtoboththemm-lite-interface-and stream-interface-basedacceleratorimplementations.

ForanacceleratorconfigurationwithasingleSFMinstance,the in-putimagedataistransferredtotheacceleratoronlyduringexecution of thefirstbranch.Afterbeing transferred,this dataisretainedina localbufferwithintheacceleratorforreusebytheremaining31 execu-tions.Foracceleratorconfigurationsthathavemultiple(parallel)SFM instances,theinputimageisalsotransferredonlyoncetothe accelera-tor.Fortheseconfigurations,theimagedataisreusedbytheremaining executionsofalloftheSFMinstances.Thus,ourincorporationof lo-calbufferingoptimizationeliminatesinputimagedatatransfersforall branchesexceptthefirstone.

5. Results

Inthissection,wepresentexperimentalresultstodemonstratethe designandimplementationmethodsprovidedbySTMCMbasedonthe detailedcasestudypresentedin Section4. Themaincontributionof thissectionistodemonstratethattheproposedmethodologyfacilitates efficientexperimentationwithalternativedataflow-basedarchitectures, designoptimizationmethods,andimplementationtrade-offs.

5.1. Embedded software implementation

In this section we present results of our experimentation using STMCMtoexplorealternativeembeddedsoftwareimplementations.We focusspeciﬁcallyontheoptimizedapplicationoflooptilingandbuﬀer memorymanagement.

(12)

5.1.1. Loop tiling

AsintroducedinSection4.1.1,in theoptimization ofourLIDE-C implementationoftheDNNapplication,weexploredloop-tiled convo-lutionactordesignswithdiﬀerenttilesizes.Speciﬁcally,wemeasured thenumberofcacheloadmissesandthecacheloadmissratesduring ex-ecutionofaconvolutionactor.Thevalidtilesizesforeachconvolution actorwerethosewithintherangeof1to D ,where D isthedimension ofinputimagestotheactor.Forexample,fortheconvolutionactors inLayer1,whichprocessinputimageswithsize96×96pixels,we ex-ploredtilesizeswithintherangeof1–96.

Fig.8 showsthenumberofcacheloadmissesandcacheloadmiss rateunderdiﬀerenttilesizesforconvolutionactorswithdiﬀerentinput imagedimensions(48×48,96×96,750×750,and1500×1500).As wecanseefromtheresults,thecacheloadmissratesareverysmallfor imagedimensionsD ∈{48,96,750}.Thisindicatesthatthedatacanbe fullystoredoralmostfullystoredinthecachewithanyvalidtilesize.

For 𝐷 =1500, however,thereis significantvariationinthecache loadmissrateacrossdifferenttilesizes.Theratereachesitslowestvalue whenthetilesizeisapproximately400.Withcarefulsettingofthetile size,looptilingsignificantlyreducesthecachemissrateforconvolution actorsthathaverelativelylargeimagedimensions.

Additionally,wecanseethatthereisalargeaverageCPUcyclecount forsmalltilesizesinallﬁgures.Weexpectthatthisisduetothe over-headcausedbytheadditional

for

loopsthatareintroducedbytheloop tilingtransformation.

Insummary,basedonoursimulationanalysisforsmallimage di-mensions(96×96and48×48),looptilingdoesnothelptoreducethe cachemissrateonthetargetplatform,andfurthermore,itintroduces overheadduetotheadditionalforloops.Thus,looptilingshouldnot beappliedtothisDNNapplicationforlowimagedimensions.However, ourexperimentsalsoshowthatforlargerimagedimensions,looptiling doeshelptoimprovetheeﬃciencybyreducingthecacheloadmissrate. 5.1.2. Buﬀer memory management

Fig.9 showstheamount of memoryrequiredfordatastorage in each DNNlayer. We report memoryrequirements in this section in termsof pixels.Inourexperiments, weused a4-byte ﬂoatingpoint datatypeforeach pixel.Fig.9 alsoshowstheamountof data com-municationthat is neededbetween adjacentlayers,andtheamount of memory that must be active simultaneously during the compu-tation associated with each layer. The memoryneeded for input is calculated as 𝑖𝑛𝑝𝑢𝑡 _𝑖𝑚𝑎𝑔𝑒 _𝑠𝑖𝑧𝑒 ×𝑛𝑢𝑚𝑏𝑒𝑟 _𝑜𝑓 _𝑖𝑛𝑝𝑢𝑡 _𝑖𝑚𝑎𝑔𝑒𝑠 . The memory neededforexecutionofeach layeris calculatedas𝑖𝑛𝑝𝑢𝑡_𝑖𝑚𝑎𝑔𝑒_𝑠𝑖𝑧𝑒× (𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑖𝑛𝑝𝑢𝑡_𝑖𝑚𝑎𝑔𝑒𝑠+1)×𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑜𝑢𝑡𝑝𝑢𝑡_𝑓𝑒𝑎𝑡𝑢𝑟𝑒_𝑚𝑎𝑝𝑠.

Aswe canseefromFig.9,theprocessing inLayer2requiresthe largestamountofactivememory,andaminimumof2,525,184pixels mustbeallocatedforbuﬀerstorage.Thememorysizecanbeoptimized subjecttothisconstraintthroughtheapplicationofsharedFIFOs,which wereintroducedinSection4.1.2.Thebuﬀermemoryallocationthatwe proposeforthisDNNapplicationbasedonsharedFIFOsisillustratedin Fig.10.

Table4 summarizesthememoryrequirementsfordataflowedges (FIFObuffers)andactorsinthetwoconvolutionallayers,whichrequire mostofthememoryamongthefivelayers.Thesememoryrequirements areshownbothwithandwithouttheuseofsharedFIFOs.Asdiscussed inSection4.1.2,actorsoperateondatafromsharedinputFIFOsdirectly

Table5

Resource utilization. In parentheses: the percentage of utilization with respect to the resources available on the targeted FPGA.

Available LUTs REGs BUFGs BRAMs DSPs

53200 106400 32 140 220 SFM 5188 (9.75) 3472 (3.26) 1 (3.1) 11 (7.9) 13 (5.9) SFMa 5430 (10.20) 3687 (3.47) 2 (6.3) 11 (7.9) 13 (5.9) SFMCG 5206 (9.79) 3496 (3.29) 5 (15.6) 11 (7.9) 13 (5.9) SFMaCG 5479 (10.30) 3704 (3.48) 6 (18.8) 11 (7.9) 13 (5.9) SFMh 5170 (9.72) 3472 (3.26) 1 (3.1) 11 (7.9) 13 (5.9) SFMhCG 5198 (9.77) 3480 (3.27) 3 (9.4) 11 (7.9) 13 (5.9) SFMauto 5230 (9.83) 3472 (3.26) 1 (3.1) 11 (7.9) 13 (5.9)

withoutcopyingdatatoitsinternalmemory.Thus,convolutionactors onlyneed memoryforits intermediatecomputationresults.Addand Maxpool&ReLUactorsdo notrequireadditionalmemory.Theresults presentedinthistablequantitativelydemonstratetheutilityofshared FIFOsforthisapplication.Inparticular,theapplicationofsharedFIFOs reducesthememoryrequirementsby65%.

5.2. Hardware implementation

Inthissection,weinvestigatetrade-offsamongthevariantsofthe SFMdesignthatwereintroducedinSection4.2.2.STMCMandthe un-derlyingLIDE-Vapproachallowonetoperform suchtrade-off explo-ration,basedondifferentcombinationsofhigh-leveloptimization tech-niques,inasystematic manner.Inparticular,STMCMallowsthe de-signertofocusondifferentstrategiesforinstantiating,configuring,and coordinatingdifferentcombinationsofactorandbuffer(edge) imple-mentations,andeliminatestheneedformodificationinsidetheactor andedgeimplementations.WeexploitedtheseadvantagesofSTMCM whenderivingtheresultspresentedinthissection.

Table5 depictsresourceutilizationdatathatisextractedfromthe post-placeandroutereportsgeneratedbytheXilinxVivadotoolusing thetargetedZynqZ-7020SoC.FromtheresultsinTable5,weseethat thediﬀerentdesignvariantsallexhibitsimilarlevelsofresourcecost. Theasynchronousdesigns SFM aandSFM aCGincurthehighestresource

costsduetotheadditionallogicrequiredbytheCDCFIFOs.Thenumber ofBUFGsvariessigniﬁcantlyamongthediﬀerentdesigns,dependingon thenumberofclockdomainsandthenumberofclockgatedactors.

Eachoftheimplementeddesignshasbeensimulatedinorderto gen-erateaswitchingactivityﬁle,whichhasbeenback-annotatedtoVivado PowerEstimationtoextractpowerconsumptiondata.Sincethedesigns havediﬀerentexecutiontimes,theenergyconsumptionlevelsdonot varyinthesameproportionsasthepowerconsumptionlevels.Table6 summarizes thepowerconsumption,executiontimeandenergy con-sumptionofthesixalternativedesigns.

Intheseexperiments,theclockfrequenciesofthesynchronous de-signsandofRegion1(CLK1)intheasynchronousdesignsareallset to100MHz,whichisthemaximumachievablefrequencyforthe tar-getedplatform.ForRegion2(CLK2)intheasynchronousdesigns,the frequency issetto5MHz.This settingof 5MHzisderivedfrom the hardwareprofilingdata(seeTable3)as1/20ofCLK1.Theseclock fre-quenciesarespecifiedinTable6 withthesuffix _F ,where F represents thefrequencyvalueinMHz.

(13)

(14)

Fig.9. Buﬀer memory and communication requirements in the DNN architecture.

Fig.10. Buﬀer memory allocation for the DNN application.

Table6

Dynamic power consumption, execution time and energy consumption of the diﬀerent SFM variants. In parentheses: the percentage diﬀerence with respect to the baseline SFM.

Power [ mW ] Time [ns] Energy [ 𝜇J ]

SFM 115 2,329,165 268 𝑆𝐹 𝑀 𝑎_5 89 (-22.61) 2,407,300 ( + 3.354) 214 (-20.01) SFMCG 89 (-22.61) 2,329,245 ( + 0.003) 207 (-22.61) 𝑆𝐹 𝑀 𝑎𝐶𝐺_5 88 (-23.48) 2,408,100 ( + 3.389) 212 (-20.89) SFMh 117 ( + 1.74) 2,329,155 (-0.000) 273 ( + 1.74) SFMhCG 105 (-8.70) 2,329,175 ( + 0.000) 244 (-8.70) SFMauto 113 (-1.74) 2,329,165 ( + 0.000) 263 (-1.74)

AccordingtoTable6,theclockgateddesignsSFM _CGand𝑆𝐹𝑀𝑎𝐶𝐺_5

havethebestcapabilitiesforsavingenergy,reducingthetotalenergy consumptionby22.61%and20.89%,respectively.Design 𝑆𝐹 𝑀 _𝑎𝐶𝐺_5

saveslessenergythanSFM _CGsincetheformeremploysonemoreBUFG.

Furthermore,in𝑆𝐹𝑀𝑎𝐶𝐺_5,theactorsintheslowerdomain(Region2)

areactiveforarelativelylargeportionoftheexecutiontime,andthus, theycannotbeswitchedoﬀ forlargeproportionsoftime.Incontrast, accordingtoTable3,theDeinterleaveactorinRegion1canbeswitched oﬀ foralmost90%ofthetotalexecutiontime.

Thedesigns𝑆𝐹 𝑀 _𝑎𝐶𝐺_5and𝑆𝐹 𝑀 𝑎_5, bothofwhichemploytwoclock

domainswithCLK1at100MHzandCLK2at5MHz,havesimilar ca-pabilitiestosaveenergy.Theformerdesignisslightlymoreenergy ef-ﬁcientcomparedtothelatter.Theresultsforthesetwodesignsshow thattheenergysavedbyswitchingoﬀ theactors,wheninactive,and alsothesavingoftheunusedlogicintheCDCFIFOscounterbalancethe energyoverheadduetotheadditionalcircuitry.

Asexpected,SFM _hhasasmallamountofenergyoverheadduetothe logicnecessarytoencapsulateSum1,Sum2andMaxpool&ReLUintothe hierarchicalactor.Thedesign SFM hCG,amongtheclockgateddesigns,

isnotasadvantageousasthepreviouslyanalyzeddesignsintermsof en-ergysaving.ThisisbecauseeventhoughitemploysonlythreeBUFGs, thehierarchicalactorisswitchedoﬀ onlywhennoneofthe

(15)

underly-ingactorsareworking.Thismeansthat,forinstance,while Sum1is active,theactorsSum2andMaxpool&ReLUwillhaveanactiveclock evenwhentheactorsareinanidlestate(sothattheykeepwasting energy).Finally SFM autoisthedesignwiththesmallestenergysaving,

only1.74%comparedtoSFM .Evenconsideringthesameoptimization technique(clockgating),thelevelonwhichitisappliedturnsoutto befundamental:atalowlevel(singleﬂip-ﬂopsin SFM auto)onlythe

dy-namicpowerofarestrictednumberofgatescanbesaved.Ontheother hand,atacoarse-grainlevel(groupsofdataﬂowactorsin SFM CG),it

ispossibletoactalsoontheclocktree,whichis highlyeﬀectivefor improvingpowersaving.

5.3. Hardware/software co-design results

Inthissection,weinvestigatedifferenthardware/softwareco-design configurations.AsanticipatedinSection4.1,dependingontheportion oftheapplicationthatisacceleratedinhardwareandonthegiven re-quirementsandconstraints,differentdesignchoicesregardingthe hard-ware/softwarecommunicationinterfaceleadtodifferenttrade-offs be-tweenresourcerequirementsandperformance.FortheSFMaccelerator, weinvestigatedseveralimplementationandoptimizationsolutions, ex-ploringthreekeyaspects:exploitingparallelism,communication inter-facesandlocalbuffering(seeSection4.3).Inthissection,byan SFM accelerator ,wemeanspecificallyahardwareaccelerator.

Diﬀerentsoftwareandhardwareconﬁgurationsthatweexploredin ourco-designexplorationaresummarizedasfollows.

• SW1 — TheapplicationrunsinsoftwareonasingleARMcore.This designcanbeviewedasabaselinedesignwithoutanyoptimization orhardwareacceleration.Comparisonsbetweenthisbaselinedesign andalternativedesignsarediscussedintheremainderofthissection. • SW2 — TheapplicationrunsinsoftwarebyusingbothoftheARM

coresonthetargetplatform.

• HW1 — Asingle-branchSFMacceleratorisemployedtoexecutethe ﬁrstconvolutionallayer.

• HW2 — AnSFMacceleratorwithtwoparallelbranches.Inthis con-ﬁguration,alocalbuﬀerissharedbetweenthebranches.

• HW4 — AnSFMacceleratorwithfourparallelbranches.Again,a localbuﬀerissharedamongthebranches.

Formulticoresoftwareimplementationsandhardware implementa-tionswithmultiplebranches,thelayerorlayersthatareexecutedin parallel(i.e.,intra-layerparallelismisexploited)areindicatedin paren-theses.Similarly,hardwareconﬁgurationsareannoatedwith -mm or -s depending,respectively,onwhetheramemory-mappedAXI-lite com-municationinterfaceisused, oraFIFO-basedAXI-streaminterfaceis used.

Forexample,SW2(L1,L2)representsasoftware-only implementa-tioninwhichlayer1andlayer2areexecutedinparallel.Asanother example,SW2(L2)/HW2(L1)-mm representsa hardware/software im-plementationbasedonconﬁgurationsSW2andHW2;inthis implemen-tation,layer2isexecutedacrossmultiplecores,layer1isparallelized inhardwarewith2parallelbranches,andAXI-liteisusedasthe com-municationinterface.

NotethattheSFMacceleratorsareabletoexecuteonlytheﬁrst con-volutionallayer.Thus,inalloftheDNNsystemimplementations,the acceleratorsarecoupledwithoneofthesoftwareconﬁgurations. 5.3.1. Resource costs of accelerator implementations

Table7depictstheresourceoccupancyinthetargetedZynqZ-7020 deviceforthediﬀerentSFMaccelerator implementationsthatwe ex-perimentedwith.Asexpected,ahigherlevelofparallelism(goingfrom HW1-mmtoHW4-mm)requiresmoreresources,andourexperiments herehelptoquantifytheassociatedtrends.Forexample,ﬁne-grained andcomputation-relatedresources(LUTs,REGsandDSPs)increase lin-earlywiththenumberof parallelbranchesplaced intheaccelerator (about+100% withonemorebranchandabout+300% withthree

Table7

Resource occupancy for diﬀerent SFM accelerator implementations. In parentheses: the percentage of utilization with respect to the resources available on the targeted FPGA. The bottom part of the table depicts the percentage of variation with respect to HW1-mm.

Available LUTs REGs BRAMs DSPs

53200 106400 140 220 HW1-mm 5395(10.14) 4668(4.39) 43 (30.71) 13 (5.91) HW2-mm 10890 (20.47) 8197 (7.70) 54 (38.57) 26 (11.82) HW4-mm 21474 (40.36) 16331(15.35) 76 (54.29) 52 (23.64) HW2-mm + 101.85 + 75.60 + 25.58 + 100.00 HW4-mm + 298.04 + 249.85 + 76.74 + 300.00 Table8

Performance of diﬀerent co-design solutions. The top part of the table depicts execution time in milliseconds (ms). The bottom part depicts the percentage of execution time variation for each conﬁguration with respect to SW1.

input Layer Prediction

1 2 3:5 SW1 118.9 640.3 1594.7 34.4 2388.2 SW2(L1) 118.7 368.3 1609.8 34.0 1639.5 SW2(L1,L2) 117.4 354.7 842.1 33.8 1348.0 SW2(L2)/HW1(L1)-mm 118.9 118.5 856.7 35.4 1129.5 SW2(L2)/HW2(L1)-mm 118.4 74.6 866.5 35.1 1094.5 SW2(L2)/HW4(L1)-mm 117.9 54.5 859.0 35.4 1066.8 SW2(L1) -0.13 -42.48 + 0.95 + 1.12 -10.75 SW2(L1,L2) -1.22 -44.61 -47.19 -1.72 -43.56 SW2(L2)/HW1(L1)-mm -0.00 -81.49 -46.28 + 2.89 -52.71 SW2(L2)/HW2(L1)-mm -0.40 -88.36 -45.66 + 2.09 -54.17 SW2(L2)/HW4(L1)-mm -0.81 -91.48 -46.13 + 2.85 -55.33

morebranches),whilecoarse-grainedmemoryresources(BRAMs) ex-hibitagentlerslope.Weexpectthatthisgentlersloperesultsbecause theprimaryBRAM-demandingmodule,thelocalbuﬀer,issharedacross parallelbranches.

TheresultsaboveindicatethatwhentheDNNarchitectureismade deeper(i.e.,asthenumberof convolutionallayersis increased),the biggestrestrictionwillbethehardwareresourcelimitations.Usually,as aDNNismadedeeper,moreparallelbranchesareneededtocomplete thecomputationwithoutcompromisingtheprocessingspeedandmore memoryresourcesareneededtostoretheintermediatefeaturemaps. However,deepernetworksdonotnecessarilyimplymorecomputational complexity.Forexample,thewell-knownResNet101, whichhas101 layers,needslesscomputationthanthe16-layerVGG16becausethe VGGlayersaresigniﬁcantlylarger[39,40].

5.3.2. Comparison of co-design solutions

Table8presentsperformanceresultsfordifferentsoftware-onlyand hardware/softwaresolutionsthatweinvestigatedusingSTMCM.In par-ticular,thetable reportstheexecutiontimein termsof milliseconds (ms)fordifferentexecutionphases:readingtheinputfile(column in-put),computingthefirstandthesecondlayers,andcomputingthedeep layers(Layers3,4and5).Thetablealsoreportstheexecutiontimeof theoverallapplication(prediction)fordifferentdegreesofsoftwareand hardwareparallelism.

ThereferencetimeisgivenbytheexecutionoftheentireDNN ap-plicationonasingleARMcore(SW1),whichiscapableofcompleting thepredictioninabout2.4seconds.Fromthisreferenceconfiguration, itisalsopossibletoappreciatethecomputationalloadofthedifferent applicationphases.TheheaviestpartisLayer2,whichisresponsible formorethan65%oftheoverallexecutiontime,whilemostofthe re-mainingloadisattributabletoLayer1(around25%),andtoreadingof theinputfile(about5%).Forthisreason,softwareparallelizationhas beenevaluatedonlyonLayer1(SW2(L1)),andonbothLayers1and2 together(SW2(L1,L2)).

(16)

(3) DMA (stream) 1490 (2.80) 1881 (1.77) 3 (2.14) 0 (0.00) (1) + (2)+(3) 7486 (14.07) 6480 (6.09) 56 (40.00) 13 (5.91)

HW1-s + 7.21 -6.66 + 0.00 + 0.00

(1) + (2)+(3) + 38.75 + 38.81 + 53.49 + 0.00%

Theexecutiontimeneededbyeachofthemajorexecutionphasesis almosthalvedwhentwocoresareadopted.Aprecise50%reductionis notreachedbecauseofthesoftwareoverheadnecessarytomanage mul-titasking.Withsoftwareparallelizationonly,theoverallexecutiontime isreducedto1.13seconds,about44%lessthantheSW1conﬁguration. Hardwareaccelerationandrelatedparallelizationareonlyappliedto theﬁrstconvolutionallayer,whileonlysoftwareparallelizationis ap-pliedtoLayer2.Ifweconsideronlytheexecutiontimeoflayer1,then SW2(L2)/HW1(L1)reducesexecutiontimebymorethan80%compared toSW1,andmorethan65%comparedtoSW2(L1,L2).

IfmultiplebranchesofLayer1areprocessedinparallel,the hard-wareacceleratorachievesfurtherperformancebenefits— atime sav-ingupto88%fora2-branchconfiguration(SW2(L2)/HW2(L1)-mm), andupto91%fora4-branchconfiguration(SW2(L2)/HW4(L1)-mm). TheseperformanceimprovementsarewithrespecttoSW1.Notethatthe speed-upobtainedbydoublingthenumberofbranches(goingfrom1to 2andfrom2to4)islessthan2ineithercase(1.6from1to2and1.4 from2to4).Thisisduetothesoftwareoverheadrelatedtomanaging multiplebranches.DuetothelimitedcomputationalloadofLayer1,the benefitsofhardwareaccelerationandparallelizationontheoverall sys-temaresomewhatlimited.Thebestsolution,SW2(L2)/HW4(L1)-mm, requires1.07secondstoperformthewholeapplication,55%lessthana fullsoftwareexecutiononasinglecore(SW1)and21%lessthanafull softwareexecutionontwocores(SW2(L1,L2)).

Another aspect that has been studied in our co-design experi-mentsistheinterfacingbetweensystemcomponents.Asdiscussedin Section4.3.2,theadoptedcommunicationinterfacebetweensoftware andhardware portionsof a designcan havea signiﬁcantimpact on overallsystemperformance.Inourco-designexperiments,wehave ap-pliedtwoveryinterfaces— mm-liteandstream,whicharediscussedin Section4.3.2.

Table9 helpstounderstanddifferencesbetweentheresourcecosts ofthesetwointerfaces.Thefirstrowofthistableshowsresource avail-abilityonthetargetplatform.Thesecondandthirdrowsshowresource costsfortheHW1-mmaccelerator,andHW1-saccelerator.Thefourth andfifthrowsshowresourcecostsforFIFOandDMAmodules(external totheaccelerator)thatarenecessaryforthestreaminterface.Thesixth rowshowstotalresourcecostsinducedbyuseofthestreaminterface

((1)+(2)+(3)),whileover50%moreBRAMcostisincurred. Tomakethestreaminterfaceausefuloptioninoursystemdesign,its significantincreaseinresourcecostsshouldbeaccompaniedbytangible advantagesinexecutionperformance.Table10showsresultspertaining totheimpactofthecommunicationinterfaceonexecutiontime.In or-dertobetterexposetheeffectsoftheselectedcommunicationinterface, detailsondatatransfers(input,convolutioncoefficientsandoutputs) be-tweenthehardware(accelerator)andsoftwaresubsystemsisreported. FortheHW1-sdesign,twodifferentsetsofresultsarereported depend-ingonwhetherprogramdataisdirectlyaccessiblebytheDMAengine. Foroneset,theprogramdataislocatedinamemorythatisnotdirectly accessiblebytheDMA.Thisscenariocorrespondstothedesignthatwe haveimplemented.Itrequiresanadditionalcopyoftheprogramdata inamemorythatisaccesseddirectlybytheDMA.Fortheotherset,the programdataislocatedinamemorythatisdirectlyaccessiblebythe DMA.ThissetisindicatedinTable10usingtheannotation HW1-s dir . WehavenotimplementedHW1-sdir;instead,wehaveestimatedthe correspondingresultstogainsomeideaaboutthemaximumachievable performance.Detailsontheestimationapproachareomittedforbrevity. The results in Table 10 demonstratethe utility of the resource-hungryHW1-sdesign,andquantifyitsclearabilitytooutperform HW1-mm.Inparticular,theinputdataandtransmissionofconvolution coef-ficientsarerespectivelyabout84%and67%fasterwhentheAXI-stream protocolisadopted.Thisleadstoanestimatedtimesavingofupto92% and80%,respectively,whentheDMAhasdirectaccesstotheprogram data(HW1-sdir).

Ontheotherhand,theoutputdatatransmissiontimeisthesame amongallofthereportedconfigurations.Weexpectthatthisisbecause theoutputsareproducedinarow-by-rowfashion(48dataunitsata time),andthetimingofoutputproductionisdeterminedbythe compu-tationlatency,whichisgreaterthanthecommunicationlatencyforall of theinterfacingconfigurations.However,lookingatthetotalLayer 1 andDNNapplicationexecution times,we see thattheadvantages ofadoptingthestreaminterfacearenolongervisible.Indeed,forthe consideredSFMaccelerator,theinputdataistransmittedonlyduring thefirstbranchexecutionduetoouruseoflocalbuffering. Addition-ally,eventhoughthecoefficientsaretransmittedforeachbranch,their transmissionrequiresarelativelysmallamountoftime.

Theseresultsinvolvingcommunicationinterfaceselectionillustrate theimportanceofcomprehensivesystem-levelevaluationofalternative designoptions,whichisoneofthekeypartsofthedesignprocessthat isfacilitatedbySTMCM.

Table10

Results pertaining to the impact of the communication interface on execution time. The top part of the table depicts the execution time of the diﬀerent DNN application steps. The bottom part depicts the execution time variation of each conﬁguration with respect to HW1-mm.

File input [ms] Layer 1 Layer 2 [ms] Layers 3, 4, 5 [ms] Prediction [ms] input tx [ 𝜇s] coeﬀs tx [ 𝜇s] output tx [ 𝜇s] total [ms]

HW1-mm 118.9 5222 15 2339 118.5 856.7 35.4 1129.5

HW1-s 117.5 854 5 2333 114.6 864.3 34.9 1131.3

HW1-s dir 118.3 418 3 2333 114.1 869.5 35.0 1137.0

HW1-s -1.17 -83.65 -66.67 -0.26 -3.27 + 0.87 -1.39 + 0.16