Algorithms for Advanced Processor Architectures - AFAPA 12/10/21
1
Cell BE Architecture
Authors
Class Objectives – Things you will learn
Current microprocessor design constraints
Cell processor organization and components
– Power processor element, block diagram, PXU pipeline – Synergistic processor element, block diagram, SXU pipeline – Memory flow controller and MFC commands
– Element interconnect bus, command and data topology
– I/O and memory interfaces
Class Agenda
Current microprocessor design
Cell processor components – Power processor element – Synergistic processor element – Memory flow controller
– Element interconnect bus – I/O and memory interfaces
– Resource allocation management References
Jim Kahle, Cell Broadband Engine and Cell Broadband Engine Architecture
J.J. Porta, Cell Overview
Trademarks - Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.
Current Microprocessor Design
Three Major Limiters to Processor Performance
Frequency Wall
– Diminishing returns from deeper pipelines
Memory Wall
– Processor frequency vs. DRAM memory latency – Latency introduced by multiple levels of memory
Power Wall
– Limits in CMOS technology
– Hard limit to acceptable system power
Where have all of the transistors gone?
Memory and memory management
– Larger and larger cache sizes needed for visible improvement
Trying to extract parallelism
– Long pipelines with lots of interlock logic
– Superscalar designs (multiple pipelines) with even more interlocks!
– Branch prediction to effectively utilize the pipelines – Speculative execution
– Out-of-order execution – Hardware threads
The amount of transistors doing direct computation is shrinking
relative to the total number of transistors.
Techniques for Better Efficiency
Chip level multi-processors
– Go back to simpler core designs and use the extra available chip area for multiple cores.
Vector Units/SIMD
– Have the same instruction execution logic operate on multiple pieces of data concurrently; allows for more throughput with little increase in overhead.
Rethink memory organization
– At the register level hidden registers with register renaming is transparent to the end user but expensive to implement. Can a compiler do better with more instruction set architected registers?
– Caches are automatic but not necessarily the most efficient use of chip area.
Temporal and spatial locality mean little when processing streams of data.
Today Microprocessor Design: Power Constrained
Transistor performance scaling
– Channel off-current & gate-oxide tunneling challenges supply voltage scaling
– Material and structure changes required to stay on Moore’s Law
– Power per (switched) transistor decreases only slowly
Chip Power Consumption increases faster as before
Microprocessor performance limited by processor transistor power efficiency
We know how to design processors we cannot reasonably cool or power
Net: Increasing Performance requires Increasing Power Efficiency
Moore’s Law: 2x transistor density
every 18-24 months
Hardware Accelerators Concept
Streaming Systems use Architecture to Address Power/Area/Performance challenge
– Processing unit with many specialized co-processor cores – Storage hierarchy (on chip private,
on chip shared, off-chip) explicit at software level
– Applications coded for parallelism with localized data access to
minimize off chip access
Advantage demonstrated per chip and per Watt of ~100x on well
behaved applications (graphics, media, dsp,…)
Concept/Key Ideas
Augment standard CPU with many (8-32) small, efficient, hardware accelerators (SIMD or vector) with private memories, visible to
applications.
Rewrite code to highly parallel model with explicit on-chip vs. off- chip memory.
Stream data between cores on
same chip to reduce off-chip
accesses.
Cell BE Solutions
Increase concurrency
– Multiple cores
– SIMD/Vector operations in a core
– Start memory movement early so that memory is available when needed
Increase efficiency
– Simpler cores devote more resources to actual computation
– Programmer managed memory is more efficient than dragging data through caches
– Large register files give the compiler more flexibility and eliminate transistors needed for register renaming
– Specialize processor cores for specific tasks
Other (Micro)Architectural and Decisions
Large shared register file
Local store size tradeoffs
Dual issue, In order
Software branch prediction
Channels
Microarchitecture decisions, more so than architecture decisions
show bias towards compute-intensive codes
The Cell BE Concept
Compatibility with 64b Power Architecture™
–
Builds on and leverages IBM investment and community
Increased efficiency and performance
–
Attacks on the “Power Wall”
• Non Homogenous Coherent Multiprocessor
• High design frequency @ a low operating voltage with advanced power management
–
Attacks on the “Memory Wall”
• Streaming DMA architecture
• 3-level Memory Model: Main Storage, Local Storage, Register Files
–
Attacks on the “Frequency Wall”
• Highly optimized implementation
• Large shared register files and software controlled branching to allow deeper pipelines
Interface between user and networked world
–
Image rich information, virtual reality, shared reality
–Flexibility and security
Multi-OS support, including RTOS / non-RTOS
Cell Synergy
Cell is not a collection of different processors, but a synergistic whole – Operation paradigms, data formats and semantics consistent
– Share address translation and memory protection model
PPE for operating systems and program control
SPE optimized for efficient data processing
– SPEs share Cell system functions provided by Power Architecture – MFC implements interface to memory
• Copy in/copy out to local storage
PowerPC provides system functions – Virtualization
– Address translation and protection – External exception handling
EIB integrates system as data transport hub
State of the Art: Intel Core 2 Duo
Guess where the cache is?
State of the Art: IBM Power 5
How about on this one?
Unconventional State of the Art: Cell BE
memory warehousing vs. in-time data processing
Today’s x86 Quad Core processors are Dual Chip Modules (DCMs), 2 of these processor stacked vertically &
packaged together
Cell/B.E. - ½ the space & power vs traditional approaches
On any traditional processor, shown ratio of cores to cache, prediction, & related items illustrated here remains at ~50% of area the
chip area
Example Dual Core
349mm2, 3.4 GHz @ 150W 2 Cores, ~54 SP GFlops
Cell/B.E.
3.2 GHz
9 Cores, ~230 SP GFlops
Cell BE Architecture
Cell Architecture is …
COHERENT BUS
Power ISA
MMU/BIU
Power ISA
MMU/BIU
…
IO transl.
Memory
Incl. coherence/memory
compatible with 32/64b Power Arch. Applications and OS’s
64b Power Architecture™
Cell Architecture is … 64b Power Architecture™
COHERENT BUS (+RAG)
Power ISA
+RMT
MMU/BIU +RMT
Power ISA
+RMT
MMU/BIU +RMT
IO transl.
Memory Plus Memory
Flow Control (MFC)
MMU/DMA +RMT
Local Store Memory
MMU/DMA +RMT
Local Store Memory
LS Alias LS Alias
…
…
…
Cell Architecture is … 64b Power Architecture™+
MFC
COHERENT BUS (+RAG)
Power ISA
+RMT
MMU/BIU +RMT
Power ISA
+RMT
MMU/BIU +RMT
IO transl.
Memory Plus
Synergistic Processors
MMU/DMA +RMT
Local Store Memory
MMU/DMA +RMT
Local Store Memory
LS Alias LS Alias
…
…
…
Syn.
Proc.
ISA
Syn.
Proc.
ISA
Coherent Offload Model
DMA into and out of Local Store equivalent to Power core loads & stores
Governed by Power Architecture page and segment tables for translation and protection
Shared memory model
– Power architecture compatible addressing – MMIO capabilities for SPEs
– Local Store is mapped (alias) allowing LS to LS DMA transfers – DMA equivalents of locking loads & stores
– OS management/virtualization of SPEs
• Pre-emptive context switch is supported (but not efficient)
Cell Broadband Engine TM :
A Heterogeneous Multi-core Architecture
* Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc.
1 PPE core:
- VMX unit
- 32k L1 caches
- 512k L2 cache
- 2 way SMT
-128-bit SIMD instruction set - Register file – 128x128-bit - Local store – 256KB
- MFC
- Isolation mode
Cell BE Block Diagram
16B/cycle
BIC
FlexIOTM
MIC
Dual XDRTM
16B/cycle
EIB (up to 96B/cycle)
64-bit Power Architecture with VMX
PPE SPE
LS SXU SPU
MFC
PXU
L1 PPU
16B/cycle
L2
32B/cycle
LS SXU SPU
MFC LS
SXU SPU
MFC LS
SXU SPU
MFC LS
SXU SPU
MFC LS
SXU SPU
MFC LS
SXU SPU
MFC LS
SXU SPU
MFC
Pre-Decode
L1 Instruction Cache
Microcode
SMT Dispatch (Queue)
Decode Dependency
Issue
Branch Scan Fetch Control
L2 Interface
VMX/FPU Issue (Queue)
VMX Load/Store/
Permute
VMX Arith./Logic Unit
FPU Load/Store FPU
Arith/Logic Unit
Load/Store Unit
Branch Execution
Unit Fixed-Point
Unit
FPU Completion VMX Completion
Completion/Flush
8
4
2 1
2 1
1 1
1
Thread A 4 Thread B
Threads alternate fetch and dispatch cycles
Thread A Thread B Thread A
L1 Data Cache
PPE Block Diagram
1 1 1
2
SPE Block Diagram
SPU Core (SXU)
Channel Unit
Local Store MFC
(DMA Unit)
SPU
SPU Core: Registers & Logic SPE
Channel Unit: Message passing interface for I/O
Local Store: 256KB of SRAM private to the SPU Core
DMA Unit: Transfers data
between Local Store and
Main Memory
Local Store
Never misses
– No tags, backing store, or prefetch engine – Predictable real-time behavior
– Less wasted bandwidth
– Easier programming model to achieve very high performance – Software managed caching
– Large register file –or– local memory
– DMA’s are fast to setup – almost like normal load instructions – Can move data from one local store to another
No translation
– Multiuser operating system is running on control processor
– Can be mapped as system memory - cached copies are non-coherent wrt SPU
loads/stores
Channels: Message Passing I/O
The interface from SPU core to rest of system
Channels have capacity => allows pipelining
Instructions: read, write, read capacity
Effects appear at SPU core interface in instruction order
Blocking reads & writes stall SPU in low power wait mode
Example facilities accessible through the channel interface:
– DMA control – counter-timer – interrupt controller – Mailboxes
– Status
Interrupts, BISLED
SPU Organization
Permute Unit Load-Store Unit Floating-Point Unit
Fixed-Point Unit
Branch Unit Channel Unit
Result Forwarding and Staging Register File
Local Store (256kB)
Single Port SRAM
128B Read 128B Write DMA Unit
Instruction Issue Unit / Instruction Line Buffer
8 Byte/Cycle 16 Byte/Cycle 64 Byte/Cycle 128 Byte/Cycle On-Chip Coherent Bus
SPE Block Diagram (Detailed)
SPE Issue
In-order
Dual issue requires alignment according to type
Instruction Swap forces single issue
Saves a pipeline stage & simplifies resource checking
9 units & 2 pipes chosen for best balance
instruction from address 0 instruction from address 4
Simple Fixed Permute
Shift Local Store
Single Precision Channel
Floating Integer Branch
Byte
SPE Pipeline Diagram
Instruction Class Pipe Latency
Simple Fixed (FX) 0 2
Shift (FX) 0 4
Single Precision (FP) 0 6 Floating Integer (FP) 0 7
Byte (BYTE) 0 4
Permute (PERM) 1 4
Load (LS) 1 6
Branch 1 4
Channel 1 6
SPE Branch Considerations
Mispredicts cost 18 cycles
– Not bad for 11 FO4 processor
No hardware history mechanism
Branch penalty avoidance techniques
– Write frequent path as inline code
– Compute both paths and use select instruction – Unroll loops – also reduces dependency stalls – Load target into branch target buffer
• Branch hint instruction
• 16 cycles ahead of branch instruction
• Single entry BTB
SPE Instructions
Scalar processing supported on data-parallel substrate
– All instructions are data parallel and operate on vectors of elements – Scalar operation defined by instruction use, not opcode
• Vector instruction form used to perform operation
Preferred slot paradigm
– Scalar arguments to instructions found in “preferred slot”
– Computation can be performed in any slot
Register Scalar Data Layout
Preferred slot in bytes 0-3
– By convention for procedure interfaces – Used by instructions expecting scalar data
• Addresses, branch conditions, generate controls for insert
Memory Management & Mapping
System Memory Interface:
- 16 B/cycle
- 25.6 GB/s (1.6 Ghz)
Communication Performance
Internal bus (Element Interconnect Bus EIB) with peak performance of 204.8 Gbytes/second
Memory Bandwidth 25.6 Gbytes/second
Impressive I/O bandwidth
– 25 Gbytes/second inbound – 35 Gbytes/second outbound
Many outstanding memory requests
– Up to 128, typical of multi-threaded processors
Moving the Spotlight from Processor Performance to Communication Performance
Traditionally the focus is on (raw) processor performance
Emphasis is now shifting towards communication performance
Lots of (peak) processing power inside a chip (approaching Teraflops/sec)
– Small fraction is delivered to applications
Lots of (peak) aggregate communication bandwidth inside the chip (approaching Terabytes/sec)
– But processing units do not interact frequently
Small on chip local memories
– Little data reuse
Main memory bandwidth is the primary bottleneck
And then I/O and network bandwidth
Performance and Programmability
Programming model is already a critical issue
– And it is going to get worse
Low data-reuse increases the algorithmic complexity
Memory and Network bandwidth are key to achieve performance and simplify the programming model
Multi-core Uni-bus
SPE Internal Architecture
SPU
3.2 GHz 1.6 GHz MFC
DMAC
QueuesDMA
LS
BIU EIB
(2) (1)
MMU
(3)
TLB (4)
(5) (6)
MIC
(6)
Memory (off chip)
(7)
(7)
(7) (7)
1.6 GHz
Channel Interface MMIO
16B/cyc in/out
16B/cyc in/out
Cell BE Communication Architecture
SPUs can only access programs and data in their local storage
SPEs have a DMA controller that performs transfers between local stores, main memory and I/O
SPUs can post list a list of DMAs
SPUs can also use mailboxes and signals to perform basic synchronizations
More complex synchronization mechanisms can support atomic operations
All resources can be memory mapped
DMA & Multibuffering
I n it D M A F e tc h
W a it f o r D a ta C o m p u te
I n it D M A S to r e I n it D M A F e tc h
W a it fo r D a ta C o m p u te
T h r e a d 2 T h r e a d 1
G o to T h r e a d 2
G o to T h r e a d 1 D M A T r a n s f e r s
DMA commands move data between system memory & Local Storage
DMA commands are processed in parallel with software execution
Double buffering
Software Multithreading
16 queued commands - up to 16 kB/command
Up to 16, 128 Byte, transfers in flight on the on chip interconnect
Richer than typical cache prefetch instructions
Scatter-gather
Flexible DMA command status, can achieve low
power wait
Basic Latencies (3.2 Ghz)
LATENCY COMPONENT
CYCLES NANOSECONDS
DMA issue 10 3.125
DMA to EIB 30 9.375
List Element Fetch 10 3.125
Coherence Protocol 100 31.25
Data Transfer for inter-
SPE put 140 43.75
TOTAL 290 90.61
DMA Latency
DMA Bandwidth
DMA batches (put)
Element Interconnect Bus (EIB)
- 96B / cycle bandwidth
Internal Structure of the Cell BE
MIC SPE0 SPE2 SPE4 SPE6 BIF / IOIF1 Ramp
7
Controller
Ramp 8
Controller
Ramp 9
Controller
Ramp 10
Controller
Ramp 11
Controller
Controller
Ramp 0
Controller
Ramp 1
Controller
Ramp 2
Controller
Ramp 3
Controller
Ramp 4
Controller
Ramp 5
Controller
Ramp 6
Controller
Ramp 7
Controller
Ramp 8
Controller
Ramp 9
Controller
Ramp 10
Controller
Ramp 11
Data Arbiter
Ram p
7
Con tro lle r
Ram p
8
Cont rol ler
Ram p
9
Con trol ler
Ram p
10
Con trol ler
Ram p
11
Cont rol Controller ler
Ramp 5
Controller
Ramp 4
Controller
Ramp 3
Controller
Ramp 2
Controller
Ramp 1
Controller
Ramp 0 PPEPPE SPE1SPE1 SPE3SPE3 SPE5SPE5 SPE7SPE7 IOIF1IOIF1
PPEMIC SPE1SPE0 SPE3SPE2 SPE5SPE4 SPE7SPE6 IOIF1BIF / IOIF0
Aggregate Behavior
Hot Spot
Latency Distribution under Hot-Spot
I/O Interface:
- 16 B/cycle x 2
I/O and Memory Interfaces
I/O Provides wide bandwidth
– Dual XDRTM controller (25.6GB/s @ 3.2Gbps) – Two configurable interfaces (76.8GB/s @6.4Gbps)
• Configurable number of Bytes
• Coherent or I/O Protection
– Allows for multiple system configurations
Cell BE Processor Can Support Many Systems
Game console systems
Blades
HDTV
Home media servers
Supercomputers
Cell BE Processor XDR
tmXDR
tmIOIF0 IOIF1
Cell BE Processor
XDR
tmXDR
tmIOIF BIF
Cell BE Processor
XDR
tmXDR
tmIOIF
Cell BE Processor
XDR
tmXDR
tmIOIF
BIF
Cell BE Processor
XDR
tmXDR
tmIOIF
Cell BE
Processor
XD
R R XD
IO IF
BIF
Cel l BE
Processor
XD
R R XD
IO
SW IF
This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.
Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA.
All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.
The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied.
All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions.
IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice.
IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies.
All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary.
IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.
Many of the features described in this document are operating system dependent and may not be available on Linux. For more information, please check: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.html
Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally- available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment.
Special Notices -- Trademarks
The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks, BladeCenter, Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBM Business Partner (logo), IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage, xSeries; Advanced Micro- Partitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, Power Architecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks, System p, System p5, System Storage, VideoCharger, Virtualization Engine.
A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.
Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, other countries, or both.
Rambus is a registered trademark of Rambus, Inc.
XDR and FlexIO are trademarks of Rambus, Inc.
UNIX is a registered trademark in the United States, other countries or both.
Linux is a trademark of Linus Torvalds in the United States, other countries or both.
Fedora is a trademark of Redhat, Inc.
Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both.
Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries.
AMD Opteron is a trademark of Advanced Micro Devices, Inc.
Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries.
TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC).
SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC).
AltiVec is a trademark of Freescale Semiconductor, Inc.
PCI-X and PCI Express are registered trademarks of PCI SIG.
InfiniBand™ is a trademark the InfiniBand® Trade Association
Other company, product and service names may be trademarks or service marks of others.
Is this is a processor or a supercomputer on a chip?
Striking similarities with high-performance networks for supercomputers
– E.g., Quadrics Elan4
DMAs overlap computation and communication
Similar programming model
Similar synchronization algorithms
– Barriers, allreduces, scatter & gather
We can adopt the same techniques that we already use in
high-performance clusters and supercomputers!
Legend:
Data Bus Snoop Bus Control Bus Xlate Ld/St MMIO Local
Store
SPU
DMA Engine DMA
Queue
Atomic Facility
MMU RMT
Bus I/F Control
S P C
MMIO
Memory Flow Control System
•DMA Unit
•LS <-> LS, LS<-> Sys Memory, LS<-> I/O Transfers
•8 PPE-side Command Queue entries
•16 SPU-side Command Queue entries
•MMU similar to PowerPC MMU
•8 SLBs, 256 TLBs
•4K, 64K, 1M, 16M page sizes
•Software/HW page table walk
•PT/SLB misses interrupt PPE
•Atomic Cache Facility
•4 cache lines for atomic updates
•2 cache lines for cast out/MMU reload
•Up to 16 outstanding DMA requests in BIU
•Resource / Bandwidth Management Tables
•Token Based Bus Access Management
•TLB Locking
Isolation Mode Support (Security Feature)
Hardware enforced “isolation”
– SPU and Local Store not visible (bus or jtag) – Small LS “untrusted area” for communication area
Secure Boot
– Chip Specific Key
– Decrypt/Authenticate Boot code
“Secure Vault” – Runtime Isolation Support – Isolate Load Feature
– Isolate Exit Feature
Per SPE Resources (PPE Side)
Problem State
8 Entry MFC Command Queue Interface DMA Command and Queue Status DMA Tag Status Query Mask DMA Tag Status
32 bit Mailbox Status and Data from SPU 32 bit Mailbox Status and Data to SPU 4 deep FIFO
Signal Notification 1 Signal Notification 2 SPU Run Control
SPU Next Program Counter SPU Execution Status
4K Physical Page Boundary
4K Physical Page Boundary Optionally Mapped 256K Local Store
Privileged 1 State (OS)
4K Physical Page Boundary
SPU Privileged Control
SPU Channel Counter Initialize SPU Channel Data Initialize SPU Signal Notification Control SPU Decrementer Status & Control MFC DMA Control
MFC Context Save / Restore Registers SLB Management Registers
Privileged 2 State (OS or Hypervisor)
4K Physical Page Boundary
SPU Master Run Control SPU ID
SPU ECC Control SPU ECC Status SPU ECC Address
SPU 32 bit PU Interrupt Mailbox MFC Interrupt Mask
MFC Interrupt Status
MFC DMA Privileged Control MFC Command Error Register
MFC Command Translation Fault Register MFC SDR (PT Anchor)
MFC ACCR (Address Compare) MFC DSSR (DSI Status) MFC DAR (DSI Address) MFC LPID (logical partition ID) MFC TLB Management Registers Optionally Mapped 256K Local Store
4K Physical Page Boundary
Per SPE Resources (SPU Side)
128 - 128 bit GPRs
External Event Status (Channel 0) Decrementer Event
Tag Status Update Event DMA Queue Vacancy Event SPU Incoming Mailbox Event Signal 1 Notification Event Signal 2 Notification Event Reservation Lost Event External Event Mask (Channel 1)
External Event Acknowledgement (Channel 2) Signal Notification 1 (Channel 3)
Signal Notificaiton 2 (Channel 4) Set Decrementer Count (Channel 7) Read Decrementer Count (Channel 8)
16 Entry MFC Command Queue Interface (Channels 16-21) DMA Tag Group Query Mask (Channel 22)
Request Tag Status Update (Channel 23) Immediate
Conditional - ALL Conditional - ANY
Read DMA Tag Group Status (Channel 24) DMA List Stall and Notify Tag Status (Channel 25)
DMA List Stall and Notify Tag Acknowledgement (Channel 26) Lock Line Command Status (Channel 27)
Outgoing Mailbox to PU (Channel 28)
SPU Direct Access Resources SPU Indirect Access Resources (via EA Addressed DMA)
System Memory Memory Mapped I/O This SPU Local Store Other SPU Local Store Other SPU Signal Registers
Atomic Update (Cacheable Memory)
Memory Flow Controller Commands
LSA - Local Store Address (32 bit) EA - Effective Address (32 or 64 bit) TS - Transfer Size (16 bytes to 16K bytes) LS - DMA List Size (8 bytes to 16 K bytes) TG - Tag Group(5 bit)
CL - Cache Management / Bandwidth Class
DMA Commands
Command Parameters
Put - Transfer from Local Store to EA space Puts - Transfer and Start SPU execution Putr - Put Result - (Arch. Scarf into L2) Putl - Put using DMA List in Local Store Putrl - Put Result using DMA List in LS (Arch) Get - Transfer from EA Space to Local Store Gets - Transfer and Start SPU execution Getl - Get using DMA List in Local Store Sndsig - Send Signal to SPU
Command Modifiers: <f,b>
f: Embedded Tag Specific Fence
Command will not start until all previous commands in same tag group have completed
b: Embedded Tag Specific Barrier
Command and all subsiquent commands in same
tag group will not start until previous commands in same tag group have completed
SL1 Cache Management Commands
sdcrt - Data cache region touch (DMA Get hint)
sdcrtst - Data cache region touch for store (DMA Put hint) sdcrz - Data cache region zero
sdcrs - Data cache region store sdcrf - Data cache region flush
Synchronization Commands
Lockline (Atomic Update) Commands:
getllar - DMA 128 bytes from EA to LS and set Reservation putllc - Conditionally DMA 128 bytes from LS to EA
putlluc - Unconditionally DMA 128 bytes from LS to EA barrier - all previous commands complete before subsiquent
commands are started
mfcsync - Results of all previous commands in Tag group are remotely visible
mfceieio - Results of all preceding Puts commands in same group visible with respect to succeeding Get commands
Element Interconnect Bus
Internal Bandwidth Capability
EIB data ring for internal communication
– Four 16 byte data rings, supporting multiple transfers – 96B/cycle peak bandwidth
– Over 100 outstanding requests
Each EIB Bus data port supports 25.6GBytes/sec* in each direction
The EIB Command Bus streams commands fast enough to support 102.4 GB/sec for coherent commands, and 204.8 GB/sec for non-coherent commands.
The EIB data rings can sustain 204.8GB/sec for certain workloads, with transient rates as high as 307.2GB/sec between bus units
* The above numbers assume a 3.2GHz core frequency – internal bandwidth scales with core frequency
Element Interconnect Bus - Data Topology
Four 16B data rings connecting 12 bus elements
– Two clockwise / Two counter-clockwise
Physically overlaps all processor elements
Central arbiter supports up to three concurrent transfers per data ring
– Two stage, dual round robin arbiter
Each element port simultaneously supports 16B in and 16B out data path
– Ring topology is transparent to element data interface
16B 16B 16B 16B
Data Arb
16B 16B 16B 16B
16B 16B 16B 16B
16B 16B 16B 16B
16B 16B 16B 16B
16B 16B 16B 16B
SPE0 SPE2 SPE4 SPE6
SPE7 SPE5
SPE1 SPE3
MIC PPE
BIF/IOIF0 IOIF1
I/O and Memory Interfaces
(c) Copyright International Business Machines Corporation 2005.
All Rights Reserved. Printed in the United Sates September 2005.
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.
IBM IBM Logo Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.
IBM Microelectronics Division The IBM home page is http://www.ibm.com
1580 Route 52, Bldg. 504 The IBM Microelectronics Division home page is
Hopewell Junction, NY 12533-6351 http://www.chips.ibm.com