Cell BE Architecture

(1)

Algorithms for Advanced Processor Architectures - AFAPA _12/10/21

1

Cell BE Architecture

Authors

(2)

Class Objectives – Things you will learn

 Current microprocessor design constraints

 Cell processor organization and components

– Power processor element, block diagram, PXU pipeline – Synergistic processor element, block diagram, SXU pipeline – Memory flow controller and MFC commands

– Element interconnect bus, command and data topology

– I/O and memory interfaces

(3)

Class Agenda

 Current microprocessor design

 Cell processor components – Power processor element – Synergistic processor element – Memory flow controller

– Element interconnect bus – I/O and memory interfaces

– Resource allocation management References

 Jim Kahle, Cell Broadband Engine and Cell Broadband Engine Architecture

 J.J. Porta, Cell Overview

Trademarks - Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

(4)

Current Microprocessor Design

(5)

Three Major Limiters to Processor Performance

 Frequency Wall

– Diminishing returns from deeper pipelines

 Memory Wall

– Processor frequency vs. DRAM memory latency – Latency introduced by multiple levels of memory

 Power Wall

– Limits in CMOS technology

– Hard limit to acceptable system power

(6)

Where have all of the transistors gone?

 Memory and memory management

– Larger and larger cache sizes needed for visible improvement

 Trying to extract parallelism

– Long pipelines with lots of interlock logic

– Superscalar designs (multiple pipelines) with even more interlocks!

– Branch prediction to effectively utilize the pipelines – Speculative execution

– Out-of-order execution – Hardware threads

The amount of transistors doing direct computation is shrinking

relative to the total number of transistors.

(7)

Techniques for Better Efficiency

 Chip level multi-processors

– Go back to simpler core designs and use the extra available chip area for multiple cores.

 Vector Units/SIMD

– Have the same instruction execution logic operate on multiple pieces of data concurrently; allows for more throughput with little increase in overhead.

 Rethink memory organization

– At the register level hidden registers with register renaming is transparent to the end user but expensive to implement. Can a compiler do better with more instruction set architected registers?

– Caches are automatic but not necessarily the most efficient use of chip area.

Temporal and spatial locality mean little when processing streams of data.

(8)

Today Microprocessor Design: Power Constrained

 Transistor performance scaling

– Channel off-current & gate-oxide tunneling challenges supply voltage scaling

– Material and structure changes required to stay on Moore’s Law

– Power per (switched) transistor decreases only slowly

Chip Power Consumption increases faster as before

Microprocessor performance limited by processor transistor power efficiency

We know how to design processors we cannot reasonably cool or power

Net: Increasing Performance requires Increasing Power Efficiency

Moore’s Law: 2x transistor density

every 18-24 months

(9)

Hardware Accelerators Concept

 Streaming Systems use Architecture to Address Power/Area/Performance challenge

– Processing unit with many specialized co-processor cores – Storage hierarchy (on chip private,

on chip shared, off-chip) explicit at software level

– Applications coded for parallelism with localized data access to

minimize off chip access

 Advantage demonstrated per chip and per Watt of ~100x on well

behaved applications (graphics, media, dsp,…)

Concept/Key Ideas

Augment standard CPU with many (8-32) small, efficient, hardware accelerators (SIMD or vector) with private memories, visible to

applications.

Rewrite code to highly parallel model with explicit on-chip vs. off- chip memory.

Stream data between cores on

same chip to reduce off-chip

accesses.

(10)

Cell BE Solutions

 Increase concurrency

– Multiple cores

– SIMD/Vector operations in a core

– Start memory movement early so that memory is available when needed

 Increase efficiency

– Simpler cores devote more resources to actual computation

– Programmer managed memory is more efficient than dragging data through caches

– Large register files give the compiler more flexibility and eliminate transistors needed for register renaming

– Specialize processor cores for specific tasks

(11)

Other (Micro)Architectural and Decisions

 Large shared register file

 Local store size tradeoffs

 Dual issue, In order

 Software branch prediction

 Channels

Microarchitecture decisions, more so than architecture decisions

show bias towards compute-intensive codes

(12)

The Cell BE Concept

 Compatibility with 64b Power Architecture™

–

Builds on and leverages IBM investment and community

 Increased efficiency and performance

–

Attacks on the “Power Wall”

• Non Homogenous Coherent Multiprocessor

• High design frequency @ a low operating voltage with advanced power management

–

Attacks on the “Memory Wall”

• Streaming DMA architecture

• 3-level Memory Model: Main Storage, Local Storage, Register Files

–

Attacks on the “Frequency Wall”

• Highly optimized implementation

• Large shared register files and software controlled branching to allow deeper pipelines

 Interface between user and networked world

–

Image rich information, virtual reality, shared reality

–

Flexibility and security

 Multi-OS support, including RTOS / non-RTOS

(13)

Cell Synergy

 Cell is not a collection of different processors, but a synergistic whole – Operation paradigms, data formats and semantics consistent

– Share address translation and memory protection model

 PPE for operating systems and program control

 SPE optimized for efficient data processing

– SPEs share Cell system functions provided by Power Architecture – MFC implements interface to memory

• Copy in/copy out to local storage

 PowerPC provides system functions – Virtualization

– Address translation and protection – External exception handling

 EIB integrates system as data transport hub

(14)

State of the Art: Intel Core 2 Duo

Guess where the cache is?

(15)

State of the Art: IBM Power 5

How about on this one?

(16)

Unconventional State of the Art: Cell BE

memory warehousing vs. in-time data processing

(17)

Today’s x86 Quad Core processors are Dual Chip Modules (DCMs), 2 of these processor stacked vertically &

packaged together

Cell/B.E. - ½ the space & power vs traditional approaches

On any traditional processor, shown ratio of cores to cache, prediction, & related items illustrated here remains at ~50% of area the

chip area

Example Dual Core

349mm², 3.4 GHz @ 150W 2 Cores, ~54 SP GFlops

Cell/B.E.

3.2 GHz

9 Cores, ~230 SP GFlops

(18)

Cell BE Architecture

(19)

Cell Architecture is …

COHERENT BUS

Power ISA

MMU/BIU

Power ISA

MMU/BIU

…

IO transl.

Memory

Incl. coherence/memory

compatible with 32/64b Power Arch. Applications and OS’s

64b Power Architecture™

(20)

Cell Architecture is … 64b Power Architecture™

COHERENT BUS (+RAG)

Power ISA

+RMT

MMU/BIU +RMT

Power ISA

+RMT

MMU/BIU +RMT

IO transl.

Memory Plus Memory

Flow Control (MFC)

MMU/DMA +RMT

Local Store Memory

LS Alias LS Alias

…

(21)

Cell Architecture is … 64b Power Architecture™+

MFC

COHERENT BUS (+RAG)

Power ISA

+RMT

MMU/BIU +RMT

Power ISA

+RMT

MMU/BIU +RMT

IO transl.

Memory Plus

Synergistic Processors

Local Store Memory

LS Alias LS Alias

…

Syn.

Proc.

ISA

Syn.

Proc.

ISA

(22)

Coherent Offload Model

 DMA into and out of Local Store equivalent to Power core loads & stores

 Governed by Power Architecture page and segment tables for translation and protection

 Shared memory model

– Power architecture compatible addressing – MMIO capabilities for SPEs

– Local Store is mapped (alias) allowing LS to LS DMA transfers – DMA equivalents of locking loads & stores

– OS management/virtualization of SPEs

• Pre-emptive context switch is supported (but not efficient)

(23)

Cell Broadband Engine ^TM :

A Heterogeneous Multi-core Architecture

* Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc.

(24)

1 PPE core:

- VMX unit

- 32k L1 caches

- 512k L2 cache

- 2 way SMT

(25)

-128-bit SIMD instruction set - Register file – 128x128-bit - Local store – 256KB

- MFC

- Isolation mode

(26)

Cell BE Block Diagram

16B/cycle

BIC

FlexIO^TM

MIC

Dual XDR^TM

16B/cycle

EIB (up to 96B/cycle)

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU

MFC

PXU

L1 PPU

16B/cycle

L2

32B/cycle

LS SXU SPU

MFC LS

SXU SPU

MFC LS

SXU SPU

MFC LS

SXU SPU

MFC LS

SXU SPU

MFC LS

SXU SPU

MFC LS

SXU SPU

MFC

(27)

Pre-Decode

L1 Instruction Cache

Microcode

SMT Dispatch (Queue)

Decode Dependency

Issue

Branch Scan Fetch Control

L2 Interface

VMX/FPU Issue (Queue)

VMX Load/Store/

Permute

VMX Arith./Logic Unit

FPU Load/Store FPU

Arith/Logic Unit

Load/Store Unit

Branch Execution

Unit Fixed-Point

Unit

FPU Completion VMX Completion

Completion/Flush

8

4

2 1

1 1

1

Thread A 4 Thread B

Threads alternate fetch and dispatch cycles

Thread A Thread B Thread A

L1 Data Cache

PPE Block Diagram

1 1 1

2

(28)

SPE Block Diagram

SPU Core (SXU)

Channel Unit

Local Store MFC

(DMA Unit)

SPU

 SPU Core: Registers & Logic SPE

 Channel Unit: Message passing interface for I/O

 Local Store: 256KB of SRAM private to the SPU Core

 DMA Unit: Transfers data

between Local Store and

Main Memory

(29)

Local Store

 Never misses

– No tags, backing store, or prefetch engine – Predictable real-time behavior

– Less wasted bandwidth

– Easier programming model to achieve very high performance – Software managed caching

– Large register file –or– local memory

– DMA’s are fast to setup – almost like normal load instructions – Can move data from one local store to another

 No translation

– Multiuser operating system is running on control processor

– Can be mapped as system memory - cached copies are non-coherent wrt SPU

loads/stores

(30)

Channels: Message Passing I/O

 The interface from SPU core to rest of system

 Channels have capacity => allows pipelining

 Instructions: read, write, read capacity

 Effects appear at SPU core interface in instruction order

 Blocking reads & writes stall SPU in low power wait mode

 Example facilities accessible through the channel interface:

– DMA control – counter-timer – interrupt controller – Mailboxes

– Status

 Interrupts, BISLED

(31)

SPU Organization

(32)

Permute Unit Load-Store Unit Floating-Point Unit

Fixed-Point Unit

Branch Unit Channel Unit

Result Forwarding and Staging Register File

Local Store (256kB)

Single Port SRAM

128B Read 128B Write DMA Unit

Instruction Issue Unit / Instruction Line Buffer

8 Byte/Cycle 16 Byte/Cycle 64 Byte/Cycle 128 Byte/Cycle On-Chip Coherent Bus

SPE Block Diagram (Detailed)

(33)

SPE Issue

 In-order

 Dual issue requires alignment according to type

 Instruction Swap forces single issue

 Saves a pipeline stage & simplifies resource checking

 9 units & 2 pipes chosen for best balance

instruction from address 0 instruction from address 4

Simple Fixed Permute

Shift Local Store

Single Precision Channel

Floating Integer Branch

Byte

(34)

SPE Pipeline Diagram

Instruction Class Pipe Latency

Simple Fixed (FX) 0 2

Shift (FX) 0 4

Single Precision (FP) 0 6 Floating Integer (FP) 0 7

Byte (BYTE) 0 4

Permute (PERM) 1 4

Load (LS) 1 6

Branch 1 4

Channel 1 6

(35)

SPE Branch Considerations

 Mispredicts cost 18 cycles

– Not bad for 11 FO4 processor

 No hardware history mechanism

 Branch penalty avoidance techniques

– Write frequent path as inline code

– Compute both paths and use select instruction – Unroll loops – also reduces dependency stalls – Load target into branch target buffer

• Branch hint instruction

• 16 cycles ahead of branch instruction

• Single entry BTB

(36)

SPE Instructions

 Scalar processing supported on data-parallel substrate

– All instructions are data parallel and operate on vectors of elements – Scalar operation defined by instruction use, not opcode

• Vector instruction form used to perform operation

 Preferred slot paradigm

– Scalar arguments to instructions found in “preferred slot”

– Computation can be performed in any slot

(37)

Register Scalar Data Layout

 Preferred slot in bytes 0-3

– By convention for procedure interfaces – Used by instructions expecting scalar data

• Addresses, branch conditions, generate controls for insert

(38)

Memory Management & Mapping

(39)

System Memory Interface:

- 16 B/cycle

- 25.6 GB/s (1.6 Ghz)

(40)

Communication Performance

 Internal bus (Element Interconnect Bus EIB) with peak performance of 204.8 Gbytes/second

 Memory Bandwidth 25.6 Gbytes/second

 Impressive I/O bandwidth

– 25 Gbytes/second inbound – 35 Gbytes/second outbound

 Many outstanding memory requests

– Up to 128, typical of multi-threaded processors

(41)

Moving the Spotlight from Processor Performance to Communication Performance

 Traditionally the focus is on (raw) processor performance

 Emphasis is now shifting towards communication performance

 Lots of (peak) processing power inside a chip (approaching Teraflops/sec)

– Small fraction is delivered to applications

 Lots of (peak) aggregate communication bandwidth inside the chip (approaching Terabytes/sec)

– But processing units do not interact frequently

 Small on chip local memories

– Little data reuse

 Main memory bandwidth is the primary bottleneck

 And then I/O and network bandwidth

(42)

Performance and Programmability

 Programming model is already a critical issue

– And it is going to get worse

 Low data-reuse increases the algorithmic complexity

 Memory and Network bandwidth are key to achieve performance and simplify the programming model

 Multi-core Uni-bus 

(43)

SPE Internal Architecture

SPU

3.2 GHz 1.6 GHz MFC

DMAC

QueuesDMA

LS

BIU EIB

(2) (1)

MMU

(3)

TLB (4)

(5) (6)

MIC

(6)

Memory (off chip)

(7)

(7) (7)

1.6 GHz

Channel Interface MMIO

16B/cyc in/out

(44)

Cell BE Communication Architecture

 SPUs can only access programs and data in their local storage

 SPEs have a DMA controller that performs transfers between local stores, main memory and I/O

 SPUs can post list a list of DMAs

 SPUs can also use mailboxes and signals to perform basic synchronizations

 More complex synchronization mechanisms can support atomic operations

 All resources can be memory mapped

(45)

DMA & Multibuffering

I n it D M A F e tc h

W a it f o r D a ta C o m p u te

I n it D M A S to r e I n it D M A F e tc h

W a it fo r D a ta C o m p u te

T h r e a d 2 T h r e a d 1

G o to T h r e a d 2

G o to T h r e a d 1 D M A T r a n s f e r s

DMA commands move data between system memory & Local Storage

 DMA commands are processed in parallel with software execution

Double buffering

 Software Multithreading

 16 queued commands - up to 16 kB/command

Up to 16, 128 Byte, transfers in flight on the on chip interconnect

 Richer than typical cache prefetch instructions

Scatter-gather

 Flexible DMA command status, can achieve low

power wait

(46)

Basic Latencies (3.2 Ghz)

LATENCY COMPONENT

CYCLES NANOSECONDS

DMA issue 10 3.125

DMA to EIB 30 9.375

List Element Fetch 10 3.125

Coherence Protocol 100 31.25

Data Transfer for inter-

SPE put 140 43.75

TOTAL 290 90.61

(47)

DMA Latency

(48)

DMA Bandwidth

(49)

DMA batches (put)

(50)

Element Interconnect Bus (EIB)

- 96B / cycle bandwidth

(51)

Internal Structure of the Cell BE

(52)

MIC SPE0 SPE2 SPE4 SPE6 BIF / IOIF1 Ramp

7

Controller

Ramp 8

Controller

Ramp 9

Controller

Ramp 10

Controller

Ramp 11

Controller

Ramp 0

Controller

Ramp 1

Controller

Ramp 2

Controller

Ramp 3

Controller

Ramp 4

Controller

Ramp 5

Controller

Ramp 6

Controller

Ramp 7

Controller

Ramp 8

Controller

Ramp 9

Controller

Ramp 10

Controller

Ramp 11

Data Arbiter

Ram p

7

Con tro lle r

Ram p

8

Cont rol ler

Ram p

9

Con trol ler

Ram p

10

Con trol ler

Ram p

11

Cont rol Controller ler

Ramp 5

Controller

Ramp 4

Controller

Ramp 3

Controller

Ramp 2

Controller

Ramp 1

Controller

Ramp 0 PPEPPE SPE1SPE1 SPE3SPE3 SPE5SPE5 SPE7SPE7 IOIF1IOIF1

PPEMIC SPE1SPE0 SPE3SPE2 SPE5SPE4 SPE7SPE6 IOIF1BIF / IOIF0

(53)

Aggregate Behavior

(54)

Hot Spot

(55)

Latency Distribution under Hot-Spot

(56)

I/O Interface:

- 16 B/cycle x 2

(57)

I/O and Memory Interfaces

 I/O Provides wide bandwidth

– Dual XDR^TM controller (25.6GB/s @ 3.2Gbps) – Two configurable interfaces (76.8GB/s @6.4Gbps)

• Configurable number of Bytes

• Coherent or I/O Protection

– Allows for multiple system configurations

(58)

Cell BE Processor Can Support Many Systems

 Game console systems

 Blades

 HDTV

 Home media servers

 Supercomputers

Cell BE Processor XDR

^tm

XDR

^tm

IOIF0 IOIF1

Cell BE Processor

XDR

^tm

XDR

^tm

IOIF BIF

Cell BE Processor

XDR

^tm

XDR

^tm

IOIF

Cell BE Processor

XDR

^tm

XDR

^tm

IOIF

BIF

Cell BE Processor

XDR

^tm

XDR

^tm

IOIF

Cell BE

Processor

XD

R R XD

IO IF

BIF

Cel l BE

Processor

XD

R R XD

IO

SW IF

(59)

This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.

Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA.

All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.

The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied.

All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions.

IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice.

IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies.

All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary.

IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.

Many of the features described in this document are operating system dependent and may not be available on Linux. For more information, please check: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.html

Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally- available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment.

Special Notices -- Trademarks

(60)

The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks, BladeCenter, Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBM Business Partner (logo), IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage, xSeries; Advanced Micro- Partitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, Power Architecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks, System p, System p5, System Storage, VideoCharger, Virtualization Engine.

A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.

Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, other countries, or both.

Rambus is a registered trademark of Rambus, Inc.

XDR and FlexIO are trademarks of Rambus, Inc.

UNIX is a registered trademark in the United States, other countries or both.

Linux is a trademark of Linus Torvalds in the United States, other countries or both.

Fedora is a trademark of Redhat, Inc.

Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both.

Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries.

AMD Opteron is a trademark of Advanced Micro Devices, Inc.

Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries.

TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC).

SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC).

AltiVec is a trademark of Freescale Semiconductor, Inc.

PCI-X and PCI Express are registered trademarks of PCI SIG.

InfiniBand™ is a trademark the InfiniBand® Trade Association

Other company, product and service names may be trademarks or service marks of others.

(61)

Is this is a processor or a supercomputer on a chip?

 Striking similarities with high-performance networks for supercomputers

– E.g., Quadrics Elan4

 DMAs overlap computation and communication

 Similar programming model

 Similar synchronization algorithms

– Barriers, allreduces, scatter & gather

 We can adopt the same techniques that we already use in

high-performance clusters and supercomputers!

(62)

Legend:

Data Bus Snoop Bus Control Bus Xlate Ld/St MMIO Local

Store

SPU

DMA Engine DMA

Queue

Atomic Facility

MMU RMT

Bus I/F Control

S P C

MMIO

Memory Flow Control System

•DMA Unit

•LS <-> LS, LS<-> Sys Memory, LS<-> I/O Transfers

•8 PPE-side Command Queue entries

•16 SPU-side Command Queue entries

•MMU similar to PowerPC MMU

•8 SLBs, 256 TLBs

•4K, 64K, 1M, 16M page sizes

•Software/HW page table walk

•PT/SLB misses interrupt PPE

•Atomic Cache Facility

•4 cache lines for atomic updates

•2 cache lines for cast out/MMU reload

•Up to 16 outstanding DMA requests in BIU

•Resource / Bandwidth Management Tables

•Token Based Bus Access Management

•TLB Locking

Isolation Mode Support (Security Feature)

 Hardware enforced “isolation”

– SPU and Local Store not visible (bus or jtag) – Small LS “untrusted area” for communication area

 Secure Boot

– Chip Specific Key

– Decrypt/Authenticate Boot code

 “Secure Vault” – Runtime Isolation Support – Isolate Load Feature

– Isolate Exit Feature

(63)

Per SPE Resources (PPE Side)

Problem State

8 Entry MFC Command Queue Interface DMA Command and Queue Status DMA Tag Status Query Mask DMA Tag Status

32 bit Mailbox Status and Data from SPU 32 bit Mailbox Status and Data to SPU 4 deep FIFO

Signal Notification 1 Signal Notification 2 SPU Run Control

SPU Next Program Counter SPU Execution Status

4K Physical Page Boundary

4K Physical Page Boundary Optionally Mapped 256K Local Store

Privileged 1 State (OS)

SPU Privileged Control

SPU Channel Counter Initialize SPU Channel Data Initialize SPU Signal Notification Control SPU Decrementer Status & Control MFC DMA Control

MFC Context Save / Restore Registers SLB Management Registers

Privileged 2 State (OS or Hypervisor)

SPU Master Run Control SPU ID

SPU ECC Control SPU ECC Status SPU ECC Address

SPU 32 bit PU Interrupt Mailbox MFC Interrupt Mask

MFC Interrupt Status

MFC DMA Privileged Control MFC Command Error Register

MFC Command Translation Fault Register MFC SDR (PT Anchor)

MFC ACCR (Address Compare) MFC DSSR (DSI Status) MFC DAR (DSI Address) MFC LPID (logical partition ID) MFC TLB Management Registers Optionally Mapped 256K Local Store

(64)

Per SPE Resources (SPU Side)

128 - 128 bit GPRs

External Event Status (Channel 0) Decrementer Event

Tag Status Update Event DMA Queue Vacancy Event SPU Incoming Mailbox Event Signal 1 Notification Event Signal 2 Notification Event Reservation Lost Event External Event Mask (Channel 1)

External Event Acknowledgement (Channel 2) Signal Notification 1 (Channel 3)

Signal Notificaiton 2 (Channel 4) Set Decrementer Count (Channel 7) Read Decrementer Count (Channel 8)

16 Entry MFC Command Queue Interface (Channels 16-21) DMA Tag Group Query Mask (Channel 22)

Request Tag Status Update (Channel 23) Immediate

Conditional - ALL Conditional - ANY

Read DMA Tag Group Status (Channel 24) DMA List Stall and Notify Tag Status (Channel 25)

DMA List Stall and Notify Tag Acknowledgement (Channel 26) Lock Line Command Status (Channel 27)

Outgoing Mailbox to PU (Channel 28)

SPU Direct Access Resources SPU Indirect Access Resources (via EA Addressed DMA)

System Memory Memory Mapped I/O This SPU Local Store Other SPU Local Store Other SPU Signal Registers

Atomic Update (Cacheable Memory)

(65)

Memory Flow Controller Commands

LSA - Local Store Address (32 bit) EA - Effective Address (32 or 64 bit) TS - Transfer Size (16 bytes to 16K bytes) LS - DMA List Size (8 bytes to 16 K bytes) TG - Tag Group(5 bit)

CL - Cache Management / Bandwidth Class

DMA Commands

Command Parameters

Put - Transfer from Local Store to EA space Puts - Transfer and Start SPU execution Putr - Put Result - (Arch. Scarf into L2) Putl - Put using DMA List in Local Store Putrl - Put Result using DMA List in LS (Arch) Get - Transfer from EA Space to Local Store Gets - Transfer and Start SPU execution Getl - Get using DMA List in Local Store Sndsig - Send Signal to SPU

Command Modifiers: <f,b>

f: Embedded Tag Specific Fence

Command will not start until all previous commands in same tag group have completed

b: Embedded Tag Specific Barrier

Command and all subsiquent commands in same

tag group will not start until previous commands in same tag group have completed

SL1 Cache Management Commands

sdcrt - Data cache region touch (DMA Get hint)

sdcrtst - Data cache region touch for store (DMA Put hint) sdcrz - Data cache region zero

sdcrs - Data cache region store sdcrf - Data cache region flush

Synchronization Commands

Lockline (Atomic Update) Commands:

getllar - DMA 128 bytes from EA to LS and set Reservation putllc - Conditionally DMA 128 bytes from LS to EA

putlluc - Unconditionally DMA 128 bytes from LS to EA barrier - all previous commands complete before subsiquent

commands are started

mfcsync - Results of all previous commands in Tag group are remotely visible

mfceieio - Results of all preceding Puts commands in same group visible with respect to succeeding Get commands

(66)

Element Interconnect Bus

(67)

Internal Bandwidth Capability

 EIB data ring for internal communication

– Four 16 byte data rings, supporting multiple transfers – 96B/cycle peak bandwidth

– Over 100 outstanding requests

 Each EIB Bus data port supports 25.6GBytes/sec* in each direction

 The EIB Command Bus streams commands fast enough to support 102.4 GB/sec for coherent commands, and 204.8 GB/sec for non-coherent commands.

 The EIB data rings can sustain 204.8GB/sec for certain workloads, with transient rates as high as 307.2GB/sec between bus units

* The above numbers assume a 3.2GHz core frequency – internal bandwidth scales with core frequency

(68)

Element Interconnect Bus - Data Topology

 Four 16B data rings connecting 12 bus elements

– Two clockwise / Two counter-clockwise

 Physically overlaps all processor elements

 Central arbiter supports up to three concurrent transfers per data ring

– Two stage, dual round robin arbiter

 Each element port simultaneously supports 16B in and 16B out data path

– Ring topology is transparent to element data interface

16B 16B 16B 16B

Data Arb

16B 16B 16B 16B

SPE0 SPE2 SPE4 SPE6

SPE7 SPE5

SPE1 SPE3

MIC PPE

BIF/IOIF0 IOIF1

(69)

I/O and Memory Interfaces

(70)

(71)

(c) Copyright International Business Machines Corporation 2005.

The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.

IBM IBM Logo Power Architecture

Other company, product and service names may be trademarks or service marks of others.

All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary.

While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.

THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.

IBM Microelectronics Division The IBM home page is http://www.ibm.com

1580 Route 52, Bldg. 504 The IBM Microelectronics Division home page is

Hopewell Junction, NY 12533-6351 http://www.chips.ibm.com