High Performance Computing - Challenges on the Road to Exascale Computing

April 2011

High Performance Computing
Challenges on the Road to Exascale Computing

H. J. Schick

IBM Germany Research & Development GmbH © 2011 IBM Corporation

Agenda

 Introduction

 The What and Why of High Performance Computing

 Exascale Challenges

 Balanced Systems

 Blue Gene Architecture and Blue Gene Active Storage

 Supercomputers in a Sugar Cube

2 © 2011 IBM Corporation

Origination of the “Jugene” Supercomputer


Supercomputer Satisfies Need for FLOPS

 FLOPS = FLoating point OPerations per Second.
– Mega=106, Giga=109, Tera=1012, Peta=1015, Exa=1018

 Simulation is a major application area.

 Many simulations based on the notion of “timestep”.
– At each timestep, advance the constituent parts according to their physics or
chemistry.
– Example Challenge:
Molecular dynamics has picosecond=10-12 timescale,
but many biology processes have millisecond=10-3 timescale.
• Simulation has 109 timesteps!
Each timestep requires many operations!


Simulation Pseudo-code:

// Initialize state of atoms.

While time < 1 millisecond {
// Calculate forces on 40,000 atoms.
// Calculate velocities of all atoms.
// Advance position of all atoms.
time = time + 1picosecond
}

// Write biology result.


Supercomputing is Capability Computing

 A single instance of an application using large tightly-coupled computer resources.
– For example, a single 1000-year climate simulation.

 Contrast to Capacity Computing:
– Many instances of one or more applications using large loosely-coupled computer
resources.
– For example, 1000 independent 1-year climate simulations.
– Often trivial parallelism.
Often suited for GRID or SETI@Home-style systems.


Supercomputer Versus Your Desktop

 Assume 2000-processor supercomputer delivers simulation result in 1 day.

 Assuming memory-size is not a problem, then your 1-processor desktop would deliver same
result in 2000 days = 5 years.

 So supercomputers make results available on a human timescale.


But what could you do if all objects were intelligent…

…and connected?


What could you do with
unlimited computing power…
for pennies?

Could you predict the path of a
storm down to the square
kilometer? Could you identify another 20%
of proven oil reserves without
drilling one hole? © 2011 IBM Corporation

Grand Challenges

“A grand challenge is a fundamental problem in science or
engineering, with broad applications, whose solution would be
enabled by the application of high performance computing resources
that could become available in the near future.”

Computational fluid dynamics Electronic structure Calculations to
calculations for the understand the
• Design of hypersonic aircraft, design of new fundamental nature
efficient automobile bodies, and materials: of matter:
extremely quiet submarines.
• Weather forecasting for short and • Chemical catalysts • Quantum
long term effects. • Immunological agents chromodynamics
• Efficient recovery of oil, and for • Superconductors • Condensed matter
many other applications. theory


Enough Atoms to See Grains in Solidification of Metal
http://www-phys.llnl.gov/Research/Metals_Alloys/news.html


Building Blocks of Matter

 QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

 Quarks are the constituents of matter which strongly interact exchanging gluons.

 Particular phenomena
– Confinement
– Asymptotic freedom (Nobel Prize 2004)

 Theory of strong interactions = Quantum Chromodynamics (QCD)


Projected Performance Development

Almost a doubling every year !!!


Extrapolating an Exaflop in 2018
Standard technology scaling will not get us there in 2018

BlueGene/L Exaflop Exaflop compromise Assumption for “compromise guess”
(2005) Directly using traditional
scaled technology
Node Peak Perf 5.6GF 20TF 20TF Same node count (64k)

hardware 2 8000 1600 Assume 3.5GHz
concurrency/node

System Power in 1 MW 3.5 GW 25 MW Expected based on technology improvement through 4 technology generations. (Only
Compute Chip compute chip power scaling, I/Os also scaled same way)

Link Bandwidth 1.4Gbps 5 Tbps 1 Tbps Not possible to maintain bandwidth ratio.
(Each unidirectional
3-D link)

Wires per 2 400 wires 80 wires Large wire count will eliminate high density and drive links onto cables where they are
unidirectional 3-D 100x more expensive. Assume 20 Gbps signaling
link

Pins in network on 24 pins 5,000 pins 1,000 pins 20 Gbps differential assumed. 20 Gbps over copper will be limited to 12 inches. Will need
node optics for in rack interconnects.
10Gbps now possible in both copper and optics.

Power in network 100 KW 20 MW 4 MW 10 mW/Gbps assumed.
Now: 25 mW/Gbps for long distance (greater than 2 feet on copper) for both ends one
direction. 45mW/Gbps optics both ends one direction. + 15mW/Gbps of electrical
Electrical power in future: separately optimized links for power.

Memory 5.6GB/s 20TB/s 1 TB/s Not possible to maintain external bandwidth/Flop
Bandwidth/node
L2 cache/node 4 MB 16 GB 500 MB About 6-7 technology generations with expected eDRAM density improvements

Data pins associated 128 data 40,000 pins 2000 pins 3.2 Gbps per pin
with memory/node pins

Power in memory I/O 12.8 KW 80 MW 4 MW 10 mW/Gbps assumed. Most current power in address bus.
(not DRAM) Future probably about 15mW/Gbps maybe get to 10mW/Gbps (2.5mW/Gbps is c*v^2*f for
random data on data pins) Address power is higher.


The Big Leap from Petaflops to Exaflops

 We will hit 20 Petaflop in 2011/2012 …. Now beginning research for ~2018 Exascale.

 IT/CMOS industry is trying to double performance every 2 years.
HPC industry is trying to double performance every year.

 Technology disruptions in many areas.

– BAD NEWS: Scalability of current technologies?
• Silicon Power, Interconnect, Memory, Packaging.

– GOOD NEWS: Emerging technologies?
• Memory technologies (e.g. storage class memory), 3D-chips, etc.

 Exploiting exascale machines.
– Want to maximize science output per €.
– Need multiple partner applications to evaluate HW trade-offs.


Exascale Challenges – Energy

 Power consumption will increase in the future!

 What is the critical limit?
– JSC has 5 MW, potential of 10 MW
– 1 MW is 1 M€ / year
– 20 MW expected to be the critical limit

 Are Exascale systems a Large Scale Facility?
– LHC uses 100 MW

 Energy efficiency
– Cooling uses significant fraction (PUE > 1.2 today → 1.0)
– Hot cooling water (40°C and more) might help
– Free cooling: use free air to cool water
– Heat recycling: use waste heat for heating, cooling, etc.


Exascale Challenges – Resiliency

 Ever increasing number of components
– O(10000) nodes
– O(100000) DIMMs of RAM

 Each component's MTBF will not increase
– Optimistic: Remains constant
– Realistic: Smaller structures, lower voltages → decrease

 Global MTBF will decrease
– Critical limit? 1 day? 1 hour? Time to write checkpoint!

 How to handle failures
– Try to anticipate failures via monitoring
– Software must help to handle failures
• checkpoints, process-migration, transactional computing


Exascale Challenges – Applications

 Ever increasing levels of parallelism
– Thousands of nodes, hundreds of cores, dozens of registers
– Automatic parallelization vs. explicit exposure
– How large are coherency domains?
– How many languages do we have to learn?

 MPI + X most probably not sufficient
– 1 process / core makes orchestration of processes harder
– GPUs require explicit handling today (CUDA, OpenCL)

 What is the future paradigm
– MPI + X + Y? PGAS + X (+Y)?
– PGAS: UPC, Co-Array Fortran, X10, Chapel, Fortress, …

 Which applications are inherently scalable enough at all?


Balanced Systems

 Example caxpy:

Processor FPU throughput Memory bandwidth

[FLOPS / cycle] [words / cycle] [FLOPS / word]

apeNEXT 8 2 4
QCDOC (MM) 2 0.63 3.2
QCDOC (LS) 2 2 1
Xeon 2 0.29 7
GPU 128 x 2 17.3 (*) 14.8
Cell/B.E. (MM) 8x4 1 32
Cell/B.E. (LS) 8x4 8x4 1


Balanced Systems ???


… but are they Reliable, Available and Serviceable ???


Blue Gene/P


Blue Gene/P System
72 Racks, 72x32x32
Cabled 8x8x16

Rack
32 Node Cards

1 PF/s
Node Card 144 (288) TB
(32 chips 4x4x2)
32 compute, 0-1 IO cards
13.9 TF/s
2 (4) TB

435 GF/s
Compute Card 64 (128) GB
1 chip, 20 DRAMs

Chip 13.6 GF/s
4 processors 2.0 GB DDR2
(4.0GB 6/30/08)

13.6 GF/s
8 MB EDRAM


Blue Gene/P Compute ASIC
32k I1/32k D1
Snoop snoop
filter
PPC450 128

Multiplexing switch
Double FPU L2 4MB
256
Shared L3 512b data eDRAM
Directory 72b ECC
32k I1/32k D1 256
Snoop for eDRAM L3 Cache
filter or
PPC450 w/ECC On-Chip
128
Memory

Double FPU L2
32
32k I1/32k D1 Shared

Multiplexing switch
Snoop SRAM
filter
PPC450 128 4MB
Shared L3 eDRAM
L2 512b data
Double FPU Directory 72b ECC
for eDRAM L3 Cache
or
32k I1/32k D1
Snoop w/ECC On-Chip
filter Memory
PPC450 128

Double FPU
L2
Arb
DMA
Hybrid
PMU DDR-2 DDR-2
w/ SRAM JTAG Global Ethernet Controller Controller
256x64b Access
Torus Collective
Barrier 10 Gbit w/ ECC w/ ECC

JTAG 6 3.4Gb/s 3 6.8Gb/s 4 global 10 Gb/s 13.6 Gb/s
bidirectional bidirectional barriers or
DDR-2 DRAM bus
interrupts
© 2011 IBM Corporation

Blue Gene/P Compute Card

2 x 16GB interface to 2 or 4
BGQ ASIC 29mm x 29mm FC-PBGA
GB SDRAM-DDR2

NVRAM, monitors, decoupling,
Vtt termination
All network and IO, power input


Blue Gene/P Node Board 32 Compute nodes

Optional IO card (one of 2 possible)

Local DC-DC regulators
(6 required, 8 with redundancy)
10Gb optical link

Blue Gene Interconnection Networks
Optimized for Parallel Programming and Scalable Management

3D Torus
– Interconnects all compute nodes (65,536)
– Virtual cut-through hardware routing
– 1.4Gb/s on all 12 node links (2.1 GB/s per node)
– Communications backbone for computations
– 0.7/1.4 TB/s bisection bandwidth, 67TB/s total bandwidth

Global Collective Network
– One-to-all broadcast functionality
– Reduction operations functionality
– 2.8 Gb/s of bandwidth per link; One-way global latency 2.5 µs
– ~23TB/s total bandwidth (64k machine)
– Interconnects all compute and I/O nodes (1024)

Low Latency Global Barrier and Interrupt
– Round trip latency 1.3 µs
Control Network
– Boot, monitoring and diagnostics
Ethernet
– Incorporated into every node ASIC
– Active in the I/O nodes (1:64)
– All external comm. (file I/O, control, user interaction, etc.)


Source: Kirk Borne, Data Science Challenges from
Distributed Petabyte Astronomical Data Collections:
29 Preparing for the Data Avalanche through
Persistence, Parallelization, and Provenance

Blue Gene Architecture in Review
Blue Gene is not just FLOPs …

… it’s also torus
network, power
efficiency, and dense
packaging.

Focus on scalability
rather than on
configurability
gives the Blue Gene
family’s System-on-a-
Chip architecture
unprecedented
scalability and
reliability.
30
30 Blue Gene Active Storage HEC FSIO 2010 © 2011 IBM Corporation

Thought Experiment: A Blue Gene Active Storage Machine
• Integrate significant storage class memory (SCM) at each node
• For now, Flash memory, maybe similar function to Fusion-io ioDrive Duo
• Future systems may deploy Phase Change Memory (PCM), Memristor, or …?
ioDrive Duo One Board 512 Node
• Assume node density will drops 50% -- 512 Nodes/Rack for embedded apps
• Objective: balance Flash bandwidth to network all-to-all throughput SLC NAND Cap. 320 GB 160 TB

Read BW (64K) 1450 MB/s 725 GB/s

• Resulting System Attributes: Write BW (64K) 1400 MB/s 700 GB/s

• Rack: 0.5 petabyte, 512 Blue Gene processors, and embedded torus network Read IOPS (4K) 270,000 138 Mega

• 700 TB/s I/O bandwidth to Flash – competitive with ~70 large disk controllers Write IOPS (4K) 257,000 131 Maga

• Order of magnitude less space and power than equivalent perf via disk solution Mixed R/W
207,000 105 Mega
• Can configure fewer disk controllers and optimize them for archival use IOPs(75/25@4K)
• With network all-to-all throughput at 1GB/s per node, anticipate:
• 1TB sort from/to persistent storage in order 10 secs.
• 130 Million IOPs per rack, 700 GB/s I/O bandwidth
• Inherit Blue Gene attributes: scalability, reliability, power efficiency,

• Research Challenges (list not exhaustive):
• Packaging – can the integration succeed?
• Resilience – storage, network, system management, middleware
• Data management – need clear split between on-line and archival data
• Data structures and algorithms can take specific advantage of the BGAS
architecture – no one cares it’s not x86 since software is embedded in storage

• Related Work:
• Gordon (UCSD) http://nvsl.ucsd.edu/papers/Asplos2009Gordon.pdf
• FAWN (CMU) http://www.cs.cmu.edu/~fawnproj/papers/fawn-sosp2009.pdf
• RamCloud (Stanford) http://www.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf

31 Blue Gene Active Storage HEC FSIO 2010 © 2011 IBM Corporation

From individual transistors to the globe
Energy-consumption issues (and thermal issues) propagate through hardware levels


Energy consumption of datacenters today

Source: APC, Whitepaper #154 (2008)

 Current air-cooled datacenters are extremely inefficient. Cooling needs as
much energy as IT and both are thrown-away.
 Provocative: Datacenter is a huge “Heater with integrated Logic”.
 For a 10 MW datacenter US$ 3 - 5M is wasted per year.


Hot-water-cooled datacenters – towards zero emission

Micro-channel
liquid coolers
Heat exchanger

CMOS 80ºC

Direct „Waste“-Heat usage
e.g. heating

34 Water 60ºC © 2011 IBM Corporation

Paradigm change: Moore’s law goes 3D

Multi-Chip Design Brain: synapse network

System on Chip

Meindl 05 et al. 3D Integration

Benefits:
 High core-cache bandwidth
 Separation of technologies
 Reduction in wire length
 Equivalent to two generations of scaling
Global wire lengths reduction
 No impact on software development


Scalable Heat Removal by Interlayer Cooling
cross-section through fluid
port and cavities

 3D integration requires (scalable) interlayer liquid cooling
 Challenge: isolate electrical interconnects from liquid

 Microchannel
 Pin fin

Through silicon via electrical bonding
and water insulation scheme

 A large fraction of energy in computers is spent for data transport
 Shrinking computers saves energy
Test vehicle with fluid
manifold and connection


On the Cube Road

Paradigm Changes
-Energy will cost more than servers
-Coolers are million fold larger than transistors

Moore’s Law goes 3D
-Single layer scaling slows down
-Stacking of layers allows extension of Moore’s law
-Approaching functional density of human brain

Future computers look different
-Liquid cooling and heat re-use, e.g. Aquasar
-Interlayer cooled 3D chip stacks
-Smarter energy by bionic designs

Energy aspects are key
-Cooling – power delivery – photonics
-Shrink a rack to a “sugar cube”: 50x efficiency


Thank you very much for your attention.

High Performance Computing - Challenges on the Road to Exascale Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to High Performance Computing - Challenges on the Road to Exascale Computing

Similar to High Performance Computing - Challenges on the Road to Exascale Computing (20)

More from Heiko Joerg Schick

More from Heiko Joerg Schick (16)

Recently uploaded

Recently uploaded (20)

High Performance Computing - Challenges on the Road to Exascale Computing