More Related Content
Similar to High Performance Computing - Challenges on the Road to Exascale Computing
Similar to High Performance Computing - Challenges on the Road to Exascale Computing (20)
More from Heiko Joerg Schick
More from Heiko Joerg Schick (16)
High Performance Computing - Challenges on the Road to Exascale Computing
- 1. April 2011
High Performance Computing
Challenges on the Road to Exascale Computing
H. J. Schick
IBM Germany Research & Development GmbH © 2011 IBM Corporation
- 2. Agenda
Introduction
The What and Why of High Performance Computing
Exascale Challenges
Balanced Systems
Blue Gene Architecture and Blue Gene Active Storage
Supercomputers in a Sugar Cube
2 © 2011 IBM Corporation
- 5. Supercomputer Satisfies Need for FLOPS
FLOPS = FLoating point OPerations per Second.
– Mega=106, Giga=109, Tera=1012, Peta=1015, Exa=1018
Simulation is a major application area.
Many simulations based on the notion of “timestep”.
– At each timestep, advance the constituent parts according to their physics or
chemistry.
– Example Challenge:
Molecular dynamics has picosecond=10-12 timescale,
but many biology processes have millisecond=10-3 timescale.
• Simulation has 109 timesteps!
Each timestep requires many operations!
5 © 2011 IBM Corporation
- 6. Simulation Pseudo-code:
// Initialize state of atoms.
While time < 1 millisecond {
// Calculate forces on 40,000 atoms.
// Calculate velocities of all atoms.
// Advance position of all atoms.
time = time + 1picosecond
}
// Write biology result.
6 © 2011 IBM Corporation
- 7. Supercomputing is Capability Computing
A single instance of an application using large tightly-coupled computer resources.
– For example, a single 1000-year climate simulation.
Contrast to Capacity Computing:
– Many instances of one or more applications using large loosely-coupled computer
resources.
– For example, 1000 independent 1-year climate simulations.
– Often trivial parallelism.
Often suited for GRID or SETI@Home-style systems.
7 © 2011 IBM Corporation
- 8. Supercomputer Versus Your Desktop
Assume 2000-processor supercomputer delivers simulation result in 1 day.
Assuming memory-size is not a problem, then your 1-processor desktop would deliver same
result in 2000 days = 5 years.
So supercomputers make results available on a human timescale.
8 © 2011 IBM Corporation
- 9. But what could you do if all objects were intelligent…
…and connected?
9 © 2011 IBM Corporation
- 10. What could you do with
unlimited computing power…
for pennies?
Could you predict the path of a
storm down to the square
kilometer? Could you identify another 20%
of proven oil reserves without
drilling one hole? © 2011 IBM Corporation
- 11. Grand Challenges
“A grand challenge is a fundamental problem in science or
engineering, with broad applications, whose solution would be
enabled by the application of high performance computing resources
that could become available in the near future.”
Computational fluid dynamics Electronic structure Calculations to
calculations for the understand the
• Design of hypersonic aircraft, design of new fundamental nature
efficient automobile bodies, and materials: of matter:
extremely quiet submarines.
• Weather forecasting for short and • Chemical catalysts • Quantum
long term effects. • Immunological agents chromodynamics
• Efficient recovery of oil, and for • Superconductors • Condensed matter
many other applications. theory
11 © 2011 IBM Corporation
- 12. Enough Atoms to See Grains in Solidification of Metal
http://www-phys.llnl.gov/Research/Metals_Alloys/news.html
12 © 2011 IBM Corporation
- 13. Building Blocks of Matter
QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
Quarks are the constituents of matter which strongly interact exchanging gluons.
Particular phenomena
– Confinement
– Asymptotic freedom (Nobel Prize 2004)
Theory of strong interactions = Quantum Chromodynamics (QCD)
13 © 2011 IBM Corporation
- 15. Extrapolating an Exaflop in 2018
Standard technology scaling will not get us there in 2018
BlueGene/L Exaflop Exaflop compromise Assumption for “compromise guess”
(2005) Directly using traditional
scaled technology
Node Peak Perf 5.6GF 20TF 20TF Same node count (64k)
hardware 2 8000 1600 Assume 3.5GHz
concurrency/node
System Power in 1 MW 3.5 GW 25 MW Expected based on technology improvement through 4 technology generations. (Only
Compute Chip compute chip power scaling, I/Os also scaled same way)
Link Bandwidth 1.4Gbps 5 Tbps 1 Tbps Not possible to maintain bandwidth ratio.
(Each unidirectional
3-D link)
Wires per 2 400 wires 80 wires Large wire count will eliminate high density and drive links onto cables where they are
unidirectional 3-D 100x more expensive. Assume 20 Gbps signaling
link
Pins in network on 24 pins 5,000 pins 1,000 pins 20 Gbps differential assumed. 20 Gbps over copper will be limited to 12 inches. Will need
node optics for in rack interconnects.
10Gbps now possible in both copper and optics.
Power in network 100 KW 20 MW 4 MW 10 mW/Gbps assumed.
Now: 25 mW/Gbps for long distance (greater than 2 feet on copper) for both ends one
direction. 45mW/Gbps optics both ends one direction. + 15mW/Gbps of electrical
Electrical power in future: separately optimized links for power.
Memory 5.6GB/s 20TB/s 1 TB/s Not possible to maintain external bandwidth/Flop
Bandwidth/node
L2 cache/node 4 MB 16 GB 500 MB About 6-7 technology generations with expected eDRAM density improvements
Data pins associated 128 data 40,000 pins 2000 pins 3.2 Gbps per pin
with memory/node pins
Power in memory I/O 12.8 KW 80 MW 4 MW 10 mW/Gbps assumed. Most current power in address bus.
(not DRAM) Future probably about 15mW/Gbps maybe get to 10mW/Gbps (2.5mW/Gbps is c*v^2*f for
random data on data pins) Address power is higher.
15 © 2011 IBM Corporation
- 16. The Big Leap from Petaflops to Exaflops
We will hit 20 Petaflop in 2011/2012 …. Now beginning research for ~2018 Exascale.
IT/CMOS industry is trying to double performance every 2 years.
HPC industry is trying to double performance every year.
Technology disruptions in many areas.
– BAD NEWS: Scalability of current technologies?
• Silicon Power, Interconnect, Memory, Packaging.
– GOOD NEWS: Emerging technologies?
• Memory technologies (e.g. storage class memory), 3D-chips, etc.
Exploiting exascale machines.
– Want to maximize science output per €.
– Need multiple partner applications to evaluate HW trade-offs.
16 © 2011 IBM Corporation
- 17. Exascale Challenges – Energy
Power consumption will increase in the future!
What is the critical limit?
– JSC has 5 MW, potential of 10 MW
– 1 MW is 1 M€ / year
– 20 MW expected to be the critical limit
Are Exascale systems a Large Scale Facility?
– LHC uses 100 MW
Energy efficiency
– Cooling uses significant fraction (PUE > 1.2 today → 1.0)
– Hot cooling water (40°C and more) might help
– Free cooling: use free air to cool water
– Heat recycling: use waste heat for heating, cooling, etc.
17 © 2011 IBM Corporation
- 18. Exascale Challenges – Resiliency
Ever increasing number of components
– O(10000) nodes
– O(100000) DIMMs of RAM
Each component's MTBF will not increase
– Optimistic: Remains constant
– Realistic: Smaller structures, lower voltages → decrease
Global MTBF will decrease
– Critical limit? 1 day? 1 hour? Time to write checkpoint!
How to handle failures
– Try to anticipate failures via monitoring
– Software must help to handle failures
• checkpoints, process-migration, transactional computing
18 © 2011 IBM Corporation
- 19. Exascale Challenges – Applications
Ever increasing levels of parallelism
– Thousands of nodes, hundreds of cores, dozens of registers
– Automatic parallelization vs. explicit exposure
– How large are coherency domains?
– How many languages do we have to learn?
MPI + X most probably not sufficient
– 1 process / core makes orchestration of processes harder
– GPUs require explicit handling today (CUDA, OpenCL)
What is the future paradigm
– MPI + X + Y? PGAS + X (+Y)?
– PGAS: UPC, Co-Array Fortran, X10, Chapel, Fortress, …
Which applications are inherently scalable enough at all?
19 © 2011 IBM Corporation
- 20. Balanced Systems
Example caxpy:
Processor FPU throughput Memory bandwidth
[FLOPS / cycle] [words / cycle] [FLOPS / word]
apeNEXT 8 2 4
QCDOC (MM) 2 0.63 3.2
QCDOC (LS) 2 2 1
Xeon 2 0.29 7
GPU 128 x 2 17.3 (*) 14.8
Cell/B.E. (MM) 8x4 1 32
Cell/B.E. (LS) 8x4 8x4 1
20 © 2011 IBM Corporation
- 22. … but are they Reliable, Available and Serviceable ???
22 © 2011 IBM Corporation
- 24. Blue Gene/P System
72 Racks, 72x32x32
Cabled 8x8x16
Rack
32 Node Cards
1 PF/s
Node Card 144 (288) TB
(32 chips 4x4x2)
32 compute, 0-1 IO cards
13.9 TF/s
2 (4) TB
435 GF/s
Compute Card 64 (128) GB
1 chip, 20 DRAMs
Chip 13.6 GF/s
4 processors 2.0 GB DDR2
(4.0GB 6/30/08)
13.6 GF/s
8 MB EDRAM
24 © 2011 IBM Corporation
- 25. Blue Gene/P Compute ASIC
32k I1/32k D1
Snoop snoop
filter
PPC450 128
Multiplexing switch
Double FPU L2 4MB
256
Shared L3 512b data eDRAM
Directory 72b ECC
32k I1/32k D1 256
Snoop for eDRAM L3 Cache
filter or
PPC450 w/ECC On-Chip
128
Memory
Double FPU L2
32
32k I1/32k D1 Shared
Multiplexing switch
Snoop SRAM
filter
PPC450 128 4MB
Shared L3 eDRAM
L2 512b data
Double FPU Directory 72b ECC
for eDRAM L3 Cache
or
32k I1/32k D1
Snoop w/ECC On-Chip
filter Memory
PPC450 128
Double FPU
L2
Arb
DMA
Hybrid
PMU DDR-2 DDR-2
w/ SRAM JTAG Global Ethernet Controller Controller
256x64b Access
Torus Collective
Barrier 10 Gbit w/ ECC w/ ECC
JTAG 6 3.4Gb/s 3 6.8Gb/s 4 global 10 Gb/s 13.6 Gb/s
bidirectional bidirectional barriers or
DDR-2 DRAM bus
interrupts
© 2011 IBM Corporation
- 26. Blue Gene/P Compute Card
2 x 16GB interface to 2 or 4
BGQ ASIC 29mm x 29mm FC-PBGA
GB SDRAM-DDR2
NVRAM, monitors, decoupling,
Vtt termination
All network and IO, power input
© 2011 IBM Corporation
- 27. Blue Gene/P Node Board 32 Compute nodes
Optional IO card (one of 2 possible)
Local DC-DC regulators
(6 required, 8 with redundancy)
10Gb optical link
© 2011 IBM Corporation
- 28. Blue Gene Interconnection Networks
Optimized for Parallel Programming and Scalable Management
3D Torus
– Interconnects all compute nodes (65,536)
– Virtual cut-through hardware routing
– 1.4Gb/s on all 12 node links (2.1 GB/s per node)
– Communications backbone for computations
– 0.7/1.4 TB/s bisection bandwidth, 67TB/s total bandwidth
Global Collective Network
– One-to-all broadcast functionality
– Reduction operations functionality
– 2.8 Gb/s of bandwidth per link; One-way global latency 2.5 µs
– ~23TB/s total bandwidth (64k machine)
– Interconnects all compute and I/O nodes (1024)
Low Latency Global Barrier and Interrupt
– Round trip latency 1.3 µs
Control Network
– Boot, monitoring and diagnostics
Ethernet
– Incorporated into every node ASIC
– Active in the I/O nodes (1:64)
– All external comm. (file I/O, control, user interaction, etc.)
28 © 2011 IBM Corporation
- 29. Source: Kirk Borne, Data Science Challenges from
Distributed Petabyte Astronomical Data Collections:
29 Preparing for the Data Avalanche through
© 2011 IBM Corporation
Persistence, Parallelization, and Provenance
- 30. Blue Gene Architecture in Review
Blue Gene is not just FLOPs …
… it’s also torus
network, power
efficiency, and dense
packaging.
Focus on scalability
rather than on
configurability
gives the Blue Gene
family’s System-on-a-
Chip architecture
unprecedented
scalability and
reliability.
30
30 Blue Gene Active Storage HEC FSIO 2010 © 2011 IBM Corporation
- 31. Thought Experiment: A Blue Gene Active Storage Machine
• Integrate significant storage class memory (SCM) at each node
• For now, Flash memory, maybe similar function to Fusion-io ioDrive Duo
• Future systems may deploy Phase Change Memory (PCM), Memristor, or …?
ioDrive Duo One Board 512 Node
• Assume node density will drops 50% -- 512 Nodes/Rack for embedded apps
• Objective: balance Flash bandwidth to network all-to-all throughput SLC NAND Cap. 320 GB 160 TB
Read BW (64K) 1450 MB/s 725 GB/s
• Resulting System Attributes: Write BW (64K) 1400 MB/s 700 GB/s
• Rack: 0.5 petabyte, 512 Blue Gene processors, and embedded torus network Read IOPS (4K) 270,000 138 Mega
• 700 TB/s I/O bandwidth to Flash – competitive with ~70 large disk controllers Write IOPS (4K) 257,000 131 Maga
• Order of magnitude less space and power than equivalent perf via disk solution Mixed R/W
207,000 105 Mega
• Can configure fewer disk controllers and optimize them for archival use IOPs(75/25@4K)
• With network all-to-all throughput at 1GB/s per node, anticipate:
• 1TB sort from/to persistent storage in order 10 secs.
• 130 Million IOPs per rack, 700 GB/s I/O bandwidth
• Inherit Blue Gene attributes: scalability, reliability, power efficiency,
• Research Challenges (list not exhaustive):
• Packaging – can the integration succeed?
• Resilience – storage, network, system management, middleware
• Data management – need clear split between on-line and archival data
• Data structures and algorithms can take specific advantage of the BGAS
architecture – no one cares it’s not x86 since software is embedded in storage
• Related Work:
• Gordon (UCSD) http://nvsl.ucsd.edu/papers/Asplos2009Gordon.pdf
• FAWN (CMU) http://www.cs.cmu.edu/~fawnproj/papers/fawn-sosp2009.pdf
• RamCloud (Stanford) http://www.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf
31 Blue Gene Active Storage HEC FSIO 2010 © 2011 IBM Corporation
- 32. From individual transistors to the globe
Energy-consumption issues (and thermal issues) propagate through hardware levels
32 © 2011 IBM Corporation
- 33. Energy consumption of datacenters today
Source: APC, Whitepaper #154 (2008)
Current air-cooled datacenters are extremely inefficient. Cooling needs as
much energy as IT and both are thrown-away.
Provocative: Datacenter is a huge “Heater with integrated Logic”.
For a 10 MW datacenter US$ 3 - 5M is wasted per year.
33 © 2011 IBM Corporation
- 34. Hot-water-cooled datacenters – towards zero emission
Micro-channel
liquid coolers
Heat exchanger
CMOS 80ºC
Direct „Waste“-Heat usage
e.g. heating
34 Water 60ºC © 2011 IBM Corporation
- 35. Paradigm change: Moore’s law goes 3D
Multi-Chip Design Brain: synapse network
System on Chip
Meindl 05 et al. 3D Integration
Benefits:
High core-cache bandwidth
Separation of technologies
Reduction in wire length
Equivalent to two generations of scaling
Global wire lengths reduction
No impact on software development
35 © 2011 IBM Corporation
- 36. Scalable Heat Removal by Interlayer Cooling
cross-section through fluid
port and cavities
3D integration requires (scalable) interlayer liquid cooling
Challenge: isolate electrical interconnects from liquid
Microchannel
Pin fin
Through silicon via electrical bonding
and water insulation scheme
A large fraction of energy in computers is spent for data transport
Shrinking computers saves energy
Test vehicle with fluid
manifold and connection
36 © 2011 IBM Corporation
- 37. On the Cube Road
Paradigm Changes
-Energy will cost more than servers
-Coolers are million fold larger than transistors
Moore’s Law goes 3D
-Single layer scaling slows down
-Stacking of layers allows extension of Moore’s law
-Approaching functional density of human brain
Future computers look different
-Liquid cooling and heat re-use, e.g. Aquasar
-Interlayer cooled 3D chip stacks
-Smarter energy by bionic designs
Energy aspects are key
-Cooling – power delivery – photonics
-Shrink a rack to a “sugar cube”: 50x efficiency
37 © 2011 IBM Corporation