SlideShare a Scribd company logo
1 of 29
Download to read offline
Efficient Scheduling of OpenMP and OpenCL Workloads
Getting the most out of your APU
Objective
! software has a long life-span that exceeds the life-span of hardware
! software is very expensive to be written and maintained
! next generation hardware also needs to run legacy software
! Example: IWAVE
! procedural C-code
! no object orientation
! tight integration between data structures and functions
! What do I mean by efficient scheduling?
! find ways to utilize GPU cores for code blocks
! find ways to utilize all CPU cores and GPU units at the same time

!2

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Historical Context
GPU Compute Timeline

Aparapi
CUDA
2002
!3

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

2008

AMP C++
2010

2012
Accelerator Challenges
Technology Accessibility and Performance
Performance

OpenCL & CUDA

CPU Multithread

CPU Single Thread
Ease-of-Use
!4

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
APU Opportunities
One Die - Two Computational Devices

Metric

CPU

APU

Memory Size

large

small

Memory Bandwidth

small

large

Parallelism

small

large

yes

no

Performance

application dependent

application dependent

Performance-per-Watt

application dependent

application dependent

Traditional

OpenCL

General Purpose

Programming

!5

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
APU Opportunities

Performance and Performance-per-Watt
! Example: Luxmark OpenCL Benchmark

APU

Performance[Pts]

170

197

316

50

37

58

3.4

5.3

5.4

Combined[Pts2/W]

! GPU has best performance-per-Watt

GPU

PPW[Pts/W]

! Best performance by using the APU

CPU

Power[W]

! Similar CPU and GPU performance

Metric

578

1049

1722

! APU provides outstanding value

Luxmark OpenCL Benchmark
Ubuntu 12.10 x86_64
4 Piledriver CPU cores @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!6

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Example: Luxmark Renderer

Performance and Performance-per-Watt

+64%
+81%

!7

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

Luxmark OpenCL Benchmark
Render “Sala” Scene
Ubuntu 12.10 x86_64
4 Piledriver cores @ 2.5GHz
6 GPU CUs @ 720MHz
16GB DDR3 1600MHz
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! Know the problem you are trying to solve.
! staggered rectangular grid in 3D
! coupled first order PDE
! scalar pressure field p
! vector velocity field v = {vx, vy, vz}
! source term g

!8

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);
sgn_ts3d_210_v0_OpenMP(dom, pars);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
…
}

OpenMP p

OpenMP vx

//
//
//
//
//

main simulation loop
calculate pressure field
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis

OpenMP vy

OpenMP vz

OpenMP
Time

!9

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! Measure the initial performance.
! pressure and velocity field simulated using OpenMP
! average time T[ms] per iteration
! OpenMP linear scaling with threads

!10

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! find computational blocks
! understand dependencies between blocks

OpenMP vx
OpenMP p

OpenMP vy

! identify sequential and parallel parts

OpenMP

OpenMP vz
Causality

OpenMP p

OpenMP vx

OpenMP vy

OpenMP vz

OpenMP
Time

!11

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);
sgn_ts3d_210_v0_OpenCL(dom, pars);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
…
}

//
//
//
//
//

main simulation loop
calculate pressure field p
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis

OpenCL vx
OpenMP p

IDLE

OpenMP vy

OpenMP vz

OpenMP
Time

!12

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! use the GPU to compute vx
! the CPU is idle while the GPU is running
! 42% improvement for 1 thread
! 25% improvement for 2 threads
! 9% improvement for 4 threads

!13

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE
while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);

!
!

// main simulation loop
// calculate pressure field p

int num_threads = atoi(getenv("OMP_NUM_THREADS"));
omp_set_num_threads(2);
omp_set_nested(1);

#pragma omp parallel shared(…) private(…)
{
switch ( omp_get_thread_num() ) {
case 0:
sgn_ts3d_210_v0_OpenCL(dom, pars)
break;
case 1:
omp_set_num_threads(num_threads);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
break;
default:
break;
}
}
x
}

OpenCL v

OpenMP p

OpenMP vy

OpenMP vz

// save the current number of OpenMP threads
// restrict the number of OpenMP threads to 2
// allow nested OpenMP threads
// start 2 OpenMP threads

// calculate velocity x-axis using OpenCL
// increase number of OpenMP threads back
// calculate velocity y-axis
// calculate velocity z-axis

// close OpenMP pragma
// close simulation while

OpenMP
Time

!14

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! overlap vx and vy
! CPU not idle anymore
! 50% improvement for 1 thread
! 40% improvement for 2 threads
! 38% improvement for 4 threads

!15

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_p012_OpenCL(dom, pars);
sgn_ts3d_210_v0_OpenCL(dom, pars);
sgn_ts3d_210_v1_OpenCL(dom, pars);
sgn_ts3d_210_v2_OpenCL(dom, pars);
…
}

//
//
//
//
//

bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) {
…
clEnqueueWriteBuffer(queue, buffer, …);
clEnqueueNDRangeKernel(queue, kernel_P012, dims, …);
clEnqueueReadBuffer(queue, buffer, …);
…
}

OpenCL p

OpenCL vx

OpenCL vy

main simulation loop
calculate pressure field
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis

// copy data from host to device
// execute OpenCL kernel on device
// copy data from device to host

OpenCL vz

OpenCL
Time

!16

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! understand where performance gets lost
! 98% of time spent on I/O
! 2% of time spent on compute
! reduce I/O

OpenCL Upload

Kernel Execution

OpenCL Download

188ms

4ms

54ms

OpenCL vx
OpenMP p

OpenMP vy

OpenMP vz

OpenMP
Time

!17

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! How does the speedup of an OpenCL application
(SOpenCL) depend on speedup of the OpenCL kernel
(SKernel) when the OpenCL I/O time is fixed?
! Fraction of OpenCL I/O time: FI/O
! 50% I/O time limit the maximal possible speedup to 2
! Minimize OpenCL I/O, only then increase OpenCL
kernel performance

!18

SKernel
SOpenCL =
HSKernel - 1L FIêO + 1

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_ALL_OpenCL(dom, pars);
…
}

// main simulation loop
// combine all OpenCL calculations

bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) {
…
clEnqueueWriteBuffer(queue, buffer, …);

!
!

while(…) {
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,

kernel_P012, dims, …);
kernel_V0, dims, …);
kernel_V1, dims, …);
kernel_V1, dims, …);

// copy data from host to device
//
//
//
//

execute
execute
execute
execute

OpenCL
OpenCL
OpenCL
OpenCL

kernel
kernel
kernel
kernel

for
for
for
for

pressure
velocity x
velocity y
velocity z

}
clEnqueueReadBuffer(queue, buffer, …);
…

// copy data from device to host

}

OpenCL p

OpenCL vx

OpenCL vy

OpenCL vz

OpenCL
Time

!19

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! eliminate all but essential I/O
! significant speedup over simple OpenCL

!20

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! measure real application performance
! 3000 iterations using a 97x405x389 simulation grid
! 8 GCN Compute Units achieve 70% more
performance than 8 traditional OpenMP threads

14
10.5
7
3.5
0
CPU (8T) "Piledriver"

!21

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

GPU (8CU)

AMD S9000
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! initial OpenCL performance measurements
! 89 Algorithms tested for image size of 4MP
! compare OpenCL I/O and execution time
! 28% of all algorithms are compute bound
! 72% of all algorithms are I/O bound

OpenCV Computer Vision Library Performance Tests v2.4
Ubuntu 12.10 x86_64
1 Piledriver CPU core @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!22

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! compare OpenCL and single-threaded performance
! 89 Algorithms tested for image size of 4MP
! realistic timing that includes I/O over PCIe
! 59% of all algorithms execute faster on the GPU
! 41% of all algorithms execute faster on the CPU(1)
! significant speedup for only 15% of all algorithms

OpenCV Computer Vision Library Performance Tests v2.4
Ubuntu 12.10 x86_64
1 Piledriver CPU core @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!23

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! Task: Batch process a large amount of images using a single algorithm.
! OpenCL performance is algorithm and image size dependent
! Either the CPU will process data or the GPU, but not both
! How to choose which algorithm and device to use depending on image size?

!24

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV

!25

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty.
! all CPU cores are fully utilized at all times even for single-threaded algorithms
! all GPU compute units are fully utilized at all times
! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm
! combined performance for multiple algorithms is better than sum of device performance

P

i

APU

=P

P=
!26

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

i

CPU

+P

i

N
1
⁄i=1 Pi

1

GPU
Programming Strategies

Example: High Throughput Computer Vision with OpenCV

!27

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Summary

!
! next generation hardware and legacy code requires compromises
! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time
! application performance can be increased by overlapping OpenCL and OpenMP workloads
! removing all but necessary OpenCL I/O can have a dramatic influence on performance
! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms
! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances
! APUs may provide greatest performance per Watt
! GPUs may provide greatest performance

!28

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product
and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing
manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or
revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof
without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD
BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

!
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro
Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation
Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

!29

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

More Related Content

What's hot

PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoAMD Developer Central
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansAMD Developer Central
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterAMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauAMD Developer Central
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahAMD Developer Central
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...AMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...AMD Developer Central
 
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-PoustyCC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-PoustyAMD Developer Central
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 

What's hot (20)

PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-Bilodeau
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
 
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-PoustyCC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 

Viewers also liked

Curriculum de professor_atual
Curriculum de professor_atualCurriculum de professor_atual
Curriculum de professor_atualWanderson Amaral
 
CURRICULUM VITAE Alexandra Damaso
CURRICULUM VITAE Alexandra DamasoCURRICULUM VITAE Alexandra Damaso
CURRICULUM VITAE Alexandra DamasoAlexandra Damaso
 
Modelo de currículo 1º emprego
Modelo de currículo 1º empregoModelo de currículo 1º emprego
Modelo de currículo 1º empregoCebracManaus
 
Curriculum vitae 2013
Curriculum vitae 2013Curriculum vitae 2013
Curriculum vitae 2013Ana Santos
 
Professor de musica curriculo - arnaldo alves
Professor de  musica   curriculo - arnaldo alvesProfessor de  musica   curriculo - arnaldo alves
Professor de musica curriculo - arnaldo alvesArnaldo Alves
 
Modelo de curriculo menor aprendiz
Modelo de curriculo menor aprendiz Modelo de curriculo menor aprendiz
Modelo de curriculo menor aprendiz CebracManaus
 
Modelo de-curriculum-1-preenchido
Modelo de-curriculum-1-preenchidoModelo de-curriculum-1-preenchido
Modelo de-curriculum-1-preenchidoJocileu Segundo
 
CurríCulo Luiz 2010
CurríCulo Luiz 2010CurríCulo Luiz 2010
CurríCulo Luiz 2010luizmarco
 
Curriculum Profª Elizete Arantes
Curriculum  Profª Elizete ArantesCurriculum  Profª Elizete Arantes
Curriculum Profª Elizete Aranteselizetearantes
 
Trabalho LPL
Trabalho LPLTrabalho LPL
Trabalho LPLTaissccp
 
Curriculo 850 Alternativo
Curriculo 850 AlternativoCurriculo 850 Alternativo
Curriculo 850 Alternativorpicorelli
 
PPP - E.B.M. Henrique Alfarth 2014
PPP - E.B.M. Henrique Alfarth 2014PPP - E.B.M. Henrique Alfarth 2014
PPP - E.B.M. Henrique Alfarth 2014Fernando Heringer
 
Manual blogger
Manual bloggerManual blogger
Manual bloggerBLAJEJS
 
Criar Um Blog -Blogger
Criar Um Blog -BloggerCriar Um Blog -Blogger
Criar Um Blog -BloggerLeny Cerqueira
 
Blog na-educacao
Blog na-educacaoBlog na-educacao
Blog na-educacaoNecy
 
Curriculum psicóloga educacional
Curriculum psicóloga educacionalCurriculum psicóloga educacional
Curriculum psicóloga educacionalcarolinaanabella
 

Viewers also liked (20)

Curriculum de professor_atual
Curriculum de professor_atualCurriculum de professor_atual
Curriculum de professor_atual
 
CURRICULUM VITAE Alexandra Damaso
CURRICULUM VITAE Alexandra DamasoCURRICULUM VITAE Alexandra Damaso
CURRICULUM VITAE Alexandra Damaso
 
Modelos de curriculo
Modelos de curriculoModelos de curriculo
Modelos de curriculo
 
Modelo de currículo 1º emprego
Modelo de currículo 1º empregoModelo de currículo 1º emprego
Modelo de currículo 1º emprego
 
Curriculum vitae 2013
Curriculum vitae 2013Curriculum vitae 2013
Curriculum vitae 2013
 
Professor de musica curriculo - arnaldo alves
Professor de  musica   curriculo - arnaldo alvesProfessor de  musica   curriculo - arnaldo alves
Professor de musica curriculo - arnaldo alves
 
Modelo de curriculo menor aprendiz
Modelo de curriculo menor aprendiz Modelo de curriculo menor aprendiz
Modelo de curriculo menor aprendiz
 
Modelo de-curriculum-1-preenchido
Modelo de-curriculum-1-preenchidoModelo de-curriculum-1-preenchido
Modelo de-curriculum-1-preenchido
 
Curriculo pronto-3
Curriculo pronto-3Curriculo pronto-3
Curriculo pronto-3
 
CurríCulo Luiz 2010
CurríCulo Luiz 2010CurríCulo Luiz 2010
CurríCulo Luiz 2010
 
Curriculo:Enfermeiro
Curriculo:Enfermeiro Curriculo:Enfermeiro
Curriculo:Enfermeiro
 
Curriculum Profª Elizete Arantes
Curriculum  Profª Elizete ArantesCurriculum  Profª Elizete Arantes
Curriculum Profª Elizete Arantes
 
Trabalho LPL
Trabalho LPLTrabalho LPL
Trabalho LPL
 
Curriculo 850 Alternativo
Curriculo 850 AlternativoCurriculo 850 Alternativo
Curriculo 850 Alternativo
 
PPP - E.B.M. Henrique Alfarth 2014
PPP - E.B.M. Henrique Alfarth 2014PPP - E.B.M. Henrique Alfarth 2014
PPP - E.B.M. Henrique Alfarth 2014
 
Manual blogger
Manual bloggerManual blogger
Manual blogger
 
Criar Um Blog -Blogger
Criar Um Blog -BloggerCriar Um Blog -Blogger
Criar Um Blog -Blogger
 
Blog na-educacao
Blog na-educacaoBlog na-educacao
Blog na-educacao
 
Modelo de-curriculum-4-1
Modelo de-curriculum-4-1Modelo de-curriculum-4-1
Modelo de-curriculum-4-1
 
Curriculum psicóloga educacional
Curriculum psicóloga educacionalCurriculum psicóloga educacional
Curriculum psicóloga educacional
 

Similar to HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded ProgrammingSri Prasanna
 
MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103Linaro
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班Paul Chao
 
The Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setupThe Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setupIvano Malavolta
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 

Similar to HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel (20)

lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded Programming
 
MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
 
The Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setupThe Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setup
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 

More from AMD Developer Central

Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14AMD Developer Central
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14AMD Developer Central
 

More from AMD Developer Central (20)

Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14
 

Recently uploaded

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 

Recently uploaded (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 

HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

  • 1. Efficient Scheduling of OpenMP and OpenCL Workloads Getting the most out of your APU
  • 2. Objective ! software has a long life-span that exceeds the life-span of hardware ! software is very expensive to be written and maintained ! next generation hardware also needs to run legacy software ! Example: IWAVE ! procedural C-code ! no object orientation ! tight integration between data structures and functions ! What do I mean by efficient scheduling? ! find ways to utilize GPU cores for code blocks ! find ways to utilize all CPU cores and GPU units at the same time !2 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 3. Historical Context GPU Compute Timeline Aparapi CUDA 2002 !3 | OpenCL and OpenMP Workloads on Accelerated Processing Units | 2008 AMP C++ 2010 2012
  • 4. Accelerator Challenges Technology Accessibility and Performance Performance OpenCL & CUDA CPU Multithread CPU Single Thread Ease-of-Use !4 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 5. APU Opportunities One Die - Two Computational Devices Metric CPU APU Memory Size large small Memory Bandwidth small large Parallelism small large yes no Performance application dependent application dependent Performance-per-Watt application dependent application dependent Traditional OpenCL General Purpose Programming !5 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 6. APU Opportunities Performance and Performance-per-Watt ! Example: Luxmark OpenCL Benchmark APU Performance[Pts] 170 197 316 50 37 58 3.4 5.3 5.4 Combined[Pts2/W] ! GPU has best performance-per-Watt GPU PPW[Pts/W] ! Best performance by using the APU CPU Power[W] ! Similar CPU and GPU performance Metric 578 1049 1722 ! APU provides outstanding value Luxmark OpenCL Benchmark Ubuntu 12.10 x86_64 4 Piledriver CPU cores @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !6 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 7. Example: Luxmark Renderer Performance and Performance-per-Watt +64% +81% !7 | OpenCL and OpenMP Workloads on Accelerated Processing Units | Luxmark OpenCL Benchmark Render “Sala” Scene Ubuntu 12.10 x86_64 4 Piledriver cores @ 2.5GHz 6 GPU CUs @ 720MHz 16GB DDR3 1600MHz
  • 8. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! Know the problem you are trying to solve. ! staggered rectangular grid in 3D ! coupled first order PDE ! scalar pressure field p ! vector velocity field v = {vx, vy, vz} ! source term g !8 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 9. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); sgn_ts3d_210_v0_OpenMP(dom, pars); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); … } OpenMP p OpenMP vx // // // // // main simulation loop calculate pressure field calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis OpenMP vy OpenMP vz OpenMP Time !9 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 10. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! Measure the initial performance. ! pressure and velocity field simulated using OpenMP ! average time T[ms] per iteration ! OpenMP linear scaling with threads !10 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 11. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! find computational blocks ! understand dependencies between blocks OpenMP vx OpenMP p OpenMP vy ! identify sequential and parallel parts OpenMP OpenMP vz Causality OpenMP p OpenMP vx OpenMP vy OpenMP vz OpenMP Time !11 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 12. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); sgn_ts3d_210_v0_OpenCL(dom, pars); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); … } // // // // // main simulation loop calculate pressure field p calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis OpenCL vx OpenMP p IDLE OpenMP vy OpenMP vz OpenMP Time !12 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 13. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! use the GPU to compute vx ! the CPU is idle while the GPU is running ! 42% improvement for 1 thread ! 25% improvement for 2 threads ! 9% improvement for 4 threads !13 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 14. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); ! ! // main simulation loop // calculate pressure field p int num_threads = atoi(getenv("OMP_NUM_THREADS")); omp_set_num_threads(2); omp_set_nested(1); #pragma omp parallel shared(…) private(…) { switch ( omp_get_thread_num() ) { case 0: sgn_ts3d_210_v0_OpenCL(dom, pars) break; case 1: omp_set_num_threads(num_threads); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); break; default: break; } } x } OpenCL v OpenMP p OpenMP vy OpenMP vz // save the current number of OpenMP threads // restrict the number of OpenMP threads to 2 // allow nested OpenMP threads // start 2 OpenMP threads // calculate velocity x-axis using OpenCL // increase number of OpenMP threads back // calculate velocity y-axis // calculate velocity z-axis // close OpenMP pragma // close simulation while OpenMP Time !14 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 15. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! overlap vx and vy ! CPU not idle anymore ! 50% improvement for 1 thread ! 40% improvement for 2 threads ! 38% improvement for 4 threads !15 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 16. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenCL(dom, pars); sgn_ts3d_210_v0_OpenCL(dom, pars); sgn_ts3d_210_v1_OpenCL(dom, pars); sgn_ts3d_210_v2_OpenCL(dom, pars); … } // // // // // bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); clEnqueueReadBuffer(queue, buffer, …); … } OpenCL p OpenCL vx OpenCL vy main simulation loop calculate pressure field calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis // copy data from host to device // execute OpenCL kernel on device // copy data from device to host OpenCL vz OpenCL Time !16 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 17. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! understand where performance gets lost ! 98% of time spent on I/O ! 2% of time spent on compute ! reduce I/O OpenCL Upload Kernel Execution OpenCL Download 188ms 4ms 54ms OpenCL vx OpenMP p OpenMP vy OpenMP vz OpenMP Time !17 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 18. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! How does the speedup of an OpenCL application (SOpenCL) depend on speedup of the OpenCL kernel (SKernel) when the OpenCL I/O time is fixed? ! Fraction of OpenCL I/O time: FI/O ! 50% I/O time limit the maximal possible speedup to 2 ! Minimize OpenCL I/O, only then increase OpenCL kernel performance !18 SKernel SOpenCL = HSKernel - 1L FIêO + 1 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 19. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_ALL_OpenCL(dom, pars); … } // main simulation loop // combine all OpenCL calculations bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); ! ! while(…) { clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); kernel_V0, dims, …); kernel_V1, dims, …); kernel_V1, dims, …); // copy data from host to device // // // // execute execute execute execute OpenCL OpenCL OpenCL OpenCL kernel kernel kernel kernel for for for for pressure velocity x velocity y velocity z } clEnqueueReadBuffer(queue, buffer, …); … // copy data from device to host } OpenCL p OpenCL vx OpenCL vy OpenCL vz OpenCL Time !19 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 20. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! eliminate all but essential I/O ! significant speedup over simple OpenCL !20 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 21. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! measure real application performance ! 3000 iterations using a 97x405x389 simulation grid ! 8 GCN Compute Units achieve 70% more performance than 8 traditional OpenMP threads 14 10.5 7 3.5 0 CPU (8T) "Piledriver" !21 | OpenCL and OpenMP Workloads on Accelerated Processing Units | GPU (8CU) AMD S9000
  • 22. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! initial OpenCL performance measurements ! 89 Algorithms tested for image size of 4MP ! compare OpenCL I/O and execution time ! 28% of all algorithms are compute bound ! 72% of all algorithms are I/O bound OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !22 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 23. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! compare OpenCL and single-threaded performance ! 89 Algorithms tested for image size of 4MP ! realistic timing that includes I/O over PCIe ! 59% of all algorithms execute faster on the GPU ! 41% of all algorithms execute faster on the CPU(1) ! significant speedup for only 15% of all algorithms OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !23 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 24. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! Task: Batch process a large amount of images using a single algorithm. ! OpenCL performance is algorithm and image size dependent ! Either the CPU will process data or the GPU, but not both ! How to choose which algorithm and device to use depending on image size? !24 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 25. Programming Strategies Example: High Throughput Computer Vision with OpenCV !25 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 26. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty. ! all CPU cores are fully utilized at all times even for single-threaded algorithms ! all GPU compute units are fully utilized at all times ! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm ! combined performance for multiple algorithms is better than sum of device performance P i APU =P P= !26 | OpenCL and OpenMP Workloads on Accelerated Processing Units | i CPU +P i N 1 ⁄i=1 Pi 1 GPU
  • 27. Programming Strategies Example: High Throughput Computer Vision with OpenCV !27 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 28. Programming Strategies Summary ! ! next generation hardware and legacy code requires compromises ! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time ! application performance can be increased by overlapping OpenCL and OpenMP workloads ! removing all but necessary OpenCL I/O can have a dramatic influence on performance ! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms ! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances ! APUs may provide greatest performance per Watt ! GPUs may provide greatest performance !28 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 29. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
 The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
 AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
 AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ! ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. !29 | OpenCL and OpenMP Workloads on Accelerated Processing Units |