Download Elemental Technologies: Harnessing GPU Power for Video Processing - Prof. Jingke Li and more Study notes Computer Science in PDF only on Docsity!
Harnessing Stream Processors:
massively parallel processing
Jesse Rosenzweig, CTO, [email protected]
April 21st, 2009 (^) Elemental Technologies Incorporated Confidential
Agenda
• Company Background
• Story of a Startup
• The Elemental Video Engine
• Elemental Product Line
• CUDA introduction
• Conclusion
(^2) Elemental Technologies Incorporated Confidential
Company Background
- Our Mission:
- To create the fastest, highest quality video solutions by harnessing massively parallel, off-the-shelf hardware.harnessing massively parallel, off the shelf hardware.
- Founded in 2006
- Team led display revolution at
- Headquartered in beautiful Portland, Oregon
- Profitable in first quarter of revenue (Q4 ‘08)
3
- Raised $7.1M Series A in June 2008
Elemental Technologies Incorporated Confidential
Story of a Startup
- Founded August 2006
- Focus was to build ASIC St d l t d / d
VCU3D Comb IR
- Standalone transcoder / encoder
- Estimated cost $20M to revenue
- Funding sources limited
- Elemental 2.0: April 2007
TS Demux
PODController
Decryption EncryptionTS Remux
4
p
- NVIDIA G80 had been released
- CUDA had been launched
- Powerful parallel engine available
- Switched to software model! Elemental Technologies Incorporated Confidential
Disruptive Innovation
- Elemental’s video harnesses key GPU trends
- GPUs have become immensely powerful 2 2. GPUGPUs have become extremely programmable h b t l bl
- PCI-e bus allows fast CPU / GPU communication
(^7) Elemental Technologies Incorporated Confidential
Video Engine Pipeline
- Harnesses both the CPU and GPU strengths
- Achieves up to 10x performance of CPU-only
- Efficient use of system resources is key
(^8) Elemental Technologies Incorporated Confidential
Elemental Video Engine
- Currently used by a variety of applications:
- Virtualization / Remote Video Distribution
- U it d St tUnited States Intelligence Community I t lli C it
- Professional Video Editing
(^9) Elemental Technologies Incorporated Confidential
Product Target Features
Elemental’s Product Line
- All powered by Elemental core technology
Elemental Video Engine™ SDK Developer^ • Flexible and extensible
- Supports a variety of codecs Badaboom™ Media Converter Consumer^ • Video on mobile devices
- 1 million+ downloads Elemental Accelerator for CS4 Professional^ • Premiere Pro plug-in
- Bundled w/ NVIDIA Quadro CX
10
Q3 ‘08 Q4 ‘08 Q1 ‘09 Q2 ‘09 Q3 ‘
Badaboom™ Media Converter available RapiHD™ Accelerator for Adobe Premiere Pro CS4 available
RapiHD™ SDK available
Elemental Technologies Incorporated Confidential
CUDA Introduction
Elemental Technologies Incorporated Confidential
CUDA Introduction
• What is CUDA?
- Compute Unified Device Architecture
- PP arallel processing at a very low levelll l i t l l l
- Extensions to C
(^14) Elemental Technologies Incorporated Confidential
GPU Hardware Introduction
multiprocesors
- Each multiprocessor has sets of processors
- Each processor executes the same instruction on different data
15
- Each processor has access to shared memory
Elemental Technologies Incorporated Confidential
CUDA Introduction
- Memory types
- Global/Device GPU’s DRAMGPU s DRAM. Slowest of all memory Slowest of all memory
- Constant Cached global memory for constant read-only data
- Texture 2D cache and hardware interpolation for global memory
- Shared Fast memory (as fast as registers) available to a CUDA block
16
y ( g )
- Register Set of general purpose registers available for the thread
Elemental Technologies Incorporated Confidential
CUDA Introduction
- Typical data flow
- CPU produces/captures data
- CC opy data to GPU DRAMd t t GPU DRAM
- Kernel loads data from DRAM into shared memory
- Threads execute, in parallel, on data in shared memory
- Once threads are done (syncthreads), move data back into GPU DRAM
- Move results back to CPU
19
Move results back to CPU
Elemental Technologies Incorporated Confidential
CUDA Introduction
- Occupancy
- The ratio of the number of active warps peractive warps per multiprocessor to the maximum number of active warps
- Current NVIDIA GPU capability has a max of 32 active warps
20
active warps
- Higher occupancy is not necessarily faster for any given algorithm, but is a measure of how much work can be done per clock.Elemental Technologies Incorporated Confidential
CUDA Introduction
- Optimize kernels by
- minimizing registers => simple algorithms
- Mi iMinimizing shared memory usage => resourceful mem i i h d f l management
- Maximizing warps per block => give the device enough work.
- Good memory access Coalesced global reads and writes R d b k fli t h d
21
Reduce bank conflicts on shared memory.
Elemental Technologies Incorporated Confidential
CUDA Introduction
Matrix Multiply
- Each thread block is responsible for computing one square sub-matrix Csub of C;
- Each thread within the block is responsible
22
for computing one
element of Csub.
Elemental Technologies Incorporated Confidential
CUDA Introduction
- GPU Side (part 2)
- Load shared memory with datawith data
- Do matrix multiply in parallel
- Write result to global memory
(^25) Elemental Technologies Incorporated Confidential
CUDA Introduction
- Performance for A[48,80] * B[128, 48] =
C[128,80]
- GPU 10ms (5.4x faster)
- CPU 54ms
- 491k multiplies and 491k adds.
(^26) Elemental Technologies Incorporated Confidential
CUDA Introduction
- Performance for A[48,8000] * B[12800, 48] =
C[12800,8000]
- GPU 663ms ( 14.2x faster )
- CPU 9,483ms
- ~5 billion multiplies and adds.
(^27) Elemental Technologies Incorporated Confidential
Compute competition
- CUDA only for NVIDIA, but Mac, Linux and
Windows supported
- OpenCL (Apple) and DX11 (Microsoft) for all
GPU and CPU platforms.
(^28) Elemental Technologies Incorporated Confidential
More information
- CUDA: www.nvidia.com/CUDA
- OpenCL: www.khronos.org/opencl/
- DX11 Compute: DirectX March 2009 release
- www.elementaltechnologies.com!
(^31) Elemental Technologies Incorporated Confidential