Friday, September 12, 2014

GPUs, CUDA, OpenCL and how to leverage them for Post

So, there's a lot of confusion, misinformation and half-information on the internets about GPU acceleration in your favorite NLE, coloring or conforming app and unfortunately the companies that make the software we all know and love are not particularly helpful on explaining what makes for a good graphics card for their systems, so I decided to break down what I know on the subject.




Quick Glossary

GPU is short for "Graphics Processing Unit" and is used interchangeably with Graphics Card and Video Card.

GPGPU refers to having a program do number crunching/processing using a GPU's stream processors.

Compute generally refers to the same thing as GPGPU

Stream processors, Shaders, and CUDA Cores are all essentially the same thing. They are the little parallel processors that crunch numbers in a GPU

Textures Units are for rendering textures onto 3D objects and are important to gaming performance, but irrelevant to NLE performance.

Texture Fill Rate, again a 3D gaming thing that isn't relevant to us.

Graphics Clock is the MHz (speed) at which the GPU's processor works. It's only really relevant when comparing two cards with the same chip.

Memory Clock, the speed at which the memory works, this has an impact on Memory Bandwidth

Memory Bus, the width of the pipe that data has to be shoved through to get in and out of the GPU's memory chips. This also impacts Memory Bandwidth

Memory Bandwidth, usually expressed in GB/s. This is how much data can go through the memory chips in a second.

ROP "render out processor", it's the final bit of hardware that is responsible for getting the image off of the card and onto the screen. Nvidia likes to use as few as possible to keep cost down and stop lower-end cards from competing with more expensive products. AMD is more generous with the number of ROPs they put on a chip.

OpenGL is an older API for rendering 2D and 3D graphics on a video card. It's generally considered out of date and inefficient compared to newer APIs like Mantle and Direct-X 11. It is however, cross platform and is even used to draw 3D graphics iOS apps.

OpenCL refers to an API that programmers can use to do processing on a GPU instead of relying on a computer's CPU (central processing unit). OpenCL is cross-platform, meaning it works on most modern GPU's including those from AMD/ATI and Nvidia and in Windows, Mac and Linus environments.

CUDA is like OpenCL, but proprietary to Nvidia and since it's been around longer, is more mature and efficient. It also supports functions that programmers can leverage that OpenCL doesn't yet.

General Notes

AMD/ATI cards generally have superior performance over Nvidia when using OpenCL, but AMD cards do not support CUDA, which many applications are optimized for. Applications that support both OpenCL and CUDA are usually more optimized for CUDA than OpenCL, so you'll see better performance with a CUDA enabled card from Nvidia.

Avid Media Composer

There is currently no version of MC that fully leverages your video card. On a Windows machine, IF you have DX hardware AND an Nvidia Quadro card, your Quadro will help accelerate a small number of realtime effects, but honestly, it's not worth the extra investment required. When Avid states that only Quadro cards are "qualified" to work with MC, they mean that those are the only cards they've tested. 

For Windows, almost any modern Nvidia graphics card will work fine with Media Composer. If you're running multiple displays off one graphics card, for instance, 1 for bins, 1 for your timeline and composer windows and 1 used for fullscreen playback, then I would advise a card with at least 2gigs of RAM. I don't recommend using AMD/ATI cards with MC in Windows though, too many people have had problems and Avid doesn't recommend it even though many users have had success with it.

For Mac, I have the same recommendation but feel free to chose from AMD/ATI as well, as Avid relies on OS X's built in graphics APIs, which work equally well on both platforms.

On both platforms, Media Composer uses OpenGL to render certain timeline effects in realtime.

You'll see more of a performance boost with faster drives, a faster CPU or more RAM than stuffing something like a Nvidia Titan into a Media Composer machine. If Full Screen Playback drops frames but MC plays back smooth with it turned off, you might want to upgrade your card, but there's no need to go overboard.

Adobe Premiere

Premiere does support GPU acceleration for many of it's rendering/exporting functions, though not for general playback or debayering of RAW footage. Although Adobe now supports OpenCL, it is more optimized for CUDA, so all things being equal, you will generally see better performance from a powerful Nvidia card than a powerful AMD card. 

Final Cut Pro X

FCPX supports OpenCL only (no CUDA), and since most Nvidia cards are outclassed by AMD cards when it comes to OpenCL, you will see much, much better performance with an AMD card over the equivalent Nvidia card.

Note: Previous to FCP X, no version of Final Cut Pro had any sort of GPU acceleration, so if you're still using FCP7, it's pointless to upgrade your graphics card.

Lightworks

Lightworks leverages your graphics card in much the same manner as Avid, which is to say, not so much. Lightworks uses neither OpenCL or CUDA, so any mid-range "gaming" graphics card will work fine. Lightworks does leverage your video card to render realtime effects, but in a similar manner to Avid, so putting a really powerful video card in your machine isn't likely to enhance things much. Even though on their site they've only really tested workstation-class cards, they've admitted as much in their support forums. 

Davinci Resolve

Resolve now supports both OpenCL and CUDA for GPU acceleration, this means you're free to chose from both AMD and Nvidia cards but, Resolve is still more optimized for CUDA than OpenCL, so with equal specs, the Nvidia card will usually give better performance.

Resolve can also leverage multiple GPU cards at once, so long as they are of the same manufacturer, mixing AMD and Nvidia is not supported or recommended. Sometimes you can get better performance from Resolve by having a GUI only card (meaning, your monitors are hooking up to it) and GPU only card (meaning your monitors are not hooked up it) together. This allows the GPU card to concentrate on crunching numbers without having the extra burden of drawing to a display. 

REDcine-X Pro

This year, RED finally released a version of RCP that is GPU accelerated so you no longer need to dump a ton of money into a Red Rocket card to get better performance. RCP supports both OpenCL and CUDA, so you can choose from either manufacturer, but they've also stated that CUDA cards will get a performance boost over OpenCL. RCP also supports multiple cards for GPU acceleration, but it's better not to mix and match AMD and Nvidia.

Workstation Graphics Cards vs "Gaming" Graphics Cards

A lot of people feel that they need a "pro" graphics card for their NLE system because they're doing pro work, this is not the case. Everyone will happily sell you the latest, greatest, most expensive workstation graphics card for your NLE, but to be honestly, they are generally not well suited to the kind of work we do. 

The Workstation Card

This would be the "Quadro" series from Nvidia and the "Fire Pro" series from AMD. There are advantage to using these cards, but most of it not leveraged for NLE work. The major differences between and workstation card and a gaming card is that they are optimized for Double-Precision Floating Point Calculations, have ECC RAM and are built with higher quality components and to higher standards. Unless you're doing a lot of 3D drafting or modeling work, you're not likely to see the benefit of these expensive upgrades. You can get the similar floating point performance for a lot less money with a consumer card, and ECC RAM is only really used in a CAD or 3D modeling environment, so for editors, it's really not worth the investment.

The Gaming Card

Gaming cards are the "Geforce" series from Nvidia and the "Radeon" series from AMD. They are optimized for pushing pixels and rendering textures in applications that don't need high floating point precision, like games for instance. These are mostly fine for NLE work, as the kind of processing they are optimized for work well with most of our applications. 

What to look for in a graphics card

For NLEs that are not GPU accelerated, most cards with 1gig of RAM or more will be perfectly fine, unless, like I mentioned above, you're going to run 3 screens off a single card and have one screen doing full screen playback, in which case I would recommend 2 gigs.

A rule of thumb would be to avoid cards that are under $200 brand new. Most of those cards have really slow memory which can cause your NLE to drop frames on playback, especially with longer sequences. 

For GPU accelerated apps, look for a combination of a 400 or more stream processors/CUDA cores (same thing, different name) and high memory bandwidth. At this point in time, it is better to lean towards CUDA enabled cards (unless you're using FCPX), as they will usually perform better in any app that supports both CUDA and OpenCL. You can also see a benefit from having 2 or more gigs of RAM on the graphics card, especially if you're working with greater than HD resolutions.

For Final Cut Pro X, I would avoid Nvidia cards and instead get an AMD card since their OpenCL performance is vastly superior, unless you get a really, really high-end Nvidia card (think $800+). 

Memory Bandwidth is probably the most important spec to look for in a graphics card. NLE's push a lot of data through the memory of a graphics card and if that memory is slow, the whole system has to slowdown and wait for it. Unfortunately, some card manufacturers don't publish bandwidth specs on their site. 

Avoid:
  • Cards that list DDR3 as the memory type, this is old, slow memory. 
  • Cards that list the memory bus 128bit or less
Look for:
  • GDDR5 memory, better to get 2 gigs or more.
  • A memory bus of 192bit or greater.
  • Cuda Cores/Stream Processors/Shaders, more is better, but memory bandwidth is more important.
  • ROPs, more is better. A fast card will get bogged down if there's not enough ROPs to keep up. I would consider 24 to be the bare minimum as they can be used as a guide to indicate how much throughput the manufacturer thinks their card has, a more powerful card requires more ROPs to keep up with the rest of the GPU.
  • Memory Clock (in Mhz) higher is better, but a memory bus width is more important.
One of the more popular GPUs for Davinci Resolve is the rather dated Nvidia GTX 580. It was top shelf when it was released, but by current standards it's pretty sluggish for today's gamers, so what gives?

A quick perusal of it's specs reveals a lot:

Shaders/CUDA Cores: a mere 512
ROPs: 48, hmm interesting
Graphics Clock: 772MHz… kind of lethargic
Memory Clock: 1002MHz, also a bit slow
Memory Bus: 384 bit… say what?
Memory Bandwidth: 192.4 GB/s… wow!
Ram: 1.5GB GDDR5

So what we have is a card that is now no longer able to truly compete with current graphics for 3D gaming, but absolutely kicks butt at CUDA acceleration. 

Ok, lets look at the GTX 680, another popular GPU.

Shaders/CUDA Cores: 1536 Wow!
ROPs: 32 Not bad..
Graphics Clock: 1006 MHz Now that's more like it!
Memory Clock: 1502 MHz Yowza!
Memory Bus: 256 bit A step back, but not too bad, right?
Memory Bandwidth: 192.3 GB/s Same, same!
RAM: 2 GB GDDR5

So you're looking at this and probably thinking "wow, the GTX 680 must poop all over the 580!", well sadly, no, it doesn't. There were a couple of changes in architecture between the older GTX 580 "Fermi" chips and the newer "Kepler" chip in the 680, one of the most impactful to us being the deletion of the "Shader Clock". In the GTX 580, the shaders (CUDA cores) ran at twice the speed of the graphics clock, which means that, even though the 580 has 1/3 the Cuda Cores, they ran at twice the speed of the main Graphics Clock. Another thing that was changed is the reduction of Floating Point 64 units to just one, between these two cards, this is probably has the largest impact on the applications we use. So for instance the GTX has a FP64 rating of 1/8FP32 whereas the GTX 680 only has 1/24FP32. What this means is that the Double Precision (FP64) performance of the GTX 580 is 1/8 that of it's Single Precision performance and with the GTX 680, it's 1/24 it's Single Precision floating point performance. Pretty huge difference.

Now a FP32 ratio isn't particularly useful, but here's a few cards FP64 numbers for comparison.

AMD Radeon 7990: 1946
Nvidia GTX Titan: 1523
AMD Radeon R9 280x: 1024
AMD Radeon R9 280: 836
Nvidia GTX 780ti: 223
Nvidia GTX 580: 197
Nvidia GTX 570: 175
AMD Radeon R9 270x: 168
Nvidia GTX 680: 129
AMD Radeon R7 260x: 123
Nvidia GTX 670: 102
Nvidia GTX 660ti: 102
Nvidia GTX 760: 94

So even though the GTX 580 has half the FP32 performance of the GTX 680 (3090 vs 1581), it still comes out on top when it comes to FP64 performance and this is why the GTX 580 (and even 570) outperforms the much newer GTX 680 in applications like Premiere, Resolve and REDCine-X Pro. For things like gaming, where FP32 performance is more important, the GTX 680 is much better than the GTX 580. The GTX Titan basically uses brute force to get good FP64 numbers, but you certainly pay for it with your wallet. If you do a quick price comparison online, you'll find that when it comes to FP64 and OpenCL performance, you get a lot more bang for your buck with AMD than Nvidia.

All of this makes good business sense if you're Nvidia, because it makes your chips less complicated, cheaper to manufacture and more energy efficient, but you're not sacrificing performance for the gamers who buy these cards.  It also further separates your high-end workstation-class cards from your less expensive consumer oriented cards. 

AMD seems to have more of a "go big or go home" kind of attitude and has gone all-in for FP64 performance, which is why their cards are hugely popular with the bit mining crowd. 

So looking at all these numbers, you're probably questioning my assertion that Nvidia CUDA will be faster than any OpenCL AMD card in the same application. Unfortunately, as it stands right now, OpenCL when compared to CUDA is still not a very well optimized API and is missing a lot of instructions that allow programmers to really leverage all that FP64 power in video apps. CUDA has been around a lot longer, is much better optimized and is what they call "low level", which means that programmers can write code that talks more directly to the hardware than you can with OpenCL. If you're looking at an OpenCL only app, then AMD's better FP64 performance will mop the floor with any Nvidia card below the Titan. Final Cut Pro X for instance, is OpenCL centric and as such, midlevel AMD cards will outperform high-end Nvidia cards with it.

4 comments:

  1. Hi

    I just found this post now, years later. It was super helpful.

    Any chance we can talk about the latest GPUs and Software. I would like to pick your brain.

    Thanks
    Tannan

    ReplyDelete
  2. Hi Tannan, sorry I'm seeing this now, several months after you posted. I wrote this post when I was looking to upgrade my own system, so it was basically a compilation of my research at the time. I think most of it is still applicable but I haven't really kept up with the latest graphics cards.

    ReplyDelete
  3. Thanks you! The best resource I've found regarding CUDA/OPENCL on Avid! Very gratefull

    ReplyDelete