Choosing GPU for video editing in 2024

January 4, 2024

This is an update to the article
published in September 2023

In our previous article from 2018, we explored the impact of GPUs on noise reduction and video editing in general. Since then, the technology landscape has evolved significantly, and more powerful GPUs have emerged as the driving force behind lightning-fast video processing and enhanced editing capabilities. In this update, we'll delve into the latest breakthroughs in GPU technology that are revolutionizing the world of video editing. Also, we will update the list of key parameters you want to check when choosing the right working horse for your video editing projects.

Let’s start with refreshing your (and our) memory on the things that matter in post-production.

1. Integrated and discrete GPUs

All GPUs can be divided into two large groups: discrete GPUs and integrated GPUs.

A discrete GPU is a physical card separate from the CPU. It has its own dedicated memory that is not shared with the CPU. As discrete GPUs are separate physical devices, it is possible to replace a discrete GPU with another one or add multiple discrete GPUs to the same computer.

An integrated GPU, on the other hand, is embedded alongside the CPU (often on the same silicon die) and has no memory of its own, sharing the main system RAM with the CPU. As integrated GPUs are fused with the CPUs, it is impossible to upgrade the integrated GPU separately from the main processor.

2. GPU processing power

The processing power of a GPU, typically measured in GFLOPS (giga floating-point operations per second) or TFLOPS (which is a thousand times more than GFLOPS), is determined by the number of operations with single-precision floating point numbers that can be executed by all of the GPU's cores within one second. Detailed information on the GPU processing power for AMD, NVIDIA and Apple GPUs can be found on the websites of the manufacturers. For easier comparison, you may want to check out the following wiki pages, but make sure you do a cross-check as well: AMD GPUs and NVIDIA GPUs.

Generally, a GPU with higher TFLOPS performs better compared to another GPU that has a lower TFLOPS number. Discrete GPUs usually (but not always) have much higher processing power compared to that of integrated GPUs, which is one of the reasons why Neat Video supports more discrete GPU models than integrated ones.

While the TFLOPS number is a useful indicator of GPU speed, it's important to avoid choosing a GPU solely by this parameter, as there are other equally or even more important factors affecting the GPU performance in Neat Video, so please read on.

3. GPU memory (VRAM)

The GPU relies on data stored in its random-access memory (VRAM, measured in GB) to perform calculations.

Discrete GPUs are equipped with their own dedicated VRAM chips. While VRAM of a discrete GPU is typically very fast, its amount is limited and can only be expanded by upgrading the whole GPU.

Integrated GPUs, on the other hand, use a part of system RAM as VRAM instead. It is often possible to increase the amount of system RAM by adding more memory modules to the motherboard. It is not always the case though: for example, Apple Silicon systems-on-a-chip (SoCs) combine CPU, GPU and memory, making it impossible to increase RAM size.

Regardless of whether a discrete or an integrated GPU is used, for Neat Video to function optimally with 4K videos, up to 4GB of VRAM is required. It can work with less memory, but the speed may be lower. Keep in mind that most host applications also utilize GPU memory, so it's essential to consider this and allocate sufficient VRAM for them too to ensure stable system performance.

4. GPU memory bandwidth

While more GPU VRAM can result in better performance, it’s not the only thing that you should take into account. Another thing to look at is the memory bandwidth (VRAM bandwidth, measured in GB/sec). Memory bandwidth represents the speed at which VRAM can operate. GPUs with higher memory bandwidth can facilitate faster data transfers to and from GPU cores, enabling Neat Video to perform noise reduction more efficiently. So, while having more VRAM is beneficial, its bandwidth is also critical for performance. Furthermore, the larger the frame size you work with, the more important the bandwidth is as the cache (more about that later) won’t be able to keep up with supplying the data required for processing. To be safe, go with the “the more, the better” rule.

Again, the difference between discrete and integrated GPUs is important here as well. The system RAM used by integrated GPUs as working memory is typically much slower (usually 16–64 GB/s) than the VRAM of dedicated GPUs (up to 1 TB/s or more). This is another reason why Neat Video can only use a limited number of integrated GPU models.

A notable exception here is higher-tier integrated Apple M1 and M2 chips, which have fast (but unfortunately non-extendable) system memory. In particular, M2 Ultra's memory has the bandwidth of 800 GB/s. In addition, unlike most other integrated GPUs, Apple Silicon chips are equipped with a substantial number of fast GPU cores, which make them a good help in accelerating Neat Video processing.

5. CPU connection speed

While VRAM bandwidth is very important, the data needs to make its way into VRAM first before a GPU can access it.

With integrated GPUs, this is simply a matter of copying data from one region of system memory into another. Sometimes the data does not even need to be copied and can be accessed by the GPU in-place.

A discrete GPU, on the other hand, has no direct access to system memory and is connected to a computer through the internal PCI Express (PCIe) bus, an external eGPU connection like Thunderbolt, or other less common protocols (e. g. NVLink). Such connection is usually much slower than the VRAM bandwidth and not all links are created equal.

It's very important to connect discrete GPUs to a computer in the fastest way possible. If a GPU is connected via a PCIe, it is recommended that it operates at x16 speed to maximize GPU performance, without reducing its speed to x8 or x4 due to other installed cards in the system. To achieve that you need to make sure that your CPU, motherboard, and GPU are all ready to work at x16 speed and they are connected the right way (refer to the manual for your motherboard).

Additionally, both PCIe and Thunderbolt connections should be of the latest generations to enable high-speed transfer of large amounts of video data. For instance, each lane of PCI Express 4 is twice as fast as a PCI Express 3 lane. Failure to follow these recommendations may result in the connection between the GPU and the main system becoming a significant bottleneck, thereby slowing down the overall render speed.

For example, a PCIe 4.0 x16 connection has a theoretical bandwidth of 32 GB/s, a PCIe 2.0 x8 connection — 4 GB/s, a Thunderbolt 4 — about 5 GB/s (40 Gbit/s).

6. L2 and L3 cache. The new power

L1, L2 and L3 cache

In order to provide faster data access to GPU cores, memory requests are served by cache. Cache is typically organized in a hierarchical manner, with the help of multiple types of faster memory also called cache levels. The smaller the cache level, the faster the corresponding cache memory can provide data to GPU cores. However, as making fast memory chips of large capacity is difficult, the fastest cache levels are also the smallest in size. Therefore the fastest L1 cache, which is placed very close to processing cores, has a very limited capacity (typically 64–128 KB per cluster of GPU cores).

L2 cache is the second level of cache memory in a GPU. While located further away from processing cores than L1, it is still much closer than the main memory (VRAM). L2 cache is typically around 5 times faster than VRAM and its primary function is to store frequently accessed data that are not readily available in the L1 cache.

Since the L2 cache is closer to the processing cores than the main memory, it reduces the memory latency and access time, enabling faster data retrieval and better performance. It helps in reducing the number of memory requests to VRAM, thereby reducing the overall memory bandwidth usage and improving the efficiency of data processing.

L3 cache, on the other hand, is the third level of cache memory in a GPU. It is located even farther from the processing cores than the L2 cache. L3 cache is generally larger in size than both L1 and L2 caches, but is slower in terms of access time. However, it’s still much faster than the main VRAM. In fact, according to AMD’s papers, their L3 cache works 1.7x to 5.3x faster than VRAM. Not all GPUs have L3 cache: some models rely entirely on L1 and L2.

Both L2 and L3 caches play crucial roles in speeding up data access and computation in modern GPUs.

Why has this become a factor in choosing a GPU?

For a considerable period, both NVIDIA and AMD have incorporated L2 cache in their GPUs, although in relatively modest quantities. Over time, the size of the cache has seen gradual but moderate growth from one generation to another. This limited variation in cache size had little impact on the performance of Neat Video across different GPUs — until the arrival of AMD’s Radeon RX 6xxx series. The first GPU that caught our attention was Radeon RX 6900 TX. Unlike its predecessors, RX 6900 TX boasted an astonishing 128 MB of Infinity Cache (aka L3 cache). To put things into perspective, an earlier generation AMD flagship Radeon RX 5700 XT had only 4 MB of L2 cache and no L3 cache.

This trend has continued in the most recent GPU lineups from both AMD and NVIDIA. In particular, AMD's Radeon RX 7900 XTX has 96 MB of L3, while NVIDIA's GeForce RTX 4090 comes with 72 MB of L2 cache.

A large cache means it can hold a substantial amount of working data, resulting in faster computations for FullHD and 4K tests. However, with an 8K clip, the advantage of GPUs with large cache decreases as neither 72 MB nor 96 MB cache is no longer sufficient.

Examining the new M3 laptops reveals that Apple might not be placing sufficient importance on L3 cache. While there's no official data, the estimated size of L3 cache in M3 and M3 Max remains the same as in M2 and M2 Max. Notably, M3 Pro's cache has decreased to 12 MB (compared to M2 Pro's 24 MB), contributing to a 20% decline in performance.

How do the latest GPU cards perform?

Most of these modern video cards boast remarkable processing power, VRAM, and memory bandwidth. Take a closer look at these products to discover one that strikes the right balance between performance and price, aligning with your specific needs and budget.

Please note that the Neat Video speeds provided are measured with the help of NeatBench and show the processing performance of the plug-in alone, without the overhead of a video editing application, which may be quite significant.

GPU	Processing power (TFLOPS, single precision)	VRAM (GB)	Memory Bandwidth (GB/s)	L2 Cache (MB)	L3 Cache (MB)	Neat Video Full HD Speed (FPS)*	Neat Video 4K Speed (FPS)*	Price (may vary from store to store)
NVIDIA RTX 6000 Ada	91.1	48	960	96	n/a	119	31.3	$8000
NVIDIA GeForce RTX 4090	73.1	24	1008	72	n/a	113	31.4	$1600
AMD Radeon RX 7900 XTX	46.7	24	960	6	96	103	27.4	$960
AMD Radeon RX 7800 XT	37.3	16	624	4	64	66	n/a	$500
Apple Silicon M2 Ultra (76 GPU cores)	27.2	64/128/192	800	64	96	56	19.6	The whole computer's price starts at $4000
Apple Silicon M3 Max (40 GPU cores)	14.1	48/64/128	400	32	48	35.4	9.3	Laptop. Starting at $3500

* Tested on 1920x1080 and 3840x2160 32-bit frames with default filter settings.

Is an upgrade truly necessary?

Keep in mind that some older GPU cards can still provide commendable performance. Before deciding on an upgrade, carefully compare the capabilities of your current GPU hardware to determine if the investment is truly justified.

Additionally, consider opting for a more affordable GPU option that still performs well. This way, you can save some money without compromising on performance.

GPU	Processing power (TFLOPS, single precision)	VRAM (GB)	Memory Bandwidth (GB/s)	L2 Cache (MB)	L3 Cache (MB)	Neat Video Full HD Speed (FPS)*	Neat Video 4K Speed (FPS)*
Apple Silicon M1 Ultra	21	64/128	800	48	96	52	19.5
NVIDIA GeForce RTX 3090	29.3	24	935	6	n/a	67	21.1
Apple Silicon M2 Max	13.6	32/96	400	32	48	42.5	11.8
AMD Radeon RX 6900 XTX	18.7	16	512	4	128	80	18

* Tested on 1920x1080 and 3840x2160 32-bit frames with default filter settings.

Run your own tests

To assist you in gauging Neat Video's performance on your hardware, we present two valuable tools: NeatBench and Neat Optimizer.

NeatBench is a convenient and free standalone application that can be easily downloaded from the official Neat Video website . Running it is a quick and straightforward process — you can even download and run it at the store before purchasing a new PC to assess its performance. By doing so, you can identify any necessary system adjustments before making your purchase.

As for Neat Optimizer, it can be accessed through the Tools menu in Preferences of Neat Video. Both of these tools generate a comprehensive report on Neat Video's performance on your system, allowing you to determine whether upgrading CPU, GPU, or both components would be beneficial.

As we've noted above, please keep in mind that both NeatBench and Neat Optimizer report the speed of the Neat Video filter working alone, ignoring the overhead of a video editing application (such as decoding input frames and encoding the output frame). Therefore the actual speed of rendering clips with Neat Video applied in an NLE will likely be lower.