Intel Quick Sync: Examining Haswell Performance
In the recent release of 4th generation (Haswell) Intel Core integrated processor graphics (IPG), Intel placed significant focus on changes made to Quick Sync transcoding technology included with the HD graphics portion of the chip. As the review developed, it became evident quite quickly that this aspect of the Intel Core i7-4770K warranted specific coverage outside of the more general platform/system/performance characteristics that are usually covered. The detailed why and how of Quick Sync, or specifically what has changed versus the previous generation is beyond the scope of this discussion; this somewhat because Intel has already published a reasonably detailed whitepaper on the topic for those with academic interest, but mostly it is because results matter more than technical diagrams. In the pursuit of this goal the differences in Quick Sync speed and quality between 3rd and 4th generation Intel Core IPG will be detailed as well as how it compares with x264 when it matters the most – archiving high-bitrate material.
Getting Started
While a detailed examination of new Quick Sync features is best left to the whitepaper a brief outline will help facilitate the discussion. First, for those who do not know, Intel Quick Sync is a feature first introduced with 2nd generation Core IPG (Sandy Bridge) which enables hardware accelerated video transcoding, or the process of changing the video stream in one file to a different format (CODEC, size, bit rate, etc.) in another. The feature was originally intended for converting files for mobile use, but has advanced over the iterations to provide coverage for a larger set of use cases. Different in the 4th generation Core release, is that Intel has refocused on prioritizing quality over speed–which is important to point out because for specific encoding settings Quick Sync is actually slower on the newest hardware than in the last generation.
In addition to the traditional set of encoding “knobs-and-dials” Intel also exposes the concept of “Target Usage” (TU). This feature is intended to enable simple access to a series of gradations without the complexity around forcing users to select a bit rate and rate control method. On previous iterations, while there were technically seven steps only three were really exposed for consumption: quality (TU1), balanced (TU4), and speed (TU7). This changed with the latest spin of Intel HD graphics, with each step fully selectable, but as this is a mostly a comparison between the 3rd and 4th generations they will not be included in the discussion.
All of the tests were conducted on a 4th generation “Haswell”(HSW) Intel Core i7-4770K with HD 4600 graphics and a 3rd generation “Ivy Bridge” (IVB) Intel Core i7-3770K with HD 4000 graphics using QSTranscode for transcoding AVC/MPEG2/VC-1 to AVC at the source frame size and rate using the “-bench” command line argument to bypass any additional file I/O or CPU utilization caused by transcoding or writing audio streams.
Lastly, a quick glossary of terms:
- Constant Bit Rate (CBR): Rate control selected to force a constant bit rate in the target file
- Variable Bit Rate (VBR): Rate control selected to allow the bit rate to vary in the target file, usually enabled to meet a specific file size goal
- Constrained Quantization Parameter (CQP): Rate control selected to ensure a constant quality target at a per-frame level
- Intel Media Software Development Kit (MSDK): The software interface application written to utilize Intel Quick Sync use to enable its features
- Structural Similarity (SSIM): A method for measure the similarity between two images with “1” indicating a perfect match
Results
In this test four approximately thirty minute files (AVC: 1080p, 1080i, 480i & MPEG2: 720p) were transcoded using the three target usages. Testing was conducted on both Microsoft Windows 8 and Microsoft Windows 7, but no statistically significant differences were noted between the results achieved on each operating system so the results were aggregated into a single number per IPG/TU. As we can see, “Quality” takes a significant step backwards (1.7x on average) in speed going from the i7-3770K to the i7-4770K, with “Balanced” and “Speed” presenting a much more mixed, closer set of results. Taken on its own this would be disappointing, but we were warned that the shift in priorities was not free – let’s see if it paid off.
Before discussing the results, let’s briefly outline the methodology used to produce them. As mentioned before, each target scenario was run on the respective IPG using QSTranscode to produce an output file. Then a series of PNG frames were extracted from the input and output files using FFMPEG, visually aligned to ensure validity, compared using a slightly modified version of Chris Lomont’s SSIM implementation, and then averaged to provide the number presented. Using SSIM “1” is a perfect match so in this, and the subsequent graphs, the value presented measures how closely the output frames matches the input frames–and therefore quality.
This test measured the movement of quality when transcoding a 41Mbps 1080p AVC source file to a series of CBR targets with “Quality” (TU1) selected except where noted otherwise, as well as what happens when a bit rate value is not provided to the MSDK (“Default”)–with the bit rates of 4763Kbps for TU1 and 3572Kbps for TU4 resulting. As we can see in each case the HD 4600 implementation provided better quality than its HD 4000 counterpart, with the most marked difference appearing at the 3Mbps mark. It is also worth noting that the size of output files per bit rate was roughly equivalent, so the increase in quality did not require any more space on the disk.
Having seen this topic covered by other sites when Haswell launched, these results were quite puzzling because they did not match what was reported at that time. There has been a driver release since then, so it is possible that these concerns were addressed as part of that, but it was also likely that performance varies with the type and quality of the input source.
Given the reliability of what was observed at different target bit rates, each content source was only tested using the “Quality” (TU1) setting using the same methodology outlined before. Here again we see the result where the decrease in speed observed in the first test, is rewarded with an increase in quality realized in the output; except in two cases. Oddly enough, these happen to be the two files mentioned in the Quick Sync whitepaper mentioned before – which if these were the input files utilized in other reviews it could easily explain the results noted by other reviewers.
With Quick Sync initially targeted just at the mobile transcoding use case, when it came to archiving high quality sources, using a software encoder like x264 was the obvious choice. It did not matter how fast or power efficient the processing was, because the trade-off in output quality was simply not worth it. Now that the claimed coverage has expanded beyond that use case, it was time to reevaluate this belief and find out if Quick Sync could hold up to one of the best software encoders available.
To encode with x264, Handbrake’s “High Profile” was selected using CQ RF:16 and CQ RF:10 — with the same target/rate control selected in QSTranscode to produce output files from a 40Mbps 1080p AVC source. Five times more frames than previously compared were extracted from the source and output files in this test, then the SSIM generated and aggregated above with the result proving very interesting. Consistent with general consensus it was not surprising to see that the HD 4000 (IVB) lagged what was produced by x264, but it was shocking to see that the HD 4600 (HSW) encoder produced a higher quality output at both targets; although just barely at QP 10.
UPDATE 08/23/13: It was brought to my attention that the setting used by Handbrake in this case is not CQP, but Constant Quality RF: X (CRF?) so the labeling has changed to correct this error.
Wrapping Up
Comparing Quick Sync across 3rd and 4th generation Intel Core IPG, it was not altogether surprising that their efforts to reprioritize the transcoding framework on output quality were generally validated, with the Core i7-4770K “Haswell” system consistently producing video that more closely matched the source than the Core i7-3770K “Ivy Bridge.” The advances in quality were paid in speed however, with a 1.7x average performance hit recorded when selecting the “Quality” setting. Obviously, even this “sluggish” metric is still much faster than real-time and what is possible using a CPU based encoder so well worth it in my opinion, especially considering the additional quality versus speed gradations added to the MSDK. The results measured when placed against x264’s output were unexpected though, where the “Haswell” implementation of Quick Sync notched wins with both of the CQP targets when transcoding high bit rate content. While it is still open whether these results are reproducible with the myriad of content available, they are very promising and demonstrate clearly that it is no longer safe to assume that only software based encoding should be considered when archiving high-value video. Lastly, it must be said that the Lookahead feature introduced with Haswell is not yet implemented in release drivers or the current version of the MSDK so it was not enabled for this comparison. While the timing was not ideal, this holds significant promise for further improvements in output quality – something which I hope to revisit when the Lookahead is available.
Thanks to Intel for providing the hardware samples and Michael Schmidt for running down some of the technical details.
What was the speed difference
What was the speed difference for the last chart? Was HSW lots faster than Handbrake’s software encoding?
Are these results dependent on the software used, which in is case is your app? Would I be the same results using the new version of Handbrake that supports QS?
I didn’t include the time v.
I didn’t include the time v. Handbrake because it’s a minutes versus hours thing even on these CPUs (some of the fastest you can get).
Assuming the code path is the same (e.g. same VPP enabled, same settings, no external VPP) the results should be identical because it’s all the same MSDK/driver/hardware underneath.
That said, QSTranscode is faster than many commercial apps because there isn’t any superfluous processing going on; it’s just I/O & MSDK. I don’t know if it’s faster that Handbrake, but it does support all of the MSDK supported content types (MPEG2/VC-1/AVC) w/ full acceleration and should support anything ffmpeg can decode w/ encode acceleration where Handbrake is just AVC->AVC right now for QS.
babgvant wrote:
Handbrake is
[quote=babgvant]Handbrake is just AVC->AVC right now for QS.
[/quote]
Oooh, I hadn’t noticed they had added QS functions to Handbrake. Time to play with it. =)