Performance and Deployments

As part of the discussion points, Intel stated that it has integrated its BF16 support into its usual array of supported frameworks and utilities that it normally defines as ‘Intel DL Boost’. This includes PyTorch, TensorFlow, OneAPI, OpenVino, and ONNX. We had a discussion with Wei Li, who heads up Intel’s AI Software Group at Intel, who confirmed to us that all these libraries have already been updated for use with BF16.  For the high level programmers, these libraries will accept FP32 data and do the data conversion automatically to BF16, however the functions will still require an indication to use BF16 over INT8 or something similar.

When speaking with Wei Li, he confirmed that all the major CSPs who have taken delivery of Cooper Lake are already porting workloads onto BF16, and have been for quite some time. That isn’t to say that BF16 is suitable for every workload, but it provides a balance between the accuracy of FP32 and the computational speed of FP16. As noted in the slide above, over FP32, BF16 implementations are achieving up to ~1.9x speedups on both training and inference with Intel’s various CSP customers.

Normally we don’t post too many graphs of first party performance numbers, however I did want to add this one.

Here we see Intel’s BF16 DL Boost at work for Resnet-50 in both training and inference. Resnet-50 is an old training set at this point, but is still used as a reference point for performance given its limited scope in layers and convolutions. Here Intel is showing a 72% increase in performance with Cooper Lake in BF16 mode vs Cooper Lake in FP32 mode when training the dataset.

Inference is a bit different, because inference can take advantage of lower bit, high bandwidth data casting, such as INT8, INT4, and such. Here we see BF16 still giving 1.8x performance over normal FP32 AVX512, but INT8 has that throughput advantage. This is a balance of speed and accuracy.

It should be noted that this graph also includes software optimizations over time, not only raw performance of the same code across multiple platforms.

I would like to point out the standard FP32 performance generation on generation. For AI Training, Intel is showing a 1.82/1.64 = 11% gain, while for inference we see a 2.04/1.95 = 4.6 % gain in performance generation-on-generation. Given that Cooper uses the same cores underneath as Cascade, this is mostly due to core frequency increases as well as bandwidth increases.

Deployments

A number of companies reached out to us in advance of the launch to tell us about their systems.

Lenovo will be announcing the launch of its ThinkSystem SR860 V2 and SR850 V2 servers with Cooper Lake and Optane DCPMM. The SR860 V2 will support up to four double-wide 300W GPUs in a dual socket configuration.

The fact that Lenovo is offering 2P variants of Cooper Lake is quite puzzling, especially as Intel said these were aimed at 4P systems and up. Hopefully we can get one in for testing.

Also, GIGABYTE is announcing its R292-4S0 and R292-4S1 servers, both quad socket.

One of Intel’s partners stated to us that they were not expecting Cooper Lake to launch so soon – even within the next quarter. As a result, they were caught off guard and had to scramble to get materials for this announcement. It would appear that Intel had a need to pull in this announcement to now, perhaps because one of the major CSPs is ready to announce.

Socket, Silicon, and SKUs
Comments Locked

99 Comments

View All Comments

  • Deicidium369 - Thursday, June 18, 2020 - link

    Not in the case of BFloat they don't. Stick to desktops.
  • AlexDaum - Friday, June 19, 2020 - link

    AVX-512 is great for performance, but it requires good programmers to make use of it, since it is no easy task creating an AVX optimised codepath.
  • Deicidium369 - Saturday, June 20, 2020 - link

    All programming needs good programmers... most of the organizations that will have a use for ZVX will be able to draw upto existing libraries to integrate into a custom program...
  • eek2121 - Thursday, June 18, 2020 - link

    It will be interesting if Milan supports bfloat16 as well. Unless AMD segregates enterprise chips (they will at some point), it would mean that bfloat16 trickles down to desktop chips.
  • Deicidium369 - Saturday, June 20, 2020 - link

    Well Intel comes out with a feature, and then AMD copies it - so pretty sure bfloat is coming to AMD and AVX-512...
  • Spunjji - Friday, June 19, 2020 - link

    There's no such thing as a "low priced" 4S Xeon system, though. The processors are expensive and so are the servers you put them into.

    That's been AMD's pitch the whole time - consolidate more cores into fewer sockets for a lower cost of ownership. If you want to do DL/ML you put in accelerators for those jobs, which again, will be cheaper than trying to reach the same performance with a Xeon.

    Intel are trying to find niches for their CPUs to occupy, but those niches are better occupied by products that cater to them specifically.
  • Deicidium369 - Saturday, June 20, 2020 - link

    most of Cooper Lake will be for hyper scalers - not just, but most.
  • Deicidium369 - Thursday, June 18, 2020 - link

    Smokes them in number shipped, market adoption, market share and revenues. From a business standpoint that is all that matters
  • schujj07 - Thursday, June 18, 2020 - link

    That seems to be the only retort that you ever have for Intel. What you fail to realize is that given time those numbers will change if Intel continues to falter. Ice Lake is very late and when it comes out will be competing against 3rd Gen Epyc. We already know from laptops that Zen 2 competes quite well with Ice Lake on a performance side. We also know that Ice Lake has had frequency problems and has high power draw relative to Zen 2. Those things don't bode well for it in the data center. We already are seeing more companies moving to Epyc, especially those that have CTOs/CIOs that don't follow the Intel only mantra. In the enterprise world the tables are turning for Intel, much the same way it did for IBM years ago. It used to be that "no one was ever fired for buying IBM" now it is "no one was ever fired for buying Intel." That first statement eventually died and was replaced by the 2nd, and the 2nd is starting to die as well.
  • AlexDaum - Friday, June 19, 2020 - link

    The main advantage of the Intel Xeons over Epyc is AVX-512 Support, which can have a large performance impact in some Software that can make use of it

Log in

Don't have an account? Sign up now