Performance and Deployments

As part of the discussion points, Intel stated that it has integrated its BF16 support into its usual array of supported frameworks and utilities that it normally defines as ‘Intel DL Boost’. This includes PyTorch, TensorFlow, OneAPI, OpenVino, and ONNX. We had a discussion with Wei Li, who heads up Intel’s AI Software Group at Intel, who confirmed to us that all these libraries have already been updated for use with BF16.  For the high level programmers, these libraries will accept FP32 data and do the data conversion automatically to BF16, however the functions will still require an indication to use BF16 over INT8 or something similar.

When speaking with Wei Li, he confirmed that all the major CSPs who have taken delivery of Cooper Lake are already porting workloads onto BF16, and have been for quite some time. That isn’t to say that BF16 is suitable for every workload, but it provides a balance between the accuracy of FP32 and the computational speed of FP16. As noted in the slide above, over FP32, BF16 implementations are achieving up to ~1.9x speedups on both training and inference with Intel’s various CSP customers.

Normally we don’t post too many graphs of first party performance numbers, however I did want to add this one.

Here we see Intel’s BF16 DL Boost at work for Resnet-50 in both training and inference. Resnet-50 is an old training set at this point, but is still used as a reference point for performance given its limited scope in layers and convolutions. Here Intel is showing a 72% increase in performance with Cooper Lake in BF16 mode vs Cooper Lake in FP32 mode when training the dataset.

Inference is a bit different, because inference can take advantage of lower bit, high bandwidth data casting, such as INT8, INT4, and such. Here we see BF16 still giving 1.8x performance over normal FP32 AVX512, but INT8 has that throughput advantage. This is a balance of speed and accuracy.

It should be noted that this graph also includes software optimizations over time, not only raw performance of the same code across multiple platforms.

I would like to point out the standard FP32 performance generation on generation. For AI Training, Intel is showing a 1.82/1.64 = 11% gain, while for inference we see a 2.04/1.95 = 4.6 % gain in performance generation-on-generation. Given that Cooper uses the same cores underneath as Cascade, this is mostly due to core frequency increases as well as bandwidth increases.

Deployments

A number of companies reached out to us in advance of the launch to tell us about their systems.

Lenovo will be announcing the launch of its ThinkSystem SR860 V2 and SR850 V2 servers with Cooper Lake and Optane DCPMM. The SR860 V2 will support up to four double-wide 300W GPUs in a dual socket configuration.

The fact that Lenovo is offering 2P variants of Cooper Lake is quite puzzling, especially as Intel said these were aimed at 4P systems and up. Hopefully we can get one in for testing.

Also, GIGABYTE is announcing its R292-4S0 and R292-4S1 servers, both quad socket.

One of Intel’s partners stated to us that they were not expecting Cooper Lake to launch so soon – even within the next quarter. As a result, they were caught off guard and had to scramble to get materials for this announcement. It would appear that Intel had a need to pull in this announcement to now, perhaps because one of the major CSPs is ready to announce.

Socket, Silicon, and SKUs
Comments Locked

99 Comments

View All Comments

  • Deicidium369 - Saturday, June 20, 2020 - link

    Find one motherboard that is more than 2 sockets for AMD. Just 1.
  • azfacea - Thursday, June 18, 2020 - link

    i was kind of suspicious epyc 4 socket might not exist when i said that, but still i dont think it makes much of difference if you need commodity x86 compute, just buy two servers. it will still take less space and be more power efficient as long as its TSMC 7nm vs intel 14nm++

    what would make a difference is: max memory. if there is a server from intel that has double the max memory than biggest from AMD, then i guess there would be niche. but if such a customer exists surly AMD can rectify that if they simply choose to.
  • schujj07 - Thursday, June 18, 2020 - link

    Unless you are using Optane DIMMs, Xeon cannot compete with AMD in terms of RAM capacity. For non Optane Xeon you would need a 4 socket host to surpass what Epyc can do in a single socket. However, 256GB LRDIMMs are INSANELY expensive, ~$5000/DIMM. Even 128GB LRDIMMs are still $1100/DIMM minimum compared to $350/DIMM for 64GB RDIMMs.

    I can tell you from personal experience that running SAP HANA on Epyc does work, at least in a virtualized environment. It will even pass the SAP HANA PRD Performance test. Despite what SAP, probably Intel as well, says, you do not need Xeon to run HANA. The 8 channel RAM makes things a lot nicer in getting enough RAM for multiple HANA DBs or one massive DB as well.
  • kc77 - Thursday, June 18, 2020 - link

    Not to mention to use Optane you actually have to have your software written/configured around it. You can't just slap it in and experience wonderful performance.
  • schujj07 - Thursday, June 18, 2020 - link

    I've never used Optane, but I do know that VMware has 2 different modes for it.
    https://blogs.vmware.com/vsphere/2019/04/announcin...
    I don't know what the performance will be if the software isn't written for it, but hopefully the hypervisor can at least help.
  • Zibi - Thursday, June 18, 2020 - link

    Optane persistent memory is kinda non feature in the Vmware world. Yes you can use it, you can pass it as either very fast disk or pmem dev to the VMs that understands this, but you lose HA with that. There is no mechanism to protect (replicate) Optane mem content in case of the node failure.
    For me the only viable scenario for Optane persistent memory is the cache layer in the SDS.
  • Deicidium369 - Thursday, June 18, 2020 - link

    I have been told it was not a big lift to get accomplished. SAP already has been shipping with Optane DIMM support - we can move to Optane DIMMs with our SAP install if we want. Our install is small in comparison to the systems and installs at Fortune companies.

    Pretty sure Oracle support is already baked in as well.
  • Deicidium369 - Thursday, June 18, 2020 - link

    The advantages to using Intel on SAP HANA will be the reduced boot times when Optane DIMMs are used.
  • Zibi - Thursday, June 18, 2020 - link

    You are aware though that the disadvantage will be worse memory performance in any other operations ? Optane DIMMs have worse throughput and worse latency. I don't know how often SAP HANA environments are restarted. I'd be surprised if that would be more than once per quarter.
  • JayNor - Thursday, June 18, 2020 - link

    Worse performance than the database not fitting in memory? I don't think so...

Log in

Don't have an account? Sign up now