Performance and Deployments

As part of the discussion points, Intel stated that it has integrated its BF16 support into its usual array of supported frameworks and utilities that it normally defines as ‘Intel DL Boost’. This includes PyTorch, TensorFlow, OneAPI, OpenVino, and ONNX. We had a discussion with Wei Li, who heads up Intel’s AI Software Group at Intel, who confirmed to us that all these libraries have already been updated for use with BF16.  For the high level programmers, these libraries will accept FP32 data and do the data conversion automatically to BF16, however the functions will still require an indication to use BF16 over INT8 or something similar.

When speaking with Wei Li, he confirmed that all the major CSPs who have taken delivery of Cooper Lake are already porting workloads onto BF16, and have been for quite some time. That isn’t to say that BF16 is suitable for every workload, but it provides a balance between the accuracy of FP32 and the computational speed of FP16. As noted in the slide above, over FP32, BF16 implementations are achieving up to ~1.9x speedups on both training and inference with Intel’s various CSP customers.

Normally we don’t post too many graphs of first party performance numbers, however I did want to add this one.

Here we see Intel’s BF16 DL Boost at work for Resnet-50 in both training and inference. Resnet-50 is an old training set at this point, but is still used as a reference point for performance given its limited scope in layers and convolutions. Here Intel is showing a 72% increase in performance with Cooper Lake in BF16 mode vs Cooper Lake in FP32 mode when training the dataset.

Inference is a bit different, because inference can take advantage of lower bit, high bandwidth data casting, such as INT8, INT4, and such. Here we see BF16 still giving 1.8x performance over normal FP32 AVX512, but INT8 has that throughput advantage. This is a balance of speed and accuracy.

It should be noted that this graph also includes software optimizations over time, not only raw performance of the same code across multiple platforms.

I would like to point out the standard FP32 performance generation on generation. For AI Training, Intel is showing a 1.82/1.64 = 11% gain, while for inference we see a 2.04/1.95 = 4.6 % gain in performance generation-on-generation. Given that Cooper uses the same cores underneath as Cascade, this is mostly due to core frequency increases as well as bandwidth increases.

Deployments

A number of companies reached out to us in advance of the launch to tell us about their systems.

Lenovo will be announcing the launch of its ThinkSystem SR860 V2 and SR850 V2 servers with Cooper Lake and Optane DCPMM. The SR860 V2 will support up to four double-wide 300W GPUs in a dual socket configuration.

The fact that Lenovo is offering 2P variants of Cooper Lake is quite puzzling, especially as Intel said these were aimed at 4P systems and up. Hopefully we can get one in for testing.

Also, GIGABYTE is announcing its R292-4S0 and R292-4S1 servers, both quad socket.

One of Intel’s partners stated to us that they were not expecting Cooper Lake to launch so soon – even within the next quarter. As a result, they were caught off guard and had to scramble to get materials for this announcement. It would appear that Intel had a need to pull in this announcement to now, perhaps because one of the major CSPs is ready to announce.

Socket, Silicon, and SKUs
Comments Locked

99 Comments

View All Comments

  • Deicidium369 - Saturday, June 20, 2020 - link

    No one, even someone like me only buying 60, are paying any where near MSRP. and for big customers like FB which would likely install hundreds if not thousands of these systems - the MSRP is irrelevent. ~$11K list - my Q60 order was less than $9K per.

    The only bad press is from the fanboys.. some of them are editors... So yes, was delayed - yet record revenue - so yeah not a bad deal. Companies like FB don't care what the manufacturing process is - they ask "can it do what I need it to do right now?" And apparently 14nm PCIe3 Cooper Lake does.
  • schujj07 - Saturday, June 20, 2020 - link

    If you are buying 60 hosts @ $9k/host I see a lot of waste. At that cost you aren't getting much in a Xeon. You could save huge amounts of money by reducing your number of hosts and sockets.
  • Deicidium369 - Thursday, June 25, 2020 - link

    60 CPUs purchased

    16 - Engineering workstations - dual socket - single CPU installed
    2 - my engineering workstation - dual socket with dual CPU installed

    16 - 4 dual node servers - 2 nodes x 2 sockets - primary datacenter (Colorado)
    16 - 4 dual node servers - 2 nodes x 2 sockets - secondary datacenter (Dallas)
    6 - 3 single node dual socket - flash arrays - 1 at primary, 1 at secondary, 3rd for engineering
    4 - dual node server, 2 nodes x 2 sockets - systems used by IT for testing new software

    2 are basically spares, today. The 8 CPUs for the SAP server were originally intended for a different purpose - for a possible replacement for my large SGI TP16000 array - which never materialized (the array is the only remaining IB system in the mix - a Mellanox SwitchX-2 SX6710G made the conversion between the 8 40Gb/s IB to 8 40Gb/s Ethernet).

    When we moved to SAP, we had no baseline whatsoever - so it went on it's own physical server in our datacenter in Colorado - with a mirror at Level 3 in Dallas. After a year, we decided to virtualize - and after the move to virtual, added the 4th server (nodes 7&8) to the pool - changes made in Colorado are made in Dallas as well.

    When we replace the servers in the next ~12 months, the plan is to go back to 6 nodes - whether that is another 2U 2 node configuration, or as individual servers remains to be seen - will most likely be Ice Lake SP to be able to leverage PCIe4 to use dual 100Gb/s Ethernet for the planned network upgrade.

    So initially the SAP system was on a dual node, 4 socket total physical server.

    The CPUs were $9K per - not the hosts - hosts are servers. I can see why you say hosts - you missed the context.

    "The MSRP is irrelevant. ~$11K list - my Q60 order was less than $9K per"

    The $11K was in response to flgt post "Unless you're a FB or Intel employee, no one has any idea what the real price they pay for these processors." which was a response to Duncan Macdonald's post about "A 16 core 4 socket 4.5TB Xeon (the 6328HL) has a list price of $4779"

    So talking about MSRP/List prices - Duncan made a claim about MSRP prices, flgt responded that very large customers pay less per unit - and I responded with my own experience with purchasing a very small number of CPUs compared to FB prices "even someone like me only buying 60, are paying any where near MSRP."

    My post needed to be edited to be "even someone like me only buying 60, are *NOT* paying any where near MSRP

    so $9K per CPU - not $9K per host. You missed the context.

    You need to try to be more civil - the constant effort to refute everything I say is fine - but you also need to understand the context, rather than immediately sniping. I have no problem debating the merits of whatever - but the mindless / reactionary responses from you and people like Korguz need to stop. He never offers anything to the conversation and just lies in wait - with probable screen captures to try and make his point - which is "I suck".

    Sorry that I didn't choose your preference for my systems - sorry that you and others feel attacked whenever someone states the facts about AMD. I prefer Intel (along with 95% of the server market). When you are putting together a PO to buy the servers and switches, etc for your business - you can choose what you wish, and what your budget will allow.

    I have 10 people in my IT department, with over 200+ years of experience between us. The decision for hardware are not made on the fly. Other than now having a 7th and 8th node that is not needed, I have been pretty happy with the decisions we have made. Business continuity and performance were our primary goals, and both were met. The opinions held by posters on a tech forum do not come into play
  • schujj07 - Saturday, June 20, 2020 - link

    How many of the $9k hosts are your SAP HANA hosts?
  • Deicidium369 - Saturday, June 20, 2020 - link

    thing is 4 socket motherboards for Intel exist - they don't for Epyc.

    If you are buying a 4 socket Intel system - you would buy either and Inspur or a Supermicro - which would be the motherboard, case, those "special power supplies" etc...

    Those "special power supplies" are redundant and hot swap - something companies like Sun, SGI, Cisco and every single OEM have had for ages - they are considered STANDARD - not special.
  • brucethemoose - Thursday, June 18, 2020 - link

    So I guess the use case is training on enormous datasets that don't fit into the VRAM of a GPU/AI accelerator?
  • xenol - Thursday, June 18, 2020 - link

    It looks like that, plus using Optane offers very fast persistent storage which depending on bandwidth needs can replace DRAM. Either way, having a large amount of very fast storage vs. a split between DRAM and secondary storage seems to have a benefit if you believe Intel's marketing materials.
  • Deicidium369 - Thursday, June 18, 2020 - link

    Ever been involved with a very large SAP install? Once the system is in production and needs to be restarted - the amount of time it takes to bring down and back up can take hours and hours - all the while it is not able to be used. Systems like SAP runs entirely out of memory - and so during a reboot, a ton of data needs to be loaded from storage to memory - with NVDIMMs alot of that data can be available with only a cursory check, rather than having to be loaded from relatively slow storage - allowing the system to come up much quicker - even saving a couple hours on a reboot means the business can be back up and running, saving hours on lost productivity. In most large companies - nothing happens without SAP.

    Intel's marketing materials are based on having 95% market share in the Datacenter and a long relationship with businesses and their needs. So not like they are trying to cram on more cores to convince businesses that is what they need - and making few sales.
  • Duncan Macdonald - Thursday, June 18, 2020 - link

    A PCIe 4.0 NVMe drive can easily transfer over 250GB/Min so each terabyte of persistent (Optane or equivalent) memory gives a startup advantage of 4 minutes - hardly a massive advantage
  • Deicidium369 - Thursday, June 18, 2020 - link

    Funny admins of extremely large SAP and other ERP installs say otherwise.

Log in

Don't have an account? Sign up now