www. O S N E W S .com
News Features Interviews
BlogContact Editorials
.
Cloudflare bets on ARM servers
By Thom Holwerda, submitted by Doeke on 2018-04-13 22:50:52

Cloudflare, which operates a content delivery network it also uses to provide DDoS protection services for websites, is in the middle of a push to vastly expand its global data center network. CDNs are usually made up of small-footprint nodes, but those nodes need to be in many places around the world.

As it expands, the company is making a big bet on ARM, the emerging alternative to Intel’s x86 processor architecture, which has dominated the data center market for decades.

The money quote from CloudFlare's CEO:

"We think we're now at a point where we can go one hundred percent to ARM. In our analysis, we found that even if Intel gave us the chips for free, it would still make sense to switch to ARM, because the power efficiency is so much better."

Intel and AMD ought to be worried about the future. Very worried. If I were them, I'd start work on serious ARM processors - because they're already missing out on mobile, and they're about to start missing out on desktops and servers, too.

 Email a friend - Printer friendly - Related stories
.
Read Comments: 1-10 -- 11-20 -- 21-28
.
RE[2]: The real problem...
By galvanash on 2018-04-15 18:37:06
> ARM core != Xeon cores so counting cores are metrics is rather useless. (In my experience high-end AARM64 cores performance level is ~25-30% compared to a Broadwell / Skylake core, but YMMV).


Atom cores are more or less in the same ballpark as ARM cores. Broadwell and Skylake are not. They pretty much disqualify themselves from this discussion for using too much power and they perform much better single threaded... ARM servers don't really compete with Broadwell/Skylake and don't even really try to (yet).

> More ever, a Supermicro Big Twin (4 x 2S x Xeon Gold 6152) can pack 160 cores and 320 threads and 6TB RAM in 2U. (224 cores / 448 threads and 12TB RAM if opt for the far more expensive Xeon Platinum 6176/6180Ms) and should be ~2-3 faster (again, YMMV) compared to a Cavium based machine.

I was using atom based Xeons in my example. Why are you bring up machines that literally cost 10x-15x as much and use many times as much power? My whole post was about competing with ARM - atom based Xeons compete with ARM (or at least try to). High end Xeons cost way too much, use too much power, etc. - it isn't the same market at all.

So let me clarify... I thought the context was obvious in my post, but maybe not. Intel has nothing remotely as dense as Cavium/Centriq with competitive power/core and cost/core. My argument is simply that they could if they wanted to using atom cores - they don't need to switch to ARM to compete...
Permalink - Score: 2
.
RE[6]: The real problem...
By tidux on 2018-04-16 04:47:52
FPGA toolchains are so proprietary they make Microsoft look like Richard Stallman. That has to change before they can get any real use in general computation.
Permalink - Score: 0
.
RE[7]: The real problem...
By Alfman on 2018-04-16 06:47:11
tidux,

> FPGA toolchains are so proprietary they make Microsoft look like Richard Stallman. That has to change before they can get any real use in general computation.

Yea, I'm pretty sure this could be addressed by FOSS projects, but obviously we're not there yet. If the industry wants to fight FOSS, that would be a shame and it might well hurt access especially for smaller developers.
Permalink - Score: 2
.
RE[3]: The real problem...
By gilboa on 2018-04-16 06:52:57
> Tegra TX1 and very short time with ThunderX (which, as you point out, has very weak cores).

We plan to test ThunderX2 when we have some free time (and when its freely available).

Please note that our proprietary application is heavily CPU/cache/memory bandwidth limited and has zero acceleration potential, so (even) ThunderX2 limited inter-core/CPU interconnect bandwidth might be major performance handicap.

- Gilboa

Edited 2018-04-16 06:53 UTC
Permalink - Score: 3
.
RE[3]: The real problem...
By gilboa on 2018-04-16 08:18:04
> Density.

You can fit 4 dual socket ThunderX (96 cores) with a terrabyte of ram each into a 2U rack. Thats 384 cores and 4TB of ram. Intel has nothing remotely this dense. The whole thing is probably sucking down 1000W fully loaded, but that is significantly better than most 4 node Xeon 2U servers, and you get 296 extra cores... Even if you take into account hyperthreading (which doesn't work on all workloads), you still have the ability to run about 200 more threads on a Cavium rack.


You talked about Density which usually translates to MIPS per U.
You claimed that Intel has nothing remotely close (your words, not mine) to ARMs density.
I proved otherwise.

A yet-to-be released high end Cavium Thunder X2 based solutions can "shove" 2 x 48 x 4 (384 cores) in 2U and require ~190w per socket.
An already shipping Intel Xeon Platinum based solution can pack 224 fast cores (448 threads) in 2U and require ~165w per socket (205w if you go super-high-end).
An already shipping AMD Eypc based solution can packet 256 cores (512 threads) in and require 180w per socket.

As this product is still soft launched, pricing information is not available and if ThunderX 1 is any indication, pricing will be ~40-50% of a comparable AMD/Intel based solution (A far cry from your 10-15x claim).

- Gilboa

Edited 2018-04-16 08:18 UTC
Permalink - Score: 3
.
RE[4]: The real problem...
By galvanash on 2018-04-16 16:15:06
Xeon 8180 is a $11k chip. ThunderX2 is (at most) a $2k chip - pricing info is still hard to find but is likely about the same as ThunderX (which was around $800).

https://www.anandtech.com/show/10...

Thats $90k vs $12k just on the CPUs alone. Cavium motherboards will obviously be far cheaper (its a SOC so they are far simpler) and cooling/power components will be cheaper as well. Rest of the components are irrelevant as they are not platform specific for the most part.

10x-15x could be a bit of an overstatement, but its still at least 5x-10x cheaper to go with Cavium (and far lower power usage on a per thread basis), and if they are really pricing them the same as the ThunderX (say $1k) the difference really is 10x-15x...

As far as performance goes, I think your missing the point. If your running a bunch of redis/memcache instances you don't want all that performance - its a waste of silicon. You just want a shit ton of cores with a bunch of cheap memory hanging off of them that occupy as little rack space as possible and use minimal power... This is exactly the kind of thing ARM/Atom are good for.

Why on earth would anyone buy a Xeon Platinum to do this? I'm not arguing that that high-end Xeon's are bad (hell, they are awesome!) - I'm arguing that low-end Xeons (atom based ones) are bad. They are simply built the wrong way to compete in the market they would actually be competitive in. Its not because they are too slow, and its not because they are too power hungry, its because they are not dense enough for the market they should be targeting...

The market Cavium primarly targets doesn't care about MIPs/U, they care about threads/U. Latency is all that matters...

Edited 2018-04-16 16:20 UTC
Permalink - Score: 3
.
RE[8]: The real problem...
By Lennie on 2018-04-16 17:29:34
1. I wonder if projects like RISC-V can help create a better & more open ecosystem for FPGAs.

2. Aren't FPGAs far to slow to replace normal CPU or GPU ?
Permalink - Score: 2
.
RE[9]: The real problem...
By tidux on 2018-04-16 19:02:05
FPGAs are slower than CPUs if you make them emulate full CPUs, but they can accelerate certain things faster than GPUs. Just look at crypto mining. It went CPU -> GPU -> FPGA -> ASIC. FPGAs are significantly cheaper than fabbing your own ASIC for everything.
Permalink - Score: 0
.
RE[9]: The real problem...
By Alfman on 2018-04-16 20:44:55
Lennie,

> 1. I wonder if projects like RISC-V can help create a better & more open ecosystem for FPGAs.

Maybe they could compliment each other. I haven't gotten around to studying RISC-V yet.

> 2. Aren't FPGAs far to slow to replace normal CPU or GPU ?

It would depend on how you use them...

tidux,

> FPGAs are slower than CPUs if you make them emulate full CPUs, but they can accelerate certain things faster than GPUs. Just look at crypto mining. It went CPU -> GPU -> FPGA -> ASIC. FPGAs are significantly cheaper than fabbing your own ASIC for everything.

Not to overgeneralize, but I basically agree with this. GPUs are highly optimized for the vector tasks they encounter in graphics, such as simultaneously performing the exact same operation on every element. But frequently real world algorithms have "if X then Y else Z" logic, in which case the GPU has to process the vector in two or three separate passes to process X, Y and Z. More complex algorithms can result in more GPU inefficiencies. There's still merit in using a GPU versus a CPU due to the sheer amount of parallelism. However the multiple passes represent an inefficiency compared to an FPGA that can be virtually rewired to handle the logic in one pass.

To elaborate on what you were saying, an FPGA that emulates a CPU architecture to run software is not going to perform as well as an ASIC dedicated to running that CPU architecture:

software -> machine code -> ASIC processor = faster
software -> machine code -> FPGA processor = slower


While FPGA potentially gives us some interesting options for building processors at home, the software isn't taking advantage of the FPGA's programmable capabilities. In other words, the FPGA is being used in a way that isn't optimized for the software running on it. Consider how FPGAs are meant to be used:

software -> FPGA logic -> FPGA

Assuming the problem has a lot of parallelism and the compiler is any good, then this should be significantly faster than a traditional processor stepping through sequential machine code algorithms.

An ASIC is always going to win any performance contest:

software -> ASIC logic -> ASIC

...but until we have fab technology that can somehow cheaply manufacture ASICs at home, FPGAs are the more interesting option for software developers :)

Edited 2018-04-16 20:51 UTC
Permalink - Score: 2
.
RE[4]: The real problem...
By viton on 2018-04-17 03:25:03
> Please note that our proprietary application is heavily CPU/cache/memory bandwidth limited and has zero acceleration potential
Centriq 2460 has 60MB L3 Cache and 120GB/s bandwidth.
ThunderX2 has haswell-level perf, 33MB(?) L3 and 170GB/s (theoretical) that is higher than any Intel part.

What compiler do you use?

> so (even) ThunderX2 limited inter-core/CPU interconnect bandwidth might be major performance handicap.
This is definitely a sign of non muilticore-friendly workload/programming practices.

Edited 2018-04-17 03:25 UTC
Permalink - Score: 2

Read Comments 1-10 -- 11-20 -- 21-28

No new comments are allowed for stories older than 10 days.
This story is now archived.

.
News Features Interviews
BlogContact Editorials
.
WAP site - RSS feed
© OSNews LLC 1997-2007. All Rights Reserved.
The readers' comments are owned and a responsibility of whoever posted them.
Prefer the desktop version of OSNews?