www. O S N E W S .com
News Features Interviews
BlogContact Editorials
.
Cloudflare bets on ARM servers
By Thom Holwerda, submitted by Doeke on 2018-04-13 22:50:52

Cloudflare, which operates a content delivery network it also uses to provide DDoS protection services for websites, is in the middle of a push to vastly expand its global data center network. CDNs are usually made up of small-footprint nodes, but those nodes need to be in many places around the world.

As it expands, the company is making a big bet on ARM, the emerging alternative to Intel’s x86 processor architecture, which has dominated the data center market for decades.

The money quote from CloudFlare's CEO:

"We think we're now at a point where we can go one hundred percent to ARM. In our analysis, we found that even if Intel gave us the chips for free, it would still make sense to switch to ARM, because the power efficiency is so much better."

Intel and AMD ought to be worried about the future. Very worried. If I were them, I'd start work on serious ARM processors - because they're already missing out on mobile, and they're about to start missing out on desktops and servers, too.

 Email a friend - Printer friendly - Related stories
.
Post a new comment
Read Comments: 1-10 -- 11-20 -- 21-28
.
The real problem...
By galvanash on 2018-04-13 23:37:19
It's honestly not an ARM vs x86 thing...

Really. Intel has made x86 cores (atom) that are both performance and power competitive with ARM. They mostly failed in the marketplace historically, but most of that failure was in mobile. The Xeon C3000 series is very price, performance, and power competitive with anything coming to servers using ARM (on a per core basis).

The real problem isn't Intel sticking with x86, its that they still haven't figured out that they package their cores wrong, or haven't figured out how to package them right at least.

The latest top-end atom based Xeon C series has 16 cores/16 threads, runs at 2.1Ghz with a 32W TDP. The ARM based Cavium ThunderX has 48 cores/48 threads at 2.5Ghz with a 120W TDP. So basically it has 4x the thread count at 4x the TDP, i.e. its mostly a wash from a power point of view.

So why does no one use these chips and instead flock to Cavium?

Density.

You can fit 4 dual socket ThunderX (96 cores) with a terrabyte of ram each into a 2U rack. Thats 384 cores and 4TB of ram. Intel has nothing remotely this dense. The whole thing is probably sucking down 1000W fully loaded, but that is significantly better than most 4 node Xeon 2U servers, and you get 296 extra cores... Even if you take into account hyperthreading (which doesn't work on all workloads), you still have the ability to run about 200 more threads on a Cavium rack.

Its not ARM being more power efficient, its that Intel isn't servicing the market that Cavium is - the guys that need the maximum number of threads in the minimum amount of space at low power. It doesn't matter too much that the Cavium machines are slower on a per thread basis when you get almost double the number of cores per square inch of rack space (at similar power efficiency).

From a technical perspective I see no real reason why Intel couldn't build similarly dense atom based Xeons (and probably at a lower TDP to boot), they just don't. I haven't a clue why at this point.

If they can put 24 high end cores running at 3.4Ghz in a single chip, I don't understand why they can't put at least double the number of atoms into one (or more even).

Until they figure out how to do that, they are going to lose customers to ARM, not because of power efficiency, but because of density.

ps. Cloudflare seems to be going with Qualcomm Centriq based ARM servers instead of Cavium, but the basic argument is exactly the same (they are both 48 cores per CPU).

Edited 2018-04-13 23:49 UTC
Permalink - Score: 10
.
AMD Opteron A
By zdzichu on 2018-04-14 06:27:46
Opteron A isn't serious?
Permalink - Score: 2
.
RE: The real problem...
By Kochise on 2018-04-14 16:26:56
Perhaps going the ARM path also ensure a better competition, not depending on Intel duopoly (x86 and fab). I think the x86 legacy cost is making us lag behind, despite whatever Atom/Xeon good implementation you may have, you'll only depend on Intel, perhaps AMD, to deliver performance in a market segment that don't need to rely on Windows because data servers can run on almost anything, provided they follow some standards.
Permalink - Score: 2
.
RE: The real problem...
By tidux on 2018-04-14 21:36:48
> From a technical perspective I see no real reason why Intel couldn't build similarly dense atom based Xeons

They can't get the SMP scale out on a single die to work well enough. Even AMD's Ryzen/EPYC line was a game changer for x86 due to how many threads it sticks on one chip. ARM chip vendors don't have coming up on 40 years of IBM PC history weighing them down with extra silicon, so they're free to build smaller cores in more novel configurations.
Permalink - Score: 3
.
RE[2]: The real problem...
By Treza on 2018-04-14 23:00:57
Except that, as the article indicates :
“Every request that comes in to Cloudflare is independent of every other request, so what we really need is as many cores per Watt as we can possibly get,”

It is not really SMP, or it is an easy form with very little data sharing between cores. Maintaining coherency between tens or hundreds of cores is very power consuming and inefficient, you need busses with lots of coherency traffic, many-ported large caches, coherency adds latency...

Of course, the arguably simpler ARM architecture compared to x86 and the many cores available (proprietary from Apple, Qualcomm and others or from the ARM catalog) allows lots of flexibility.

Cloudflare may even one day ask for custom CPUs, with more networking interfaces, minimal floating point performance, some special accelerator for their niche...
Permalink - Score: 4
.
RE[3]: The real problem...
By Alfman on 2018-04-15 03:21:49
Treza,

> It is not really SMP, or it is an easy form with very little data sharing between cores. Maintaining coherency between tens or hundreds of cores is very power consuming and inefficient, you need busses with lots of coherency traffic, many-ported large caches, coherency adds latency...


Obviously shared state is a bottleneck. SMP quickly reaches diminishing returns. NUMA is more scalable, but it is harder for software to use NUMA effectively if the software was designed with SMP in mind.

I think CPU architectures with explicit IO rather than implicit coherency could increase the hardware performance, especially with good compiler support, but it would require new software algorithms and break compatibility so it would be unlikely to succeed in the market.

I think the way forward will be hundreds & thousands of independent cores like you say that will function more like a cluster of nodes than SMP cores with shared memory.

I can see such a system benefiting from a very high speed interconnect which will serve a similar function to ethernet but will offer much faster and more efficient IO between nodes. Since a fully connected mesh becomes less feasible at high core counts, we'll likely see more software algorithms evolving to support native (high performance) mesh topologies. Most of these algorithms will be abstracted behind libraries. For example we'll probably see sharded database servers that expose familiar interfaces but distribute and reconstruct data across the mesh in record speeds.


I for one am excited by the prospects of such highly scalable servers!

Edited 2018-04-15 03:22 UTC
Permalink - Score: 3
.
RE: The real problem...
By gilboa on 2018-04-15 08:51:45
ARM core != Xeon cores so counting cores are metrics is rather useless. (In my experience high-end AARM64 cores performance level is ~25-30% compared to a Broadwell / Skylake core, but YMMV).

More ever, a Supermicro Big Twin (4 x 2S x Xeon Gold 6152) can pack 160 cores and 320 threads and 6TB RAM in 2U. (224 cores / 448 threads and 12TB RAM if opt for the far more expensive Xeon Platinum 6176/6180Ms) and should be ~2-3 faster (again, YMMV) compared to a Cavium based machine.

Now, I've added the YMMV a couple of time, and for a good reason.
ARM has two advantages (and density is *not* one of them).
1. Price per transaction. Intel Xeon price, especially the high end parts and the M parts, is unreasonable. AMD might be able to pull another Opteron and force Intel to lower the price, but that remains to be seen.
2. Power per transaction. ARM cores are more efficient. If your application requires a lot of slow threads and you have limited power budget, ARM is definitely the answer.

- Gilboa

Edited 2018-04-15 08:52 UTC
Permalink - Score: 3
.
RE[4]: The real problem...
By Treza on 2018-04-15 12:43:40
And let's call these massively parallel architectures, with huge memory bandwidth, hundreds of cores and multithreading to hide memory latency...

GPGPUs !!!
Permalink - Score: 3
.
RE[2]: The real problem...
By viton on 2018-04-15 13:36:06
> (In my experience high-end AARM64 cores performance level is ~25-30% compared to a Broadwell / Skylake core, but YMMV).
So what “high-end” ARM did you tested and how?
Do you have an experience with Centriq or ThunderX2?
ThunderX was really weak.
Permalink - Score: 1
.
RE[5]: The real problem...
By Alfman on 2018-04-15 14:03:52
Treza,

> And let's call these massively parallel architectures, with huge memory bandwidth, hundreds of cores and multithreading to hide memory latency...

GPGPUs !!!



Obviously GPGPUs have their uses, but they target different kinds of problems. For cloudflair's example of web hosting, a massively parallel GPGPU isn't very useful, but massively parallel cluster is.

In the long term, FPGAs could eventually unify GPUs and CPUs so that we no longer have to consider them different beasts for different workloads. Instead of compiling down to a fixed instruction set architecture, software can be compiled directly into transistor logic.

I'm not happy with the price of GPUs these days, so I think there may be an opportunity for FPGAs to grow out of a niche status to become more of a commodity. However, IMHO, it will be many years before software toolchains are actually ready to target FPGA. What we have is a sort of chicken and egg problem.
Permalink - Score: 4

Read Comments 1-10 -- 11-20 -- 21-28

Post a new comment
Username

Password

Title

Your comment

If you do not have an account, please use a desktop browser to create one.
LEAVE SPACES around URLs to autoparse. No more than 8,000 characters are allowed. The only HTML/UBB tags allowed are bold & italics.
Submission of a comment on OSNews implies that you have acknowledged and fully agreed with THESE TERMS.
.
News Features Interviews
BlogContact Editorials
.
WAP site - RSS feed
© OSNews LLC 1997-2007. All Rights Reserved.
The readers' comments are owned and a responsibility of whoever posted them.
Prefer the desktop version of OSNews?