www. O S N E W S .com
News Features Interviews
BlogContact Editorials
.
AMD Threadripper reviews and benchmarks
By Thom Holwerda on 2017-08-11 19:46:32

In this review we've covered several important topics surrounding CPUs with large numbers of cores: power, frequency, and the need to feed the beast. Running a CPU is like the inverse of a diet - you need to put all the data in to get any data out. The more pi that can be fed in, the better the utilization of what you have under the hood.

AMD and Intel take different approaches to this. We have a multi-die solution compared to a monolithic solution. We have core complexes and Infinity Fabric compared to a MoDe-X based mesh. We have unified memory access compared to non-uniform memory access. Both are going hard against frequency and both are battling against power consumption. AMD supports ECC and more PCIe lanes, while Intel provides a more complete chipset and specialist AVX-512 instructions. Both are competing in the high-end prosumer and workstation markets, promoting high-throughput multi-tasking scenarios as the key to unlocking the potential of their processors.

As always, AnandTech's the only review you'll need, but there's also the Ars review and the Tom's Hardware review.

I really want to build a Threadripper machine, even though I just built a very expensive (custom watercooling is pricey) new machine a few months ago, and honestly, I have no need for a processor like this - but the little kid in me loves the idea of two dies molten together, providing all this power. Let's hope this renewed emphasis on high core and thread counts pushes operating system engineers and application developers to make more and better use of all the threads they're given.

 Email a friend - Printer friendly - Related stories
.
Read Comments: 1-10 -- 11-20 -- 21-28
.
RE[6]: Threads
By Alfman on 2017-08-13 05:51:31
kwan_e,

> But trying to be too general in your approach will mean getting the worst of both worlds. If the software doesn't require such a thing, they shouldn't pay the cost of the underlying implementation.

I'm not sure what your criticism is specifically, what is it you don't like?


> To me, that just means the OS should open up a way for a process to say "these bunch of threads/tasks/contexts should be clustered together" and the software can say "these work units are of type X" and the OS can schedule them appropriately. Something like Erlang's lightweight processes?

Sure, you could bundle some threads together, and then write code such that those threads avoid sharing memory or synchronization primitives with other bundles, and then make sure network sockets are only accessed by threads in the correct bundle associated with the remote client. This is all great, but it should also sound very familiar! We've basically reinvented the "process" :)

Edited 2017-08-13 06:10 UTC
Permalink - Score: 2
.
RE[7]: Threads
By kwan_e on 2017-08-13 08:55:12
> I'm not sure what your criticism is specifically, what is it you don't like?

Having programs that can be offloaded onto the network is fine, but it is not necessary. To take advantage of that, it would affect a program's design in a way that would make it substandard for its common use case.

> > Something like Erlang's lightweight processes?This is all great, but it should also sound very familiar! We've basically reinvented the "process" :)

Pretty sure lightweight processes a la Erlang aren't processes. Context switching between processes is much more expensive than those lightweight processes.

And also, why not have multiple levels of automated task management? The top level is the process, but why have one level? OS level processes are there for security purposes, and one could argue putting other responsibilities onto that one abstraction is inefficient.
Permalink - Score: 2
.
RE[8]: Threads
By Alfman on 2017-08-13 13:40:46
kwan_e,

> Having programs that can be offloaded onto the network is fine, but it is not necessary. To take advantage of that, it would affect a program's design in a way that would make it substandard for its common use case.

I'm going to ask you again to be more specific. I'm not saying being able to run a networking cluster needs to be a goal, however it is a nice side effect of having a design that's well optimized for NUMA. Only having explicit IPC between NUMA regions aligns very well with the network clustering too!

> Pretty sure lightweight processes a la Erlang aren't processes. Context switching between processes is much more expensive than those lightweight processes.

Not when they're running on physically separate cores and don't have to context switch.

> And also, why not have multiple levels of automated task management? The top level is the process, but why have one level? OS level processes are there for security purposes, and one could argue putting other responsibilities onto that one abstraction is inefficient.

It's true that you might need somewhat more memory to run more processes instead of one (to match the NUMA regions), but when you think about it carefully, this decoupling of address space is precisely what we need to do to avoid costly & performance killing IO across NUMA boundaries.


From your original post:
> Programs should definitely be thread agnostic and thus structured (layered) for usage patterns like work-stealing queues etc.

I understand why you like this design pattern, and it would be perfectly fine to use MT across the cores within a NUMA region, but unfortunately this pattern doesn't scale well across NUMA boundaries. This is what I was getting at before, the more you optimize MT code to reduce the overhead on NUMA, the more you end up engineering something that acts like a process.

As you know, NUMA scalability comes by compromising equal access to global address space and resources. Designing software around NUMA locality is just a suggestion, but of course you are welcome to engineer things however you like :)

Edited 2017-08-13 13:50 UTC
Permalink - Score: 2
.
RE[9]: Threads
By kwan_e on 2017-08-13 16:33:06
> Not when they're running on physically separate cores and don't have to context switch.

If you have more processes than cores, you have to context switch to give all processes some fair run time. That's how multitasking works, and the OS has to get involved. That's not the same with lightweight processes.

> From your original post:
> Programs should definitely be thread agnostic and thus structured (layered) for usage patterns like work-stealing queues etc.

I understand why you like this design pattern, and it would be perfectly fine to use MT across the cores within a NUMA region, but unfortunately this pattern doesn't scale well across NUMA boundaries.


Only with, as you keep saying, naive MT. But I'm not talking about naive MT. I'm talking about structuring a program into packaged tasks, and keeping threading concerns out of those tasks. The threading concerns should be handled by something else, which may include NUMA aware executors.

> This is what I was getting at before, the more you optimize MT code to reduce the overhead on NUMA, the more you end up engineering something that acts like a process.

Yes. Something that acts like a process but is not one. Particularly not one which is as heavy. Just because something acts like something else doesn't mean they're the same or have the same costs. As I said before, the main reason for processes is security - making sure that processes don't step on each other's memory. Not every logical process requires that kind of separation from each other and thus do not necessarily need to map to an OS process.

And as I also said before, the intermediate level of abstraction should probably be provided by the OS or some architecture aware library. So programs don't need to reinvent the wheel because it's already done.
Permalink - Score: 2
.
RE[10]: Threads
By Alfman on 2017-08-14 00:43:11
kwan_e,

> If you have more processes than cores, you have to context switch to give all processes some fair run time. That's how multitasking works, and the OS has to get involved. That's not the same with lightweight processes.

Yes, however I don't suggest having more processes than cores, only one per NUMA region.

http://www.osnews.com/comments/2...


> Only with, as you keep saying, naive MT. But I'm not talking about naive MT. I'm talking about structuring a program into packaged tasks, and keeping threading concerns out of those tasks. The threading concerns should be handled by something else, which may include NUMA aware executors.

Of course you can make MT less naive, but the part you're overlooking is that once you solve the overhead between NUMA regions, you end up with a design pattern that mimics a process anyways, even if that wasn't the goal.


This is the reason I keep asking you to provide a specific objection. I wish you would so that we could talk about it.



> And as I also said before, the intermediate level of abstraction should probably be provided by the OS or some architecture aware library. So programs don't need to reinvent the wheel because it's already done.

Take a look at the clone syscall in linux.
https://linux.die.net/man/2/clone

Creating a new thread or new process use very similar kernel paths. The difference is that a new process creates new associated address space, but a thread does not. In terms of keeping NUMA regions separate, using processes is not a hack, it's not a shortcut, using a separate namespace is exactly the pattern we need to follow to scale well on NUMA.

Again, I get that you don't want to agree with me here, but I'd like for you to put forward a specific objection that we can discuss.
Permalink - Score: 2
.
RE[7]: Threads
By tylerdurden on 2017-08-14 00:47:13
I think the problem is that you're seeing "Threads" as full processes, not at the fine grained streams that NUMA deals with.
Permalink - Score: 2
.
RE[8]: Threads
By Alfman on 2017-08-14 01:11:42
tylerdurden,

> I think the problem is that you're seeing "Threads" as full processes, not at the fine grained streams that NUMA deals with.


Haha, but there is no problem. It's not a matter of definition, it's strictly a matter of the software compromises necessary to make NUMA scale well. Oh well.
Permalink - Score: 2
.
RE[9]: Threads
By tylerdurden on 2017-08-14 05:04:30
NUMA systems have scaled to over 1K cores, I'd say that's a good scalability.
Permalink - Score: 2
.
RE[10]: Threads
By Alfman on 2017-08-14 06:12:17
tylerdurden,

> NUMA systems have scaled to over 1K cores, I'd say that's a good scalability.


You can throw hundreds or thousands of CPUs on enormous buses. Fine, whatever, it's completely orthogonal to my point that NUMA scalability only works when we isolate threads from one another across NUMA boundaries. The scalability will be absolutely pitiful with ordinary MT algorithms and 1000 cores. To use NUMA effectively, we have to distance ourselves from the traps that conventional MT programmers can fall into since threads are not equal.

We should all be in full agreement that locality is of utmost importance for NUMA to scale, right?


Edit: This is why performance can get worse with more cores...

https://arstechnica.com/gadgets/2...
> ...AMD also claims that certain games, older ones for the most part, run better (to an average of four percent) when presented with less physical cores. These include Fallout 4, Dota 2, Heroes of the Storm, and Civilization VI.


https://www.reddit.com/r/hardware...
> In a nutshell, because of the way Threadripper is made from two dies, there's going to be a huge memory latency impact crossing from one to the other - measured on Epyc it's about twice as bad as the CCX penalty. So you'll want to avoid that....but half of your cores, lanes and memory is on the other die - so you're going to need to if you want to use all of it. A program can be carefully designed to take this into account, but I've never heard of a NUMA aware game. So how will games handle this? That depends on what "mode" you choose in the UEFI....

The later link is a good read on the topic.

Edited 2017-08-14 06:31 UTC
Permalink - Score: 2
.
RE[11]: Threads
By tylerdurden on 2017-08-14 06:25:18
I think we keep missing each others point. I'm referring to NUMA, in the traditional architectural sense of the term. Which deals with intra-process parallelism. You keep referring to supposed issues/solutions at the inter-process level.
Permalink - Score: 2

Read Comments 1-10 -- 11-20 -- 21-28

There are 1 comment(s) below your current score threshold.

No new comments are allowed for stories older than 10 days.
This story is now archived.

.
News Features Interviews
BlogContact Editorials
.
WAP site - RSS feed
© OSNews LLC 1997-2007. All Rights Reserved.
The readers' comments are owned and a responsibility of whoever posted them.
Prefer the desktop version of OSNews?