www. O S N E W S .com
News Features Interviews
BlogContact Editorials
.
The best hardware to build with Swift is not what you think
By Thom Holwerda on 2017-03-15 23:22:09

Some interesting figures from LinkedIn, who benchmark the compiling times of their Swift-based iOS application. You'd think the Mac Pro would deliver the fastest compiles, but as it turns out - that's not quite true.

As you can see, 12-core MacPro is indeed the slowest machine to build our code with Swift, and going from the default 24 jobs setting down to only 5 threads improves compilation time by 23%. Due to this, even a 2-core Mac Mini ($1,399.00) builds faster than the 12-cores Mac Pro ($6,999.00).

As Steven Troughton-Smith notes on Twitter - "People suggested that the Mac Pro is necessary because devs need more cores; maybe we just need better compilers? There's no point even theorizing about a 24-core iMac Pro if a 4-core MBP or mini will beat it at compiling."

 Email a friend - Printer friendly - Related stories
.
Read Comments: 1-10 -- 11-20 -- 21-30 -- 31-36
.
RE[3]: Comment by Alfman
By CodeMonkey on 2017-03-16 14:41:00
> wouldn't it be marvelous to have on-chip addressable local memory to be under control of programmer, like the 256KB local storage in Cell (ps3)?

That's essentially what you have on the new Knights Landing Xeon Phi CPUs. There's 16GB of on-die RAM (MCDRAM). The chip can be configured such that the MCDRAM is transparently used as another cache level, in which case it's not under control of the programmer but instead managed by the memory controller, or you can configure it as a distinctly separate NUMA domain (or 4 domains) in which case it is directly addressible by the programmer in thier application.
Permalink - Score: 2
.
RE[2]: Comment by Alfman
By Alfman on 2017-03-16 14:44:03
Earl C Pottinger,

> I write code to run multi-threaded in Haiku. and one of the two things I have to struggle with is to get the inner loops to fit inside the L1 cache of each CPU so the threads run as fast as possible.

Wow, most people have just given up trying to optimize software at that level, but you are right it does make a big difference if you pull it off.

> What I find is a harder problem, and sometimes unsolvable is the data access of the different threads, it is rare that the threads want to work on the same data at the same time resulting in each thread invalidate the L2 and L3 caches.

Once you leave the local cache obviously the higher level cache & memory bandwidth becomes the primary bottleneck. We can keep adding dozens of cores, but this just exacerbates the contention of the shared bus between them. The cost of overhead will greatly negate the benefit of adding the new cores.

Even when threads are "embarrasingly parallel" and do not share the same data and have no fine grained dependencies on one another, the shared memory bottleneck kills performance for them. So in the long run the only viable way to achieve scalability through hundreds of cores is with NUMA (non-uniform memory architecture) typologies. This way each independent process can run at full speed without any degradation from neighboring processes.

I think NUMA is great for VMs and running many daemon instances simultaneously, but alot of modern multithreaded code is designed around the basic assumption that data structures can be efficiently shared by all threads, which is a bad assumption for NUMA systems intending to scale to thousands of cores. Long term scalability requires us as developers to treat cores less like a shared memory architecture and more like a cluster.


Of course, the real question might be why we would need hundreds of cores anyways, haha. If we had them, we could each have our own personal IBM watson (which ran on 90 nodes with 720 total physical cores), I'm sure there's some innovation to be had.

You could probably have a system that writes/directs/orchestrates /acts/renders a Hollywood quality movie in real time. Instead of "searching" for a movie as one does on netflix, a user could enter their personal preferences and a movie would be custom generated to their specs in real time! Friends would end up sharing their favorite creations much the way they share minecraft world seeds.
Permalink - Score: 2
.
RE[2]: Comment by Alfman
By osvil on 2017-03-16 16:14:55
Note that depending on the CPU, L2 may be per-core while sharing happens at L3.

In any case, if you can get everything optimized to run from L1 you get that extra speed.. However, in many cases (at least the ones I am tackling) you still have the bottleneck to feed/output the data.

People is just not aware of how fast the processors we use to day actually are.
Permalink - Score: 1
.
RE[2]: Comment by joekiser
By henderson101 on 2017-03-16 17:22:12
Agreed, I was just looking down the comments to see if anyone else had noted that the Mac Pro is woefully underpowered. I believe the high end iMacs have more processor power than the Mac Pro. Marco Arment is forever whining about it on ATP. I think he asserted this week that the current iPhone 7 has a better GeekBench score in single threaded mode than a Mac Pro.
Permalink - Score: 2
.
RE: Comment by joekiser
By Bill Shooter of Bul on 2017-03-16 17:22:15
Interesting, I also don't have a linked in, but I can view the results just fine. Maybe geofencing?
Permalink - Score: 2
.
RE[3]: Comment by Alfman
By Earl C Pottinger on 2017-03-16 17:41:01
I wonder if we could make a machine that to the programmer appears to give access to all the memory but in fact divides the memory into blocks so each CPU has it's own local memory pool.

When accessing memory that is in it's pool the access is very fast, if the access is outside it's pool then a virtual memory system moves that block to the local pool and invalids/exchanges the original block.

Of-course if memory access tends to be scattered then you would end up with a slow machine, but if the each thread tends to work on it own data the data will move to the CPU memory block with the fast access.

Darn, I thought on it some more, I see a real speed-up for some programs but most programs would slow it down.
Permalink - Score: 1
.
Try it on linux?
By Bill Shooter of Bul on 2017-03-16 18:56:54
They have a swift compiler for Linux, I wonder if it demonstrates the same issues. Obviously, you couldn't compile ios/mac os applications, but if its really xcode and not the OS you should be able to reproduce it with a good sample app.
Permalink - Score: 3
.
RE[4]: Comment by Alfman
By Megol on 2017-03-16 19:03:44
> I wonder if we could make a machine that to the programmer appears to give access to all the memory but in fact divides the memory into blocks so each CPU has it's own local memory pool.


AKA NUMA (Non-Uniform Memory Access). Standard since a long time even for consumer computers.

>
When accessing memory that is in it's pool the access is very fast, if the access is outside it's pool then a virtual memory system moves that block to the local pool and invalids/exchanges the original block.


You mean something like COMA (Cache Only Memory Access)? Otherwise the standard cache coherency system does the equivalent: memory is either in a fixed location (DRAM), in one processors cache or in several processors caches (shared, only one have write access).

>
Of-course if memory access tends to be scattered then you would end up with a slow machine, but if the each thread tends to work on it own data the data will move to the CPU memory block with the fast access.


Yes.

>
Darn, I thought on it some more, I see a real speed-up for some programs but most programs would slow it down.


While you may be ironic (hard to tell on the Internet) this is how computers do it already.

But IMHO message-passing is the future. That's actually what the cache coherency protocols do however hidden as memory accesses.
Permalink - Score: 2
.
RE[4]: Comment by Alfman
By Alfman on 2017-03-16 19:04:52
Earl C Pottinger,

> I wonder if we could make a machine that to the programmer appears to give access to all the memory but in fact divides the memory into blocks so each CPU has it's own local memory pool.

When accessing memory that is in it's pool the access is very fast, if the access is outside it's pool then a virtual memory system moves that block to the local pool and invalids/exchanges the original block.

Of-course if memory access tends to be scattered then you would end up with a slow machine, but if the each thread tends to work on it own data the data will move to the CPU memory block with the fast access.

Darn, I thought on it some more, I see a real speed-up for some programs but most programs would slow it down.




I've thought about it before. It seems feasible to measure the access patterns to see exactly what the bottlenecks are, but to actually compensate for it without actually correcting the underlying code seems infeasible to me.

At a fundamental level we need our data structures and logic paths to be more parallel friendly throughout the program to remove synchronous dependencies, which is something that currently requires a dedicated programmer to do. Maybe some day we'll have code optimizers that see what the programmer was trying to do and automatically generate new code better suited for distributed NUMA topologies, but it doesn't seem we're there yet. The main difficultly is that the optimizer doesn't know how to differentiate between behaviors the programmers really intended versus those that are merely coincidental.

Let's say there's a program that connects to a server, reads a file, transmits chunks of the file to the server in a loop, writes a file, and then exists in that sequence. A computer has to assume all of these are deliberate and is not free to optimize anything that would change the semantics. Theoretically it might have been possible to get more parallelization by opening/reading/writing the file while the connection was being made, but the computer doesn't know that because the sequential ordering is implied by the code even if it isn't critical. We can't know if the programmer intended for the file write to be absolutely dependent/synchronized with the socket IO. Perhaps it's a deliberate log/audit record, in which case the synchronization is important, or perhaps it's an unrelated process that just happens to be specified sequentially because that was the easiest way for the programmer to implement it.


A game daemon might send/receive messages with many UDP clients in a specific access pattern. In all likelihood the ordering is just coincidental based on the arbitrary algorithms chosen by the programmer, yet the optimizer is not free to replace the sequential algorithm with a new parallel algorithm with no inter-dependencies because it has no way to know whether the packet dependencies were intentional.

So even if we had a very smart optimizer that's capable of transforming serial programs to parallel automatically, it's going to be highly constrained to follow the unintentional semantics of the original programmer's code that specifies thousands of instances of inadvertent serialization, which is the enemy of scalability.

Now maybe there could be some AI that could use fuzzy analysis to determine if a serial construction was intentional or not (ie execute A + B + C in parallel even though the original code said to execute them in order), but if the AI makes a wrong judgement then it opens up the possibility that new subtle bugs will crop through like an application confirming a transaction while the transaction is still being processed in parallel.

For all these reasons, I think upgrading to massively parallel architectures is not going to be something we can solve just with hardware...we're going to have to rewrite the software too.

Edited 2017-03-16 19:09 UTC
Permalink - Score: 2
.
RE[5]: Comment by Alfman
By Alfman on 2017-03-16 19:17:04
Megal,

> But IMHO message-passing is the future. That's actually what the cache coherency protocols do however hidden as memory accesses.

I agree with you on message passing, it can work with both multi core processors and with extremely large clusters. Time will show it to be the most scalable solution.
Permalink - Score: 2

Read Comments 1-10 -- 11-20 -- 21-30 -- 31-36

No new comments are allowed for stories older than 10 days.
This story is now archived.

.
News Features Interviews
BlogContact Editorials
.
WAP site - RSS feed
© OSNews LLC 1997-2007. All Rights Reserved.
The readers' comments are owned and a responsibility of whoever posted them.
Prefer the desktop version of OSNews?