www. O S N E W S .com
News Features Interviews
BlogContact Editorials
.
The benefits and costs of writing a POSIX kernel in Go
By Thom Holwerda on 2018-10-08 23:51:05

This paper presents an evaluation of the use of a high-level language (HLL) with garbage collection to implement a monolithic POSIX-style kernel. The goal is to explore if it is reasonable to use an HLL instead of C for such kernels, by examining performance costs, implementation challenges, and programmability and safety benefits.

The paper contributes Biscuit, a kernel written in Go that implements enough of POSIX (virtual memory, mmap, TCP/IP sockets, a logging file system, poll, etc.) to execute significant applications. Biscuit makes liberal use of Go's HLL features (closures, channels, maps, interfaces, garbage collected heap allocation), which sub- jectively made programming easier. The most challenging puzzle was handling the possibility of running out of kernel heap memory; Biscuit benefited from the analyzability of Go source to address this challenge.

On a set of kernel-intensive benchmarks (including NGINX and Redis) the fraction of kernel CPU time Biscuit spends on HLL features (primarily garbage collection and thread stack expansion checks) ranges up to 13%. The longest single GC-related pause suffered by NGINX was 115 microseconds; the longest observed sum of GC delays to a complete NGINX client request was 600 microsec- onds. In experiments comparing nearly identical system call, page fault, and context switch code paths written in Go and C, the Go version was 5% to 15% slower.

Scientific papers about operating system experiments - who doesn't love them?

 Email a friend - Printer friendly - Related stories
.
Post a new comment
Read Comments: 1-10 -- 11-20 -- 21-30 -- 31-40 -- 41-50 -- 51-54
.
RE[2]: Comment by FlyingJester
By FlyingJester on 2018-10-09 20:40:56
What I mean is that what is idiomatic in Go is not idiomatic in C, and that will change the results.

In C, you very often tightly pack structures, where in a GC'ed language this is not nearly as natural (and in some cases not possible). You often rely on non-exhaustive logic in the small scale and partial structure initialization, both of which are usually frowned upon in other languages.

I don't know a lot of Go, but many languages provide very easy to use systems that are extremely verbose in C (interfaces, exhaustive disjunctions, compiler-verified memory region re-use), as well.

Making the C similar to Go will cause the benchmark to be "If I make C look like Go, which is faster?", which I very much suspect will not favor C as much as what idiomatic C would compared with idiomatic Go.
Permalink - Score: 4
.
RE: Excellent find Thom!
By christian on 2018-10-09 20:53:51
> We often debate this very topic without having good data to talk about. This should make for good points of discussion.


Wish this paper had come out a few weeks ago :)

>
Personally I've been leaning towards achieving memory safety through static language analysis rather than garbage collection, but this is very interesting none-the-less. These performance tradeoffs may be worthwhile in some security contexts.



Not sure static analysis can handle cases such as multiple threads have references to a shared buffer. We can't statically determine which thread will last access the buffer, for example. Sure, reference counting can handle this, but this has it's own, albeit minimal, overhead, and is prone to leak bugs in C (forgetting to release buffers leaks memory.)

>


> Should one use HLL for a new kernel?
The HLL worked well for kernel development

* Performance is paramount ⇒ use C (up to 15%)

* Minimize memory use ⇒ use C (↓ mem. budget, ↑ GC cost)

* Safety is paramount ⇒ use HLL (40 CVEs stopped)

* Performance merely important ⇒ use HLL (pay 15%, memory)



With garbage collection, additional memory usage is to be expected, but I do wonder if the performance could be improved upon with further optimization.


To be honest, given that the best case performance advantage of Linux over biscuit was about 10%, and that was using RAM disk based I/O, and given that once an SSD based test was included that performance got to within 2%, that says to me performance is perfectly adequate for most use cases.

We've paid a bigger performance hit than that with Spectre and Meltdown mitigation.
Permalink - Score: 2
.
RE[3]: Comment by FlyingJester
By christian on 2018-10-09 21:02:29
> Making the C similar to Go will cause the benchmark to be "If I make C look like Go, which is faster?", which I very much suspect will not favor C as much as what idiomatic C would compared with idiomatic Go.

Idiomatic C often results in things like leaked buffers, buffer overruns and buffers used after they've been free'd.

I personally like C the language, but would happily dump many of it's idioms which are based around it's limited runtime environment.

What I most miss from other languages, for example, are things like nice exception handling and nice standard library constructs for strings and data collections.

GC got me in hot water the other week, so I won't mention that :)
Permalink - Score: 2
.
RE[4]: Comment by FlyingJester
By FlyingJester on 2018-10-09 23:14:52
I never said it was better to use C. But I think you're close to the point I'm trying to make: C encourages different things than Go. Comparing C made to look like Go doesn't make as much sense to me as comparing "normal" C and "normal" Go. The results will be different. Go will probably have fewer bugs, but C will probably be faster and use less memory. I strongly suspect making C look like Go will reduce the gap in both case.
Permalink - Score: 3
.
RE[5]: Comment by FlyingJester
By kwan_e on 2018-10-10 03:34:42
> I strongly suspect making C look like Go will reduce the gap in both case.

Yes, when you do slow things in C, of course you'll reduce the gap. But a 15% improvment in important parts of an OS kernel is nothing to sneeze at. If a Go kernel requires a significant slow down to be safe, then the kernel won't be used, meaning it has added nothing to safety landscape.
Permalink - Score: 3
.
RE[6]: Comment by FlyingJester
By christian on 2018-10-10 09:13:30
> > I strongly suspect making C look like Go will reduce the gap in both case.

Yes, when you do slow things in C, of course you'll reduce the gap. But a 15% improvment in important parts of an OS kernel is nothing to sneeze at. If a Go kernel requires a significant slow down to be safe, then the kernel won't be used, meaning it has added nothing to safety landscape.


Very much depends on your use case. Desktop and HPC will be largely CPU bound, but CPU bound at user level, so not involving the kernel much.

Anything that is storage I/O bound will find the storage system is the bottleneck, and the kernel simply becomes an occasional arbiter of disk I/O.

The main use case I can think of where this Go kernel might be a problem is anything requiring very low latency. So, for example, high throughput network routing. A 15% dip in performance here might add 15% to the bottom line latencies, and a corresponding loss of throughput.

Anywhere else, that performance lost will be eaten by Moore's Law in a couple of months. Otherwise, it's just 15% on top of typically less than 15% kernel time for the other use cases.


I, personally, would happily pay that price on any of my (non low-latency critical) computers.
Permalink - Score: 1
.
RE[7]: Comment by FlyingJester
By Alfman on 2018-10-10 11:09:39
christian,

> To be honest, given that the best case performance advantage of Linux over biscuit was about 10%, and that was using RAM disk based I/O, and given that once an SSD based test was included that performance got to within 2%, that says to me performance is perfectly adequate for most use cases.

I agree with this, the overhead is probably reasonable for many applications that could benefit from safe pointers.


> Very much depends on your use case. Desktop and HPC will be largely CPU bound, but CPU bound at user level, so not involving the kernel much.

Anything that is storage I/O bound will find the storage system is the bottleneck, and the kernel simply becomes an occasional arbiter of disk I/O.



You are right, the kernel overhead only affects the overall performance in proportion to the amount of time spent in the kernel. If the kernel is using 5% CPU, then 10-15% of 5% would represent less than 1% overhead overall. It obviously makes sense to have a fast kernel, but we also need to consider the overall context of what it's being used for.


> The main use case I can think of where this Go kernel might be a problem is anything requiring very low latency. So, for example, high throughput network routing. A 15% dip in performance here might add 15% to the bottom line latencies, and a corresponding loss of throughput.

The paper said this too. Obviously if you never collect the garbage, there would be zero overhead. The main time overhead occurs is during a collection cycle. I do wonder what can be done to address the latency problem. One idea is to use a separate CPU to perform the garbage collection asynchronously. I did a search for this...

https://stackoverflow.com/questio...

The main source of latency comes from "stop the world GC" and sync barriers. Just now I was giving some thought to how a GC could perform it's work asynchronously without stopping execution by locking the pages such that program code can run concurrently and conflicts will be detected & resolved the moment program code tries to access the objects being scanned. This should produce very little latency. Well it turns out someone else has already thought of that. Here's a paper for a pauseless GC that tries to do this which is very interesting!

http://www.usenix.org/events/vee...
> Table 1 shows the worse-case transaction times. The Pauseless
algorithm's worse-case transaction time of 26ms is over 45 times
better than the next JVM, BEA's parallel collector. Average
transaction times are remarkable similar given the wide variation
in hardware used.





> Anywhere else, that performance lost will be eaten by Moore's Law in a couple of months. Otherwise, it's just 15% on top of typically less than 15% kernel time for the other use cases.

Moores law died for single threaded performance several years back. We keep adding more transistors, but in practice this translates into more cores, which might not do as much to counteract increased kernel overhead occurring on every core.


I've stated my opinion on Garbage Collectors before, I prefer the deterministic behavior of explicit frees, which often go hand in hand with destructors, but it's still an interesting problem to me and it's fun to have these kinds of discussions!
Permalink - Score: 3
.
RE[2]: Excellent find Thom!
By Alfman on 2018-10-10 11:25:52
christian,


> Not sure static analysis can handle cases such as multiple threads have references to a shared buffer. We can't statically determine which thread will last access the buffer, for example. Sure, reference counting can handle this, but this has it's own, albeit minimal, overhead, and is prone to leak bugs in C (forgetting to release buffers leaks memory.)



This is a hand-wavy answer, but that fact that you can manually prove using logic that two threads can not deadlock nor mishandle pointers means that a software algorithm could implement the same logical rules to prove the same conclusions about correctness. Most languages can't do this because the code is assumed to be correct and the programmer's intent of pointer ownership is not conveyed. But if you have this information, then in principal, static analysis can prove code correctness even across threads.

Rust has taken this concept and run with it, all with static compile time analysis and no runtime overhead. It's one of the reasons I wish it would be used more.

https://blog.rust-lang.org/2015/0...
Permalink - Score: 3
.
RE[8]: Comment by FlyingJester
By christian on 2018-10-10 12:53:31
>
> Anywhere else, that performance lost will be eaten by Moore's Law in a couple of months. Otherwise, it's just 15% on top of typically less than 15% kernel time for the other use cases.

Moores law died for single threaded performance several years back. We keep adding more transistors, but in practice this translates into more cores, which might not do as much to counteract increased kernel overhead occurring on every core.


I've stated my opinion on Garbage Collectors before, I prefer the deterministic behavior of explicit frees, which often go hand in hand with destructors, but it's still an interesting problem to me and it's fun to have these kinds of discussions!


Those additional cores are what allows concurrent marking. My GC probably (I've not measured it yet, mind) spends most of its time marking live data, chasing pointers from the various roots. Being a slab based allocator, marking a block of memory is simply a matter of bit twiddling to the mark bitmap as to mark the block as in use. Once everything is marked, the mark bitmap simply becomes the in-use bitmap used when allocating new memory from the heap. The sweep stage of my GC is simply replacing the in-use bitmap with the mark bitmap. If there are no finalizers declared for a slab type, that's it, no more to do. If a finalizer is defined, it has to be called with each block that is now determined to be dead, but I've only actually used the finalizer for testing.

This is work that can easily be done in parallel with other useful work, and adding more cores just makes it truly parallel. It can even operate concurrently with memory allocations in other threads/cores, as an allocation can simply spot the case of an in progress GC, and redo any marking of the newly allocated block in the mark bitmap.

For low latency work, there is nothing stopping you pre-allocating all the memory buffers you'll need, so there'll be no requirement to call the allocator and potentially hit any pauses. Your low latency, high priority task can simply preempt any in flight GC as required.

Perhaps this discussion would be best moved to forum.osdev.org?
Permalink - Score: 1
.
RE[9]: Comment by FlyingJester
By Alfman on 2018-10-10 14:54:45
christian,

> Those additional cores are what allows concurrent marking.

SMP core counts don't scale as well as we'd like unless our code is able to run mostly within a local CPU cache without general RAM and synchronization bottlenecks. Due to the nature of garbage collection scanning main RAM, IMHO it represents one of the worse case scenarios for SMP efficiency.

One extra core can benefit concurrent marking, but I doubt additional cores would help because the process is most likely limited by memory bandwidth rather than the CPU. Heck, even just one GC core is liable to hurt the memory bandwidth available for normal tasks system if the GC process is too aggressive. For this reason, I suspect that provisioning more than one core for GC tasks would not be beneficial.

I suspect the optimal configuration is probably no more than one garbage collecting core per NEMA memory region. But on a normal setup with a single shared memory bus, it takes very few cores to saturate all available memory bandwidth.



> Being a slab based allocator, marking a block of memory is simply a matter of bit twiddling to the mark bitmap as to mark the block as in use. Once everything is marked, the mark bitmap simply becomes the in-use bitmap used when allocating new memory from the heap. The sweep stage of my GC is simply replacing the in-use bitmap with the mark bitmap. If there are no finalizers declared for a slab type, that's it, no more to do. If a finalizer is defined, it has to be called with each block that is now determined to be dead, but I've only actually used the finalizer for testing.

I'm curious, in your implementation, I assume you stop the other processes from running while you scan?
If so, then you may as well do the GC on the same cores as your program since it cannot run anyways.


> This is work that can easily be done in parallel with other useful work, and adding more cores just makes it truly parallel.

I think it's possible, but I don't think it's easy because you can't scan objects while work is being done on them due to race conditions. Unless you do something clever like in the link earlier, access to objects will need to be protected by synchronization primitives. This adds complexity and overhead. Or am I misunderstanding something?

> It can even operate concurrently with memory allocations in other threads/cores, as an allocation can simply spot the case of an in progress GC, and redo any marking of the newly allocated block in the mark bitmap.

Allocation is much easier because there's nothing to verify. I have experience with my own allocater, using thread-local heaps pays off enormously!


> For low latency work, there is nothing stopping you pre-allocating all the memory buffers you'll need, so there'll be no requirement to call the allocator and potentially hit any pauses. Your low latency, high priority task can simply preempt any in flight GC as required.

Well sure, but I'd argue the allocation isn't the bottleneck to begin with. It's the garbage collecting phase that takes time and it's synchronization with the GC that results in programs slowing down.

> Perhaps this discussion would be best moved to forum.osdev.org?

Perhaps, I'm not sure I have enough motivation to though, haha.

Edited 2018-10-10 14:55 UTC
Permalink - Score: 3

Read Comments 1-10 -- 11-20 -- 21-30 -- 31-40 -- 41-50 -- 51-54

There are 1 comment(s) below your current score threshold.

Post a new comment
Username

Password

Title

Your comment

If you do not have an account, please use a desktop browser to create one.
LEAVE SPACES around URLs to autoparse. No more than 8,000 characters are allowed. The only HTML/UBB tags allowed are bold & italics.
Submission of a comment on OSNews implies that you have acknowledged and fully agreed with THESE TERMS.
.
News Features Interviews
BlogContact Editorials
.
WAP site - RSS feed
© OSNews LLC 1997-2007. All Rights Reserved.
The readers' comments are owned and a responsibility of whoever posted them.
Prefer the desktop version of OSNews?