Communication in Systems, a Review

Posted on January 5, 2016 by Gabe Parmer

Composite focuses on using fine-grained, isolated components to enable system policy customization, increased fault isolation, and decreased scope for privilege escalation. Composite leverages hardware page-tables to provide memory isolation between components. With dual-mode protection hardware, switching between page tables is a privileged instruction, thus can only be performed by the kernel. Thus, control flow transfer between components must be mediated by the kernel. This relationship between the hardware and software motivates the design of efficient communication (IPC) via kernel design. This article gives an overview of a few IPC designs from different systems, to lay the ground-work for a future article that provides a summary of the Composite design. It stays away from too many concrete implementation details, thus focuses on the high-level designs. That too, is the focus of future articles.

IPC Designs

There are a few different factors that fracture the design space for system-provided IPC. The major dimensions are:

Control transfer:

Synchronous. Is the execution in the two components performed in lock-step? This is analogous to function-call semantics where the caller only resumes execution when the callee (e.g. the function being called) completes execution. Correspondingly, the receiver’s execution is activated upon send. Synchronous semantics couple control flow transfer with event notification.
Asynchronous. Is the execution of the two components decoupled to the extent where the sender does not block regardless the state of the receiver. Receivers generally need an event notification mechanism that is synchronous (i.e. blocking) to wait for a sender, though polling is sufficient for some systems. Asynchronous semantics decouple control flow transfers (e.g. via scheduling decisions) from event notification.

Data transfer: Data is often passed between components, which is a key aspect of message passing.

Generic data movement. Such techniques for generic data movement enable arbitrarily sized buffers to be transferred between components.
Restricted data movement. The size of the data is bounded in these systems. This mechanism is often paired with a separate mechanism that provides the transfer of arbitrary data. Though this is strictly a restriction, the observation is that most IPC contains small messages, and if significant gains can be had by this simplification, it is a worthwhile optimization.

Multi-core:

Intra-core. Is the communication between components on the same core? Given that components use separate page-tables that can only be switched by the kernel (e.g. writing to %cr3 on x86). This implies that chip data-structures that are in virtually tagged caches such as non-tagged TLBs need to be flushed. This IPC mechanism is often limited by mode-switch, page-table switching, and indirect cache overheads.
Inter-core. Harnesses inter-core communication mechanisms such as cache-coherency, RDMA, or inter-processor interrupts (IPIs) to perform event notifications and pass data between cores. This IPC mechanism is often limited by the hardware latencies of these operations.

I do need to point out that the programming model that is exposed to application developers is often orthogonal to these dimensions. The same APIs and behaviors can be layered on top of many different IPC mechanisms that differ across the factors above. This document does not discuss the programming models.

Synchronous Rendezvous between Threads

I’ll start with where much of the research community ended up. Lietke went against convention at the time, and made a strong argument for synchronous IPC implemented as rendezvous between threads in L4. Two threads (one per-component in our example) communicate by following a protocol. A client sends a notification to the server, and simultaneously blocks till the server completes processing on its behalf. The server simultaneously replies to their last client, and either blocks waiting for the next, or receives the next request if it is pending. Much of the cleverness and motivation behind this approach centered around optimizations this design enables. These include lazy scheduling, direct switching, and thread-as-message-queue to avoid memory allocation. It also has breadth of design as IPC was used as the primary mechanism for memory mapping, interrupt-handling, and fault handling.

The original implementation focused on unconstrained data movement. Using some tricks, data could be copied between processes at the cost of a single copy (this should be surprising and impressive, if you think about it). However, it has the side-effect of significantly complicating the kernel. Consider 1. copying arbitrary buffers can cause page-faults, and 2. page-faults in this system are serviced by threads in other processes. So an IPC can fail at some arbitrary intermediate processing step, and require a nested IPC. When this nested IPC completes, it must pick up on the original IPC where it left off. This causes highly specialized processing in the page-fault handler, and requires some representation of continuations. We start to see why unconstrained data movement is something that can complicate the design space significantly. These (and other) problems have motivated Shapiro to deem synchronous IPC vulnerable to attacks.

Modern implementations of this IPC mechanism (Fiasco and seL4) use constrained data movement by default. The size of the amount of data transferred differs, but it is often around 64 words, and always smaller than the size of a page. In fact, each thread has a structure shared between kernel and user, and the data to be shared and copied is simply copied between these structures. Now, as the only copy is between memory that is guaranteed to be mapped in, faults are averted.

Though Fiasco does implement transparent cross-core IPC, these systems are primarily designed around intra-core IPC.

A great paper recounts some of the changes to system design from the original L4 to seL4. This paper is fairly limited as it only covers one system, and it isn’t an introduction, so your mileage may vary, but it is a nice, 20-year war story.

K42 synchronously communicates between dispatchers, which are essentially kernel threads (i.e. threads that the kernel is aware of, and schedules). Execution in the server is performed via an upcall that then can decide to switch to a specific user-level thread to handle the request. This form of IPC is as related to scheduler activations in the kernel/user coordination, as it is to L4-style IPC in the kernel.

Asynchronous Message Passing

Perhaps the most famous version of asynchronous IPC is UNIX pipes. Mach popularized microkernels (then disillusioned many about them) and is based on asynchronous IPC. It still lives in OSX, but the microkernel bits aren’t extensively used for their original purposes, as far as I know. Singularity uses language-based protection, and asynchronous message passing between threads. Xen uses asynchronous IPC between virtual machines. All modern L4 variants also, somewhat surprisingly, include some form of asynchronous IPC. It should be noted that any inter-core communication is almost necessarily asynchronous (for the same reasons that synchronous RPC is rarely used on the internet).

Clearly, the asynchronous IPC thing must be a solved problem by now.

Lietke’s paper is a practical argument for synchronous IPC. It argues that it was (at the time) the only known implementation that enables an efficient IPC implementation. I’ll take the liberty of summarizing many disparate sources, and use my own judgment to answer the question of why asynchronous IPC can be inefficient. Then, I’ll make the opposite argument. As it turns out that in systems, everything’s a trade-off. Go figure.

Why is Asynchronous IPC Inefficient?

Memory copying. The data that is sent from the client will not be accessed by the server until some indeterminate time in the future. Regardless, we want to continue to allow the client to access and modify its own memory without worrying about if the server has processed it yet. Thus, the typical implementation (e.g. pipes) copies data from the client into the kernel, and from the kernel into the server. This enables the client to modify and use its own buffers immediately after it sends the data. Memory copying trashes cache contents and is one of the fastest way to kill your performance on modern systems where memory is significantly slower than CPU. Mach attempts to avoid the copy by using copy-on-write which ends up not working very well, especially on multi-cores where page-permission modification is hugely expensive.
Memory allocation. The space in the kernel to store the message and its data must come from somewhere. The options are to either use a fixed size buffer (pipes), or allocate space (Mach). Allocation is expensive relative to the fast-path of synchronous IPC, and has very bad edge-case behavior. It can be very expensive, pushing up tail-end latencies, and can always fail. An additional complication is that a client in an infinite send loop, with a server that never receives, consumes an infinite amount of kernel memory. A convenient DoS. This is solved, as with pipes, by placing a finite upper bound on the asynchronous behavior before the system switches to synchronous. Thus, clients and servers must include logic that considers both synchronous and asynchronous behaviors.
Scheduling overheads. The system has to switch between threads at some point so that the server can process the client’s messages. This operation includes all of the overhead of scheduling and dispatching. These are efficient operations in most contexts; the kind that are so small they can be ignored. Relative to synchronous IPC latencies, these overheads are non-trivial (i.e. thread switch in Linux is generally higher than IPC in Composite).
Separation of event notification from communication. Synchronization is often essential at some point. At the bare minimum, the server often wants to block if there are no more messages to produce. The event notification APIs are separate from message reception (select vs. read), thus represent pure overhead on top of the message passing path.

If you’re banging your fist down while furiously calling foul at this point, take a deep breath. Many of these overheads are artifacts of specific implementations. However, they are, historically, major contributors to slow asynchronous IPC communication paths. It should be noted that IPC in many of these previous systems did not really need to be fast. For example, UNIX pipes really don’t need to be that speedy to be useful. “Slow” is not equivalent to “bad”. Pipes are a great API that are easy to use. Slow is often fast enough. In systems that liberally use protection domains for increased isolation (I’d argue that VMs fit into this category), we need something that is more efficient.

Why is Asynchronous IPC Efficient?

Many systems have done significantly better through light-weight designs. These include:

Megapipe uses asynchronous coordination between user-level and the kernel for network processing. Shared memory protocols for the transmission of messages and events are used along with batching of messages to remove much of the system’s communication overheads.
FlexSC replaces the synchronous interface between user-level and kernel with asynchronous inter-core communication. This demonstrates that using cache-coherency for messages can be effective. Even then, batching is used to minimize the communication overheads. Barrelfish uses a similar communication mechanism.
seL4 essentially restricts the capabilities of asynchronous messaging to event notification, thus removing many of the above overheads. Data is not passed (I’m deliberately ignoring the accumulating bitmask), and messages are coalesced (i.e. not queued) into a single event. The server, in this case, just receives a notification that one or multiple clients have activated it. This is a form of asynchronous IPC, but it is intentionally limited to event notification.

From these examples, we can informally extract some themes for making efficient, asynchronous IPC:

Batching. The costs of scheduling and context switching are amortized by batching messages (for both requests and replies). Messages are typically queued in a shared memory region of fixed size (thus limiting the asynchrony, and batch size), with concurrency control managed using wait-free data-structures. Batching has a complex relationship with per-operation latency, especially tail-end and worst-case latency.
Cross-core messages. Cores execute asynchronously to each other, and an efficient communication medium will mirror this. If production and consumption occur at roughly the same rate, then synchronization is never required, and we use the parallelism of the hardware to achieve throughput greater than on a single core. The overheads of this messaging will often be proportional to the cost of cache-line movement between cores which is has a somewhat nuanced characterization.
Decoupling data movement from messages. Many of the techniques above use asynchronous messaging as a medium for sending meta-data that includes a logical pointer to the data that accompanies the message. For example, FlexSC passes the registers that would typically be passed for a system call in a fixed size message. The server (kernel here) interprets these registers and if they include a pointer to a buffer that must be copied, then the server performs that operation in a manner separate from the message passing medium. Note the difficult part in the “memory copying” bullet is still present here. To handle this, FlexSC uses a lightweight threading model to ensure synchronous behavior for user-level threads, while using an asynchronous medium.
Coupling event notification with message passing. Most of these techniques use polling with inter-core communication. seL4 essentially uses only event notification. In each of these cases, either event notification or message passing is used, but not both. In some cases, the message passing medium is used for event notification.
Mechanism simplification. In all cases, the means for asynchronous coordination is stripped down to mechanism that maps well to the hardware. In many cases, this means using simple, fixed-size message, wait-free data-structures. In seL4’s case, this means using kernel data-structures that are simple enough to perform a single, simple operation well.

Communication Design Sweet Spots

Given 35 years of research into these topics, it seems that the community has found two sweet spots in the design space for high-performance communication.

synchronous, intra-core, with restricted data movement
asynchronous, inter-core, with restricted data movement

One obvious observation is that it seems that restricted data movement has won. This implies that systems require an orthogonal mechanism for data transfer. A topic for another day.

Some might argue that VM IPC in systems such as Xen is asynchronous and intra-core, and is conspicuously absent in the sweet spot characterization above. This is true. Look at the communication latencies on such systems for a counter-argument.

Taking a dogmatic approach by choosing only one of these two designs has had complicated results. The fully synchronous L4 had a problem. In practice, they found that you need a thread per server, per application thread if the server, or some server that it leverages can block for long amounts of time. This was done to prevent one application thread from introducing arbitrary blocking latencies into another (possibly unrelated) application thread. This necessitates multi-threaded servers, and a lot of threads. On the other side, a fully asynchronous approach can cause resource management difficulties, especially when a single event handled by a server can cause many subsequent events in other servers (which can cause many subsequent events…). Who to run at any point, and how that decision impacts end-to-end latency, is not at all clear.

The interesting question is when and where should each communication mechanism be used? These sweet spots certainly give a fair amount of guidance, but they hide important details. System design, unfortunately, cannot be entirely reduced to sweet spots. Thus, I’ll leave this as the topic of another post.

Composite IPC

This is a long history that is mostly trivially touched on above. It is usually not wise to do research in a space that has such a history, and seems to have converged on a few designs. However, as with all research, when there is a very good motivation to change the underlying assumptions of the system, a new design is viable and interesting. The next article will summarize the design of IPC in Composite.

Edits

1/13/16: Grammar edits, and added link for cache coherency. Thanks Robbert Gifford!