Posted on January 5, 2016 by Gabe Parmer
Composite focuses on using fine-grained, isolated components to enable system policy customization, increased fault isolation, and decreased scope for privilege escalation. Composite leverages hardware page-tables to provide memory isolation between components. With dual-mode protection hardware, switching between page tables is a privileged instruction, thus can only be performed by the kernel. Thus, control flow transfer between components must be mediated by the kernel. This relationship between the hardware and software motivates the design of efficient communication (IPC) via kernel design. This article gives an overview of a few IPC designs from different systems, to lay the ground-work for a future article that provides a summary of the Composite design. It stays away from too many concrete implementation details, thus focuses on the high-level designs. That too, is the focus of future articles.
There are a few different factors that fracture the design space for system-provided IPC. The major dimensions are:
Control transfer:
Data transfer: Data is often passed between components, which is a key aspect of message passing.
Multi-core:
Intra-core. Is the communication between components on the same core? Given that components use separate page-tables that can only be switched by the kernel (e.g. writing to %cr3
on x86). This implies that chip data-structures that are in virtually tagged caches such as non-tagged TLBs need to be flushed. This IPC mechanism is often limited by mode-switch, page-table switching, and indirect cache overheads.
Inter-core. Harnesses inter-core communication mechanisms such as cache-coherency, RDMA, or inter-processor interrupts (IPIs) to perform event notifications and pass data between cores. This IPC mechanism is often limited by the hardware latencies of these operations.
I do need to point out that the programming model that is exposed to application developers is often orthogonal to these dimensions. The same APIs and behaviors can be layered on top of many different IPC mechanisms that differ across the factors above. This document does not discuss the programming models.
I’ll start with where much of the research community ended up. Lietke went against convention at the time, and made a strong argument for synchronous IPC implemented as rendezvous between threads in L4. Two threads (one per-component in our example) communicate by following a protocol. A client sends a notification to the server, and simultaneously blocks till the server completes processing on its behalf. The server simultaneously replies to their last client, and either blocks waiting for the next, or receives the next request if it is pending. Much of the cleverness and motivation behind this approach centered around optimizations this design enables. These include lazy scheduling, direct switching, and thread-as-message-queue to avoid memory allocation. It also has breadth of design as IPC was used as the primary mechanism for memory mapping, interrupt-handling, and fault handling.
The original implementation focused on unconstrained data movement. Using some tricks, data could be copied between processes at the cost of a single copy (this should be surprising and impressive, if you think about it). However, it has the side-effect of significantly complicating the kernel. Consider 1. copying arbitrary buffers can cause page-faults, and 2. page-faults in this system are serviced by threads in other processes. So an IPC can fail at some arbitrary intermediate processing step, and require a nested IPC. When this nested IPC completes, it must pick up on the original IPC where it left off. This causes highly specialized processing in the page-fault handler, and requires some representation of continuations. We start to see why unconstrained data movement is something that can complicate the design space significantly. These (and other) problems have motivated Shapiro to deem synchronous IPC vulnerable to attacks.
Modern implementations of this IPC mechanism (Fiasco and seL4) use constrained data movement by default. The size of the amount of data transferred differs, but it is often around 64 words, and always smaller than the size of a page. In fact, each thread has a structure shared between kernel and user, and the data to be shared and copied is simply copied between these structures. Now, as the only copy is between memory that is guaranteed to be mapped in, faults are averted.
Though Fiasco does implement transparent cross-core IPC, these systems are primarily designed around intra-core IPC.
A great paper recounts some of the changes to system design from the original L4 to seL4. This paper is fairly limited as it only covers one system, and it isn’t an introduction, so your mileage may vary, but it is a nice, 20-year war story.
K42 synchronously communicates between dispatchers, which are essentially kernel threads (i.e. threads that the kernel is aware of, and schedules). Execution in the server is performed via an upcall that then can decide to switch to a specific user-level thread to handle the request. This form of IPC is as related to scheduler activations in the kernel/user coordination, as it is to L4-style IPC in the kernel.
Perhaps the most famous version of asynchronous IPC is UNIX pipes. Mach popularized microkernels (then disillusioned many about them) and is based on asynchronous IPC. It still lives in OSX, but the microkernel bits aren’t extensively used for their original purposes, as far as I know. Singularity uses language-based protection, and asynchronous message passing between threads. Xen uses asynchronous IPC between virtual machines. All modern L4 variants also, somewhat surprisingly, include some form of asynchronous IPC. It should be noted that any inter-core communication is almost necessarily asynchronous (for the same reasons that synchronous RPC is rarely used on the internet).
Clearly, the asynchronous IPC thing must be a solved problem by now.
Lietke’s paper is a practical argument for synchronous IPC. It argues that it was (at the time) the only known implementation that enables an efficient IPC implementation. I’ll take the liberty of summarizing many disparate sources, and use my own judgment to answer the question of why asynchronous IPC can be inefficient. Then, I’ll make the opposite argument. As it turns out that in systems, everything’s a trade-off. Go figure.
select
vs. read
), thus represent pure overhead on top of the message passing path.If you’re banging your fist down while furiously calling foul at this point, take a deep breath. Many of these overheads are artifacts of specific implementations. However, they are, historically, major contributors to slow asynchronous IPC communication paths. It should be noted that IPC in many of these previous systems did not really need to be fast. For example, UNIX pipes really don’t need to be that speedy to be useful. “Slow” is not equivalent to “bad”. Pipes are a great API that are easy to use. Slow is often fast enough. In systems that liberally use protection domains for increased isolation (I’d argue that VMs fit into this category), we need something that is more efficient.
Many systems have done significantly better through light-weight designs. These include:
From these examples, we can informally extract some themes for making efficient, asynchronous IPC:
Given 35 years of research into these topics, it seems that the community has found two sweet spots in the design space for high-performance communication.
One obvious observation is that it seems that restricted data movement has won. This implies that systems require an orthogonal mechanism for data transfer. A topic for another day.
Some might argue that VM IPC in systems such as Xen is asynchronous and intra-core, and is conspicuously absent in the sweet spot characterization above. This is true. Look at the communication latencies on such systems for a counter-argument.
Taking a dogmatic approach by choosing only one of these two designs has had complicated results. The fully synchronous L4 had a problem. In practice, they found that you need a thread per server, per application thread if the server, or some server that it leverages can block for long amounts of time. This was done to prevent one application thread from introducing arbitrary blocking latencies into another (possibly unrelated) application thread. This necessitates multi-threaded servers, and a lot of threads. On the other side, a fully asynchronous approach can cause resource management difficulties, especially when a single event handled by a server can cause many subsequent events in other servers (which can cause many subsequent events…). Who to run at any point, and how that decision impacts end-to-end latency, is not at all clear.
The interesting question is when and where should each communication mechanism be used? These sweet spots certainly give a fair amount of guidance, but they hide important details. System design, unfortunately, cannot be entirely reduced to sweet spots. Thus, I’ll leave this as the topic of another post.
This is a long history that is mostly trivially touched on above. It is usually not wise to do research in a space that has such a history, and seems to have converged on a few designs. However, as with all research, when there is a very good motivation to change the underlying assumptions of the system, a new design is viable and interesting. The next article will summarize the design of IPC in Composite.