Posted on January 17, 2016 by Gabe Parmer
In the previous post I discussed some of the main dimensions that differentiate different system communication mechanisms. I concluded that most existing systems use either synchronous communication on the same core, or asynchronous communication across cores, both with limited data-movement capabilities (i.e. messages have a relatively small maximum size).
In this post, I’ll discuss how component invocations are implemented in Composite. Composite focuses on the fine-grained decomposition of system software into separate, hardware-isolated components, each exporting an interface of functions that other components can invoke. The invocation path must be fast, as inter-component interactions are frequent. This implementation firmly falls into synchronous, intra-core, with restricted data movement. The dominant paradigm (since Lietke) for this type of communication is Synchronous Rendezvous Between Threads (SRBT). This is a well-studied, well-optimized, and established paradigm for Inter-Process Communication (IPC)1.
The L4 family of microkernels is the main prototype for this communication paradigm. A client thread in the calling component invokes the kernel to both activate the server thread, and block awaiting a reply. The server thread invokes the kernel to both reply to a client and block waiting for the next client. The entire call (from client to server) and return (from server to client) is implemented with only two system calls (compare to UNIX pipes). Given the history of this approach, one would likely assume that it is strictly better than many of the alternative implementations. As mentioned in the previous post, once you change some of the system’s goals or assumptions, it is possible for a different design to be beneficial. To understand the design space around this mechanism, lets list out a number of the designs and consequences of the designs inherent in SRBT systems.
In-kernel blocking. The kernel must understand thread blocking and waking up. This seems obvious and universal, but is not a necessary characteristic of all systems. When threads rendezvous, they are blocking and switching. For IPC to be fast, this blocking often must avoid scheduling overheads, so threads are directly switched (blocked and woken up) in the cases where a. the server thread is awaiting a request, and a client makes a request, or when b. the client is awaiting a reply, and the server is replying to it. If a client makes a request to a server that is not blocked (as it is computing for another client), then the client must block waiting for the server to become available to service its request. Thus contention is mediated by the kernel.
In-kernel scheduler. Switching between threads usually implies scheduling decisions. In the case of direct switching, scheduling is omitted for the sake of performance. When a server replies to a client, two threads can be activated: the client being replied to and a third thread, a client that is waiting for service from the server. Some policy (e.g. scheduler) must decide which to run. Interrupts also activate threads (interrupts are treated as IPC activations of interrupt threads), again leaving two threads that are executable (the interrupt thread, and the thread that is preempted).
Wait-queue management. When a client thread requests service from a server thread, and the server is in the process of handling another client, we must queue this request until the server is available. This wait-queue management implies a policy: if there are multiple threads in the wait-queue when the server is available for processing them, the kernel must determine which to service. This is effectively a second dimension of scheduling policy in the kernel.
End-to-end timing must consider inter-thread dependencies. A simple question has a complex answer: if a client communicates with a server, that is in turn a client for a further server (e.g. a client depends on a networking stack, that depends on a NIC), how long will a client’s request take to be serviced? Even if we had a good description of how long each thread took to service a request, the answer is still complicated. Each server thread might be servicing a number of other requests from other client threads (i.e. the wait-queue for the server thread has threads in it). Each thread is separately scheduled, thus the scheduling latency of each impacts the end-to-end latency. Thus, when dependencies exist between threads, introduced by the IPC relationships, the end-to-end latency of a client request is dependent on all of the previous policies. There are ways to control and analyze this latency such as resource sharing protocols (e.g. priority inheritance and ceiling protocols, holistic analysis, and delay-compositional models). Regardless, the communication dependencies can significantly increase the end-to-end latency compared to the case where a single thread executes through the different layers of software.
Each of these is more completely and precisely characterized in a paper comparing implementations of synchronous IPC between threads with Composite IPC.
A number of pieces of research have investigated alterations to many of these factors. These include:
Credo and timeslice donation. These systems attempt to decouple the execution context of a thread, and the scheduling context which is the actual entity that the scheduler considers in its own decisions. They key is that the scheduling context is passed to the server when the client rendezvous with it. In this way, the server is scheduled with the scheduling parameters (e.g. priority) of the client (“timeslice donation”). This has the effect of removing the scheduling policy from the system effects that must be considered when assessing end-to-end IPC latency. These techniques have the effect of also implementing a form of priority inheritance. When multiple clients are in the wait-queue for a server thread, the server is “helped” by receiving the priority of the highest priority client.
Removing direct switching. Ruocco provides a fairly detailed analysis of the sources of unpredictable behavior in L4 IPC with direct switching and lazy scheduling, and the impact of removing these optimizations in the IPC path. Many L4 versions relax accurate scheduling and accounting with these optimizations. Removing them has a considerable impact on IPC performance.
The main emphasis of SRBT is on the efficiency of the IPC mechanism. It does a very good job of this; specifically, the Nova IPC path takes this to a practical conclusion. However, any design makes trade-offs and assumptions, and a different design might alleviate some of the listed assumptions above.
Anyone familiar with modern microkernels is likely aware of SRBT designs. Thread migration, maybe not. It has a long history and is summarized in that paper as:
…during [IPC], a single thread abstraction moves between tasks with the logical flow of control, and “server” code is passively executed. … The key element of our design is a decoupling of the thread abstraction into the execution context and the schedulable thread of control, consisting of a chain of contexts.
Note that this thread migration is different from the identical term that denotes the movement of a thread’s context between cores in a multicore system. Here we are talking about the IPC design sweet spot on a single core.
It should be noted that thread migration is pervasively used for communication between user-level and kernel-level in monolithic systems. A flow of execution at user-level, executing within an application, continues in kernel-level via a system-call. There is no SRBT, thus no switching between threads, and we think of a single application thread that executes at some points in the application, and at others in the kernel. time
reports these different execution times along with total time. As pointed out in the quote, a key aspect is the decoupling of the execution context, and the scheduling context (which probably sounds familiar from Credo).
Execution context. Isolation between user-level and kernel-level is necessary, and supported by dual-mode execution. An implication of this is that the stack used for execution at kernel-level must be isolated from the stack at user-level to avoid corruption and information leakage. Additionally, registers must be carefully managed when transitioning between modes. Calling conventions are protocols that specify how they should be managed (related: cdecl for function calls). So just as with SRBT, different execution contexts are required in different protection domains.
Scheduling context. The scheduler does not treat the executions at user-level and kernel-level as different entities. It only schedules the single thread that migrates between user- and kernel-levels.
This decoupling of the execution context and scheduling context is the key concept to thread migration. Note that this concept alone removes a few of the design assumptions on the previous list.
So what is the design of component invocation in Composite? We deviate from the dominant SRBT designs, and instead focus on thread migration. Lets look at some of the benefits that it enables.
User-level scheduling. If threads no longer block and wake-up in kernel, then the kernel does not require in-kernel scheduling. User-level schedulers determine which threads run when, and use the kernel only for dispatching, or switching, between threads. Importantly, that operation is independent of the IPC mechanism. When scheduling is required by a component in the system (e.g. a lock is contended), that component can invoke the scheduler component, which will mark the thread as blocked in its own (user-level) data-structures, and dispatch away from that thread. A good question is why we’d want user-level scheduling? The short answer is that it enables scheduler specialization and customization, hierarchical system structures, efficient and predictable parallel fork/join, and failure recovery. It does so with little system overhead (thread switch is on par if not faster than Linux). This factor makes Composite more a micro-kernel than some of the popular micro-kernels as it moves CPU management policy into user-level components.
IPC mechanism flexibility. Scheduling decisions are not necessary during IPC as execution logically traverses components. This enables the flexibility for invocations to be done via the kernel IPC path, or (given the proper configuration of protection domains) via direct function invocation. Mutable Protection Domains (MPD) leverage this fact to enable the dynamic collapsing of components into the same protection domains and separation of them into separate protection domains. This enables the system to play with the trade-off between isolation and performance where that isolation is not necessary for security.
Predictable end-to-end latency. Invocations along long chains of components have the execution properties as a single thread executing through each component’s software. This removes dependency considerations when assessing the latency of a client’s invocation. As embedded systems increase in complexity, and the layers of software and abstraction increase, this predictability is important.
Composite uses a capability system to track which kernel resources each component has access to. Their implementation (as resource tables) is detailed in the Speck paper. One such resource is a synchronous invocation gate (SInv). Each SInv denotes a function (address) to activate in the server. Each corresponds to a stub that calls a function in the server component’s interface. When the client activates its SInv capability, the stack and instruction pointers to return to in the client are saved on a stack (like the C execution stack) in the kernel. Returning back to the client simply restores these values. Note that neither the client nor the server need to trust each other as the kernel mediates control transfer into and out of each component.
The calling convention is such that four registers-worth of data is passed on call and return (on x86-32). The design explicitly separates generic data transfer from control flow and fixed message passing as motivated by the previous analysis. Future articles will cover the data-movement APIs in Composite.
Scheduling context. Note that the same schedulable thread executes through the entire IPC path much as a thread in a monolithic kernel traverses user- and kernel-mode. Composite’s IPC is an optimized version of thread migration. Note that the optimizations to SRBT made by Credo and timeslice donation move SRBT closer to thread migration by attempting to decouple scheduling and execution contexts.
Execution contexts. This is where it gets a little tricky. The question is this: which stack should we use for execution within the server component? There are a couple of easy edge cases:
A single stack per server. When a client triggers activation in the server, initial execution begins in a stub that attempts to retrieve the stack. If another thread is already using it, then there is contention, and a third component that is in charge of managing stacks (surprisingly called the stack manager) is invoked. The stack manager’s job is to coordinate contention on the stacks. We’ve implemented the priority inheritance, and priority ceiling protocols as means to predictably coordinate stack access. The ability to define our own code to mediate access to stacks is a benefit of the approach. The down-side is that the invocation to the stack manager is overhead (~0.2 microseconds) over what would be an in-kernel operation with SRBT.
A stack per client thread, per server. This case is analogous to the monolithic kernel approach described above where each thread has a stack in both user- and kernel-level. As there is no contention on stacks, there are no dependencies to consider in end-to-end latency. The downside of this is the amount of memory required: \(\sum_{c^x \in C} M \times s^x\) where \(C\) is the set of components, \(s^x\) is the maximum size of the stack in component \(c^x\), and \(M\) is the number of threads in the system. That can be quite a bit of memory, much of it wasted, if many components have low levels of concurrency.
Blocking waiting for an execution context (stack) in a server increases the latency for an invocation. We’ve investigated two different policies for assigning execution contexts throughout the system. We’ve considered the hard real-time, worst-case latency introduced by the blocking on execution context contention. Alternatively, we’ve investigated adapting the allocation of stacks throughout the system in response to measured latency. Such policies are possible due to the configurable user-level stack manager that monitors where and when contention occurs, and adapts execution context allocations accordingly.
Component invocation via thread migration is the means for IPC in Composite. It is an implementation of one of the design sweet spots for communication: same-core, synchronous IPC with restricted data movement. As opposed to the prevalent implementation that uses synchronous rendezvous between threads, thread migration affords a number of benefits when controlling end-to-end latency is a primary concern. Composite’s implementation also focuses on enabling the component-based implementation of policies for scheduling, contention, and thread block/wakeup. We’ve also seen that work has been done to move SRBT implementations toward thread migrations to enjoy some of its benefits. In future posts, we’ll elaborate on the asynchronous IPC facilities in Composite to finish the high-level design of IPC theme.
This post will mainly discuss inter-component communication. However, we’ll still use the traditional terminology of IPC. The important similarity between components and processes is that they are mutually isolated, for example via page-table hardware.↩