Capability-based OS Design

Posted on April 6, 2016 by Gabe Parmer

A core of students is currently hacking on the lowest-level components of Composite. Most components in the system mainly rely on other components for most of the services they require. For example, a component that needs to access files would invoke the file-system component. The lowest-level components of the system depend on and use only the raw kernel API. This API is not easy to wrap your head around, so I want to provide an intuition about what is going on, and how to properly use it.

Update. I’ve added a post on capability based systems that starts from a more fundamental and introductory spot than this.

Capability Table-based Resource Access

First, it is important to realize that the kernel provides a very primitive means of accessing the most fundamental resources of the bare-metal system. These resource include

Pages of memory that are used both for user-level memory, and to construct kernel data-structures.
Components which are a simple collection of resource tables (to be defined momentarily).
Threads that execute inside of and between components.
Communication end-points that support synchronous communication, and asynchronous.

With a few exceptions, these are the only resources provided, protected, and modified by the kernel. They are the building blocks of the higher-level OS abstractions that are all defined in user-level components. All operating systems provide some abstraction of resources (for example, files, sockets, threads, processes in UNIX), but they also must define methods for accessing and modifying those resources. In UNIX, we have open, close, read, and write as the polymorphic API for modifying many system resources. With that API, a file-descriptor table maps file descriptors (fds, which are simply integers) to the resources, and the first argument of most methods is the file descriptor. Put another way, the API must decide how resources are named. Files are named via paths, and after an open file descriptor exists to the file, they are named via the descriptor.

Composite’s resources are named in a similar way to file descriptors – with capabilities. Capabilities are tokens whose possession designates the ability to access a specific resource. If a component has a capability to a resource, it can access that resource; if it does not, it cannot. A primary job of the kernel is to ensure that capabilities cannot be synthesized, and are only obtained through proper channels. Put another way, each component’s system calls operate on resources determined by $r = f_i(c)$ where $c$ is a specific capability, $r$ is the resource to perform an operation on, $i$ refers to component $c_i$ ’s capability table, and $f_i$ is the capability mapping using $i$ . Each system call can be seen as $op(f_i(c), args\ldots)$ such that it performs a specified operation ( $op$ ) with a set of arguments ( $args$ ), on a specific resource a component has access to.

An important question in operating system design is how to implement the mapping $f_i$ for a component. Many capability based systems such as KeyKOS, Eros, Fiasco, seL4, and barrelfish use capability tables to implement this mapping. Each component (or “process” in many of these systems) is associated with kernel data-structure that simply maps a capability (integer) to a pointer to the kernel resource. Many of these systems implement the table as a radix trie, or even as a large array in virtual memory. For the purposes of understanding the APIs that access a capability table, we’ll just think of them as an array where each index (capability, $c$ ) is a pointer to the resource ( $r$ ).

Resource Tables

Capabilities define a namespace for accessing resources for components. System calls need to specify which resource to operate on. However, there is another pervasive namespace for resource access that is provided by our hardware, our virtual address space of memory. Indeed, every time we execute a load or store instruction, we are effectively asking to read or write to the memory word at that address. Putting this into the previous terminology, we could think of a load as $load(g_i(vaddr), reg)$ where $vaddr$ is the virtual address we are loading from, $g_i$ is the current ( $i$ th) component’s mapping from virtual addresses to physical memory, and $reg$ is the register in which to put the value in memory. The function $g_i$ is required so that different processes are provided the virtual impression that they own all of memory. When one process accesses memory at address 0xdeadbeef, it will find a different value than another process accessing the same address.

The function $g_i$ is implemented using page tables on most major processors. page tables, like capability tables, can be viewed simply as a large array with an entry for each page (4096 byte region of memory, aligned on a 4096 boundary). $g_i$ is implemented by returning the physical address of the actual memory to access which is pointed to in index $vaddr/4096$ . Note that the array maps pages to physical memory regions also of 4096 bytes (called frames), so there is no reason for our array to have indices for any of the lowest 12 bits of any address (note that $2^{12} = 4096$ ).

It is a little strange to think about it this way, but $f_i$ and $g_i$ are both simply maps that enable a thread executing in a component to access resources that it has been given access to. They are both maps that in a way implement capability-based resource access. Lets look at the similarities:

They both define a namespace of capabilities, either $c$ or $vaddr$ , that are used to denote access to a resource.
If a capability has not been provided to a resource in a component, the resource is not accessible.

Now the differences:

The most obvious: to harness resources provided by $c$ , we make explicit, programmatic system calls. To access memory, we just issue loads and stores (or our compiler provides the code that does).
The resource accessed by the memory operations are limited to memory (with the exception of memory mapped devices). The specific memory operations to be performed are limited by the ISA. In contrast, the resources indexed by traditional capability tables, and the operations performed on them are whatever the kernel wants to define them as.
Hardware acceleration is provided for $g_i$ in many processors (e.g. in all major processors other than MIPS) via page table walking logic and TLB caches.

In some sense, both capability tables and page tables are instances of capability-based resource access. In Composite, we unify both under the term “resource tables”. That is, they are both simply tables that use some capability identifier to index to a specific resource. From now on, I’ll use the more generic “resource table” term to denote one of the tables that indexes resources, and “capability table” as the table that traditionally implements $f_i$ .

I’ve ignored one important aspect of capability-based systems: how resources are added to, and removed from the resource tables, and which capability ids are used for new items. This ends up being the aspect of capability-table design that is often the most difficult.

Integrating Capability Tables and Page Tables

Past systems have treated these two namespaces as fundamentally different abstractions. In Eros, the page table is simply a cache of capability-table entries that define the virtual address mappings of a process. In seL4 and Barrelfish, all memory ranges are referenced by capabilities, and they are attached into the virtual address space explicitly. Thus all operations are directly on capabilities, and the side-effect is that the virtual address space represented as page tables is modified (i.e. the memory is mapped into the virtual address space). This has the benefit that the capability table alone references all resources a component has access to, and memory is tracked in the same manner as capabilities (see the next section). The main down-side of this approach are that there are redundant references to pages between capability tables and page tables – which, on its own isn’t that bad, but could indicate the existence of a simpler abstraction, wastes memory, and, as we will see, complicates concurrent modifications.

Capability Creation and Destruction

The Traditional Approach

If a component has access to a resource in its own capability table, it can copy the capability (i.e. create another reference to the resource) across a communication end-point. In this way, a resource accessible in one component ( $A$ ), can be shared (or, in some cases, moved) to another component ( $B$ ). This is called delegation. This is similar to how file descriptors can be passed through a UNIX domain socket between processes. One question this raises is which capability id do we use for the resource? The capability id of the new capability in component $B$ can either be 1) decided using an in-kernel capability allocator, or 2) chosen by the other component when it asked for the new capability. Importantly, both component $A$ and $B$ consent to the creation of the capability, thus the sharing of the resource. This is necessary for $A$ to prevent the uncontrolled sharing of its resources, but also to avoid denial of service attacks on the capability id namespace of $B$ .

Destruction of a capability within a component is related to the concept of revocation. Most of the previously listed operating systems support the recursive revocation of capabilities in which the revocation of a capability causes any previously delegated capabilities, and any transitive delegations to be deleted. To support recursive revocation, the kernel must track and remember each delegation. This is tracked in the mapping data-base in L4 variants such as Fiasco, and in the derivation tree in seL4/Barrelfish¹. This data-structure – which I’ll call the delegation tree – is a simple tree (often linearized in concrete implementations). Revocation, then, is the simple action of walking a sub-tree and removing all of the corresponding capabilities.

This style of delegation and revocation is amazingly powerful and versatile, but has some down-sides. This delegation tree complicates system design. The revocation operation is not necessarily practically bounded in time as it is dependent on an unpredictably sized sub-tree. Care must be taken to either enable kernel preemption carefully (e.g. with preemption points in seL4) to avoid long latency spikes, and the parallel execution of revocations and delegations is difficult if not impossible to make scalable. From another viewpoint, you can look at recursive revocation as a specific policy for controlling the resource sharing throughout the system. Other policies might include preventing transitive delegation, enabling a single transitive delegation, prevent capabilities from being delegated into specific subsystems, preventing the revocation of a capability, or delegating and revoking atomically to multiple components (e.g. replicas). It represents a reasonable default policy that systems have used to great effect, but it is not the only reasonable policy.

Composite’s Resource Tables

Composite does not adopt the notions of delegation and recursive revocation of capabilities, thus eschews all of the benefits and detriments that come with it. This policy is replaced with a more primitive set of operations that can be used to implement delegation and recursive revocation, along with (all?) other policies for capability creation and deletion. These policies are implemented in user-level components that could, for example, track the delegation tree. The kernel mechanism for this exposes the ability to add and delete capabilities at specific slots in a resource table in a controlled manner. Only components that are trusted to define capability management policies have access to this ability. Specifically, Composite creates the notion of higher-order resource tables, or resource tables that refer to other resource tables. Only if a component has a capability to a resource table can it create and delete capabilities in it. Thus, the ability to modify a specific resource table is controlled using the very protection mechanisms of the resource tables themselves. Though there are two types of resource tables, they are modified using the same interface.

If a component, $C$ , is meant to provide delegation and revocation for a set of “client” components, then it will have a capability to each of their resource tables. The operation the kernel provides for revocation is simple:

$delete(d, d')$

$delete$ takes a capability referring to a resource table as its first argument, and the capability id to delete within that resource table. After this operation,

$f_{f_i(d)}(d') = \emptyset$ , which is to say that looking up

$d'$ in the resource table referred to by the resource table at capability

$d$ results in no resource. And the operation for delegation:

$copy(d, d', s, s')$ As with

$delete$ ,

$copy$ takes a destination and source capabilities to resource tables as first and third arguments, and the capability ids within those respective resource tables to copy to and from. After

$copy(d, d', s, s')$ ,

$f_{f_i(d)}(d') = f_{f_i(s)}(s')$ . Note that this does not copy the resource, only the capability to the resource. In other words, it manipulates the access rights of the destination by adding access to a resource. Importantly, both the

$delete$ and

$copy$ operations operate on generic resource tables – both capability tables and page tables.

Page tables are used to store virtual memory mappings, but also memory of types that are inaccessible to user-level. The present bit indicates if a mapping is visible to user-level. If the present bit is unset, the mapping can still hold a reference to a frame that can be used for kernel data-structures, or retyped into user-level-accessible memory (more in retyping in a future article). Page tables, then, are treated as a special shape of capability tables that contain references to all memory resources that a component has access to.

To create a kernel resource and a capability to reference it, the Composite kernel implements a series of activate calls for different types of resources. These take the form, $activate(f_i(c), T, args\ldots)$ where $T$ is the type of the kernel resource to create, and $args$ include capabilities to the necessary resources to build the new resource. For example, creating a new thread requires memory, so a capability to a page table $p$ and a specific page that can be used as kernel memory $p'$ is required.

$activate(f_i(c), THD, g_{f_i(p)}(p'))$ This is a constant refrain in the Composite kernel API: creating a kernel resource requires references to existing resources that are used in its construction. Examples include the memory required for threads and temporal capabilities, the resource table capabilities that are required to create a component, and the receive end-point that is activated when an asynchronous send end-point is used.

An important aspect of the Composite kernel API is that when a capability is created, the user must explicitly specify which capability id to use. This means that the kernel does not have to track a freelist (or freelist equivalent) of capabilities. Instead user-level components must track this manually. We provide a few libraries that ease the burden of this book-keeping (e.g. the cos_kernel_api that does allocate-only bump-pointer capability id allocation).

Why Resource Tables?

The unification in Composite of capability tables and page tables into the single abstraction of a resource table means that page tables are first-class access control primitives, rather than a cache of capability table information. Conceptually, this is different from existing systems, but not valuable on its own. Before we talk about why we do this, lets summarize the differences with previous systems.

The operations on resource tables are more primitive than delegate and revoke. They do not track delegation trees. Such higher-level abstractions must be provided by components.
The capability namespaces (i.e. the capability ids in both capability tables, and page tables) are component-managed, rather than kernel-managed.
Capability tables and page tables are simply two different capability namespaces for accessing different types of resources (abstract vs. memory). Page-tables are used explicitly as capability-indexed resource containers. It just so happens that we use hardware acceleration to perform direct access to those resources via load and store.
Access control for the modification of component permissions (i.e. the creation and deletion of capabilities) is performed using higher-order resource-tables. Only if a component has a capability to a resource table, can it modify the resources contained therein. Most components do not have such access to resource tables.
Thought not described here, we enforce liveness constraints on the reuse of capability ids in resource tables. This enables the implementation of scalable capability management.

What is the benefit of a world viewed through resource tables?

Avoiding delegation trees simplifies the kernel and enables the scalable implementation of these operations, especially when combined with the component-management of the capability id namespaces.
The primitives for capability management are scalable and predictable, thus enable the implementation of higher-level abstractions such as delegation and revocation that are also scalable and predictable (i.e. the scalability and predictability of those abstractions is solely a function of their own implementation). A significant enabler of the scalable implementation of these primitives is that there are few kernel data-structures that span multiple structures. The consistent modification of multiple pointers, or disparate data often requires locks that disable scalability.
Built on the resource table manipulation primitives, Composite has implemented a few different versions of capability tracking. For example, the parsec work (see the Eurosys paper, to be linked soon) provides a scalable version of delegation and revocation, while cbufs (see the ISMM paper, to be linked soon) provide controlled memory sharing with system-wide garbage-collection.
This implementation is consistent with the philosophy of $\mu$ -kernels – what can be reasonably implemented as a configurable component at user-level, should be. This enables the export of capability management policy to user level. This has the benefit of leveraging fault tolerance, parallelism, configurable scheduling, and customizability.

Resource Table Examples

So that’s the idea behind resource tables. Now lets dive into what this looks like in code. The lowest-level API is a C wrapper around the system call.

static inline int
call_cap_op(int cap_no, int op_code,
            int arg1, int arg2, int arg3, int arg4);

This is a generic call that performs an operation (defined by op_code) on a capability (cap_no which is really an enumerated type), passing up to four arguments. An example of this in use to create a new component follows.

call_cap_op(higher_order_captbl_cap, CAPTBL_OP_COMPACTIVATE, cap,
            (captbl_cap<<16) | pagetable_cap, liveness_id,
            entry_address)

A new component will be activated within the capability table referred to by the higher_order_captbl_cap capability at capability id cap. A component requires a capability table, and a page table to be created, along with a few other “bookkeeping” values. Thus, the capability table, and page table are passed in as captbl_cap and pagetable_cap. Note that we actually need more than 4 arguments, so we simply compress these capabilities into a single argument. On an architecture with more registers, this would not be necessary.

No-one wants to program directly with this API. We also see (in the cap argument), that we are in charge of deciding which capability id to use for the new capability, which implies that we track which capability ids are used and not used. It is all very low-level and error prone. Thus, the first layer of abstraction around this primitive API, is provided by the cos_kernel_api.

The main abstraction this layer provides is that of a component that tracks capability ids. This is provided by the cos_compinfo structure. This structure notably tracks which capabilities refer to the component’s page table, capability table, and component capability. These capabilities to resource tables are higher-order. We pass this structure into all aspects of the API that modify resource tables, to identify which higher-order resource tables to operate on. This structure also has a primitive allocator for capability ids, thus removing that tedious job from the user. This allocator does not handle deallocation, thus does not re-use capability ids. Another layer of abstraction top of that (provided by parsec) will handle capability id reuse (including virtual address re-use). The cos_compinfo structure, and the function to initiate it:

/* Component captbl/pgtbl allocation information */
struct cos_compinfo {
       /* capabilities to higher-order capability tables */
       capid_t pgtbl_cap, captbl_cap, comp_cap;
       /* ... */
};
void cos_compinfo_init(struct cos_compinfo *ci, captblcap_t
        pgtbl_cap, pgtblcap_t captbl_cap, compcap_t comp_cap,
        vaddr_t heap_ptr, capid_t cap_frontier,
        struct cos_compinfo *ci_resources);

Note that the ci_resources argument is somewhat mysterious. Whenever memory is required to implement one of the cos_kernel_api’s functions (recall the previous thread example where we need to provide kernel memory to activate a thread), it is allocated from the resource tables of ci_resources. This is lightweight support that enables one component to control the resources, and simply allocate them out to other components when necessary.

The functions that wrap the kernel system calls are all pre-pended with cos. Examples follow:

captblcap_t cos_captbl_alloc(struct cos_compinfo *ci);
pgtblcap_t cos_pgtbl_alloc(struct cos_compinfo *ci);
compcap_t cos_comp_alloc(struct cos_compinfo *ci,
            captblcap_t ctc, pgtblcap_t ptc, vaddr_t entry);

You can look at the structure of which functions take arguments of which types to understand which resources are required to create other resources. In this case, to create a component, you need to provide the capability table and page table capabilities. cos_comp_alloc wraps the raw system call shown above. Note that we not longer specify the capability id to use for the new component, and instead it is returned from the function. We also don’t specify which higher-order capability refers to the capability table to create the new component in as that is already encoded in the ci argument. Digging into the implementation of that function, we see the mechanics of capability id allocation (calling __capid_bump_alloc), and the use of the captbl_cap in the corresponding cos_compinfo.

compcap_t
cos_comp_alloc(struct cos_compinfo *ci, captblcap_t ctc,
               pgtblcap_t ptc, vaddr_t entry)
{
	capid_t cap;
	int     lid = livenessid_bump_alloc();

	cap = __capid_bump_alloc(ci, CAP_COMP);
	if (!cap) return 0;
	if (call_cap_op(ci->captbl_cap, CAPTBL_OP_COMPACTIVATE,
                        cap, (ctc<<16) | ptc, lid, entry)) BUG();

	return cap;
}

Another example were memory is required for the resource to be activated is for threads. In this case __alloc_mem_cap allocates both a new frame that is typed to be used as a kernel data-structure, and a capability id.

static thdcap_t
__cos_thd_alloc(struct cos_compinfo *ci, compcap_t comp,
                int init_data)
{
	vaddr_t kmem;
	capid_t cap;

	if (__alloc_mem_cap(ci, CAP_THD, &kmem, &cap)) return 0;
	if (call_cap_op(ci->captbl_cap, CAPTBL_OP_THDACTIVATE,
                        (init_data << 16) | cap, ci->pgtbl_cap,
                        kmem, comp)) BUG();

	return cap;
}

Using Resources

At the beginning of this post, we discussed all of the kernel resources. There aren’t that many of them, and we haven’t touched much on how to use them. Using a thread means dispatching to it (i.e. performing a context switch to it). Using an asynchronous send end-point means activating a thread connected to the associated receive end-point. Using resources is much simpler, represent the common case for most components – the rest of the cos_kernel_api requires higher-order resource tables that most components don’t have access to. We can see these examples in the API as well.

int cos_thd_switch(thdcap_t c);
int cos_asnd(asndcap_t snd);

Note that there are important differences between a derivation tree that relates to the retyping operation in seL4 and Barrelfish, and the mapping tree in Lietke’s L4 (e.g. L4 Pistacio) and Fiasco. The concrete difference between the two approach is if delegation is performed by referencing a capability, or a virtual address (which is viewed as a capability). As these designs must consider two address spaces (capabilities and virtual addresses), they differ with respect to how memory is represented in each, and how it moves between each. I’ll generalize both here and focus only on the tree-based structure of how the capabilities are delegated between components.↩