Module allocator

Source
Expand description

§The Allocator

The allocator in Fxfs is filesystem-wide entity responsible for managing the allocation of regions of the device to “owners” (which are ObjectStore).

Allocations are tracked in an LSMTree with coalescing used to merge neighboring allocations with the same properties (owner and reference count). As of writing, reference counting is not used. (Reference counts are intended for future use if/when snapshot support is implemented.)

There are a number of important implementation features of the allocator that warrant further documentation.

§Byte limits

Fxfs is a multi-volume filesystem. Fuchsia with fxblob currently uses two primary volumes - an unencrypted volume for blob storage and an encrypted volume for data storage. Byte limits ensure that no one volume can consume all available storage. This is important as software updates must always be possible (blobs) and conversely configuration data should always be writable (data).

§Reservation tracking

Fxfs on Fuchsia leverages write-back caching which allows us to buffer writes in RAM until memory pressure, an explicit flush or a file close requires us to persist data to storage.

To ensure that we do not over-commit in such cases (e.g. by writing many files to RAM only to find out tha there is insufficient disk to store them all), the allocator includes a simple reservation tracking mechanism.

Reservation tracking is implemented hierarchically such that a reservation can portion out sub-reservations, each of which may be “reserved” or “forgotten” when it comes time to actually allocate space.

§Fixed locations

The only fixed location files in Fxfs are the first 512kiB extents of the two superblocks that exist as the first things on the disk (i.e. The first 1MiB). The allocator manages these locations, but does so using a ‘mark_allocated’ method distinct from all other allocations in which the location is left up to the allocator.

§Deallocated unusable regions

It is not legal to reuse a deallocated disk region until after a flush. Transactions are not guaranteed to be persisted until after a successful flush so any reuse of a region before then followed by power loss may lead to corruption.

e.g. Consider if we deallocate a file, reuse its extent for another file, then crash after writing to the new file but not yet flushing the journal. At next mount we will have lost the transaction despite overwriting the original file’s data, leading to inconsistency errors (data corruption).

These regions are currently tracked in RAM in the allocator.

§Allocated but uncommitted regions

These are allocated regions that are not (yet) persisted to disk. They are regular file allocations, but are not stored on persistent storage until their transaction is committed and safely flushed to disk.

§TRIMed unusable regions

We periodically TRIM unallocated regions to give the SSD controller insight into which parts of the device contain data. We must avoid using temporary TRIM allocations that are held while we perform these operations.

§Volume deletion

We make use of an optimisation in the case where an entire volume is deleted. In such cases, rather than individually deallocate all disk regions associated with that volume, we make note of the deletion and perform special merge operation on the next LSMTree compaction that filters out allocations for the deleted volume.

This is designed to make dropping of volumes significantly cheaper, but it does add some additional complexity if implementing an allocator that implements data structures to track free space (rather than just allocated space).

§Image generation

The Fuchsia build process requires building an initial filesystem image. In the case of fxblob-based boards, this is an Fxfs filesystem containing a volume with the base set of blobs required to bootstrap the system. When we build such an image, we want it to be as compact as possible as we’re potentially packaging it up for distribution. To that end, our allocation strategy (or at least the strategy used for image generation) should prefer to allocate from the start of the device wherever possible.

Modules§

merge
strategy
A strategy tracks free space and decides where allocations are to be made.

Structs§

Allocator
AllocatorInfoV32
AllocatorKeyPartitionIterator
AllocatorKeyV32
CoalescingIterator
ReservationImpl
A reservation guarantees that when it comes time to actually allocate, it will not fail due to lack of space. Sub-reservations (a.k.a. holds) are possible which effectively allows part of a reservation to be set aside until it’s time to commit. Reservations do offer some thread-safety, but some responsibility is born by the caller: e.g. calling forget and reserve at the same time from different threads is unsafe. Reservations are have an |owner_object_id| which associates it with an object under the root object store that the reservation is accounted against.
TrimmableExtents
A container for a set of extents which are known to be free and can be trimmed. Returned by take_for_trimming.

Enums§

AllocatorValueV32

Traits§

ReservationOwner
This trait is implemented by things that own reservations.

Functions§

max_extent_size_for_block_size
Computes the target maximum extent size based on the block size of the allocator.

Type Aliases§

AllocatorInfo
Serialized information about the allocator.
AllocatorItem
AllocatorKey
Our allocator implementation tracks extents with a reference count. At time of writing, these reference counts should never exceed 1, but that might change with snapshots and clones.
AllocatorValue
Allocations are “owned” by a single ObjectStore and are reference counted (for future snapshot/clone support).
Hold
Reservation