Expand description
§The Allocator
The allocator in Fxfs is filesystem-wide entity responsible for managing the allocation of
regions of the device to “owners” (which are ObjectStore
).
Allocations are tracked in an LSMTree with coalescing used to merge neighboring allocations with the same properties (owner and reference count). As of writing, reference counting is not used. (Reference counts are intended for future use if/when snapshot support is implemented.)
There are a number of important implementation features of the allocator that warrant further documentation.
§Byte limits
Fxfs is a multi-volume filesystem. Fuchsia with fxblob currently uses two primary volumes - an unencrypted volume for blob storage and an encrypted volume for data storage. Byte limits ensure that no one volume can consume all available storage. This is important as software updates must always be possible (blobs) and conversely configuration data should always be writable (data).
§Reservation tracking
Fxfs on Fuchsia leverages write-back caching which allows us to buffer writes in RAM until memory pressure, an explicit flush or a file close requires us to persist data to storage.
To ensure that we do not over-commit in such cases (e.g. by writing many files to RAM only to find out tha there is insufficient disk to store them all), the allocator includes a simple reservation tracking mechanism.
Reservation tracking is implemented hierarchically such that a reservation can portion out sub-reservations, each of which may be “reserved” or “forgotten” when it comes time to actually allocate space.
§Fixed locations
The only fixed location files in Fxfs are the first 512kiB extents of the two superblocks that exist as the first things on the disk (i.e. The first 1MiB). The allocator manages these locations, but does so using a ‘mark_allocated’ method distinct from all other allocations in which the location is left up to the allocator.
§Deallocated unusable regions
It is not legal to reuse a deallocated disk region until after a flush. Transactions are not guaranteed to be persisted until after a successful flush so any reuse of a region before then followed by power loss may lead to corruption.
e.g. Consider if we deallocate a file, reuse its extent for another file, then crash after writing to the new file but not yet flushing the journal. At next mount we will have lost the transaction despite overwriting the original file’s data, leading to inconsistency errors (data corruption).
These regions are currently tracked in RAM in the allocator.
§Allocated but uncommitted regions
These are allocated regions that are not (yet) persisted to disk. They are regular file allocations, but are not stored on persistent storage until their transaction is committed and safely flushed to disk.
§TRIMed unusable regions
We periodically TRIM unallocated regions to give the SSD controller insight into which parts of the device contain data. We must avoid using temporary TRIM allocations that are held while we perform these operations.
§Volume deletion
We make use of an optimisation in the case where an entire volume is deleted. In such cases, rather than individually deallocate all disk regions associated with that volume, we make note of the deletion and perform special merge operation on the next LSMTree compaction that filters out allocations for the deleted volume.
This is designed to make dropping of volumes significantly cheaper, but it does add some additional complexity if implementing an allocator that implements data structures to track free space (rather than just allocated space).
§Image generation
The Fuchsia build process requires building an initial filesystem image. In the case of fxblob-based boards, this is an Fxfs filesystem containing a volume with the base set of blobs required to bootstrap the system. When we build such an image, we want it to be as compact as possible as we’re potentially packaging it up for distribution. To that end, our allocation strategy (or at least the strategy used for image generation) should prefer to allocate from the start of the device wherever possible.
Modules§
Structs§
- Allocator
- Allocator
Info V32 - Allocator
KeyPartition Iterator - Allocator
KeyV32 - Coalescing
Iterator - Reservation
Impl - A reservation guarantees that when it comes time to actually allocate, it will not fail due to
lack of space. Sub-reservations (a.k.a. holds) are possible which effectively allows part of a
reservation to be set aside until it’s time to commit. Reservations do offer some
thread-safety, but some responsibility is born by the caller: e.g. calling
forget
andreserve
at the same time from different threads is unsafe. Reservations are have an |owner_object_id| which associates it with an object under the root object store that the reservation is accounted against. - Trimmable
Extents - A container for a set of extents which are known to be free and can be trimmed. Returned by
take_for_trimming
.
Enums§
Traits§
- Reservation
Owner - This trait is implemented by things that own reservations.
Functions§
- max_
extent_ size_ for_ block_ size - Computes the target maximum extent size based on the block size of the allocator.
Type Aliases§
- Allocator
Info - Serialized information about the allocator.
- Allocator
Item - Allocator
Key - Our allocator implementation tracks extents with a reference count. At time of writing, these reference counts should never exceed 1, but that might change with snapshots and clones.
- Allocator
Value - Allocations are “owned” by a single ObjectStore and are reference counted (for future snapshot/clone support).
- Hold
- Reservation