[RFC,00/36] xfs: more work towards shrinking.

Note: this is "heads up" at this point so that people can see what is coming down the line and make early comments, not a request to consider these for merging soon. This series continues the work towards making shrinking a filesystem possible. We need to be able to stop operations from taking place on AGs that need to be removed by a shrink, so before shrink can be implemented we need to have the infrastructure in place to prevent incursion into AGs that are going to be, or are in the process, of being removed from active duty. The focus of this is making operations that depend on access to AGs use the perag to access and pin the AG in active use, thereby creating a barrier we can use to delay shrink until all active uses have been drained and new uses are prevented. This series starts by driving the perag down into the AGI, AGF and AGFL access routines and unifies the perag structure initialisation with the high level AG header read functions. This largely replaces the xfs_mount/agno pair that is passed to all these functions with a perag, and in most places we already have a perag ready to pass in. There are a few places where perags need to be grabbed before reading the AG header buffers - some of these will need to be driven to higher layers to ensure we can run operations on AGs without getting stuck part way through waiting on a perag reference. The next section of this patchset moves some of the AG geometry information from the xfs_mount to the xfs_perag, and starts converting code that requires geometry validation to use a perag instead of a mount and having to extract the AGNO from the object location. This also allows us to store the AG size in the perag and then we can stop having to compare the agno against sb_agcount to determine if the AG is the last AG and so has a runt size. This greatly simplifies some of the type validity checking we do and substantially reduces the CPU overhead of type validity checking. It also cuts over 1.2kB out of the binary size. The series then starts converting the code to use active references. Active reference counts are used by high level code that needs to prevent the AG from being taken out from under it by a shrink operation. The high level code needs to be able to handle not getting an active reference gracefully, and the shrink code will need to wait for active references to drain before continuing. Active references are implemented just as reference counts right now - an active reference is taken at perag init during mount, and all other active references are dependent on the active reference count being greater than zero. This gives us an initial method of stopping new active references without needing other infrastructure; just drop the reference taken at filesystem mount time and when the refcount then falls to zero no new references can be taken. In future, this will need to take into account AG control state (e.g. offline, no alloc, etc) as well as the reference count, but right now we can implement a basic barrier for shrink with just reference count manipulations. There are patches to convert the perag state to atomic opstate fields similar to the xfs_mount and xlog opstate fields in preparation for this. The first target for active reference conversion is the for_each_perag*() iterators. This captures a lot of high level code that should skip offline AGs, and introduces the ability to differentiate between a lookup that didn't have an online AG and the end of the AG iteration range. From there, the inode allocation AG selection is converted to active references, and the perag is driven deeper into the inode allocation and btree code to replace the xfs_mount. Most of the inode allocation code operates on a single AG once it is selected, hence it should pass the perag as the primary referenced object around for allocation, not the xfs_mount. There is a bit of churn here, but it emphasises that inode allocation is inherently an allocation group based operation. Next the bmap/alloc interface undergoes a major untangling, reworking xfs_bmap_btalloc() into separate allocation operations for different contexts and failure handling behaviours. This then allows us to completely remove the xfs_alloc_vextent() layer via restructuring the xfs_alloc_vextent/xfs_alloc_ag_vextent() into a set of realtively simple helper function that describe the allocation that they are doing. e.g. xfs_alloc_vextent_exact_bno(). This allows the requirements for accessing AGs to be allocation context dependent. The allocations that require operation on a single AG generally can't tolerate failure after the allocation method and AG has been decided on, and hence the caller needs to manage the active references to ensure the allocation does not race with shrink removing the selected AG for the duration of the operation that requires access to that allocation group. Other allocations iterate AGs and so the first AG is just a hint - these do not need to pin a perag first as they can tolerate not being able to access an AG by simply skipping over it. These require new perag iteration functions that can start at arbitrary AGs and wrap around at arbitrary AGs, hence a new set for for_each_perag_wrap*() helpers to do this. So far this smoke tests OK - there's a problem with AGF locking deadlocks as a result of converting xfs_alloc_vextent_iterate_ags() to use for_each_perag_wrap_range() that shows in stress tests, but it passes everything in the quick group. There's more to come: - the bmapi layer needs to handle active AG references for exact and near allocation - filestreams allocator AG selection needs a significant rework to simplify and use active references - converting the allocation "firstblock" restrictions to hold an actively referenced perag, not a filesystem block address. - inode cache lookups need to converted to active references - audits needed to find and convert all the places that we use bp->b_pag instead of active references passed from high level code. - addition of a "going offline" opstate and state machine to use for rejecting new active references as well as blocking shrink from making progress until all active references are gone - ioctls for changing AG state from userspace - audit of the freeing code to determine whether it can use passive references to allow freeing of blocks (which may require allocation!) whilst new allocations are prevented from being run on "going offline" AGs. This will allow userspace to stop new allocations in AGs to be shrunk before it starts emptying them and freeing the space that they have in use. - the physical shrink code. This current patchset is based on 5.16-rc3. -Dave.

Message ID	20211203000111.2800982-1-david@fromorbit.com (mailing list archive)
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Subject: [RFC] [PATCH 00/36] xfs: more work towards shrinking. Date: Fri, 3 Dec 2021 11:00:35 +1100 Message-Id: <20211203000111.2800982-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xfs: more work towards shrinking. \| expand [RFC,00/36] xfs: more work towards shrinking. [01/36] xfs: make last AG grow/shrink perag centric [02/36] xfs: kill xfs_ialloc_pagi_init() [03/36] xfs: pass perag to xfs_ialloc_read_agi() [04/36] xfs: kill xfs_alloc_pagf_init() [05/36] xfs: pass perag to xfs_alloc_read_agf() [06/36] xfs: pass perag to xfs_read_agi [07/36] xfs: pass perag to xfs_read_agf [08/36] xfs: pass perag to xfs_alloc_get_freelist [09/36] xfs: pass perag to xfs_alloc_put_freelist [10/36] xfs: pass perag to xfs_alloc_read_agfl [11/36] xfs: Pre-calculate per-AG agbno geometry [12/36] xfs: Pre-calculate per-AG agino geometry [13/36] xfs: replace xfs_ag_block_count() with perag accesses [14/36] xfs: make is_log_ag() a first class helper [15/36] xfs: active perag reference counting [16/36] xfs: rework the perag trace points to be perag centric [17/36] xfs: convert xfs_imap() to take a perag [18/36] xfs: use active perag references for inode allocation [19/36] xfs: inobt can use perags in many more places than it does [20/36] xfs: convert xfs_ialloc_next_ag() to an atomic [21/36] xfs: perags need atomic operational state [22/36] xfs: introduce xfs_for_each_perag_wrap() [23/36] xfs: rework xfs_alloc_vextent() [24/36] xfs: use xfs_alloc_vextent_this_ag() in _iterate_ags() [25/36] xfs: combine __xfs_alloc_vextent_this_ag and xfs_alloc_ag_vextent [26/36] xfs: use xfs_alloc_vextent_this_ag() where appropriate [27/36] xfs: factor xfs_bmap_btalloc() [28/36] xfs: use xfs_alloc_vextent_first_ag() where appropriate [29/36] xfs: use xfs_alloc_vextent_start_bno() where appropriate [30/36] xfs: introduce xfs_alloc_vextent_near_bno() [31/36] xfs: introduce xfs_alloc_vextent_exact_bno() [32/36] xfs: introduce xfs_alloc_vextent_prepare() [33/36] xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno() [34/36] xfs: fold xfs_alloc_ag_vextent() into callers [35/36] xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker [36/36] xfs: convert trim to use for_each_perag_range

[RFC,00/36] xfs: more work towards shrinking.

Message