mbox series

[PATCHSET,v2,0/2] fstests: fix cpu hotplug mess

Message ID 169335047228.3523635.3342369338633870707.stgit@frogsfrogsfrogs (mailing list archive)
Headers show
Series fstests: fix cpu hotplug mess | expand

Message

Darrick J. Wong Aug. 29, 2023, 11:07 p.m. UTC
Hi all,

Ritesh and Eric separately reported crashes in XFS's hook function for
CPU hot remove if the remove event races with a filesystem being
mounted.  I also noticed via generic/650 that once in a while the log
will shut down over an apparent overrun of a transaction reservation;
this turned out to be due to CIL percpu list aggregation failing to pick
up the percpu list items from a dying CPU.

Either way, the solution here is to eliminate the need for a CPU dying
hook by using a private cpumask to track which CPUs have added to their
percpu lists directly, and iterating with that mask.  This fixes the log
problems and (I think) solves a theoretical UAF bug in the inodegc code
too.

v2: fix a few put_cpu uses, add necessary memory barriers, and use
    atomic cpumask operations

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fix-percpu-lists
---
 tests/generic/650 |   18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

Comments

Dave Chinner Aug. 30, 2023, 10:57 p.m. UTC | #1
On Tue, Aug 29, 2023 at 04:07:52PM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> Ritesh and Eric separately reported crashes in XFS's hook function for
> CPU hot remove if the remove event races with a filesystem being
> mounted.  I also noticed via generic/650 that once in a while the log
> will shut down over an apparent overrun of a transaction reservation;
> this turned out to be due to CIL percpu list aggregation failing to pick
> up the percpu list items from a dying CPU.
> 
> Either way, the solution here is to eliminate the need for a CPU dying
> hook by using a private cpumask to track which CPUs have added to their
> percpu lists directly, and iterating with that mask.  This fixes the log
> problems and (I think) solves a theoretical UAF bug in the inodegc code
> too.
> 
> v2: fix a few put_cpu uses, add necessary memory barriers, and use
>     atomic cpumask operations
> 
> If you're going to start using this code, I strongly recommend pulling
> from my git trees, which are linked below.
> 
> This has been running on the djcloud for months with no problems.  Enjoy!
> Comments and questions are, as always, welcome.

These changes look good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>