mbox series

[00/10,v2] xfs: LARP - clean up xfs_attr_set_iter state machine

Message ID 20220414094434.2508781-1-david@fromorbit.com (mailing list archive)
Headers show
Series xfs: LARP - clean up xfs_attr_set_iter state machine | expand

Message

Dave Chinner April 14, 2022, 9:44 a.m. UTC
HI folks,

Version 2 is here and working better. The first 11 patches are
largely the same as the first version, just we a couple of bugs
fixes which caused a couple patches to be reworked to fix conflicts.
These patches set up the xfs_attr_set_iter() state machine to be
more readable and maintainable. It largely jsut worked with LARP=0,
but failed on the first recovery when LARP was enabled.

I realised that recovery wasn't setting the initial state properly,
and then started digging into the recovery code. At this point, I
realised the recovery code didn't work correctly in all cases and
could often leave unremovable incomplete attrs sitting around. The
issues are larger documented ini patch 16 so I won't go over them
here, just read that patch.

However, to get, the whole replace operation for LARP=1 needed to
change. Luckily, that turned out to be pretty simple because it was
largely already broken down into component operations in the state
machine. hence I just needed to add new "remove" initial states to
the set_iter state machine, and that allowed the new algorithm to
function.

Then I realised that I'd just implemented the remove_iter algorithm
in the set_iter state machine, so I removed the remove_iter code and
it's states altogether and just pointed remove ops at the set-iter
remove initial states. The code now uses the XFS_DA_OP_RENAME flag
to determine if it should follow up an add or remove with a remove
or add, and it all largely jsut works. All runtime algorithms run
throught he same state machine just with different initial states
and state progressions.

And with patch 16, attr item log recovery uses that same state
machine, too. It has a few quirks that need to be handled, so I
added the XFS_DA_OP_RECOVERY flag to allow the right thing to be
done with the INCOMPLETE flag deep in the guts of the attr lookup
code. And so recovery with LARP=1 now seems to mostly work.

The current state is that larp=0 seems to pass attr group and
recoverloop group testing just fine. My current build is about 10
loops in without a failure. With larp=1, attr group passes jsut
fine, but I'm getting random recovery failures with recoveryloop
testing that aren't immediately obvious as being attribute intent
recvoery failures.  (e.g. things like unlinked inode recovery
finding inodes on the unlinked list that have nlink != 0). Hence
while it mostly works, I'm sure there are still some bugs lurking
there.

There's also a fair bit of cleanup needed of the last couple of
patches. There's some quirks in the state change code when
completing an op and determining whether we need to run the second
half of a replace op that need cleaning up, and the last patch needs
further splitting up and it changes stuff all over the place.

However, if you want to get an idea of what the code looks like,
see if it runs and breaks on your machines and maybe find some bugs
in it, help is welcome!

Cheers,

Dave.