diff mbox series

[v1,01/17] xfs: Add larp state XFS_DAS_CREATE_FORK

Message ID 20220611094200.129502-2-allison.henderson@oracle.com (mailing list archive)
State New, archived
Headers show
Series Return of the Parent Pointers | expand

Commit Message

Allison Henderson June 11, 2022, 9:41 a.m. UTC
Recent parent pointer testing has exposed a bug in the underlying
larp state machine.  A replace operation may remove an old attr
before adding the new one, but if it is the only attr in the fork,
then the fork is removed.  This later causes a null pointer in
xfs_attr_try_sf_addname which expects the fork present.  This
patch adds an extra state to create the fork.

Additionally the new state will be used by parent pointers which
need to add attributes to newly created inodes that do not yet
have a fork.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
---
 fs/xfs/libxfs/xfs_attr.c | 22 ++++++++++++++++++++--
 fs/xfs/libxfs/xfs_attr.h |  2 +-
 fs/xfs/libxfs/xfs_bmap.c |  2 +-
 fs/xfs/libxfs/xfs_bmap.h |  1 +
 fs/xfs/xfs_attr_item.c   |  4 +++-
 5 files changed, 26 insertions(+), 5 deletions(-)

Comments

Dave Chinner June 15, 2022, 1:09 a.m. UTC | #1
On Sat, Jun 11, 2022 at 02:41:44AM -0700, Allison Henderson wrote:
> Recent parent pointer testing has exposed a bug in the underlying
> larp state machine.  A replace operation may remove an old attr
> before adding the new one, but if it is the only attr in the fork,
> then the fork is removed.  This later causes a null pointer in
> xfs_attr_try_sf_addname which expects the fork present.  This
> patch adds an extra state to create the fork.

Hmmmm.

I thought I fixed those problems - in xfs_attr_sf_removename() there
is this code:

        if (totsize == sizeof(xfs_attr_sf_hdr_t) && xfs_has_attr2(mp) &&
            (dp->i_df.if_format != XFS_DINODE_FMT_BTREE) &&
            !(args->op_flags & (XFS_DA_OP_ADDNAME | XFS_DA_OP_REPLACE))) {
                xfs_attr_fork_remove(dp, args->trans);

A replace operation will have XFS_DA_OP_REPLACE set, and so the
final remove from a sf directory will not remove the attr fork in
this case. There is equivalent checks in the leaf/node remove name
paths to avoid removing the attr fork if the last attr is removed
while the attr fork is in those formats.

How do you reproduce this issue?

> Additionally the new state will be used by parent pointers which
> need to add attributes to newly created inodes that do not yet
> have a fork.

We already have the capability of doing that in xfs_init_new_inode()
by passing in init_xattrs == true. So when we are creating a new
inode with parent pointers enabled, we know that we are going to be
creating an xattr on the inode and so we should always set
init_xattrs in that case.

This should avoid the need for parent pointers to ever need to run
an extra transaction to create the attr fork. Hence, AFAICT, this
new state to handle attr fork creation shouldn't ever be needed for
parent pointers....

What am I missing?

Cheers,

Dave.
Allison Henderson June 15, 2022, 11:40 p.m. UTC | #2
On Wed, 2022-06-15 at 11:09 +1000, Dave Chinner wrote:
> On Sat, Jun 11, 2022 at 02:41:44AM -0700, Allison Henderson wrote:
> > Recent parent pointer testing has exposed a bug in the underlying
> > larp state machine.  A replace operation may remove an old attr
> > before adding the new one, but if it is the only attr in the fork,
> > then the fork is removed.  This later causes a null pointer in
> > xfs_attr_try_sf_addname which expects the fork present.  This
> > patch adds an extra state to create the fork.
> 
> Hmmmm.
> 
> I thought I fixed those problems - in xfs_attr_sf_removename() there
> is this code:
> 
>         if (totsize == sizeof(xfs_attr_sf_hdr_t) && xfs_has_attr2(mp)
> &&
>             (dp->i_df.if_format != XFS_DINODE_FMT_BTREE) &&
>             !(args->op_flags & (XFS_DA_OP_ADDNAME |
> XFS_DA_OP_REPLACE))) {
>                 xfs_attr_fork_remove(dp, args->trans);
Hmm, ok, let me shuffle in some traces around there to see where things
fall off the rails

> 
> A replace operation will have XFS_DA_OP_REPLACE set, and so the
> final remove from a sf directory will not remove the attr fork in
> this case. There is equivalent checks in the leaf/node remove name
> paths to avoid removing the attr fork if the last attr is removed
> while the attr fork is in those formats.
> 
> How do you reproduce this issue?
> 

Sure, you can apply this kernel set or download it here:
https://github.com/allisonhenderson/xfs/tree/xfs_new_pptrs

Next you'll need this xfsprogs that has the neccassary updates to run
parent pointers
https://github.com/allisonhenderson/xfsprogs/tree/xfsprogs_new_pptrs


To reproduce the bug, you'll need to apply a quick patch on the kernel
side:
diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index b86188b63897..f279afd43462 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -741,8 +741,8 @@ xfs_attr_set_iter(
 		fallthrough;
 	case XFS_DAS_SF_ADD:
 		if (!args->dp->i_afp) {
-			attr->xattri_dela_state = XFS_DAS_CREATE_FORK;
-			goto next_state;
+//			attr->xattri_dela_state = XFS_DAS_CREATE_FORK;
+//			goto next_state;
 		}
 		return xfs_attr_sf_addname(attr);
 	case XFS_DAS_LEAF_ADD:



Lastly, you'll need Catherines parent pointer tests that she sent out a
day or so ago.  Once you have that, just run the parent pointers test:
echo 1 > /sys/fs/xfs/debug/larp; ./check xfs/549

Dmesg below:


==================================================================
[  365.288788] BUG: KASAN: null-ptr-deref in
xfs_attr_try_sf_addname+0x2a/0xd0 [xfs]
[  365.289170] Read of size 1 at addr 000000000000002a by task
mount/23669

[  365.289182] CPU: 10 PID: 23669 Comm: mount Tainted:
G            E     5.18.0-rc2 #84
[  365.289196] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[  365.289202] Call Trace:
[  365.289206]  <TASK>
[  365.289211]  dump_stack_lvl+0x49/0x5f
[  365.289232]  print_report.cold+0x494/0x6b4
[  365.289242]  ? path_mount+0x641/0xfd0
[  365.289258]  ? __x64_sys_mount+0xca/0x110
[  365.289265]  ? do_syscall_64+0x3b/0x90
[  365.289274]  ? xfs_attr_try_sf_addname+0x2a/0xd0 [xfs]
[  365.289649]  kasan_report+0xa7/0x120
[  365.289660]  ? xfs_attr_try_sf_addname+0x2a/0xd0 [xfs]
[  365.290034]  __asan_load1+0x6a/0x70
[  365.290048]  xfs_attr_try_sf_addname+0x2a/0xd0 [xfs]
[  365.290423]  xfs_attr_set_iter+0x2f9/0x1510 [xfs]
[  365.290801]  ? xfs_init_attr_trans+0x130/0x130 [xfs]
[  365.291178]  ? kasan_poison+0x3c/0x50
[  365.291187]  ? kasan_unpoison+0x28/0x50
[  365.291197]  ? xfs_errortag_test+0x57/0x120 [xfs]
[  365.291592]  xfs_xattri_finish_update+0x66/0xd0 [xfs]
[  365.292008]  xfs_attr_finish_item+0x43/0x120 [xfs]
[  365.292410]  xfs_defer_finish_noroll+0x3c2/0xcc0 [xfs]
[  365.292804]  ? xfs_defer_cancel+0xc0/0xc0 [xfs]
[  365.293184]  ? kasan_quarantine_put+0x57/0x180
[  365.293196]  __xfs_trans_commit+0x333/0x610 [xfs]
[  365.293599]  ? xfs_trans_free_items+0x150/0x150 [xfs]
[  365.293995]  ? kvfree+0x28/0x30
[  365.294004]  ? kvfree+0x28/0x30
[  365.294017]  ? xfs_defer_ops_continue+0x1c5/0x280 [xfs]
[  365.294401]  xfs_trans_commit+0x10/0x20 [xfs]
[  365.294797]  xlog_finish_defer_ops+0x133/0x270 [xfs]
[  365.295203]  ? xlog_recover_free_trans+0x1c0/0x1c0 [xfs]
[  365.295609]  ? xfs_attr_finish_item+0x120/0x120 [xfs]
[  365.296036]  ? _raw_spin_lock+0x88/0xd7
[  365.296044]  ? _raw_spin_lock_irqsave+0xf0/0xf0
[  365.296054]  xlog_recover_process_intents+0x1f7/0x3e0 [xfs]
[  365.296469]  ? xlog_do_recover+0x290/0x290 [xfs]
[  365.296757]  ? __queue_delayed_work+0xdc/0x140
[  365.296766]  xlog_recover_finish+0x18/0x150 [xfs]
[  365.296949]  xfs_log_mount_finish+0x194/0x310 [xfs]
[  365.297132]  xfs_mountfs+0x957/0xeb0 [xfs]
[  365.297313]  ? xfs_mount_reset_sbqflags+0xa0/0xa0 [xfs]
[  365.297494]  ? xfs_filestream_put_ag+0x40/0x40 [xfs]
[  365.297674]  ? xfs_mru_cache_create+0x226/0x280 [xfs]
[  365.297855]  xfs_fs_fill_super+0x7f0/0xd20 [xfs]
[  365.298034]  get_tree_bdev+0x22e/0x360
[  365.298041]  ? xfs_fs_sync_fs+0x150/0x150 [xfs]
[  365.298223]  xfs_fs_get_tree+0x15/0x20 [xfs]
[  365.298401]  vfs_get_tree+0x4c/0x120
[  365.298408]  path_mount+0x641/0xfd0
[  365.298411]  ? putname+0x7c/0x90
[  365.298416]  ? finish_automount+0x2e0/0x2e0
[  365.298419]  ? kmem_cache_free+0x104/0x4d0
[  365.298422]  ? putname+0x7c/0x90
[  365.298426]  ? putname+0x7c/0x90
[  365.298430]  do_mount+0xd2/0xf0
[  365.298433]  ? path_mount+0xfd0/0xfd0
[  365.298436]  ? memdup_user+0x52/0x90
[  365.298440]  __x64_sys_mount+0xca/0x110
[  365.298444]  do_syscall_64+0x3b/0x90
[  365.298448]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  365.298451] RIP: 0033:0x7f7e4e213cae
[  365.298455] Code: 48 8b 0d e5 c1 0c 00 f7 d8 64 89 01 48 83 c8 ff c3
66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 c1 0c 00 f7 d8 64 89 01 48
[  365.298459] RSP: 002b:00007ffee6811418 EFLAGS: 00000246 ORIG_RAX:
00000000000000a5
[  365.298467] RAX: ffffffffffffffda RBX: 00007f7e4e345204 RCX:
00007f7e4e213cae
[  365.298470] RDX: 000056006007db70 RSI: 000056006007dbb0 RDI:
000056006007db90
[  365.298472] RBP: 000056006007d960 R08: 0000000000000000 R09:
00007ffee6810190
[  365.298475] R10: 0000000000000000 R11: 0000000000000246 R12:
0000000000000000
[  365.298477] R13: 000056006007db90 R14: 000056006007db70 R15:
000056006007d960




> > Additionally the new state will be used by parent pointers which
> > need to add attributes to newly created inodes that do not yet
> > have a fork.
> 
> We already have the capability of doing that in xfs_init_new_inode()
> by passing in init_xattrs == true. So when we are creating a new
> inode with parent pointers enabled, we know that we are going to be
> creating an xattr on the inode and so we should always set
> init_xattrs in that case.
Hmm, ok.  I'll add some tracing around in there too, if I back out the
entire first patch, we crash out earlier in recovery path because no
state is set.  If we enter xfs_attri_item_recover with no fork, we end
up in the following switch:


        case
XFS_ATTRI_OP_FLAGS_REPLACE:                                        
                args->value = nv-
>value.i_addr;                                 
                args->valuelen = nv-
>value.i_len;                               
                args->total = xfs_attr_calc_size(args,
&local);                 
                if (xfs_inode_hasattr(args-
>dp))                                
                        attr->xattri_dela_state =
xfs_attr_init_replace_state(args);
                else                                                   
         
                        attr->xattri_dela_state =
xfs_attr_init_add_state(args);
                break;     

Which will leave the state unset if the fork is absent.
> 
> This should avoid the need for parent pointers to ever need to run
> an extra transaction to create the attr fork. Hence, AFAICT, this
> new state to handle attr fork creation shouldn't ever be needed for
> parent pointers....
> 
> What am I missing?
> 
I hope the description helped?  I'll do some more poking around too and
post back if I find anything else.

Allison

> Cheers,
> 
> Dave.
Dave Chinner June 16, 2022, 2:08 a.m. UTC | #3
On Wed, Jun 15, 2022 at 04:40:07PM -0700, Alli wrote:
> On Wed, 2022-06-15 at 11:09 +1000, Dave Chinner wrote:
> > On Sat, Jun 11, 2022 at 02:41:44AM -0700, Allison Henderson wrote:
> > > Recent parent pointer testing has exposed a bug in the underlying
> > > larp state machine.  A replace operation may remove an old attr
> > > before adding the new one, but if it is the only attr in the fork,
> > > then the fork is removed.  This later causes a null pointer in
> > > xfs_attr_try_sf_addname which expects the fork present.  This
> > > patch adds an extra state to create the fork.
> > 
> > Hmmmm.
> > 
> > I thought I fixed those problems - in xfs_attr_sf_removename() there
> > is this code:
> > 
> >         if (totsize == sizeof(xfs_attr_sf_hdr_t) && xfs_has_attr2(mp)
> > &&
> >             (dp->i_df.if_format != XFS_DINODE_FMT_BTREE) &&
> >             !(args->op_flags & (XFS_DA_OP_ADDNAME |
> > XFS_DA_OP_REPLACE))) {
> >                 xfs_attr_fork_remove(dp, args->trans);
> Hmm, ok, let me shuffle in some traces around there to see where things
> fall off the rails
> 
> > 
> > A replace operation will have XFS_DA_OP_REPLACE set, and so the
> > final remove from a sf directory will not remove the attr fork in
> > this case. There is equivalent checks in the leaf/node remove name
> > paths to avoid removing the attr fork if the last attr is removed
> > while the attr fork is in those formats.
> > 
> > How do you reproduce this issue?
> > 
> 
> Sure, you can apply this kernel set or download it here:
> https://github.com/allisonhenderson/xfs/tree/xfs_new_pptrs
> 
> Next you'll need this xfsprogs that has the neccassary updates to run
> parent pointers
> https://github.com/allisonhenderson/xfsprogs/tree/xfsprogs_new_pptrs
> 
> 
> To reproduce the bug, you'll need to apply a quick patch on the kernel
> side:
> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> index b86188b63897..f279afd43462 100644
> --- a/fs/xfs/libxfs/xfs_attr.c
> +++ b/fs/xfs/libxfs/xfs_attr.c
> @@ -741,8 +741,8 @@ xfs_attr_set_iter(
>  		fallthrough;
>  	case XFS_DAS_SF_ADD:
>  		if (!args->dp->i_afp) {
> -			attr->xattri_dela_state = XFS_DAS_CREATE_FORK;
> -			goto next_state;
> +//			attr->xattri_dela_state = XFS_DAS_CREATE_FORK;
> +//			goto next_state;
>  		}

Ah, so it's recovery that trips this....

> [  365.290048]  xfs_attr_try_sf_addname+0x2a/0xd0 [xfs]
> [  365.290423]  xfs_attr_set_iter+0x2f9/0x1510 [xfs]
> [  365.291592]  xfs_xattri_finish_update+0x66/0xd0 [xfs]
> [  365.292008]  xfs_attr_finish_item+0x43/0x120 [xfs]
> [  365.292410]  xfs_defer_finish_noroll+0x3c2/0xcc0 [xfs]
> [  365.293196]  __xfs_trans_commit+0x333/0x610 [xfs]
> [  365.294401]  xfs_trans_commit+0x10/0x20 [xfs]
> [  365.294797]  xlog_finish_defer_ops+0x133/0x270 [xfs]
> [  365.296054]  xlog_recover_process_intents+0x1f7/0x3e0 [xfs]

ayup.

> > > Additionally the new state will be used by parent pointers which
> > > need to add attributes to newly created inodes that do not yet
> > > have a fork.
> > 
> > We already have the capability of doing that in xfs_init_new_inode()
> > by passing in init_xattrs == true. So when we are creating a new
> > inode with parent pointers enabled, we know that we are going to be
> > creating an xattr on the inode and so we should always set
> > init_xattrs in that case.
> Hmm, ok.  I'll add some tracing around in there too, if I back out the
> entire first patch, we crash out earlier in recovery path because no
> state is set.  If we enter xfs_attri_item_recover with no fork, we end
> up in the following switch:
> 
> 
>         case XFS_ATTRI_OP_FLAGS_REPLACE:
>                 args->value = nv- >value.i_addr;
>                 args->valuelen = nv- >value.i_len;
>                 args->total = xfs_attr_calc_size(args, &local);
>                 if (xfs_inode_hasattr(args- >dp))
>                         attr->xattri_dela_state = xfs_attr_init_replace_state(args);
>                 else
>                         attr->xattri_dela_state = xfs_attr_init_add_state(args);
>                 break;
> 
> Which will leave the state unset if the fork is absent.

Yeah, OK, I think this is because we are combining attribute
creation with inode creation. When log recovery replays inode core
modifications, it replays the inode state into the cluster buffer
and writes it. Then when we go to replay the attr intent at the end
of recovery, the inode is read from disk via xlog_recover_iget(),
but we don't initialise the attr fork because ip->i_forkoff is zero.
i.e. it has no attrs at this point.

I suspect that we could catch that in xlog_recover_iget() when it is
called from attr recovery. i.e. we detect newly created inodes and
initialise the attr fork similar to what we do in
xfs_init_new_inode(). I was thinking something like this:

	if (init_xattrs && xfs_has_attr(mp)) {
		if (!ip->i_forkoff && !ip->i_nextents) {
			ip->i_forkoff = xfs_default_attroffset(ip) >> 3;
			ip->i_afp = xfs_ifork_alloc(XFS_DINODE_FMT_EXTENTS, 0);
		} else {
			ASSERT(ip->i_afp);
		}
	}

Would do the trick, but then I realised that the timing/ordering is
very different to runtime: we don't replay the attr intent until the
end of log recovery and all the inode changes have been replayed
into the inode cluster buffer. That means we could have already
replayed a bunch of data fork extent modifications into the inode,
and so the default attr offset is almost certainly not a safe thing
to be using here. Indeed, there might not be space in the inode for
the attr we want to insert and so we might need to convert the data
fork to a different format before we run the attr intent replay.

That does indeed take us down the path of needing to run a full attr
fork creation operation, because we aren't creating the parent attr
when the data fork is empty in recovery. i.e. recovery is changing
the temporal order of data fork vs attr operations because intent
recovery uses an eventual consistency model rather than the
immediate consistency model that runtime uses.

Hmmmm. I'm going to have to have a think about the implications of
that relevation. I think it does mean we need recovery to be able to
run an attr fork initialisation, but I suspect it also means that
the runtime fork initialisation might also need to include it, too.

I'll need to think on this a bit.

> > This should avoid the need for parent pointers to ever need to run
> > an extra transaction to create the attr fork. Hence, AFAICT, this
> > new state to handle attr fork creation shouldn't ever be needed for
> > parent pointers....
> > 
> > What am I missing?
> > 
> I hope the description helped?  I'll do some more poking around too and
> post back if I find anything else.

Yup, it most definitely helped. :)

You've pointed out something I had completely missed w.r.t. attr
intent replay ordering against replay of data fork modifications.
There's definitely an issue here, I think it might be a fundamental
issue with the recovery mechanism (and not parent pointers), and I
think we'll end up needing  something like this patch to fix it.
Let me bounce this around my head for a bit...

Cheers,

Dave.
Dave Chinner June 16, 2022, 5:32 a.m. UTC | #4
On Thu, Jun 16, 2022 at 12:08:43PM +1000, Dave Chinner wrote:
> On Wed, Jun 15, 2022 at 04:40:07PM -0700, Alli wrote:
> > On Wed, 2022-06-15 at 11:09 +1000, Dave Chinner wrote:
> > > On Sat, Jun 11, 2022 at 02:41:44AM -0700, Allison Henderson wrote:
> > > > Recent parent pointer testing has exposed a bug in the underlying
> > > > larp state machine.  A replace operation may remove an old attr
> > > > before adding the new one, but if it is the only attr in the fork,
> > > > then the fork is removed.  This later causes a null pointer in
> > > > xfs_attr_try_sf_addname which expects the fork present.  This
> > > > patch adds an extra state to create the fork.
> > > 
> > > Hmmmm.
> > > 
> > > I thought I fixed those problems - in xfs_attr_sf_removename() there
> > > is this code:
> > > 
> > >         if (totsize == sizeof(xfs_attr_sf_hdr_t) && xfs_has_attr2(mp)
> > > &&
> > >             (dp->i_df.if_format != XFS_DINODE_FMT_BTREE) &&
> > >             !(args->op_flags & (XFS_DA_OP_ADDNAME |
> > > XFS_DA_OP_REPLACE))) {
> > >                 xfs_attr_fork_remove(dp, args->trans);
> > Hmm, ok, let me shuffle in some traces around there to see where things
> > fall off the rails
> > 
> > > 
> > > A replace operation will have XFS_DA_OP_REPLACE set, and so the
> > > final remove from a sf directory will not remove the attr fork in
> > > this case. There is equivalent checks in the leaf/node remove name
> > > paths to avoid removing the attr fork if the last attr is removed
> > > while the attr fork is in those formats.
> > > 
> > > How do you reproduce this issue?
> > > 
> > 
> > Sure, you can apply this kernel set or download it here:
> > https://github.com/allisonhenderson/xfs/tree/xfs_new_pptrs
> > 
> > Next you'll need this xfsprogs that has the neccassary updates to run
> > parent pointers
> > https://github.com/allisonhenderson/xfsprogs/tree/xfsprogs_new_pptrs
> > 
> > 
> > To reproduce the bug, you'll need to apply a quick patch on the kernel
> > side:
> > diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> > index b86188b63897..f279afd43462 100644
> > --- a/fs/xfs/libxfs/xfs_attr.c
> > +++ b/fs/xfs/libxfs/xfs_attr.c
> > @@ -741,8 +741,8 @@ xfs_attr_set_iter(
> >  		fallthrough;
> >  	case XFS_DAS_SF_ADD:
> >  		if (!args->dp->i_afp) {
> > -			attr->xattri_dela_state = XFS_DAS_CREATE_FORK;
> > -			goto next_state;
> > +//			attr->xattri_dela_state = XFS_DAS_CREATE_FORK;
> > +//			goto next_state;
> >  		}
> 
> Ah, so it's recovery that trips this....
> 
> > [  365.290048]  xfs_attr_try_sf_addname+0x2a/0xd0 [xfs]
> > [  365.290423]  xfs_attr_set_iter+0x2f9/0x1510 [xfs]
> > [  365.291592]  xfs_xattri_finish_update+0x66/0xd0 [xfs]
> > [  365.292008]  xfs_attr_finish_item+0x43/0x120 [xfs]
> > [  365.292410]  xfs_defer_finish_noroll+0x3c2/0xcc0 [xfs]
> > [  365.293196]  __xfs_trans_commit+0x333/0x610 [xfs]
> > [  365.294401]  xfs_trans_commit+0x10/0x20 [xfs]
> > [  365.294797]  xlog_finish_defer_ops+0x133/0x270 [xfs]
> > [  365.296054]  xlog_recover_process_intents+0x1f7/0x3e0 [xfs]
> 
> ayup.
> 
> > > > Additionally the new state will be used by parent pointers which
> > > > need to add attributes to newly created inodes that do not yet
> > > > have a fork.
> > > 
> > > We already have the capability of doing that in xfs_init_new_inode()
> > > by passing in init_xattrs == true. So when we are creating a new
> > > inode with parent pointers enabled, we know that we are going to be
> > > creating an xattr on the inode and so we should always set
> > > init_xattrs in that case.
> > Hmm, ok.  I'll add some tracing around in there too, if I back out the
> > entire first patch, we crash out earlier in recovery path because no
> > state is set.  If we enter xfs_attri_item_recover with no fork, we end
> > up in the following switch:
> > 
> > 
> >         case XFS_ATTRI_OP_FLAGS_REPLACE:
> >                 args->value = nv- >value.i_addr;
> >                 args->valuelen = nv- >value.i_len;
> >                 args->total = xfs_attr_calc_size(args, &local);
> >                 if (xfs_inode_hasattr(args- >dp))
> >                         attr->xattri_dela_state = xfs_attr_init_replace_state(args);
> >                 else
> >                         attr->xattri_dela_state = xfs_attr_init_add_state(args);
> >                 break;
> > 
> > Which will leave the state unset if the fork is absent.
> 
> Yeah, OK, I think this is because we are combining attribute
> creation with inode creation. When log recovery replays inode core
> modifications, it replays the inode state into the cluster buffer
> and writes it. Then when we go to replay the attr intent at the end
> of recovery, the inode is read from disk via xlog_recover_iget(),
> but we don't initialise the attr fork because ip->i_forkoff is zero.
> i.e. it has no attrs at this point.
> 
> I suspect that we could catch that in xlog_recover_iget() when it is
> called from attr recovery. i.e. we detect newly created inodes and
> initialise the attr fork similar to what we do in
> xfs_init_new_inode(). I was thinking something like this:
> 
> 	if (init_xattrs && xfs_has_attr(mp)) {
> 		if (!ip->i_forkoff && !ip->i_nextents) {
> 			ip->i_forkoff = xfs_default_attroffset(ip) >> 3;
> 			ip->i_afp = xfs_ifork_alloc(XFS_DINODE_FMT_EXTENTS, 0);
> 		} else {
> 			ASSERT(ip->i_afp);
> 		}
> 	}
> 
> Would do the trick, but then I realised that the timing/ordering is
> very different to runtime: we don't replay the attr intent until the
> end of log recovery and all the inode changes have been replayed
> into the inode cluster buffer. That means we could have already
> replayed a bunch of data fork extent modifications into the inode,
> and so the default attr offset is almost certainly not a safe thing
> to be using here. Indeed, there might not be space in the inode for
> the attr we want to insert and so we might need to convert the data
> fork to a different format before we run the attr intent replay.

Ok, so after further thought, I don't think this can happen. If we
are replaying an attr intent, it means we crashed before the intent
done was recorded in the log. At this point in time the inode was
locked and so there could be no racing changes to the inode data
fork in the log. i.e. because of log item ordering, if the intent
done is not in the log, none of the future changes that occurred
after the intent done will be in the log, either. The inode on disk
will not contain them either because the intent done must be in the
log before the inode gets unpinned and is able to be written to
disk.

Hence if we've got an attr intent to replay, it must be the last
active modification to that inode that must be replayed, and the
state of the inode on disk at the time of recovering the attr intent
should match the state of the inode in memory at the time the attr
intent was started.

Hence there isn't a consistency model coherency problem here, and
that means if there's no attr fork at the time the attr recovery is
started, it *must* be a newly created inode. If the inode already
existed and a transaction had to be run to create the attr fork
(i.e. xfs_bmap_add_attrfork() had to be run) then that transaction
would have been recovered from the log before attr replay started,
and so xlog_recovery_iget() should see a non-zero ip->i_forkoff and
initialise the attr fork correctly.

But this makes me wonder further. If the attr intent is logged in
the same transaction as the inode is allocated, then the setting of
ip->i_forkoff in xfs_init_new_inode() should also be logged in that
transaction (because XFS_ILOG_CORE is used) and hence be replayed
into the on-disk inode by recovery before the attr intent recovery
starts. Hence xlog_recover_iget() should be initialising the attr
fork through this path:

xlog_recover_iget
  xfs_iget
    xfs_iget_cache_miss
      xfs_inode_from_disk
	if (ip->i_forkoff)
	  xfs_iformat_attr_fork()

This means the newly allocated inode would have an attr fork
allocated to it, in extent format with zero extents. If we then look
at xfs_inode_hasattr():

int
xfs_inode_hasattr(
        struct xfs_inode        *ip)
{
        if (!XFS_IFORK_Q(ip))
                return 0;
        if (!ip->i_afp)
                return 0;
>>>>>   if (ip->i_afp->if_format == XFS_DINODE_FMT_EXTENTS &&
>>>>>       ip->i_afp->if_nextents == 0)
>>>>>           return 0;
        return 1;
}

It would still say that it has no attrs even though the fork has
been initialised and we go down the xfs_attr_init_add_state()
branch.  This does:

static inline enum xfs_delattr_state
xfs_attr_init_add_state(struct xfs_da_args *args)
{
        /*
         * When called from the completion of a attr remove to determine the
         * next state, the attribute fork may be null. This can occur only occur
         * on a pure remove, but we grab the next state before we check if a
         * replace operation is being performed. If we are called from any other
         * context, i_afp is guaranteed to exist. Hence if the attr fork is
         * null, we were called from a pure remove operation and so we are done.
         */
>>>>    if (!args->dp->i_afp)
>>>>            return XFS_DAS_DONE;

        args->op_flags |= XFS_DA_OP_ADDNAME;
        if (xfs_attr_is_shortform(args->dp))
                return XFS_DAS_SF_ADD;
        if (xfs_attr_is_leaf(args->dp))
                return XFS_DAS_LEAF_ADD;
        return XFS_DAS_NODE_ADD;
}

Which would return XFS_DAS_DONE if there was no attr fork
initialised and recovery would be skipped completely. Hence for this
change to be required to trigger failures:

> > @@ -741,8 +741,8 @@ xfs_attr_set_iter(
> >             fallthrough;
> >     case XFS_DAS_SF_ADD:
> >             if (!args->dp->i_afp) {
> > -                   attr->xattri_dela_state = XFS_DAS_CREATE_FORK;
> > -                   goto next_state;
> > +//                 attr->xattri_dela_state = XFS_DAS_CREATE_FORK;
> > +//                 goto next_state;
> >             }

then it implies the only way we can get here is via a replace
operation that has removed the attr fork during the remove and we
hit this on the attr add. Yet, AFAICT, the attr fork does not get
removed when a replace operation is in progress.

Maybe there's a new bug introduced in the PP patchset that triggers
this - I'll do some more looking...

> > > This should avoid the need for parent pointers to ever need to run
> > > an extra transaction to create the attr fork. Hence, AFAICT, this
> > > new state to handle attr fork creation shouldn't ever be needed for
> > > parent pointers....
> > > 
> > > What am I missing?
> > > 
> > I hope the description helped?  I'll do some more poking around too and
> > post back if I find anything else.
> 
> Yup, it most definitely helped. :)
> 
> You've pointed out something I had completely missed w.r.t. attr
> intent replay ordering against replay of data fork modifications.
> There's definitely an issue here, I think it might be a fundamental
> issue with the recovery mechanism (and not parent pointers), and I
> think we'll end up needing  something like this patch to fix it.
> Let me bounce this around my head for a bit...

In summary, after further thought this turns out not to be an issue
at all, so I'm back to "replace doesn't remove the attr fork, so
how does this happen?"....

Cheers,

Dave.
Allison Henderson June 29, 2022, 6:33 a.m. UTC | #5
On Thu, 2022-06-16 at 15:32 +1000, Dave Chinner wrote:
> On Thu, Jun 16, 2022 at 12:08:43PM +1000, Dave Chinner wrote:
> > On Wed, Jun 15, 2022 at 04:40:07PM -0700, Alli wrote:
> > > On Wed, 2022-06-15 at 11:09 +1000, Dave Chinner wrote:
> > > > On Sat, Jun 11, 2022 at 02:41:44AM -0700, Allison Henderson
> > > > wrote:
> > > > > Recent parent pointer testing has exposed a bug in the
> > > > > underlying
> > > > > larp state machine.  A replace operation may remove an old
> > > > > attr
> > > > > before adding the new one, but if it is the only attr in the
> > > > > fork,
> > > > > then the fork is removed.  This later causes a null pointer
> > > > > in
> > > > > xfs_attr_try_sf_addname which expects the fork present.  This
> > > > > patch adds an extra state to create the fork.
> > > > 
> > > > Hmmmm.
> > > > 
> > > > I thought I fixed those problems - in xfs_attr_sf_removename()
> > > > there
> > > > is this code:
> > > > 
> > > >         if (totsize == sizeof(xfs_attr_sf_hdr_t) &&
> > > > xfs_has_attr2(mp)
> > > > &&
> > > >             (dp->i_df.if_format != XFS_DINODE_FMT_BTREE) &&
> > > >             !(args->op_flags & (XFS_DA_OP_ADDNAME |
> > > > XFS_DA_OP_REPLACE))) {
> > > >                 xfs_attr_fork_remove(dp, args->trans);
> > > Hmm, ok, let me shuffle in some traces around there to see where
> > > things
> > > fall off the rails
> > > 
> > > > A replace operation will have XFS_DA_OP_REPLACE set, and so the
> > > > final remove from a sf directory will not remove the attr fork
> > > > in
> > > > this case. There is equivalent checks in the leaf/node remove
> > > > name
> > > > paths to avoid removing the attr fork if the last attr is
> > > > removed
> > > > while the attr fork is in those formats.
> > > > 
> > > > How do you reproduce this issue?
> > > > 
> > > 
> > > Sure, you can apply this kernel set or download it here:
> > > https://urldefense.com/v3/__https://github.com/allisonhenderson/xfs/tree/xfs_new_pptrs__;!!ACWV5N9M2RV99hQ!O7bhq9bR_z2xjlhlwWb78ZXwzigOh6P8V3_EkeL9AHFOpdYdr_irAwUocygT_G8LI-lKDL5_01Df49lTy27a$ 
> > > 
> > > Next you'll need this xfsprogs that has the neccassary updates to
> > > run
> > > parent pointers
> > > https://urldefense.com/v3/__https://github.com/allisonhenderson/xfsprogs/tree/xfsprogs_new_pptrs__;!!ACWV5N9M2RV99hQ!O7bhq9bR_z2xjlhlwWb78ZXwzigOh6P8V3_EkeL9AHFOpdYdr_irAwUocygT_G8LI-lKDL5_01Df4w3Mct8F$ 
> > > 
> > > 
> > > To reproduce the bug, you'll need to apply a quick patch on the
> > > kernel
> > > side:
> > > diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> > > index b86188b63897..f279afd43462 100644
> > > --- a/fs/xfs/libxfs/xfs_attr.c
> > > +++ b/fs/xfs/libxfs/xfs_attr.c
> > > @@ -741,8 +741,8 @@ xfs_attr_set_iter(
> > >  		fallthrough;
> > >  	case XFS_DAS_SF_ADD:
> > >  		if (!args->dp->i_afp) {
> > > -			attr->xattri_dela_state = XFS_DAS_CREATE_FORK;
> > > -			goto next_state;
> > > +//			attr->xattri_dela_state =
> > > XFS_DAS_CREATE_FORK;
> > > +//			goto next_state;
> > >  		}
> > 
> > Ah, so it's recovery that trips this....
> > 
> > > [  365.290048]  xfs_attr_try_sf_addname+0x2a/0xd0 [xfs]
> > > [  365.290423]  xfs_attr_set_iter+0x2f9/0x1510 [xfs]
> > > [  365.291592]  xfs_xattri_finish_update+0x66/0xd0 [xfs]
> > > [  365.292008]  xfs_attr_finish_item+0x43/0x120 [xfs]
> > > [  365.292410]  xfs_defer_finish_noroll+0x3c2/0xcc0 [xfs]
> > > [  365.293196]  __xfs_trans_commit+0x333/0x610 [xfs]
> > > [  365.294401]  xfs_trans_commit+0x10/0x20 [xfs]
> > > [  365.294797]  xlog_finish_defer_ops+0x133/0x270 [xfs]
> > > [  365.296054]  xlog_recover_process_intents+0x1f7/0x3e0 [xfs]
> > 
> > ayup.
> > 
> > > > > Additionally the new state will be used by parent pointers
> > > > > which
> > > > > need to add attributes to newly created inodes that do not
> > > > > yet
> > > > > have a fork.
> > > > 
> > > > We already have the capability of doing that in
> > > > xfs_init_new_inode()
> > > > by passing in init_xattrs == true. So when we are creating a
> > > > new
> > > > inode with parent pointers enabled, we know that we are going
> > > > to be
> > > > creating an xattr on the inode and so we should always set
> > > > init_xattrs in that case.
> > > Hmm, ok.  I'll add some tracing around in there too, if I back
> > > out the
> > > entire first patch, we crash out earlier in recovery path because
> > > no
> > > state is set.  If we enter xfs_attri_item_recover with no fork,
> > > we end
> > > up in the following switch:
> > > 
> > > 
> > >         case XFS_ATTRI_OP_FLAGS_REPLACE:
> > >                 args->value = nv- >value.i_addr;
> > >                 args->valuelen = nv- >value.i_len;
> > >                 args->total = xfs_attr_calc_size(args, &local);
> > >                 if (xfs_inode_hasattr(args- >dp))
> > >                         attr->xattri_dela_state =
> > > xfs_attr_init_replace_state(args);
> > >                 else
> > >                         attr->xattri_dela_state =
> > > xfs_attr_init_add_state(args);
> > >                 break;
> > > 
> > > Which will leave the state unset if the fork is absent.
> > 
> > Yeah, OK, I think this is because we are combining attribute
> > creation with inode creation. When log recovery replays inode core
> > modifications, it replays the inode state into the cluster buffer
> > and writes it. Then when we go to replay the attr intent at the end
> > of recovery, the inode is read from disk via xlog_recover_iget(),
> > but we don't initialise the attr fork because ip->i_forkoff is
> > zero.
> > i.e. it has no attrs at this point.
> > 
> > I suspect that we could catch that in xlog_recover_iget() when it
> > is
> > called from attr recovery. i.e. we detect newly created inodes and
> > initialise the attr fork similar to what we do in
> > xfs_init_new_inode(). I was thinking something like this:
> > 
> > 	if (init_xattrs && xfs_has_attr(mp)) {
> > 		if (!ip->i_forkoff && !ip->i_nextents) {
> > 			ip->i_forkoff = xfs_default_attroffset(ip) >>
> > 3;
> > 			ip->i_afp =
> > xfs_ifork_alloc(XFS_DINODE_FMT_EXTENTS, 0);
> > 		} else {
> > 			ASSERT(ip->i_afp);
> > 		}
> > 	}
> > 
> > Would do the trick, but then I realised that the timing/ordering is
> > very different to runtime: we don't replay the attr intent until
> > the
> > end of log recovery and all the inode changes have been replayed
> > into the inode cluster buffer. That means we could have already
> > replayed a bunch of data fork extent modifications into the inode,
> > and so the default attr offset is almost certainly not a safe thing
> > to be using here. Indeed, there might not be space in the inode for
> > the attr we want to insert and so we might need to convert the data
> > fork to a different format before we run the attr intent replay.
> 
> Ok, so after further thought, I don't think this can happen. If we
> are replaying an attr intent, it means we crashed before the intent
> done was recorded in the log. At this point in time the inode was
> locked and so there could be no racing changes to the inode data
> fork in the log. i.e. because of log item ordering, if the intent
> done is not in the log, none of the future changes that occurred
> after the intent done will be in the log, either. The inode on disk
> will not contain them either because the intent done must be in the
> log before the inode gets unpinned and is able to be written to
> disk.
> 
> Hence if we've got an attr intent to replay, it must be the last
> active modification to that inode that must be replayed, and the
> state of the inode on disk at the time of recovering the attr intent
> should match the state of the inode in memory at the time the attr
> intent was started.
> 
> Hence there isn't a consistency model coherency problem here, and
> that means if there's no attr fork at the time the attr recovery is
> started, it *must* be a newly created inode. If the inode already
> existed and a transaction had to be run to create the attr fork
> (i.e. xfs_bmap_add_attrfork() had to be run) then that transaction
> would have been recovered from the log before attr replay started,
> and so xlog_recovery_iget() should see a non-zero ip->i_forkoff and
> initialise the attr fork correctly.
> 
> But this makes me wonder further. If the attr intent is logged in
> the same transaction as the inode is allocated, then the setting of
> ip->i_forkoff in xfs_init_new_inode() should also be logged in that
> transaction (because XFS_ILOG_CORE is used) and hence be replayed
> into the on-disk inode by recovery before the attr intent recovery
> starts. Hence xlog_recover_iget() should be initialising the attr
> fork through this path:
> 
> xlog_recover_iget
>   xfs_iget
>     xfs_iget_cache_miss
>       xfs_inode_from_disk
> 	if (ip->i_forkoff)
> 	  xfs_iformat_attr_fork()
> 
> This means the newly allocated inode would have an attr fork
> allocated to it, in extent format with zero extents. If we then look
> at xfs_inode_hasattr():
> 
> int
> xfs_inode_hasattr(
>         struct xfs_inode        *ip)
> {
>         if (!XFS_IFORK_Q(ip))
>                 return 0;
>         if (!ip->i_afp)
>                 return 0;
> > > > > >   if (ip->i_afp->if_format == XFS_DINODE_FMT_EXTENTS &&
> > > > > >       ip->i_afp->if_nextents == 0)
> > > > > >           return 0;
>         return 1;
> }
> 
> It would still say that it has no attrs even though the fork has
> been initialised and we go down the xfs_attr_init_add_state()
> branch.  This does:
> 
> static inline enum xfs_delattr_state
> xfs_attr_init_add_state(struct xfs_da_args *args)
> {
>         /*
>          * When called from the completion of a attr remove to
> determine the
>          * next state, the attribute fork may be null. This can occur
> only occur
>          * on a pure remove, but we grab the next state before we
> check if a
>          * replace operation is being performed. If we are called
> from any other
>          * context, i_afp is guaranteed to exist. Hence if the attr
> fork is
>          * null, we were called from a pure remove operation and so
> we are done.
>          */
> > > > >    if (!args->dp->i_afp)
> > > > >            return XFS_DAS_DONE;
> 
>         args->op_flags |= XFS_DA_OP_ADDNAME;
>         if (xfs_attr_is_shortform(args->dp))
>                 return XFS_DAS_SF_ADD;
>         if (xfs_attr_is_leaf(args->dp))
>                 return XFS_DAS_LEAF_ADD;
>         return XFS_DAS_NODE_ADD;
> }
> 
> Which would return XFS_DAS_DONE if there was no attr fork
> initialised and recovery would be skipped completely. Hence for this
> change to be required to trigger failures:
> 
> > > @@ -741,8 +741,8 @@ xfs_attr_set_iter(
> > >             fallthrough;
> > >     case XFS_DAS_SF_ADD:
> > >             if (!args->dp->i_afp) {
> > > -                   attr->xattri_dela_state =
> > > XFS_DAS_CREATE_FORK;
> > > -                   goto next_state;
> > > +//                 attr->xattri_dela_state =
> > > XFS_DAS_CREATE_FORK;
> > > +//                 goto next_state;
> > >             }
> 
> then it implies the only way we can get here is via a replace
> operation that has removed the attr fork during the remove and we
> hit this on the attr add. Yet, AFAICT, the attr fork does not get
> removed when a replace operation is in progress.
> 
> Maybe there's a new bug introduced in the PP patchset that triggers
> this - I'll do some more looking...
> 
> > > > This should avoid the need for parent pointers to ever need to
> > > > run
> > > > an extra transaction to create the attr fork. Hence, AFAICT,
> > > > this
> > > > new state to handle attr fork creation shouldn't ever be needed
> > > > for
> > > > parent pointers....
> > > > 
> > > > What am I missing?
> > > > 
> > > I hope the description helped?  I'll do some more poking around
> > > too and
> > > post back if I find anything else.
> > 
> > Yup, it most definitely helped. :)
> > 
> > You've pointed out something I had completely missed w.r.t. attr
> > intent replay ordering against replay of data fork modifications.
> > There's definitely an issue here, I think it might be a fundamental
> > issue with the recovery mechanism (and not parent pointers), and I
> > think we'll end up needing  something like this patch to fix it.
> > Let me bounce this around my head for a bit...
> 
> In summary, after further thought this turns out not to be an issue
> at all, so I'm back to "replace doesn't remove the attr fork, so
> how does this happen?"....
Just a follow up here...

So, I think what is happening is we are getting interleaved replays.
When a parent changes, the parent pointer attribute needs to be
updated.  But since this changes the attribute name, it cant be an attr
replace.  So we queue up an attr set and then a remove (in xfs_rename).
The test case sets the inject which starts a replay of a set and
remove.  The set expands the leaf, and bounces out the EAGAIN
into xfs_attri_item_recover.  We then fall into this error handling:

        ret = xfs_xattri_finish_update(attr,
done_item);                        
        if (ret == -EAGAIN)
{                                                   
                /* There's more work to do, so add it to this
transaction */    
                xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr-
>xattri_list);
        }

Which I think is queuing up the rest of the set operation, but on the
other side of the remove operation.  The remove operation removes the
fork, and we get the null pointer when the set operation resumes.

I havnt quite worked out a fix yet, but I'm pretty sure that's the
problem description....  I think what we're going to need is some sort
of defer_push operation or something similar.

Allison

> 
> Cheers,
> 
> Dave.
>
Darrick J. Wong June 30, 2022, 12:40 a.m. UTC | #6
On Tue, Jun 28, 2022 at 11:33:49PM -0700, Alli wrote:
> On Thu, 2022-06-16 at 15:32 +1000, Dave Chinner wrote:
> > On Thu, Jun 16, 2022 at 12:08:43PM +1000, Dave Chinner wrote:
> > > On Wed, Jun 15, 2022 at 04:40:07PM -0700, Alli wrote:
> > > > On Wed, 2022-06-15 at 11:09 +1000, Dave Chinner wrote:
> > > > > On Sat, Jun 11, 2022 at 02:41:44AM -0700, Allison Henderson
> > > > > wrote:
> > > > > > Recent parent pointer testing has exposed a bug in the
> > > > > > underlying
> > > > > > larp state machine.  A replace operation may remove an old
> > > > > > attr
> > > > > > before adding the new one, but if it is the only attr in the
> > > > > > fork,
> > > > > > then the fork is removed.  This later causes a null pointer
> > > > > > in
> > > > > > xfs_attr_try_sf_addname which expects the fork present.  This
> > > > > > patch adds an extra state to create the fork.
> > > > > 
> > > > > Hmmmm.
> > > > > 
> > > > > I thought I fixed those problems - in xfs_attr_sf_removename()
> > > > > there
> > > > > is this code:
> > > > > 
> > > > >         if (totsize == sizeof(xfs_attr_sf_hdr_t) &&
> > > > > xfs_has_attr2(mp)
> > > > > &&
> > > > >             (dp->i_df.if_format != XFS_DINODE_FMT_BTREE) &&
> > > > >             !(args->op_flags & (XFS_DA_OP_ADDNAME |
> > > > > XFS_DA_OP_REPLACE))) {
> > > > >                 xfs_attr_fork_remove(dp, args->trans);
> > > > Hmm, ok, let me shuffle in some traces around there to see where
> > > > things
> > > > fall off the rails
> > > > 
> > > > > A replace operation will have XFS_DA_OP_REPLACE set, and so the
> > > > > final remove from a sf directory will not remove the attr fork
> > > > > in
> > > > > this case. There is equivalent checks in the leaf/node remove
> > > > > name
> > > > > paths to avoid removing the attr fork if the last attr is
> > > > > removed
> > > > > while the attr fork is in those formats.
> > > > > 
> > > > > How do you reproduce this issue?
> > > > > 
> > > > 
> > > > Sure, you can apply this kernel set or download it here:
> > > > https://urldefense.com/v3/__https://github.com/allisonhenderson/xfs/tree/xfs_new_pptrs__;!!ACWV5N9M2RV99hQ!O7bhq9bR_z2xjlhlwWb78ZXwzigOh6P8V3_EkeL9AHFOpdYdr_irAwUocygT_G8LI-lKDL5_01Df49lTy27a$ 
> > > > 
> > > > Next you'll need this xfsprogs that has the neccassary updates to
> > > > run
> > > > parent pointers
> > > > https://urldefense.com/v3/__https://github.com/allisonhenderson/xfsprogs/tree/xfsprogs_new_pptrs__;!!ACWV5N9M2RV99hQ!O7bhq9bR_z2xjlhlwWb78ZXwzigOh6P8V3_EkeL9AHFOpdYdr_irAwUocygT_G8LI-lKDL5_01Df4w3Mct8F$ 
> > > > 
> > > > 
> > > > To reproduce the bug, you'll need to apply a quick patch on the
> > > > kernel
> > > > side:
> > > > diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> > > > index b86188b63897..f279afd43462 100644
> > > > --- a/fs/xfs/libxfs/xfs_attr.c
> > > > +++ b/fs/xfs/libxfs/xfs_attr.c
> > > > @@ -741,8 +741,8 @@ xfs_attr_set_iter(
> > > >  		fallthrough;
> > > >  	case XFS_DAS_SF_ADD:
> > > >  		if (!args->dp->i_afp) {
> > > > -			attr->xattri_dela_state = XFS_DAS_CREATE_FORK;
> > > > -			goto next_state;
> > > > +//			attr->xattri_dela_state =
> > > > XFS_DAS_CREATE_FORK;
> > > > +//			goto next_state;
> > > >  		}
> > > 
> > > Ah, so it's recovery that trips this....
> > > 
> > > > [  365.290048]  xfs_attr_try_sf_addname+0x2a/0xd0 [xfs]
> > > > [  365.290423]  xfs_attr_set_iter+0x2f9/0x1510 [xfs]
> > > > [  365.291592]  xfs_xattri_finish_update+0x66/0xd0 [xfs]
> > > > [  365.292008]  xfs_attr_finish_item+0x43/0x120 [xfs]
> > > > [  365.292410]  xfs_defer_finish_noroll+0x3c2/0xcc0 [xfs]
> > > > [  365.293196]  __xfs_trans_commit+0x333/0x610 [xfs]
> > > > [  365.294401]  xfs_trans_commit+0x10/0x20 [xfs]
> > > > [  365.294797]  xlog_finish_defer_ops+0x133/0x270 [xfs]
> > > > [  365.296054]  xlog_recover_process_intents+0x1f7/0x3e0 [xfs]
> > > 
> > > ayup.
> > > 
> > > > > > Additionally the new state will be used by parent pointers
> > > > > > which
> > > > > > need to add attributes to newly created inodes that do not
> > > > > > yet
> > > > > > have a fork.
> > > > > 
> > > > > We already have the capability of doing that in
> > > > > xfs_init_new_inode()
> > > > > by passing in init_xattrs == true. So when we are creating a
> > > > > new
> > > > > inode with parent pointers enabled, we know that we are going
> > > > > to be
> > > > > creating an xattr on the inode and so we should always set
> > > > > init_xattrs in that case.
> > > > Hmm, ok.  I'll add some tracing around in there too, if I back
> > > > out the
> > > > entire first patch, we crash out earlier in recovery path because
> > > > no
> > > > state is set.  If we enter xfs_attri_item_recover with no fork,
> > > > we end
> > > > up in the following switch:
> > > > 
> > > > 
> > > >         case XFS_ATTRI_OP_FLAGS_REPLACE:
> > > >                 args->value = nv- >value.i_addr;
> > > >                 args->valuelen = nv- >value.i_len;
> > > >                 args->total = xfs_attr_calc_size(args, &local);
> > > >                 if (xfs_inode_hasattr(args- >dp))
> > > >                         attr->xattri_dela_state =
> > > > xfs_attr_init_replace_state(args);
> > > >                 else
> > > >                         attr->xattri_dela_state =
> > > > xfs_attr_init_add_state(args);
> > > >                 break;
> > > > 
> > > > Which will leave the state unset if the fork is absent.
> > > 
> > > Yeah, OK, I think this is because we are combining attribute
> > > creation with inode creation. When log recovery replays inode core
> > > modifications, it replays the inode state into the cluster buffer
> > > and writes it. Then when we go to replay the attr intent at the end
> > > of recovery, the inode is read from disk via xlog_recover_iget(),
> > > but we don't initialise the attr fork because ip->i_forkoff is
> > > zero.
> > > i.e. it has no attrs at this point.
> > > 
> > > I suspect that we could catch that in xlog_recover_iget() when it
> > > is
> > > called from attr recovery. i.e. we detect newly created inodes and
> > > initialise the attr fork similar to what we do in
> > > xfs_init_new_inode(). I was thinking something like this:
> > > 
> > > 	if (init_xattrs && xfs_has_attr(mp)) {
> > > 		if (!ip->i_forkoff && !ip->i_nextents) {
> > > 			ip->i_forkoff = xfs_default_attroffset(ip) >>
> > > 3;
> > > 			ip->i_afp =
> > > xfs_ifork_alloc(XFS_DINODE_FMT_EXTENTS, 0);
> > > 		} else {
> > > 			ASSERT(ip->i_afp);
> > > 		}
> > > 	}
> > > 
> > > Would do the trick, but then I realised that the timing/ordering is
> > > very different to runtime: we don't replay the attr intent until
> > > the
> > > end of log recovery and all the inode changes have been replayed
> > > into the inode cluster buffer. That means we could have already
> > > replayed a bunch of data fork extent modifications into the inode,
> > > and so the default attr offset is almost certainly not a safe thing
> > > to be using here. Indeed, there might not be space in the inode for
> > > the attr we want to insert and so we might need to convert the data
> > > fork to a different format before we run the attr intent replay.
> > 
> > Ok, so after further thought, I don't think this can happen. If we
> > are replaying an attr intent, it means we crashed before the intent
> > done was recorded in the log. At this point in time the inode was
> > locked and so there could be no racing changes to the inode data
> > fork in the log. i.e. because of log item ordering, if the intent
> > done is not in the log, none of the future changes that occurred
> > after the intent done will be in the log, either. The inode on disk
> > will not contain them either because the intent done must be in the
> > log before the inode gets unpinned and is able to be written to
> > disk.
> > 
> > Hence if we've got an attr intent to replay, it must be the last
> > active modification to that inode that must be replayed, and the
> > state of the inode on disk at the time of recovering the attr intent
> > should match the state of the inode in memory at the time the attr
> > intent was started.
> > 
> > Hence there isn't a consistency model coherency problem here, and
> > that means if there's no attr fork at the time the attr recovery is
> > started, it *must* be a newly created inode. If the inode already
> > existed and a transaction had to be run to create the attr fork
> > (i.e. xfs_bmap_add_attrfork() had to be run) then that transaction
> > would have been recovered from the log before attr replay started,
> > and so xlog_recovery_iget() should see a non-zero ip->i_forkoff and
> > initialise the attr fork correctly.
> > 
> > But this makes me wonder further. If the attr intent is logged in
> > the same transaction as the inode is allocated, then the setting of
> > ip->i_forkoff in xfs_init_new_inode() should also be logged in that
> > transaction (because XFS_ILOG_CORE is used) and hence be replayed
> > into the on-disk inode by recovery before the attr intent recovery
> > starts. Hence xlog_recover_iget() should be initialising the attr
> > fork through this path:
> > 
> > xlog_recover_iget
> >   xfs_iget
> >     xfs_iget_cache_miss
> >       xfs_inode_from_disk
> > 	if (ip->i_forkoff)
> > 	  xfs_iformat_attr_fork()
> > 
> > This means the newly allocated inode would have an attr fork
> > allocated to it, in extent format with zero extents. If we then look
> > at xfs_inode_hasattr():
> > 
> > int
> > xfs_inode_hasattr(
> >         struct xfs_inode        *ip)
> > {
> >         if (!XFS_IFORK_Q(ip))
> >                 return 0;
> >         if (!ip->i_afp)
> >                 return 0;
> > > > > > >   if (ip->i_afp->if_format == XFS_DINODE_FMT_EXTENTS &&
> > > > > > >       ip->i_afp->if_nextents == 0)
> > > > > > >           return 0;
> >         return 1;
> > }
> > 
> > It would still say that it has no attrs even though the fork has
> > been initialised and we go down the xfs_attr_init_add_state()
> > branch.  This does:
> > 
> > static inline enum xfs_delattr_state
> > xfs_attr_init_add_state(struct xfs_da_args *args)
> > {
> >         /*
> >          * When called from the completion of a attr remove to
> > determine the
> >          * next state, the attribute fork may be null. This can occur
> > only occur
> >          * on a pure remove, but we grab the next state before we
> > check if a
> >          * replace operation is being performed. If we are called
> > from any other
> >          * context, i_afp is guaranteed to exist. Hence if the attr
> > fork is
> >          * null, we were called from a pure remove operation and so
> > we are done.
> >          */
> > > > > >    if (!args->dp->i_afp)
> > > > > >            return XFS_DAS_DONE;
> > 
> >         args->op_flags |= XFS_DA_OP_ADDNAME;
> >         if (xfs_attr_is_shortform(args->dp))
> >                 return XFS_DAS_SF_ADD;
> >         if (xfs_attr_is_leaf(args->dp))
> >                 return XFS_DAS_LEAF_ADD;
> >         return XFS_DAS_NODE_ADD;
> > }
> > 
> > Which would return XFS_DAS_DONE if there was no attr fork
> > initialised and recovery would be skipped completely. Hence for this
> > change to be required to trigger failures:
> > 
> > > > @@ -741,8 +741,8 @@ xfs_attr_set_iter(
> > > >             fallthrough;
> > > >     case XFS_DAS_SF_ADD:
> > > >             if (!args->dp->i_afp) {
> > > > -                   attr->xattri_dela_state =
> > > > XFS_DAS_CREATE_FORK;
> > > > -                   goto next_state;
> > > > +//                 attr->xattri_dela_state =
> > > > XFS_DAS_CREATE_FORK;
> > > > +//                 goto next_state;
> > > >             }
> > 
> > then it implies the only way we can get here is via a replace
> > operation that has removed the attr fork during the remove and we
> > hit this on the attr add. Yet, AFAICT, the attr fork does not get
> > removed when a replace operation is in progress.
> > 
> > Maybe there's a new bug introduced in the PP patchset that triggers
> > this - I'll do some more looking...
> > 
> > > > > This should avoid the need for parent pointers to ever need to
> > > > > run
> > > > > an extra transaction to create the attr fork. Hence, AFAICT,
> > > > > this
> > > > > new state to handle attr fork creation shouldn't ever be needed
> > > > > for
> > > > > parent pointers....
> > > > > 
> > > > > What am I missing?
> > > > > 
> > > > I hope the description helped?  I'll do some more poking around
> > > > too and
> > > > post back if I find anything else.
> > > 
> > > Yup, it most definitely helped. :)
> > > 
> > > You've pointed out something I had completely missed w.r.t. attr
> > > intent replay ordering against replay of data fork modifications.
> > > There's definitely an issue here, I think it might be a fundamental
> > > issue with the recovery mechanism (and not parent pointers), and I
> > > think we'll end up needing  something like this patch to fix it.
> > > Let me bounce this around my head for a bit...
> > 
> > In summary, after further thought this turns out not to be an issue
> > at all, so I'm back to "replace doesn't remove the attr fork, so
> > how does this happen?"....
> Just a follow up here...
> 
> So, I think what is happening is we are getting interleaved replays.
> When a parent changes, the parent pointer attribute needs to be
> updated.  But since this changes the attribute name, it cant be an attr
> replace.  So we queue up an attr set and then a remove (in xfs_rename).
> The test case sets the inject which starts a replay of a set and
> remove.  The set expands the leaf, and bounces out the EAGAIN
> into xfs_attri_item_recover.  We then fall into this error handling:
> 
>         ret = xfs_xattri_finish_update(attr,
> done_item);                        
>         if (ret == -EAGAIN)
> {                                                   
>                 /* There's more work to do, so add it to this
> transaction */    
>                 xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr-
> >xattri_list);
>         }
> 
> Which I think is queuing up the rest of the set operation, but on the
> other side of the remove operation.  The remove operation removes the
> fork, and we get the null pointer when the set operation resumes.
> 
> I havnt quite worked out a fix yet, but I'm pretty sure that's the
> problem description....  I think what we're going to need is some sort
> of defer_push operation or something similar.

Hm.  To summarize (poorly) a conversation we had on IRC, I guess the
problem here is that xfs_rename logs both an attr-set and an attr-remove
log item to a rename transaction.  Then log recovery finds both, starts
the first step of the attr-set, and we capture-and-commit the rest of
the chain.  Next it recovers the the attr-remove, starts the first step
of /that/ (which clobbers the attr fork) and capture-and-commits that.
Finally, we continue the attr-set, which promptly barfs because the attr
fork disappeared on it!

So I guess this is a gaping hole in how log recovery works because we
don't write enough information into the log intent items to be able to
reconstruct the full defer ops chains that were attached to a
transaction.  If we could do that, then we'd know that there exists a
data dependency between the set and the remove such that we must finish
the entire set chain before moving on to the remove chain.

However, we don't know this.  Thus far in the lifespan of defer ops, I
think we've been careful/lucky enough that we've never had this problem,
and reworking the log format to encode transaction chain details is
quite a lot of work.

---

A second thought -- the xattri log recovery routine recreates the defer
ops state, logs an xattrd for the recovered item, takes one step in the
state machine, and uses capture-and-commit to defer any more steps to a
later time.  What if instead the recovery routine didn't take that extra
step?  If all it did was recreate the incore defer ops state and
deferred everything else to xlog_finish_defer_ops, the attr set and attr
remove recoveries would not overlap and trip over each other.

Maybe a POC would be to have the _recover function log the xattrd for
the recovered log item and call xfs_defer_ops_capture_and_commit without
actually bothering with xfs_attri_finish_update?  Something like this?

	xfs_init_attr_trans(args, &tres, &total);
	error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE,
			&tp);
	if (error)
		goto out;

	args->trans = tp;
	done_item = xfs_trans_get_attrd(tp, attrip);

	xfs_ilock(ip, XFS_ILOCK_EXCL);
	xfs_trans_ijoin(tp, ip, 0);

	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
	error = xfs_defer_ops_capture_and_commit(tp, capture_list);
	if (error)
		goto out_unlock;

	xfs_iunlock(ip, XFS_ILOCK_EXCL);
	xfs_irele(ip);
	return 0;

---

A third thought I had for solving this problem would be to extend the
xattr log format to be able to record a "rename and rewrite" operation.
I only suggest this because directory online repair currently works by
creating a temporary file, using the regular xfs_dir_createname function
to build the directory structures (without touching the children) and
then uses swapext to exchange the directory structure blocks.  When
parent pointers come along, I guess we'll have to figure out how to
insert dirents at specific offsets, or we'll have to figure out how to
update all the parent pointers atomically.  Yuck.

While a "rename and reset" xattr op would work here, I suppose the
online repair case is a lot simpler -- it never needs to update the
dirent name, so all we really have to do is rewrite the xattr key,
remove the dabtree hash mapping, and insert a new one.

Hmm ok I'm going to go eat dinner and think about my second suggestion
tomorrow.

--D

> Allison
> 
> > 
> > Cheers,
> > 
> > Dave.
> > 
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 836ab1b8ed7b..a94850d9b8b1 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -719,15 +719,31 @@  xfs_attr_set_iter(
 	struct xfs_attr_intent		*attr)
 {
 	struct xfs_da_args              *args = attr->xattri_da_args;
+	int				sf_size;
 	int				error = 0;
 
 	/* State machine switch */
+
 next_state:
 	switch (attr->xattri_dela_state) {
 	case XFS_DAS_UNINIT:
 		ASSERT(0);
 		return -EFSCORRUPTED;
+	case XFS_DAS_CREATE_FORK:
+		sf_size = sizeof(struct xfs_attr_sf_hdr) +
+			  xfs_attr_sf_entsize_byname(args->namelen,
+						     args->valuelen);
+		error = xfs_bmap_set_attrforkoff(args->dp, sf_size, NULL);
+		if (error)
+			return error;
+		args->dp->i_afp = kmem_cache_zalloc(xfs_ifork_cache, 0);
+		args->dp->i_afp->if_format = XFS_DINODE_FMT_EXTENTS;
+		fallthrough;
 	case XFS_DAS_SF_ADD:
+		if (!args->dp->i_afp) {
+			attr->xattri_dela_state = XFS_DAS_CREATE_FORK;
+			goto next_state;
+		}
 		return xfs_attr_sf_addname(attr);
 	case XFS_DAS_LEAF_ADD:
 		return xfs_attr_leaf_addname(attr);
@@ -920,8 +936,10 @@  xfs_attr_defer_add(
 	error = xfs_attr_intent_init(args, XFS_ATTRI_OP_FLAGS_SET, &new);
 	if (error)
 		return error;
-
-	new->xattri_dela_state = xfs_attr_init_add_state(args);
+	if (!args->dp->i_afp)
+		new->xattri_dela_state = XFS_DAS_CREATE_FORK;
+	else
+		new->xattri_dela_state = xfs_attr_init_add_state(args);
 	xfs_defer_add(args->trans, XFS_DEFER_OPS_TYPE_ATTR, &new->xattri_list);
 	trace_xfs_attr_defer_add(new->xattri_dela_state, args->dp);
 
diff --git a/fs/xfs/libxfs/xfs_attr.h b/fs/xfs/libxfs/xfs_attr.h
index e329da3e7afa..7600eac74db7 100644
--- a/fs/xfs/libxfs/xfs_attr.h
+++ b/fs/xfs/libxfs/xfs_attr.h
@@ -445,7 +445,7 @@  struct xfs_attr_list_context {
  */
 enum xfs_delattr_state {
 	XFS_DAS_UNINIT		= 0,	/* No state has been set yet */
-
+	XFS_DAS_CREATE_FORK,		/* Create the attr fork */
 	/*
 	 * Initial sequence states. The replace setup code relies on the
 	 * ADD and REMOVE states for a specific format to be sequential so
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 6833110d1bd4..edafb6b1bfd6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -977,7 +977,7 @@  xfs_bmap_add_attrfork_local(
 /*
  * Set an inode attr fork offset based on the format of the data fork.
  */
-static int
+int
 xfs_bmap_set_attrforkoff(
 	struct xfs_inode	*ip,
 	int			size,
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 16db95b11589..a35945d44b80 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -172,6 +172,7 @@  void	xfs_trim_extent(struct xfs_bmbt_irec *irec, xfs_fileoff_t bno,
 		xfs_filblks_t len);
 unsigned int xfs_bmap_compute_attr_offset(struct xfs_mount *mp);
 int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
+int	xfs_bmap_set_attrforkoff(struct xfs_inode *ip, int size, int *version);
 void	xfs_bmap_local_to_extents_empty(struct xfs_trans *tp,
 		struct xfs_inode *ip, int whichfork);
 void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
index 4a28c2d77070..f524dbbb42d3 100644
--- a/fs/xfs/xfs_attr_item.c
+++ b/fs/xfs/xfs_attr_item.c
@@ -625,7 +625,9 @@  xfs_attri_item_recover(
 		args->value = nv->value.i_addr;
 		args->valuelen = nv->value.i_len;
 		args->total = xfs_attr_calc_size(args, &local);
-		if (xfs_inode_hasattr(args->dp))
+		if (!args->dp->i_afp)
+			attr->xattri_dela_state = XFS_DAS_CREATE_FORK;
+		else if (xfs_inode_hasattr(args->dp))
 			attr->xattri_dela_state = xfs_attr_init_replace_state(args);
 		else
 			attr->xattri_dela_state = xfs_attr_init_add_state(args);