mbox series

[v24,00/11] Log Attribute Replay

Message ID 20210824224434.968720-1-allison.henderson@oracle.com (mailing list archive)
Headers show
Series Log Attribute Replay | expand

Message

Allison Henderson Aug. 24, 2021, 10:44 p.m. UTC
Hi all,

This set is a subset of a larger series parent pointers. Delayed attributes allow
attribute operations (set and remove) to be logged and committed in the same
way that other delayed operations do. This allows more complex operations (like
parent pointers) to be broken up into multiple smaller transactions. To do
this, the existing attr operations must be modified to operate as a delayed
operation.  This means that they cannot roll, commit, or finish transactions.
Instead, they return -EAGAIN to allow the calling function to handle the
transaction.  In this series, we focus on only the delayed attribute portion.
We will introduce parent pointers in a later set.

The set as a whole is a bit much to digest at once, so I usually send out the
smaller sub series to reduce reviewer burn out.  But the entire extended series
is visible through the included github links.

Updates since v23:

xfs: Return from xfs_attr_set_iter if there are no more rmtblks to process
   Fixed xfs/125 fail

xfs: Capture buffers for delayed ops
   NEW

xfs: Set up infrastructure for deferred attribute operations
   Commit message updated with new name
   Removed hasdelattr() from xfs_attri_validate

xfs: Implement attr logging and replay
   Fixed xfs_sb_version_hasdelattr to use sb_features_log_incompat
   Removed redundant flag set in xfs_trans_attr_finish_update
   Renamed XFS_SB_FEAT_INCOMPAT_LOG_DELATTR to XFS_SB_FEAT_INCOMPAT_LOG_XATTRS
   Renamed xfs_sb_version_hasdelattr sb_version_haslogxattrs
   Removed unneeded xfs_qm_dqattach_locked in xfs_trans_attr_finish_update
   Changed xfs_hasdelattr to sb_version_haslogxattrs in xfs_attr_create_intent
   Simplified args pointer in xfs_attri_item_recover
   Removed uneeded buffer rejoin logic

RFC xfs: Skip flip flags for delayed attrs
   Fixed xfs/125 fail

 xfs: Add xfs_attr_set_deferred and xfs_attr_remove_deferred
   added delayed sample to avoid potential race when enabling feature 
   moved warn_once to after error check

xfs: Remove unused xfs_attr_*_args
   Removed local variable leaf_bp

xfs: Add log attribute error tag
   Renamed error tag to ERRTAG_LARP

xfs: Add larp debug option
   NEW

xfs: Merge xfs_delattr_context into xfs_attr_item
   Rebase adjustments

xfs: Add helper function xfs_attr_leaf_addname
    Rebase adjustments

This series can be viewed on github here:
https://github.com/allisonhenderson/xfs_work/tree/delayed_attrs_v24

As well as the extended delayed attribute and parent pointer series:
https://github.com/allisonhenderson/xfs_work/tree/delayed_attrs_v24_extended

And the test cases:
https://github.com/allisonhenderson/xfs_work/tree/pptr_xfstestsv4
In order to run the test cases, you will need have the corresponding xfsprogs
changes as well.  Which can be found here:
https://github.com/allisonhenderson/xfs_work/tree/delayed_attrs_xfsprogs_v24
https://github.com/allisonhenderson/xfs_work/tree/delayed_attrs_xfsprogs_v24_extended

To run the xfs attributes tests run:
check -g attr

To run as delayed attributes run:
echo 1 > /sys/fs/xfs/debug/larp;
check -g attr

To run parent pointer tests:
check -g parent

I've also made the corresponding updates to the user space side as well, and ported anything
they need to seat correctly.

Questions, comment and feedback appreciated! 

Allison

Allison Collins (1):
  xfs: Add xfs_attr_set_deferred and xfs_attr_remove_deferred

Allison Henderson (10):
  xfs: Return from xfs_attr_set_iter if there are no more rmtblks to
    process
  xfs: Capture buffers for delayed ops
  xfs: Set up infrastructure for log atrribute replay
  xfs: Implement attr logging and replay
  RFC xfs: Skip flip flags for delayed attrs
  xfs: Remove unused xfs_attr_*_args
  xfs: Add log attribute error tag
  xfs: Add larp debug option
  xfs: Merge xfs_delattr_context into xfs_attr_item
  xfs: Add helper function xfs_attr_leaf_addname

 fs/xfs/Makefile                 |   1 +
 fs/xfs/libxfs/xfs_attr.c        | 454 +++++++++---------
 fs/xfs/libxfs/xfs_attr.h        |  57 ++-
 fs/xfs/libxfs/xfs_attr_leaf.c   |   3 +-
 fs/xfs/libxfs/xfs_attr_remote.c |  37 +-
 fs/xfs/libxfs/xfs_attr_remote.h |   6 +-
 fs/xfs/libxfs/xfs_defer.c       |   8 +-
 fs/xfs/libxfs/xfs_defer.h       |   7 +-
 fs/xfs/libxfs/xfs_errortag.h    |   4 +-
 fs/xfs/libxfs/xfs_format.h      |  10 +-
 fs/xfs/libxfs/xfs_log_format.h  |  44 +-
 fs/xfs/libxfs/xfs_log_recover.h |   2 +
 fs/xfs/scrub/common.c           |   2 +
 fs/xfs/xfs_attr_item.c          | 814 ++++++++++++++++++++++++++++++++
 fs/xfs/xfs_attr_item.h          |  52 ++
 fs/xfs/xfs_attr_list.c          |   1 +
 fs/xfs/xfs_bmap_item.c          |   2 +-
 fs/xfs/xfs_buf.c                |   1 +
 fs/xfs/xfs_buf.h                |   1 +
 fs/xfs/xfs_error.c              |   3 +
 fs/xfs/xfs_extfree_item.c       |   2 +-
 fs/xfs/xfs_globals.c            |   1 +
 fs/xfs/xfs_ioctl32.c            |   2 +
 fs/xfs/xfs_iops.c               |   2 +
 fs/xfs/xfs_log.c                |  45 ++
 fs/xfs/xfs_log.h                |   1 +
 fs/xfs/xfs_log_recover.c        |   9 +
 fs/xfs/xfs_ondisk.h             |   2 +
 fs/xfs/xfs_refcount_item.c      |   2 +-
 fs/xfs/xfs_rmap_item.c          |   2 +-
 fs/xfs/xfs_sysctl.h             |   1 +
 fs/xfs/xfs_sysfs.c              |  24 +
 fs/xfs/xfs_trace.h              |   1 +
 33 files changed, 1333 insertions(+), 270 deletions(-)
 create mode 100644 fs/xfs/xfs_attr_item.c
 create mode 100644 fs/xfs/xfs_attr_item.h

Comments

Dave Chinner Aug. 31, 2021, 12:20 a.m. UTC | #1
On Tue, Aug 24, 2021 at 03:44:23PM -0700, Allison Henderson wrote:
> Hi all,
> 
> This set is a subset of a larger series parent pointers. Delayed attributes allow
> attribute operations (set and remove) to be logged and committed in the same
> way that other delayed operations do. This allows more complex operations (like
> parent pointers) to be broken up into multiple smaller transactions. To do
> this, the existing attr operations must be modified to operate as a delayed
> operation.  This means that they cannot roll, commit, or finish transactions.
> Instead, they return -EAGAIN to allow the calling function to handle the
> transaction.  In this series, we focus on only the delayed attribute portion.
> We will introduce parent pointers in a later set.
> 
> The set as a whole is a bit much to digest at once, so I usually send out the
> smaller sub series to reduce reviewer burn out.  But the entire extended series
> is visible through the included github links.

Ok, so like I did with Darrick's deferred inactivation series, the
first thing I'm doing here is throwing this patchset at
scalability/performance worklaods and finding out what is different.

I've merged this series with 5.14 + xfs/for-next + xfs-cil-scale and
then run some tests on it. First up is fsmark creating zero length
files w/ 64 byte xattrs. This should stress only shortform attribute
manipulations.

I have not enabled delayed attributes yet (i.e.
/sys/fs/xfs/debug/larp = 0)

First thing I notice is the transaction commit rate during create is
up around 900k/s, so we are doing 3 transactions per inode - 1 for
create, 2 for attributes. That looks like a regression - existing
shortform attribute creation only takes a single transaction commit,
so this workload prior to this patchset only ran at 600k commits/s.

Note that hte only reason I'm getting 900k transactions/s is the
CIL scalability patchset - without that the system tops out at ~800k
transactions/s and so this would be a significant performance
regression (20%) vs the current xfs/for-next code.

Essentially, this looks like we are doing an extra transaction
commit to defer the creation of the attribute, then doing another
transaction to actually modify the attribute. i.e.:

 - 11.04% xfs_attr_set
    - 8.70% xfs_trans_commit
       - 8.69% __xfs_trans_commit
	  - 5.10% xfs_defer_finish_noroll
	     - 3.74% xfs_defer_trans_roll
		- 3.57% xfs_trans_roll
		   - 3.13% __xfs_trans_commit
		      - 3.01% xlog_cil_commit
			   0.66% down_read
			   0.63% xfs_log_ticket_regrant
	     - 1.16% xfs_attr_finish_item
		- 1.06% xfs_trans_attr_finish_update
		   - 1.03% xfs_attr_set_iter
		      - 1.01% xfs_attr_sf_addname
			 - 0.99% xfs_attr_try_sf_addname
			    - 0.61% xfs_attr_shortform_addname
				 0.55% xfs_attr_shortform_add


AFAICT, for non-delayed attributes, this first transaction commit
logs the inode but does not create intent or intent done items
(returns NULL for both operations), so just rolls and runs the
->finish_item. So it would seem that the first transaction just
changes the inode timestamps and does nothing else.

Firstly, this means the inode timestamp change is not atomic w.r.t.
the attribute change the timestamp change relates to and it's
essentially new overhead for the non-delayed path.

Looking at the unlink path, I see the same thing - there's an extra
transaction for the attr remove path, the same as the attr set path.
This drives the unlink path to 1.1 million transaction commits/sec
instead of 800k/s, so it's likely that there's a substantial
performance regression here on a kernel without the CIL scalability
patchset.

IOWs, there's significant behavioural changes with the non-delayed
logging version of this patchset, both in terms of performance and
the atomicity of changes that appear in the journal and hence
recovery behaviour.

At this point I have to ask: why are we trying to retain the "old"
way of doing things (even for testing) if it is substantially
changing behaviour and on-disk journal contents for attribute
modifications?

So, lets turn on delayed logging:

$ sudo sh -c 'echo 1 > /sys/fs/xfs/debug/larp'
$ ~/tests/fsmark-50-test-xfs.sh -t 16 -X 64 -d /dev/mapper/fast -- -l size=2000m -d agcount=67
QUOTA=
MKFSOPTS= -l size=2000m -d agcount=67
DEV=/dev/mapper/fast
THREADS=16
.....

Message from syslogd@test4 at Aug 31 09:12:55 ...
 kernel:[ 2342.737931] XFS: Assertion failed: !test_bit(XFS_LI_DIRTY, &lip->li_flags), file: fs/xfs/xfs_trans.c, line: 652

Instant assert fail and the machine locks up hard.

Actually, now that I reproduce it with a full console trace (which
is terribly interleaved and almost impossible to read) there's
bad stuff all over the place. Null pointer dereferences in
xlog_cil_commit, "sleeping in atomic" failures, and the assert
failure above. 

Ok, run a single thread, and...

[   84.119162] BUG: kernel NULL pointer dereference, address: 000000000000000d
[   84.123541] #PF: supervisor write access in kernel mode
[   84.126028] #PF: error_code(0x0002) - not-present page
[   84.127312] PGD 0 P4D 0 
[   84.127966] Oops: 0002 [#1] PREEMPT SMP
[   84.128960] CPU: 8 PID: 5139 Comm: fs_mark Not tainted 5.14.0-dgc+ #552
[   84.130632] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
[   84.132723] RIP: 0010:xlog_prepare_iovec+0x59/0xe0
[   84.133951] Code: 4c 89 f8 4c 29 e0 48 c1 f8 04 48 39 d0 7d 76 4d 8d 67 10 8b 43 34 8d 50 0c 83 e2 07 74 0c 83 c0 0b 83 c8 07 83 e8 0b 89 43 34 <45> 89 74 24 0c 48 63 43 34 48 03 43 28 49 89 04 24 c7 40 08 69 00
[   84.137980] RSP: 0018:ffffc900021d7838 EFLAGS: 00010202
[   84.138968] RAX: 0000000000000004 RBX: ffff888140f3e100 RCX: 0000000000000006
[   84.140321] RDX: 0000000000000004 RSI: ffffc900021d7880 RDI: ffff888140f3e100
[   84.141671] RBP: ffffc900021d7868 R08: ffffffff82a8fb88 R09: 000000000000494e
[   84.143069] R10: ffff88823ffd5000 R11: 00000000000319c8 R12: 0000000000000001
[   84.144447] R13: ffffc900021d7880 R14: 000000000000001b R15: 0000000000000000
[   84.145825] FS:  00007f51fb33c740(0000) GS:ffff88823bc00000(0000) knlGS:0000000000000000
[   84.147413] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   84.148535] CR2: 000000000000000d CR3: 00000001473b1003 CR4: 0000000000770ee0
[   84.149927] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   84.151291] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   84.152331] PKRU: 55555554
[   84.152732] Call Trace:
[   84.153106]  xfs_attri_item_format+0x87/0x230
[   84.153748]  xlog_cil_commit+0x253/0xa00
[   84.154329]  ? kvmalloc_node+0x79/0x80
[   84.154881]  __xfs_trans_commit+0xc1/0x330
[   84.155490]  xfs_trans_roll+0x53/0xe0
[   84.156027]  xfs_defer_trans_roll+0x10d/0x2b0
[   84.156659]  xfs_defer_finish_noroll+0xb1/0x650
[   84.157312]  __xfs_trans_commit+0x143/0x330
[   84.157918]  xfs_trans_commit+0x10/0x20
[   84.158498]  xfs_attr_set+0x41a/0x4e0
[   84.159030]  xfs_xattr_set+0x8d/0xe0
[   84.159554]  __vfs_setxattr+0x6b/0x90
[   84.160090]  __vfs_setxattr_noperm+0x7d/0x1f0
[   84.160718]  __vfs_setxattr_locked+0xdf/0x100
[   84.161346]  vfs_setxattr+0x9b/0x170
[   84.161862]  setxattr+0x110/0x200
[   84.162346]  ? _raw_spin_unlock+0xe/0x20
[   84.162914]  ? __handle_mm_fault+0xc1b/0x16d0
[   84.163556]  ? __might_sleep+0x49/0x80
[   84.164132]  __x64_sys_fsetxattr+0xb1/0xe0
[   84.164782]  do_syscall_64+0x35/0x80

Ok, there's the first failure.

This looks like it's a problem with xfs_attri_item_{size,format} in
calculating the number of bytes to log. They use ATTR_NVEC_SIZE() to
calculate the number of bytes of copy from the attribute item which
rounds up the length to copy to 4 byte aligned values. I'm not sure
what this function is calculating:

/* iovec length must be 32-bit aligned */
static inline size_t ATTR_NVEC_SIZE(size_t size)
{
        return size == sizeof(int32_t) ? size :
	               sizeof(int32_t) + round_up(size, sizeof(int32_t));
}

It appears to be saying if the size == 4, then return 4, otherwise
return 4 + roundup(size)... which leads me to struct
xfs_attri_log_format:

struct xfs_attri_log_format {
        uint16_t        alfi_type;      /* attri log item type */
        uint16_t        alfi_size;      /* size of this item */
        uint32_t        __pad;          /* pad to 64 bit aligned */
        uint64_t        alfi_id;        /* attri identifier */
        uint64_t        alfi_ino;       /* the inode for this attr operation */
        uint32_t        alfi_op_flags;  /* marks the op as a set or remove */
        uint32_t        alfi_name_len;  /* attr name length */
        uint32_t        alfi_value_len; /* attr value length */
        uint32_t        alfi_attr_flags;/* attr flags */
};

I don't see where the extra 4 bytes for the attribute vector size
comes from. It's not needed to store the length, so this could
oversize the amount of data to be copied from the source
buffer by up to 7 bytes?

I can see that it might need rounding with the existing
log code (because the formatter is responsible for 32 bit alignment
of log vectors), but that goes away with the CIL scalability
patchset that always aligns iovecs to 4 byte alignment so the
formatters do not need to do that.

Hiding it in a "macro" is not necessary, either - look at how
xfs_inode_item_{data,attr}_fork_size handle the rounding up of the
local format fork size. They round up the fork byte count to 4
directly, and the format code copies those bytes because
xfs_idata_realloc() allocates those bytes.

However, for the attribute buffers, this isn't guaranteed. Look at
xfs_xattr_set():

        struct xfs_da_args      args = {
                .dp             = XFS_I(inode),
                .attr_filter    = handler->flags,
                .attr_flags     = flags,
                .name           = name,
                .namelen        = strlen(name),
                .value          = (void *)value,
                .valuelen       = size,
        };

There is no rounding up of the name or value lengths, and these end
up directly referenced by the deferred logging via xfs_attr_log_item()
and attrip->da_args->...

        attrip->attri_name = (void *)attr->xattri_da_args->name;
        attrip->attri_value = attr->xattri_da_args->value;
        attrip->attri_name_len = attr->xattri_da_args->namelen;
        attrip->attri_value_len = attr->xattri_da_args->valuelen;

We then pass those pointers directly to xlog_iovec_copy() but with a
rounded up length that is longer than the source buffer:

        xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTR_NAME,
                        attrip->attri_name,
                        ATTR_NVEC_SIZE(attrip->attri_name_len));
        if (attrip->attri_value_len > 0)
                xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTR_VALUE,
                                attrip->attri_value,
                                ATTR_NVEC_SIZE(attrip->attri_value_len));

So while this might not be the source of the above crash, it's
certainly a bug that needs fixing.

At this point, I'm just going to hack on the code to make it work,
and we can go from there...

Cheers,

Dave.
Allison Henderson Aug. 31, 2021, 6:27 p.m. UTC | #2
On 8/30/21 5:20 PM, Dave Chinner wrote:
> On Tue, Aug 24, 2021 at 03:44:23PM -0700, Allison Henderson wrote:
>> Hi all,
>>
>> This set is a subset of a larger series parent pointers. Delayed attributes allow
>> attribute operations (set and remove) to be logged and committed in the same
>> way that other delayed operations do. This allows more complex operations (like
>> parent pointers) to be broken up into multiple smaller transactions. To do
>> this, the existing attr operations must be modified to operate as a delayed
>> operation.  This means that they cannot roll, commit, or finish transactions.
>> Instead, they return -EAGAIN to allow the calling function to handle the
>> transaction.  In this series, we focus on only the delayed attribute portion.
>> We will introduce parent pointers in a later set.
>>
>> The set as a whole is a bit much to digest at once, so I usually send out the
>> smaller sub series to reduce reviewer burn out.  But the entire extended series
>> is visible through the included github links.

Some of this we worked out in the chat last night, but I will echo it 
here for the archives

> 
> Ok, so like I did with Darrick's deferred inactivation series, the
> first thing I'm doing here is throwing this patchset at
> scalability/performance worklaods and finding out what is different.
> 
> I've merged this series with 5.14 + xfs/for-next + xfs-cil-scale and
> then run some tests on it. First up is fsmark creating zero length
> files w/ 64 byte xattrs. This should stress only shortform attribute
> manipulations.
> 
> I have not enabled delayed attributes yet (i.e.
> /sys/fs/xfs/debug/larp = 0)
> 
> First thing I notice is the transaction commit rate during create is
> up around 900k/s, so we are doing 3 transactions per inode - 1 for
> create, 2 for attributes. That looks like a regression - existing
> shortform attribute creation only takes a single transaction commit,
> so this workload prior to this patchset only ran at 600k commits/s.
> 
> Note that hte only reason I'm getting 900k transactions/s is the
> CIL scalability patchset - without that the system tops out at ~800k
> transactions/s and so this would be a significant performance
> regression (20%) vs the current xfs/for-next code.
> 
> Essentially, this looks like we are doing an extra transaction
> commit to defer the creation of the attribute, then doing another
> transaction to actually modify the attribute. i.e.:
> 
>   - 11.04% xfs_attr_set
>      - 8.70% xfs_trans_commit
>         - 8.69% __xfs_trans_commit
> 	  - 5.10% xfs_defer_finish_noroll
> 	     - 3.74% xfs_defer_trans_roll
> 		- 3.57% xfs_trans_roll
> 		   - 3.13% __xfs_trans_commit
> 		      - 3.01% xlog_cil_commit
> 			   0.66% down_read
> 			   0.63% xfs_log_ticket_regrant
> 	     - 1.16% xfs_attr_finish_item
> 		- 1.06% xfs_trans_attr_finish_update
> 		   - 1.03% xfs_attr_set_iter
> 		      - 1.01% xfs_attr_sf_addname
> 			 - 0.99% xfs_attr_try_sf_addname
> 			    - 0.61% xfs_attr_shortform_addname
> 				 0.55% xfs_attr_shortform_add
> 
> 
> AFAICT, for non-delayed attributes, this first transaction commit
> logs the inode but does not create intent or intent done items
> (returns NULL for both operations), so just rolls and runs the
> ->finish_item. So it would seem that the first transaction just
> changes the inode timestamps and does nothing else.
> 
> Firstly, this means the inode timestamp change is not atomic w.r.t.
> the attribute change the timestamp change relates to and it's
> essentially new overhead for the non-delayed path.
> 
> Looking at the unlink path, I see the same thing - there's an extra
> transaction for the attr remove path, the same as the attr set path.
> This drives the unlink path to 1.1 million transaction commits/sec
> instead of 800k/s, so it's likely that there's a substantial
> performance regression here on a kernel without the CIL scalability
> patchset.
> 
> IOWs, there's significant behavioural changes with the non-delayed
> logging version of this patchset, both in terms of performance and
> the atomicity of changes that appear in the journal and hence
> recovery behaviour.
> 
> At this point I have to ask: why are we trying to retain the "old"
> way of doing things (even for testing) if it is substantially
> changing behaviour and on-disk journal contents for attribute
> modifications?
Per the chat discussion, we have to keep both methods since sb v4 would 
not use the new log entries.

> 
> So, lets turn on delayed logging:
> 
> $ sudo sh -c 'echo 1 > /sys/fs/xfs/debug/larp'
> $ ~/tests/fsmark-50-test-xfs.sh -t 16 -X 64 -d /dev/mapper/fast -- -l size=2000m -d agcount=67
> QUOTA=
> MKFSOPTS= -l size=2000m -d agcount=67
> DEV=/dev/mapper/fast
> THREADS=16
> .....
> 
> Message from syslogd@test4 at Aug 31 09:12:55 ...
>   kernel:[ 2342.737931] XFS: Assertion failed: !test_bit(XFS_LI_DIRTY, &lip->li_flags), file: fs/xfs/xfs_trans.c, line: 652
> 
> Instant assert fail and the machine locks up hard.
> 
> Actually, now that I reproduce it with a full console trace (which
> is terribly interleaved and almost impossible to read) there's
> bad stuff all over the place. Null pointer dereferences in
> xlog_cil_commit, "sleeping in atomic" failures, and the assert
> failure above.
> 
> Ok, run a single thread, and...
> 
> [   84.119162] BUG: kernel NULL pointer dereference, address: 000000000000000d
> [   84.123541] #PF: supervisor write access in kernel mode
> [   84.126028] #PF: error_code(0x0002) - not-present page
> [   84.127312] PGD 0 P4D 0
> [   84.127966] Oops: 0002 [#1] PREEMPT SMP
> [   84.128960] CPU: 8 PID: 5139 Comm: fs_mark Not tainted 5.14.0-dgc+ #552
> [   84.130632] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
> [   84.132723] RIP: 0010:xlog_prepare_iovec+0x59/0xe0
> [   84.133951] Code: 4c 89 f8 4c 29 e0 48 c1 f8 04 48 39 d0 7d 76 4d 8d 67 10 8b 43 34 8d 50 0c 83 e2 07 74 0c 83 c0 0b 83 c8 07 83 e8 0b 89 43 34 <45> 89 74 24 0c 48 63 43 34 48 03 43 28 49 89 04 24 c7 40 08 69 00
> [   84.137980] RSP: 0018:ffffc900021d7838 EFLAGS: 00010202
> [   84.138968] RAX: 0000000000000004 RBX: ffff888140f3e100 RCX: 0000000000000006
> [   84.140321] RDX: 0000000000000004 RSI: ffffc900021d7880 RDI: ffff888140f3e100
> [   84.141671] RBP: ffffc900021d7868 R08: ffffffff82a8fb88 R09: 000000000000494e
> [   84.143069] R10: ffff88823ffd5000 R11: 00000000000319c8 R12: 0000000000000001
> [   84.144447] R13: ffffc900021d7880 R14: 000000000000001b R15: 0000000000000000
> [   84.145825] FS:  00007f51fb33c740(0000) GS:ffff88823bc00000(0000) knlGS:0000000000000000
> [   84.147413] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   84.148535] CR2: 000000000000000d CR3: 00000001473b1003 CR4: 0000000000770ee0
> [   84.149927] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   84.151291] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   84.152331] PKRU: 55555554
> [   84.152732] Call Trace:
> [   84.153106]  xfs_attri_item_format+0x87/0x230
> [   84.153748]  xlog_cil_commit+0x253/0xa00
> [   84.154329]  ? kvmalloc_node+0x79/0x80
> [   84.154881]  __xfs_trans_commit+0xc1/0x330
> [   84.155490]  xfs_trans_roll+0x53/0xe0
> [   84.156027]  xfs_defer_trans_roll+0x10d/0x2b0
> [   84.156659]  xfs_defer_finish_noroll+0xb1/0x650
> [   84.157312]  __xfs_trans_commit+0x143/0x330
> [   84.157918]  xfs_trans_commit+0x10/0x20
> [   84.158498]  xfs_attr_set+0x41a/0x4e0
> [   84.159030]  xfs_xattr_set+0x8d/0xe0
> [   84.159554]  __vfs_setxattr+0x6b/0x90
> [   84.160090]  __vfs_setxattr_noperm+0x7d/0x1f0
> [   84.160718]  __vfs_setxattr_locked+0xdf/0x100
> [   84.161346]  vfs_setxattr+0x9b/0x170
> [   84.161862]  setxattr+0x110/0x200
> [   84.162346]  ? _raw_spin_unlock+0xe/0x20
> [   84.162914]  ? __handle_mm_fault+0xc1b/0x16d0
> [   84.163556]  ? __might_sleep+0x49/0x80
> [   84.164132]  __x64_sys_fsetxattr+0xb1/0xe0
> [   84.164782]  do_syscall_64+0x35/0x80
> 
> Ok, there's the first failure.
> 
> This looks like it's a problem with xfs_attri_item_{size,format} in
> calculating the number of bytes to log. They use ATTR_NVEC_SIZE() to
> calculate the number of bytes of copy from the attribute item which
> rounds up the length to copy to 4 byte aligned values. I'm not sure
> what this function is calculating:
> 
> /* iovec length must be 32-bit aligned */
> static inline size_t ATTR_NVEC_SIZE(size_t size)
> {
>          return size == sizeof(int32_t) ? size :
> 	               sizeof(int32_t) + round_up(size, sizeof(int32_t));
> }
> 
> It appears to be saying if the size == 4, then return 4, otherwise
> return 4 + roundup(size)... which leads me to struct
> xfs_attri_log_format:
> 
> struct xfs_attri_log_format {
>          uint16_t        alfi_type;      /* attri log item type */
>          uint16_t        alfi_size;      /* size of this item */
>          uint32_t        __pad;          /* pad to 64 bit aligned */
>          uint64_t        alfi_id;        /* attri identifier */
>          uint64_t        alfi_ino;       /* the inode for this attr operation */
>          uint32_t        alfi_op_flags;  /* marks the op as a set or remove */
>          uint32_t        alfi_name_len;  /* attr name length */
>          uint32_t        alfi_value_len; /* attr value length */
>          uint32_t        alfi_attr_flags;/* attr flags */
> };
> 
> I don't see where the extra 4 bytes for the attribute vector size
> comes from. It's not needed to store the length, so this could
> oversize the amount of data to be copied from the source
> buffer by up to 7 bytes?
> 
> I can see that it might need rounding with the existing
> log code (because the formatter is responsible for 32 bit alignment
> of log vectors), but that goes away with the CIL scalability
> patchset that always aligns iovecs to 4 byte alignment so the
> formatters do not need to do that.

I think we figured this out last night, initially this was here for an 
assertion check in the log code, but I think just the round up will 
suffice for the check.

> 
> Hiding it in a "macro" is not necessary, either - look at how
> xfs_inode_item_{data,attr}_fork_size handle the rounding up of the
> local format fork size. They round up the fork byte count to 4
> directly, and the format code copies those bytes because
> xfs_idata_realloc() allocates those bytes.
> 
> However, for the attribute buffers, this isn't guaranteed. Look at
> xfs_xattr_set():
> 
>          struct xfs_da_args      args = {
>                  .dp             = XFS_I(inode),
>                  .attr_filter    = handler->flags,
>                  .attr_flags     = flags,
>                  .name           = name,
>                  .namelen        = strlen(name),
>                  .value          = (void *)value,
>                  .valuelen       = size,
>          };
> 
> There is no rounding up of the name or value lengths, and these end
> up directly referenced by the deferred logging via xfs_attr_log_item()
> and attrip->da_args->...
> 
>          attrip->attri_name = (void *)attr->xattri_da_args->name;
>          attrip->attri_value = attr->xattri_da_args->value;
>          attrip->attri_name_len = attr->xattri_da_args->namelen;
>          attrip->attri_value_len = attr->xattri_da_args->valuelen;
> 
> We then pass those pointers directly to xlog_iovec_copy() but with a
> rounded up length that is longer than the source buffer:
> 
>          xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTR_NAME,
>                          attrip->attri_name,
>                          ATTR_NVEC_SIZE(attrip->attri_name_len));
>          if (attrip->attri_value_len > 0)
>                  xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTR_VALUE,
>                                  attrip->attri_value,
>                                  ATTR_NVEC_SIZE(attrip->attri_value_len));
> 
> So while this might not be the source of the above crash, it's
> certainly a bug that needs fixing.
> 
> At this point, I'm just going to hack on the code to make it work,
> and we can go from there...

I think when we left off last night, we are more concerned with the 
extra transaction overhead in the perf captures.  I will see if I can 
replicate what you are seeing with perf and maybe we can work out some 
short cuts.  Thank for your help here!

Allison

> 
> Cheers,
> 
> Dave.
>
Dave Chinner Sept. 1, 2021, 6:29 a.m. UTC | #3
On Tue, Aug 31, 2021 at 11:27:48AM -0700, Allison Henderson wrote:
> 
> 
> On 8/30/21 5:20 PM, Dave Chinner wrote:
> > On Tue, Aug 24, 2021 at 03:44:23PM -0700, Allison Henderson wrote:
> > > Hi all,
> > > 
> > > This set is a subset of a larger series parent pointers. Delayed attributes allow
> > > attribute operations (set and remove) to be logged and committed in the same
> > > way that other delayed operations do. This allows more complex operations (like
> > > parent pointers) to be broken up into multiple smaller transactions. To do
> > > this, the existing attr operations must be modified to operate as a delayed
> > > operation.  This means that they cannot roll, commit, or finish transactions.
> > > Instead, they return -EAGAIN to allow the calling function to handle the
> > > transaction.  In this series, we focus on only the delayed attribute portion.
> > > We will introduce parent pointers in a later set.
> > > 
> > > The set as a whole is a bit much to digest at once, so I usually send out the
> > > smaller sub series to reduce reviewer burn out.  But the entire extended series
> > > is visible through the included github links.
> 
> Some of this we worked out in the chat last night, but I will echo it here
> for the archives
> 
> > 
> > Ok, so like I did with Darrick's deferred inactivation series, the
> > first thing I'm doing here is throwing this patchset at
> > scalability/performance worklaods and finding out what is different.
> > 
> > I've merged this series with 5.14 + xfs/for-next + xfs-cil-scale and
> > then run some tests on it. First up is fsmark creating zero length
> > files w/ 64 byte xattrs. This should stress only shortform attribute
> > manipulations.
> > 
> > I have not enabled delayed attributes yet (i.e.
> > /sys/fs/xfs/debug/larp = 0)
> > 
> > First thing I notice is the transaction commit rate during create is
> > up around 900k/s, so we are doing 3 transactions per inode - 1 for
> > create, 2 for attributes. That looks like a regression - existing
> > shortform attribute creation only takes a single transaction commit,
> > so this workload prior to this patchset only ran at 600k commits/s.
> > 
> > Note that hte only reason I'm getting 900k transactions/s is the
> > CIL scalability patchset - without that the system tops out at ~800k
> > transactions/s and so this would be a significant performance
> > regression (20%) vs the current xfs/for-next code.
> > 
> > Essentially, this looks like we are doing an extra transaction
> > commit to defer the creation of the attribute, then doing another
> > transaction to actually modify the attribute. i.e.:
> > 
> >   - 11.04% xfs_attr_set
> >      - 8.70% xfs_trans_commit
> >         - 8.69% __xfs_trans_commit
> > 	  - 5.10% xfs_defer_finish_noroll
> > 	     - 3.74% xfs_defer_trans_roll
> > 		- 3.57% xfs_trans_roll
> > 		   - 3.13% __xfs_trans_commit
> > 		      - 3.01% xlog_cil_commit
> > 			   0.66% down_read
> > 			   0.63% xfs_log_ticket_regrant
> > 	     - 1.16% xfs_attr_finish_item
> > 		- 1.06% xfs_trans_attr_finish_update
> > 		   - 1.03% xfs_attr_set_iter
> > 		      - 1.01% xfs_attr_sf_addname
> > 			 - 0.99% xfs_attr_try_sf_addname
> > 			    - 0.61% xfs_attr_shortform_addname
> > 				 0.55% xfs_attr_shortform_add
> > 
> > 
> > AFAICT, for non-delayed attributes, this first transaction commit
> > logs the inode but does not create intent or intent done items
> > (returns NULL for both operations), so just rolls and runs the
> > ->finish_item. So it would seem that the first transaction just
> > changes the inode timestamps and does nothing else.
> > 
> > Firstly, this means the inode timestamp change is not atomic w.r.t.
> > the attribute change the timestamp change relates to and it's
> > essentially new overhead for the non-delayed path.
> > 
> > Looking at the unlink path, I see the same thing - there's an extra
> > transaction for the attr remove path, the same as the attr set path.
> > This drives the unlink path to 1.1 million transaction commits/sec
> > instead of 800k/s, so it's likely that there's a substantial
> > performance regression here on a kernel without the CIL scalability
> > patchset.
> > 
> > IOWs, there's significant behavioural changes with the non-delayed
> > logging version of this patchset, both in terms of performance and
> > the atomicity of changes that appear in the journal and hence
> > recovery behaviour.
> > 
> > At this point I have to ask: why are we trying to retain the "old"
> > way of doing things (even for testing) if it is substantially
> > changing behaviour and on-disk journal contents for attribute
> > modifications?
> Per the chat discussion, we have to keep both methods since sb v4 would not
> use the new log entries.

Yup, but we likely have better reasons than this. Tests I've run
today have indicated that as the xattr size goes up, the performance
regression of the delayed attrs gets worse. At max sized xattrs,
performance is only ~25% of the existing attr implementation.

Largely the problem is the number of copies of the xattr value we
end up in memory when we are using delalyed attrs. In the current
case, we have this all in a single syscall context:

	- xattr value allocated by VFS
	- attached to xfs_da_args
	- xfs_da_args passed to xfs_attr_rmtval_set_value() where we
	  allocate buffer(s) to hold the encoded name.
	- buffer is written to disk.
	- xattr VFS value freed.

And the xfs_buf is freed when memory pressure occurs.

So, essentially we have two copies of the xattr value and two
memcpy()s of it to get it to disk.

Now, with delayed attrs we have this in the syscall context:

	- xattr value allocated by VFS
	- attached to xfs_da_args
	- attached to xfs_attri_item
	- attri item intent gets created, allocates a 64k+ shadow
	  buffer and memcpy()s the value into it.
	- intent gets committed and attached to the CIL
	- xfs_da_args passed to xfs_attr_rmtval_set_value() where we
	  allocate buffer(s) to hold the encoded name.
	- buffer is written to disk.
	- xattr VFS value freed.

Now userspace runs more xattr creation syscalls, all queuing up
shadow buffers on the CIL. memory footprint goes up massively.

Some time later, the CIL commits and we:
	- shadow buffer attached to CIL commit
	- memcpy() the value from the shadow buffer to the iclog
	- commit the CIL checkpoint
	- on CIL checkpoint completion, shadow buffer is freed.

And the xfs_bufs is freed when memory pressure occurs.

The result is that with delayed attrs, large attr creation hammers
the page allocator really badly - more than 50% of the CPU time is
spent trying to allocate pages and compact memory into contiguous
regions and so performance goes way down. e.g.

- 51.90% xlog_cil_commit
 - 36.96% kvmalloc_node
    - 36.51% __kmalloc_node
       - 36.32% kmalloc_large_node
	  - 36.24% __alloc_pages
	     - 34.53% __alloc_pages_slowpath.constprop.0
		- 33.80% __alloc_pages_direct_compact
		   - 33.62% try_to_compact_pages
		      - compact_zone_order
			 - 30.58% compact_zone
			      14.72% PageHuge

I suspect we need to look at our use of kvmalloc() and the flags we
pass it because I think if we can't get contiguous ranges
immediately, we should go straight to vmalloc rather than burn a
dozen CPUs trying to compact memory...

The amount we log also goes way up.  At 16 threads and 4kB xattrs,
the log alone is sustaining more than 1GB/s throughput at 50k xattrs
creates/s.  There is over 2GB/s going to disk.

With 64k xattrs, there is over 2.5GB/s being written to the log and
just over 3GB/s being written to disk. Performance is about 9000
xattr creates/s.

In comparison, with delayed attrs turned off for 64k xattrs, log
throughput is roughly 500MB/s, disk throughput is between 3.5-4GB/s
and the create rate is around 45,000 xattrs/s. So you can see that
logging large xattrs really hurts performance.

We want delayed attrs for modifying attributes atomically
with other inode modifications (i.e. parent pointers, security
labels, ACLs, etc) and none of these use cases use large xattrs -
they are all relatively small and so the logging and memory overhead
of delayed attrs isn't a huge deal for them. For anything else,
small or large, a 10-15% regression is a deal breaker (think ceph,
samba, etc).

So I think we're going to need to select attribute modification
behaviour at the call site. e.g. from xattr syscalls we don't get
any real integrity benefit from logging xattrs, so maybe this path
always uses the unlogged path. The we can change the internal create
path to use delayed attrs for parent pointers and security attrs.
As Darrick said, this is likely what the current larp debug knob
should turn into - caller selected behaviour.

> > Ok, there's the first failure.
> > 
> > This looks like it's a problem with xfs_attri_item_{size,format} in
> > calculating the number of bytes to log. They use ATTR_NVEC_SIZE() to
> > calculate the number of bytes of copy from the attribute item which
> > rounds up the length to copy to 4 byte aligned values. I'm not sure
> > what this function is calculating:
> > 
> > /* iovec length must be 32-bit aligned */
> > static inline size_t ATTR_NVEC_SIZE(size_t size)
> > {
> >          return size == sizeof(int32_t) ? size :
> > 	               sizeof(int32_t) + round_up(size, sizeof(int32_t));
> > }
> > 
> > It appears to be saying if the size == 4, then return 4, otherwise
> > return 4 + roundup(size)... which leads me to struct
> > xfs_attri_log_format:
> > 
> > struct xfs_attri_log_format {
> >          uint16_t        alfi_type;      /* attri log item type */
> >          uint16_t        alfi_size;      /* size of this item */
> >          uint32_t        __pad;          /* pad to 64 bit aligned */
> >          uint64_t        alfi_id;        /* attri identifier */
> >          uint64_t        alfi_ino;       /* the inode for this attr operation */
> >          uint32_t        alfi_op_flags;  /* marks the op as a set or remove */
> >          uint32_t        alfi_name_len;  /* attr name length */
> >          uint32_t        alfi_value_len; /* attr value length */
> >          uint32_t        alfi_attr_flags;/* attr flags */
> > };
> > 
> > I don't see where the extra 4 bytes for the attribute vector size
> > comes from. It's not needed to store the length, so this could
> > oversize the amount of data to be copied from the source
> > buffer by up to 7 bytes?
> > 
> > I can see that it might need rounding with the existing
> > log code (because the formatter is responsible for 32 bit alignment
> > of log vectors), but that goes away with the CIL scalability
> > patchset that always aligns iovecs to 4 byte alignment so the
> > formatters do not need to do that.
> 
> I think we figured this out last night, initially this was here for an
> assertion check in the log code, but I think just the round up will suffice
> for the check.

I've got a cleaner solution in the works - we should be hiding the
alignment behind the log iovec API, not force callers to know about
it and always get it right.

> > Hiding it in a "macro" is not necessary, either - look at how
> > xfs_inode_item_{data,attr}_fork_size handle the rounding up of the
> > local format fork size. They round up the fork byte count to 4
> > directly, and the format code copies those bytes because
> > xfs_idata_realloc() allocates those bytes.
> > 
> > However, for the attribute buffers, this isn't guaranteed. Look at
> > xfs_xattr_set():
> > 
> >          struct xfs_da_args      args = {
> >                  .dp             = XFS_I(inode),
> >                  .attr_filter    = handler->flags,
> >                  .attr_flags     = flags,
> >                  .name           = name,
> >                  .namelen        = strlen(name),
> >                  .value          = (void *)value,
> >                  .valuelen       = size,
> >          };
> > 
> > There is no rounding up of the name or value lengths, and these end
> > up directly referenced by the deferred logging via xfs_attr_log_item()
> > and attrip->da_args->...
> > 
> >          attrip->attri_name = (void *)attr->xattri_da_args->name;
> >          attrip->attri_value = attr->xattri_da_args->value;
> >          attrip->attri_name_len = attr->xattri_da_args->namelen;
> >          attrip->attri_value_len = attr->xattri_da_args->valuelen;
> > 
> > We then pass those pointers directly to xlog_iovec_copy() but with a
> > rounded up length that is longer than the source buffer:
> > 
> >          xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTR_NAME,
> >                          attrip->attri_name,
> >                          ATTR_NVEC_SIZE(attrip->attri_name_len));
> >          if (attrip->attri_value_len > 0)
> >                  xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTR_VALUE,
> >                                  attrip->attri_value,
> >                                  ATTR_NVEC_SIZE(attrip->attri_value_len));
> > 
> > So while this might not be the source of the above crash, it's
> > certainly a bug that needs fixing.
> > 
> > At this point, I'm just going to hack on the code to make it work,
> > and we can go from there...
> 
> I think when we left off last night, we are more concerned with the extra
> transaction overhead in the perf captures.  I will see if I can replicate
> what you are seeing with perf and maybe we can work out some short cuts.
> Thank for your help here!

Yup, there's a few issues. I've hacked out the initial transaction
for non-delayed attrs and that brings the commit count down as
expected. it might be acceptible, but we'll see.

I've also added intent whiteouts to try to avoid having to log
intents that are completed before the intent has been written to the
log. With that in place, the above large xattr creation workloads
have relatively consistent log throughput of around 600-700MB/s, and
performance for 4k and 64kB xattrs roughly doubles. It's still way
behind the current implementation, though, because it hammers the
page allocator even harder now. However, intent whiteouts are
generic, so I'll write this up into a separate patchset that
addresses all the other types of intents we have as well.

There's some other things we can potentially do to minimise the
memory footprint and the number of memcpy()s of the logged values,
so I'll look into these tomorrow.

Cheers,

Dave.