[V2] xfs: implement cgroup writeback support

Message ID	f13839700d372a4663a08a11883e9c89b13056ca.1521752282.git.shli@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6C2252177A From: Shaohua Li <shli@kernel.org> To: linux-xfs@vger.kernel.org Cc: Kernel-team@fb.com, bfoster@redhat.com, Shaohua Li <shli@fb.com>, Tejun Heo <tj@kernel.org>, "Darrick J . Wong" <darrick.wong@oracle.com>, Dave Chinner <david@fromorbit.com>, Christoph Hellwig <hch@infradead.org> Subject: [PATCH V2] xfs: implement cgroup writeback support Date: Thu, 22 Mar 2018 14:11:45 -0700 Message-Id: <f13839700d372a4663a08a11883e9c89b13056ca.1521752282.git.shli@fb.com> Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk

Shaohua Li March 22, 2018, 9:11 p.m. UTC

From: Shaohua Li <shli@fb.com>

Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
writeback support). Tested with a fio test, verified writeback is
throttled against cgroup io.max write bandwidth, also verified moving
the fio test to another cgroup and the writeback is throttled against
new cgroup setting.

This only controls the file data write for cgroup. For metadata, since
xfs dispatches the metadata write in specific threads, it's possible low
prio app's metadata could harm high prio app's metadata. A while back,
Tejun has a patch to force metadata belonging to root cgroup for btrfs.
I had a similiar patch for xfs too. But Since Tejun's patch isn't in
upstream, I'll delay post the xfs patch.

Cc: Tejun Heo <tj@kernel.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/xfs/xfs_aops.c  | 13 +++++++++++--
 fs/xfs/xfs_super.c |  1 +
 2 files changed, 12 insertions(+), 2 deletions(-)

Chris Mason March 23, 2018, 2 p.m. UTC | #1

On 22 Mar 2018, at 17:11, Shaohua Li wrote:

> From: Shaohua Li <shli@fb.com>
>
> Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
> writeback support). Tested with a fio test, verified writeback is
> throttled against cgroup io.max write bandwidth, also verified moving
> the fio test to another cgroup and the writeback is throttled against
> new cgroup setting.
>
> This only controls the file data write for cgroup. For metadata, since
> xfs dispatches the metadata write in specific threads, it's possible 
> low
> prio app's metadata could harm high prio app's metadata. A while back,
> Tejun has a patch to force metadata belonging to root cgroup for 
> btrfs.
> I had a similiar patch for xfs too. But Since Tejun's patch isn't in
> upstream, I'll delay post the xfs patch.

I think we've worked out all the problems on the btrfs/generic patch 
set, we'll get those sent out shortly.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

张本龙 March 23, 2018, 2:24 p.m. UTC | #2

Hi Shaohua and XFS,

May I ask how are we gonna handle REQ_META issued from XFS? As you
mentioned about charging to root cgroup (also in an earlier email
discussion), and seems the 4.16.0-rc6 code is not handling it
separately.

In our case to support XFS cgroup writeback control, which was ported
and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
trouble. Threads from throttled docker might submit_bio in following
path by its own identity, this docker blkcg accumulated large amounts
of data (e.g., 20GB), thus such log gets blocked.
------------------------------------------------------------------------------------------------------------------------------------------------
xxx-agent 22825 [001] 78772.391023: probe:submit_bio:
(ffffffff812f70c0) bio=ffff880fb6e1f300 bi_css=0
                  4f70c1 submit_bio
(/usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.cgwb.rdt.x86_64/vmlinux)
                   4e440 xfs_buf_submit ([xfs])
                   6e59b xlog_bdstrat ([xfs])
                   704c5 xlog_sync ([xfs])
                   70683 xlog_state_release_iclog ([xfs])
                   71917 _xfs_log_force_lsn ([xfs])
                   5249e xfs_file_fsync ([xfs])
                  4374d5 do_fsync
(/usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.cgwb.rdt.x86_64/vmlinux)
                  4377a0 sys_fsync
(/usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.cgwb.rdt.x86_64/vmlinux)
                  8a0ac9 system_call
(/usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.cgwb.rdt.x86_64/vmlinux
                  1df9c4 [unknown]
(/home/xxx/xxx-agent/data/install/xxx-agent/16/xxx-agent)

Meanwhile some other container without bps limits, or kworkers from
root cgroup, might get stuck from:
----------------------------------------------------------------------------------------------------------------------------------------------
[79183.692355] INFO: task xxx:33434 blocked for more than 120 seconds.
[79183.730997] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[79183.778095] consul          D ffff880169958000     0 33434  24659 0x00000000
[79183.820478]  ffff880bce3e7e38 0000000000000082 ffff880c7af8bec0
ffff880bce3e7fd8
[79183.865232]  ffff880bce3e7fd8 ffff880bce3e7fd8 ffff880c7af8bec0
ffff880c7af8bec0
[79183.909997]  ffff880169bfa928 ffff880bce3e7ef4 0000000000000000
ffff880c7af8bec0
[79183.954738] Call Trace:
[79183.969516]  [<ffffffff81695ac9>] schedule+0x29/0x70
[79183.999357]  [<ffffffffa073497a>] _xfs_log_force_lsn+0x2fa/0x350 [xfs]
[79184.038497]  [<ffffffff810c84c0>] ? wake_up_state+0x20/0x20
[79184.071941]  [<ffffffffa071542e>] xfs_file_fsync+0x10e/0x1e0 [xfs]
[79184.109010]  [<ffffffff812374d5>] do_fsync+0x65/0xa0
[79184.138826]  [<ffffffff8120368f>] ? SyS_write+0x9f/0xe0
[79184.170182]  [<ffffffff812377a0>] SyS_fsync+0x10/0x20
[79184.200514]  [<ffffffff816a0ac9>] system_call_fastpath+0x16/0x1b
[79184.236613] INFO: task xxx:38778 blocked for more than 120 seconds.
[79184.275238] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[79184.322329] consul          D ffff880fe728af10     0 38778  32505 0x00000000
[79184.364694]  ffff881a18fe7c10 0000000000000082 ffff88196a74bec0
ffff881a18fe7fd8
[79184.409424]  ffff881a18fe7fd8 ffff881a18fe7fd8 ffff88196a74bec0
ffff881a18fe7d60
[79184.454167]  ffff881a18fe7d68 7fffffffffffffff ffff88196a74bec0
ffffffffffffffff
[79184.498909] Call Trace:
[79184.513673]  [<ffffffff81695ac9>] schedule+0x29/0x70
[79184.543475]  [<ffffffff81693529>] schedule_timeout+0x239/0x2c0
[79184.578468]  [<ffffffff810c4f99>] ? ttwu_do_wakeup+0x19/0xd0
[79184.612419]  [<ffffffff810c512d>] ? ttwu_do_activate.constprop.91+0x5d/0x70
[79184.654148]  [<ffffffff810c8308>] ? try_to_wake_up+0x1c8/0x320
[79184.689141]  [<ffffffff81695ea6>] wait_for_completion+0x116/0x170
[79184.725688]  [<ffffffff810c84c0>] ? wake_up_state+0x20/0x20
[79184.759126]  [<ffffffff810ac67c>] flush_work+0xfc/0x1c0
[79184.790481]  [<ffffffff810a85f0>] ? move_linked_works+0x90/0x90
[79184.826020]  [<ffffffffa073622a>] xlog_cil_force_lsn+0x8a/0x210 [xfs]
[79184.864644]  [<ffffffff811850df>] ? __filemap_fdatawrite_range+0xbf/0xe0
[79184.904838]  [<ffffffffa073470f>] _xfs_log_force_lsn+0x8f/0x350 [xfs]
[79184.943467]  [<ffffffff8118531f>] ? filemap_fdatawait_range+0x1f/0x30
[79184.982086]  [<ffffffff81694c42>] ? down_read+0x12/0x30
[79185.013466]  [<ffffffffa071542e>] xfs_file_fsync+0x10e/0x1e0 [xfs]
[79185.050533]  [<ffffffff812374d5>] do_fsync+0x65/0xa0
[79185.080335]  [<ffffffff8120368f>] ? SyS_write+0x9f/0xe0
[79185.111691]  [<ffffffff812377a0>] SyS_fsync+0x10/0x20
[79185.141986]  [<ffffffff816a0ac9>] system_call_fastpath+0x16/0x1b
[79185.178051] INFO: task kworker/1:0:12593 blocked for more than 120 seconds.
[79185.219258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[79185.266353] kworker/1:0     D ffff8801698caf10     0 12593      2 0x00000000
[79185.308743] Workqueue: xfs-cil/sdb1 xlog_cil_push_work [xfs]
[79185.342757]  ffff880d00e3bbe8 0000000000000046 ffff880169732f10
ffff880d00e3bfd8
[79185.387475]  ffff880d00e3bfd8 ffff880d00e3bfd8 ffff880169732f10
ffff880169bfa800
[79185.432197]  ffff880169bfa928 ffff880bd3b4a5c0 ffff880169732f10
ffff880169bfa900
[79185.476946] Call Trace:
[79185.491725]  [<ffffffff81695ac9>] schedule+0x29/0x70
[79185.521548]  [<ffffffffa073377a>]
xlog_state_get_iclog_space+0x10a/0x310 [xfs]
[79185.565010]  [<ffffffff810c84c0>] ? wake_up_state+0x20/0x20
[79185.598476]  [<ffffffffa0733c99>] xlog_write+0x1a9/0x720 [xfs]
[79185.633477]  [<ffffffffa07359c9>] xlog_cil_push+0x239/0x420 [xfs]
[79185.670037]  [<ffffffffa0735bc5>] xlog_cil_push_work+0x15/0x20 [xfs]
[79185.708138]  [<ffffffff810ab45b>] process_one_work+0x17b/0x470
[79185.743124]  [<ffffffff810ac413>] worker_thread+0x2a3/0x410
[79185.776542]  [<ffffffff810ac170>] ? rescuer_thread+0x460/0x460
[79185.811546]  [<ffffffff810b3a4f>] kthread+0xcf/0xe0
[79185.840835]  [<ffffffff810b3980>] ? kthread_create_on_node+0x140/0x140
[79185.879982]  [<ffffffff816a0a18>] ret_from_fork+0x58/0x90
[79185.912391]  [<ffffffff810b3980>] ? kthread_create_on_node+0x140/0x140

Not familiar with XFS, but seems log bios are partially stuck in
throttled cgroups, leaving other innocent groups waiting for
completion. To cope with this we bypassed REQ_META log bios in
blk_throtl_bio().

So just writing to ensure we know about the issue. Absolutely great if
your other patch already fixed this!

Thanks,
Benlong


2018-03-23 5:11 GMT+08:00 Shaohua Li <shli@kernel.org>:
> From: Shaohua Li <shli@fb.com>
>
> Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
> writeback support). Tested with a fio test, verified writeback is
> throttled against cgroup io.max write bandwidth, also verified moving
> the fio test to another cgroup and the writeback is throttled against
> new cgroup setting.
>
> This only controls the file data write for cgroup. For metadata, since
> xfs dispatches the metadata write in specific threads, it's possible low
> prio app's metadata could harm high prio app's metadata. A while back,
> Tejun has a patch to force metadata belonging to root cgroup for btrfs.
> I had a similiar patch for xfs too. But Since Tejun's patch isn't in
> upstream, I'll delay post the xfs patch.
>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Darrick J. Wong <darrick.wong@oracle.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  fs/xfs/xfs_aops.c  | 13 +++++++++++--
>  fs/xfs/xfs_super.c |  1 +
>  2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 19eadc8..5f70584 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -589,7 +589,8 @@ xfs_alloc_ioend(
>         struct inode            *inode,
>         unsigned int            type,
>         xfs_off_t               offset,
> -       struct buffer_head      *bh)
> +       struct buffer_head      *bh,
> +       struct writeback_control *wbc)
>  {
>         struct xfs_ioend        *ioend;
>         struct bio              *bio;
> @@ -606,6 +607,8 @@ xfs_alloc_ioend(
>         INIT_WORK(&ioend->io_work, xfs_end_io);
>         ioend->io_append_trans = NULL;
>         ioend->io_bio = bio;
> +       /* attach new bio to its cgroup */
> +       wbc_init_bio(wbc, bio);
>         return ioend;
>  }
>
> @@ -633,6 +636,8 @@ xfs_chain_bio(
>         ioend->io_bio->bi_write_hint = ioend->io_inode->i_write_hint;
>         submit_bio(ioend->io_bio);
>         ioend->io_bio = new;
> +       /* attach new bio to its cgroup */
> +       wbc_init_bio(wbc, new);
>  }
>
>  /*
> @@ -656,7 +661,8 @@ xfs_add_to_ioend(
>             offset != wpc->ioend->io_offset + wpc->ioend->io_size) {
>                 if (wpc->ioend)
>                         list_add(&wpc->ioend->io_list, iolist);
> -               wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh);
> +               wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset,
> +                                            bh, wbc);
>         }
>
>         /*
> @@ -666,6 +672,9 @@ xfs_add_to_ioend(
>         while (xfs_bio_add_buffer(wpc->ioend->io_bio, bh) != bh->b_size)
>                 xfs_chain_bio(wpc->ioend, wbc, bh);
>
> +       /* Charge write size to its cgroup for cgroup switching track */
> +       wbc_account_io(wbc, bh->b_page, bh->b_size);
> +
>         wpc->ioend->io_size += bh->b_size;
>         wpc->last_block = bh->b_blocknr;
>         xfs_start_buffer_writeback(bh);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 951271f..95c2d3d 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1666,6 +1666,7 @@ xfs_fs_fill_super(
>         sb->s_max_links = XFS_MAXLINK;
>         sb->s_time_gran = 1;
>         set_posix_acl_flag(sb);
> +       sb->s_iflags |= SB_I_CGROUPWB;
>
>         /* version 5 superblocks support inode version counters. */
>         if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
> --
> 2.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brian Foster March 23, 2018, 2:37 p.m. UTC | #3

On Thu, Mar 22, 2018 at 02:11:45PM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
> writeback support). Tested with a fio test, verified writeback is
> throttled against cgroup io.max write bandwidth, also verified moving
> the fio test to another cgroup and the writeback is throttled against
> new cgroup setting.
> 
> This only controls the file data write for cgroup. For metadata, since
> xfs dispatches the metadata write in specific threads, it's possible low
> prio app's metadata could harm high prio app's metadata. A while back,
> Tejun has a patch to force metadata belonging to root cgroup for btrfs.
> I had a similiar patch for xfs too. But Since Tejun's patch isn't in
> upstream, I'll delay post the xfs patch.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Darrick J. Wong <darrick.wong@oracle.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  fs/xfs/xfs_aops.c  | 13 +++++++++++--
>  fs/xfs/xfs_super.c |  1 +
>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 19eadc8..5f70584 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -589,7 +589,8 @@ xfs_alloc_ioend(
>  	struct inode		*inode,
>  	unsigned int		type,
>  	xfs_off_t		offset,
> -	struct buffer_head	*bh)
> +	struct buffer_head	*bh,
> +	struct writeback_control *wbc)
>  {
>  	struct xfs_ioend	*ioend;
>  	struct bio		*bio;
> @@ -606,6 +607,8 @@ xfs_alloc_ioend(
>  	INIT_WORK(&ioend->io_work, xfs_end_io);
>  	ioend->io_append_trans = NULL;
>  	ioend->io_bio = bio;
> +	/* attach new bio to its cgroup */
> +	wbc_init_bio(wbc, bio);

Just a nit, but can we move the wbc_init_bio() calls to earlier in the
function where we setup/init the bio? (Same comment for
xfs_chain_bio()). Otherwise this looks fine to me and passes the posted
xfstests test.

Brian

>  	return ioend;
>  }
>  
> @@ -633,6 +636,8 @@ xfs_chain_bio(
>  	ioend->io_bio->bi_write_hint = ioend->io_inode->i_write_hint;
>  	submit_bio(ioend->io_bio);
>  	ioend->io_bio = new;
> +	/* attach new bio to its cgroup */
> +	wbc_init_bio(wbc, new);
>  }
>  
>  /*
> @@ -656,7 +661,8 @@ xfs_add_to_ioend(
>  	    offset != wpc->ioend->io_offset + wpc->ioend->io_size) {
>  		if (wpc->ioend)
>  			list_add(&wpc->ioend->io_list, iolist);
> -		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh);
> +		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset,
> +					     bh, wbc);
>  	}
>  
>  	/*
> @@ -666,6 +672,9 @@ xfs_add_to_ioend(
>  	while (xfs_bio_add_buffer(wpc->ioend->io_bio, bh) != bh->b_size)
>  		xfs_chain_bio(wpc->ioend, wbc, bh);
>  
> +	/* Charge write size to its cgroup for cgroup switching track */
> +	wbc_account_io(wbc, bh->b_page, bh->b_size);
> +
>  	wpc->ioend->io_size += bh->b_size;
>  	wpc->last_block = bh->b_blocknr;
>  	xfs_start_buffer_writeback(bh);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 951271f..95c2d3d 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1666,6 +1666,7 @@ xfs_fs_fill_super(
>  	sb->s_max_links = XFS_MAXLINK;
>  	sb->s_time_gran = 1;
>  	set_posix_acl_flag(sb);
> +	sb->s_iflags |= SB_I_CGROUPWB;
>  
>  	/* version 5 superblocks support inode version counters. */
>  	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
> -- 
> 2.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner March 25, 2018, 9:59 p.m. UTC | #4

On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> Hi Shaohua and XFS,
> 
> May I ask how are we gonna handle REQ_META issued from XFS? As you
> mentioned about charging to root cgroup (also in an earlier email
> discussion), and seems the 4.16.0-rc6 code is not handling it
> separately.
> 
> In our case to support XFS cgroup writeback control, which was ported
> and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> trouble. Threads from throttled docker might submit_bio in following
> path by its own identity, this docker blkcg accumulated large amounts
> of data (e.g., 20GB), thus such log gets blocked.

And thus displaying the reason why I originally refused to merge
this code until regression tests were added to fstests to exercise
these sorts of issues. This stuff adds new internal filesystem IO
ordering constraints, so we need tests that exercise it and ensure
we don't accidentally break it in future.

> Not familiar with XFS, but seems log bios are partially stuck in
> throttled cgroups, leaving other innocent groups waiting for
> completion. To cope with this we bypassed REQ_META log bios in
> blk_throtl_bio().

Yup, the log is global and should not be throttled. Metadata is less
obvious as to what the correct thing to do is, because writes are
always done from internal kernel threads but reads are done from a
mix of kernel threads, user cgroup contexts and user data IO
completions. Hence there are situations where metadata reads may
need to be throttled because they are cgroup context only (e.g.
filesystem directory traversal) but others where reads should not
be throttled because they have global context (e.g. inside a
transaction when other buffers are already locked).

Getting this right and keeping it working requires regression tests
that get run on every release, whether it be upstream or distro
kernels, and that means we need tests in fstests to cover cgroup IO
control....

Cheers,

Dave.

Brian Foster March 26, 2018, 4:28 p.m. UTC | #5

On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > Hi Shaohua and XFS,
> > 
> > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > mentioned about charging to root cgroup (also in an earlier email
> > discussion), and seems the 4.16.0-rc6 code is not handling it
> > separately.
> > 
> > In our case to support XFS cgroup writeback control, which was ported
> > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > trouble. Threads from throttled docker might submit_bio in following
> > path by its own identity, this docker blkcg accumulated large amounts
> > of data (e.g., 20GB), thus such log gets blocked.
> 
> And thus displaying the reason why I originally refused to merge
> this code until regression tests were added to fstests to exercise
> these sorts of issues. This stuff adds new internal filesystem IO
> ordering constraints, so we need tests that exercise it and ensure
> we don't accidentally break it in future.
> 

Hmm, but if the user issues fsync from the throttled cgroup then won't
that throttling occur today, regardless of cgroup aware writeback? My
understanding is that cgawb just accurately accounts writeback I/Os to
the owner of the cached pages. IOW, if the buffered writer and fsync
call are in the same throttled cgroup, then the throttling works just as
it would with cgawb and the writer being in a throttled cgroup.

So ISTM that this is an independent problem. What am I missing?

Shaohua,

Do you have a reference to the older metadata related patch mentioned in
the commit log that presumably addressed this?

Brian

> 
> > Not familiar with XFS, but seems log bios are partially stuck in
> > throttled cgroups, leaving other innocent groups waiting for
> > completion. To cope with this we bypassed REQ_META log bios in
> > blk_throtl_bio().
> 
> Yup, the log is global and should not be throttled. Metadata is less
> obvious as to what the correct thing to do is, because writes are
> always done from internal kernel threads but reads are done from a
> mix of kernel threads, user cgroup contexts and user data IO
> completions. Hence there are situations where metadata reads may
> need to be throttled because they are cgroup context only (e.g.
> filesystem directory traversal) but others where reads should not
> be throttled because they have global context (e.g. inside a
> transaction when other buffers are already locked).
> 
> Getting this right and keeping it working requires regression tests
> that get run on every release, whether it be upstream or distro
> kernels, and that means we need tests in fstests to cover cgroup IO
> control....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Shaohua Li March 27, 2018, 12:55 a.m. UTC | #6

On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
> On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > > Hi Shaohua and XFS,
> > > 
> > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > > mentioned about charging to root cgroup (also in an earlier email
> > > discussion), and seems the 4.16.0-rc6 code is not handling it
> > > separately.
> > > 
> > > In our case to support XFS cgroup writeback control, which was ported
> > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > > trouble. Threads from throttled docker might submit_bio in following
> > > path by its own identity, this docker blkcg accumulated large amounts
> > > of data (e.g., 20GB), thus such log gets blocked.
> > 
> > And thus displaying the reason why I originally refused to merge
> > this code until regression tests were added to fstests to exercise
> > these sorts of issues. This stuff adds new internal filesystem IO
> > ordering constraints, so we need tests that exercise it and ensure
> > we don't accidentally break it in future.
> > 
> 
> Hmm, but if the user issues fsync from the throttled cgroup then won't
> that throttling occur today, regardless of cgroup aware writeback? My
> understanding is that cgawb just accurately accounts writeback I/Os to
> the owner of the cached pages. IOW, if the buffered writer and fsync
> call are in the same throttled cgroup, then the throttling works just as
> it would with cgawb and the writer being in a throttled cgroup.
> 
> So ISTM that this is an independent problem. What am I missing?
> 
> Shaohua,
> 
> Do you have a reference to the older metadata related patch mentioned in
> the commit log that presumably addressed this?

The problem is about priority reversion. Say you do a fsync in a low prio
cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
prio cgroup, the cgroup will wait for fsync IO finished, which is already
submitted by the low prio cgroup and run in low prio. This makes the high prio
cgroup run slow. The proposed patch is to force all metadata write submitted
from root cgroup regardless which task submitted, which can fix this issue.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brian Foster March 27, 2018, 11:36 a.m. UTC | #7

On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote:
> On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > > > Hi Shaohua and XFS,
> > > > 
> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > > > mentioned about charging to root cgroup (also in an earlier email
> > > > discussion), and seems the 4.16.0-rc6 code is not handling it
> > > > separately.
> > > > 
> > > > In our case to support XFS cgroup writeback control, which was ported
> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > > > trouble. Threads from throttled docker might submit_bio in following
> > > > path by its own identity, this docker blkcg accumulated large amounts
> > > > of data (e.g., 20GB), thus such log gets blocked.
> > > 
> > > And thus displaying the reason why I originally refused to merge
> > > this code until regression tests were added to fstests to exercise
> > > these sorts of issues. This stuff adds new internal filesystem IO
> > > ordering constraints, so we need tests that exercise it and ensure
> > > we don't accidentally break it in future.
> > > 
> > 
> > Hmm, but if the user issues fsync from the throttled cgroup then won't
> > that throttling occur today, regardless of cgroup aware writeback? My
> > understanding is that cgawb just accurately accounts writeback I/Os to
> > the owner of the cached pages. IOW, if the buffered writer and fsync
> > call are in the same throttled cgroup, then the throttling works just as
> > it would with cgawb and the writer being in a throttled cgroup.
> > 
> > So ISTM that this is an independent problem. What am I missing?
> > 
> > Shaohua,
> > 
> > Do you have a reference to the older metadata related patch mentioned in
> > the commit log that presumably addressed this?
> 
> The problem is about priority reversion. Say you do a fsync in a low prio
> cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
> prio cgroup, the cgroup will wait for fsync IO finished, which is already
> submitted by the low prio cgroup and run in low prio. This makes the high prio
> cgroup run slow. The proposed patch is to force all metadata write submitted
> from root cgroup regardless which task submitted, which can fix this issue.
> 

Right, but it seems to me that this can happen with or without cgroup
aware writeback. This patch just introduces the final bits required to
carry the page owner from however it is tracked in the writeback machine
to the bio submitted by the fs. It doesn't introduce/enable/implement
I/O throttling itself, which is already in place and usable (outside of
the buffered write page owner problem fixed by this patch), right?

So without this patch, if a task in throttled cgroup A does a bunch of
buffered writes and calls fsync, then another task in unthrottled cgroup
B calls fsync, aren't we (XFS) susceptible to priority inversion via
these same log I/O serialization points? If not, then what am I missing?

I'm not saying this isn't a problem that needs fixing, I just want to
make sure I understand the fundamental problem(s), what this cgawb patch
actually does and doesn't do and whether there is a logical dependency
between this patch and the proposed metadata filtering patch.

Brian

> Thanks,
> Shaohua
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brian Foster March 27, 2018, 11:50 a.m. UTC | #8

On Tue, Mar 27, 2018 at 06:54:27PM +0800, 张本龙 wrote:
> 2018-03-27 0:28 GMT+08:00 Brian Foster <bfoster@redhat.com>:
> 
> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > > > Hi Shaohua and XFS,
> > > >
> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > > > mentioned about charging to root cgroup (also in an earlier email
> > > > discussion), and seems the 4.16.0-rc6 code is not handling it
> > > > separately.
> > > >
> > > > In our case to support XFS cgroup writeback control, which was ported
> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > > > trouble. Threads from throttled docker might submit_bio in following
> > > > path by its own identity, this docker blkcg accumulated large amounts
> > > > of data (e.g., 20GB), thus such log gets blocked.
> > >
> > > And thus displaying the reason why I originally refused to merge
> > > this code until regression tests were added to fstests to exercise
> > > these sorts of issues. This stuff adds new internal filesystem IO
> > > ordering constraints, so we need tests that exercise it and ensure
> > > we don't accidentally break it in future.
> > >
> >
> > Hmm, but if the user issues fsync from the throttled cgroup then won't
> > that throttling occur today, regardless of cgroup aware writeback? My
> > understanding is that cgawb just accurately accounts writeback I/Os to
> > the owner of the cached pages. IOW, if the buffered writer and fsync
> > call are in the same throttled cgroup, then the throttling works just as
> > it would with cgawb and the writer being in a throttled cgroup.
> >
> > So ISTM that this is an independent problem. What am I missing?
> 
> 
> Yeah throttling would work for fsync traffic even without cgwb, but that's
> because the way bio_blkcg() is implemented: charging to bio->bi_css when
> set but submitter blkcg when not. So without cgwb normal traffic from
> writeback kworkers are attributed to blkcg_root, which is normally
> unlimited.
> 

Ok, that's what I thought. Your report shows everything backed up on an
fsync call, however, so I didn't really see how it related to this
patch. So unless Shaohua explains otherwise, my understanding is that
this is not necessarily caused by cgawb, but rather this is a separate
but similar gap in the broader block I/O throttling mechanism that also
needs to be addressed before throttling is usable (in practice) by
filesystems. You presumably deal with this by filtering out metadata
bios, while the patch Shaohua mentions deals with it by promoting
metadata bios to the root cgroup.

Brian

> And fsync might yield kernel INFO messages when bps * 120 < dirty data,
> giving something like below. That's Ok though, since throttling and all
> types
> of sync calls are contradictory in nature.
> ------------------------------------------------------------
> ---------------------------------
> [Mar 2 14:27] INFO: task dd:40794 blocked for more than 120 seconds.
> [  +0.000747]       Not tainted 4.15.7 #1
> [  +0.000610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  +0.000614] dd              D    0 40794  40793 0x10000000
> [  +0.000610] Call Trace:
> [  +0.000619]  ? __schedule+0x28d/0x870
> [  +0.000611]  schedule+0x32/0x80
> [  +0.000615]  io_schedule+0x12/0x40
> [  +0.000612]  wait_on_page_bit_common+0xff/0x1a0
> [  +0.000609]  ? page_cache_tree_insert+0xd0/0xd0
> [  +0.000614]  __filemap_fdatawait_range+0xe2/0x150
> [  +0.000614]  file_write_and_wait_range+0x62/0xa0
> [  +0.000664]  xfs_file_fsync+0x65/0x1d0 [xfs]
> [  +0.000621]  do_fsync+0x38/0x60
> [  +0.000610]  SyS_fsync+0xc/0x10
> [  +0.000616]  do_syscall_64+0x6e/0x1a0
> [  +0.000612]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [  +0.000610] RIP: 0033:0x7f8131231020
> [  +0.000604] RSP: 002b:00007fff81c04f48 EFLAGS: 00000246 ORIG_RAX:
> 000000000000004a
> [  +0.000610] RAX: ffffffffffffffda RBX: 00007f8131af9690 RCX:
> 00007f8131231020
> [  +0.000619] RDX: 0000000004000000 RSI: 0000000000000000 RDI:
> 0000000000000001
> [  +0.000626] RBP: 0000000000000020 R08: 00000000ffffffff R09:
> 0000000000000000
> [  +0.000618] R10: 00007fff81c04d10 R11: 0000000000000246 R12:
> 0000000000000020
> [  +0.000620] R13: 0000000000000000 R14: 00007fff81c054cf R15:
> 00007fff81c05198
> 
> 
> > Shaohua,
> >
> > Do you have a reference to the older metadata related patch mentioned in
> > the commit log that presumably addressed this?
> >
> > Brian
> >
> > >
> > > > Not familiar with XFS, but seems log bios are partially stuck in
> > > > throttled cgroups, leaving other innocent groups waiting for
> > > > completion. To cope with this we bypassed REQ_META log bios in
> > > > blk_throtl_bio().
> > >
> > > Yup, the log is global and should not be throttled. Metadata is less
> > > obvious as to what the correct thing to do is, because writes are
> > > always done from internal kernel threads but reads are done from a
> > > mix of kernel threads, user cgroup contexts and user data IO
> > > completions. Hence there are situations where metadata reads may
> > > need to be throttled because they are cgroup context only (e.g.
> > > filesystem directory traversal) but others where reads should not
> > > be throttled because they have global context (e.g. inside a
> > > transaction when other buffers are already locked).
> > >
> 
> 
> Thanks for the explanation Dave. So there are both read and write xfs_buf,
> and reads really seem to be complex then.
> 
> 
> > > Getting this right and keeping it working requires regression tests
> > > that get run on every release, whether it be upstream or distro
> > > kernels, and that means we need tests in fstests to cover cgroup IO
> > > control....
> > >
> > > Cheers,
> > >
> > > Dave.
> > > --
> > > Dave Chinner
> > > david@fromorbit.com
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner March 27, 2018, 9:56 p.m. UTC | #9

On Tue, Mar 27, 2018 at 07:36:48AM -0400, Brian Foster wrote:
> On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote:
> > On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
> > > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> > > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > > > > Hi Shaohua and XFS,
> > > > > 
> > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > > > > mentioned about charging to root cgroup (also in an earlier email
> > > > > discussion), and seems the 4.16.0-rc6 code is not handling it
> > > > > separately.
> > > > > 
> > > > > In our case to support XFS cgroup writeback control, which was ported
> > > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > > > > trouble. Threads from throttled docker might submit_bio in following
> > > > > path by its own identity, this docker blkcg accumulated large amounts
> > > > > of data (e.g., 20GB), thus such log gets blocked.
> > > > 
> > > > And thus displaying the reason why I originally refused to merge
> > > > this code until regression tests were added to fstests to exercise
> > > > these sorts of issues. This stuff adds new internal filesystem IO
> > > > ordering constraints, so we need tests that exercise it and ensure
> > > > we don't accidentally break it in future.
> > > > 
> > > 
> > > Hmm, but if the user issues fsync from the throttled cgroup then won't
> > > that throttling occur today, regardless of cgroup aware writeback? My
> > > understanding is that cgawb just accurately accounts writeback I/Os to
> > > the owner of the cached pages. IOW, if the buffered writer and fsync
> > > call are in the same throttled cgroup, then the throttling works just as
> > > it would with cgawb and the writer being in a throttled cgroup.
> > > 
> > > So ISTM that this is an independent problem. What am I missing?
> > > 
> > > Shaohua,
> > > 
> > > Do you have a reference to the older metadata related patch mentioned in
> > > the commit log that presumably addressed this?
> > 
> > The problem is about priority reversion. Say you do a fsync in a low prio
> > cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
> > prio cgroup, the cgroup will wait for fsync IO finished, which is already
> > submitted by the low prio cgroup and run in low prio. This makes the high prio
> > cgroup run slow. The proposed patch is to force all metadata write submitted
> > from root cgroup regardless which task submitted, which can fix this issue.
> > 
> 
> Right, but it seems to me that this can happen with or without cgroup
> aware writeback. This patch just introduces the final bits required to
> carry the page owner from however it is tracked in the writeback machine
> to the bio submitted by the fs. It doesn't introduce/enable/implement
> I/O throttling itself, which is already in place and usable (outside of
> the buffered write page owner problem fixed by this patch), right?
> 
> So without this patch, if a task in throttled cgroup A does a bunch of
> buffered writes and calls fsync, then another task in unthrottled cgroup
> B calls fsync, aren't we (XFS) susceptible to priority inversion via
> these same log I/O serialization points? If not, then what am I missing?

Well, I was originally told that bios send by a filesystem without
cgroup info would be accounted to the root cgroup and hence not
throttled at all. i.e. metadata, and any untagged data IO. In that
case, there should be no problems what-so-ever as all XFS IO should
be issued at the same priority

However, given the way people are talking about needing to bypass
block cgroup throttling via REQ_META hacks, I'm guessing that it
doesn't actually work that way. i.e. somewhere in the block layer it
attaches the current task cgroup to bios without existing cgroup
information, and so we end up with priority inversion problems
whenever cgroups and throttling are in use regardless of whether the
filesystem supports it or not....

Ah, yeah, the task cgroup gets added through this path:

submit_bio
  generic_make_request
    generic_make_request_checks
      blkcg_bio_issue_check
        bio_blkcg

> I'm not saying this isn't a problem that needs fixing, I just want to
> make sure I understand the fundamental problem(s), what this cgawb patch
> actually does and doesn't do and whether there is a logical dependency
> between this patch and the proposed metadata filtering patch.

Seems like there's a lot more work to be done to make this stuff
work reliably than anyone has been telling us. Without regression
tests, we're never going to know if it actually works or not....

Cheers,

Dave.

张本龙 March 28, 2018, 4:37 a.m. UTC | #10

2018-03-27 19:36 GMT+08:00 Brian Foster <bfoster@redhat.com>:
> On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote:
>> On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
>> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
>> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
>> > > > Hi Shaohua and XFS,
>> > > >
>> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
>> > > > mentioned about charging to root cgroup (also in an earlier email
>> > > > discussion), and seems the 4.16.0-rc6 code is not handling it
>> > > > separately.
>> > > >
>> > > > In our case to support XFS cgroup writeback control, which was ported
>> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
>> > > > trouble. Threads from throttled docker might submit_bio in following
>> > > > path by its own identity, this docker blkcg accumulated large amounts
>> > > > of data (e.g., 20GB), thus such log gets blocked.
>> > >
>> > > And thus displaying the reason why I originally refused to merge
>> > > this code until regression tests were added to fstests to exercise
>> > > these sorts of issues. This stuff adds new internal filesystem IO
>> > > ordering constraints, so we need tests that exercise it and ensure
>> > > we don't accidentally break it in future.
>> > >
>> >
>> > Hmm, but if the user issues fsync from the throttled cgroup then won't
>> > that throttling occur today, regardless of cgroup aware writeback? My
>> > understanding is that cgawb just accurately accounts writeback I/Os to
>> > the owner of the cached pages. IOW, if the buffered writer and fsync
>> > call are in the same throttled cgroup, then the throttling works just as
>> > it would with cgawb and the writer being in a throttled cgroup.
>> >
>> > So ISTM that this is an independent problem. What am I missing?
>> >
>> > Shaohua,
>> >
>> > Do you have a reference to the older metadata related patch mentioned in
>> > the commit log that presumably addressed this?
>>
>> The problem is about priority reversion. Say you do a fsync in a low prio
>> cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
>> prio cgroup, the cgroup will wait for fsync IO finished, which is already
>> submitted by the low prio cgroup and run in low prio. This makes the high prio
>> cgroup run slow. The proposed patch is to force all metadata write submitted
>> from root cgroup regardless which task submitted, which can fix this issue.
>>
>
> Right, but it seems to me that this can happen with or without cgroup
> aware writeback. This patch just introduces the final bits required to
> carry the page owner from however it is tracked in the writeback machine
> to the bio submitted by the fs. It doesn't introduce/enable/implement
> I/O throttling itself, which is already in place and usable (outside of
> the buffered write page owner problem fixed by this patch), right?
>
> So without this patch, if a task in throttled cgroup A does a bunch of
> buffered writes and calls fsync, then another task in unthrottled cgroup
> B calls fsync, aren't we (XFS) susceptible to priority inversion via
> these same log I/O serialization points? If not, then what am I missing?
>

Ok first let's agree we're not talking about 2 groups fsync the same
file. What's demonstrated in the original post, is that group-A MIGHT
submit xlog in the fsync path via xlog_sync() -> xfs_buf_submit().
Threads from group-B are waiting for the log from flush_work(), at the
same time kworkers from xlog_cil_push_work().

The problem is not about fsync get stuck on the target file. Actually
group-B should be waiting on filemap_write_and_wait_range() if it
were, as xfs_file_fsync() would flush the real data before
_xfs_log_force_lsn.

Here is the setup: group-A has 10 fio jobs each running on 20GB files,
and also some agents with not much IO; group-B just has the agents.
With this patch we set a bps=20Mb/s to A, thus the large amounts of
fio traffic are blocking the group (Note the fio are not doing any
fsync though). At this time, one agent does an fsync to a non-dirty
file, bypassing filemap_write_and_wait_range() and doing xlog_sync().
Here we go, log bios are stuck in group-A. Then group-B and kworkers
are waiting for log integrity indefinitely from various paths. Without
this patch fio should writeout at full disk speed, say 200MB/s, so
it's not accumulated in group-A.

> I'm not saying this isn't a problem that needs fixing, I just want to
> make sure I understand the fundamental problem(s), what this cgawb patch
> actually does and doesn't do and whether there is a logical dependency
> between this patch and the proposed metadata filtering patch.
>
> Brian
>
>> Thanks,
>> Shaohua
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

张本龙 March 28, 2018, 9:55 a.m. UTC | #11

2018-03-27 19:50 GMT+08:00 Brian Foster <bfoster@redhat.com>:
> On Tue, Mar 27, 2018 at 06:54:27PM +0800, 张本龙 wrote:
>> 2018-03-27 0:28 GMT+08:00 Brian Foster <bfoster@redhat.com>:
>>
>> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
>> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
>> > > > Hi Shaohua and XFS,
>> > > >
>> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
>> > > > mentioned about charging to root cgroup (also in an earlier email
>> > > > discussion), and seems the 4.16.0-rc6 code is not handling it
>> > > > separately.
>> > > >
>> > > > In our case to support XFS cgroup writeback control, which was ported
>> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
>> > > > trouble. Threads from throttled docker might submit_bio in following
>> > > > path by its own identity, this docker blkcg accumulated large amounts
>> > > > of data (e.g., 20GB), thus such log gets blocked.
>> > >
>> > > And thus displaying the reason why I originally refused to merge
>> > > this code until regression tests were added to fstests to exercise
>> > > these sorts of issues. This stuff adds new internal filesystem IO
>> > > ordering constraints, so we need tests that exercise it and ensure
>> > > we don't accidentally break it in future.
>> > >
>> >
>> > Hmm, but if the user issues fsync from the throttled cgroup then won't
>> > that throttling occur today, regardless of cgroup aware writeback? My
>> > understanding is that cgawb just accurately accounts writeback I/Os to
>> > the owner of the cached pages. IOW, if the buffered writer and fsync
>> > call are in the same throttled cgroup, then the throttling works just as
>> > it would with cgawb and the writer being in a throttled cgroup.
>> >
>> > So ISTM that this is an independent problem. What am I missing?
>>
>>
>> Yeah throttling would work for fsync traffic even without cgwb, but that's
>> because the way bio_blkcg() is implemented: charging to bio->bi_css when
>> set but submitter blkcg when not. So without cgwb normal traffic from
>> writeback kworkers are attributed to blkcg_root, which is normally
>> unlimited.
>>
>
> Ok, that's what I thought. Your report shows everything backed up on an
> fsync call, however, so I didn't really see how it related to this

Sorry for misleading but the original report is not showing things
backed up on fsync. It shows fsync MIGHT submit the log.
--------------------------------------------------------------------------
xxx-agent 22825 [001] 78772.391023: probe:submit_bio:
                  4f70c1 submit_bio
                   4e440 xfs_buf_submit ([xfs])
                   6e59b xlog_bdstrat ([xfs])
                   704c5 xlog_sync ([xfs])
                   70683 xlog_state_release_iclog ([xfs])
                   71917 _xfs_log_force_lsn ([xfs])
                   5249e xfs_file_fsync ([xfs])
                  4374d5 do_fsync
                  4377a0 sys_fsync
                  8a0ac9 system_call
                  1df9c4 [unknown]
Please note this is NOT a kernel INFO, it's a dynamic perf probe trace
to see who is submitting log. As the agent's docker is now blocked by
other fio threads, the log would be stuck from this path.

The following trace is kernel INFO, it shows another docker waits for
log integrity and timeout, similar is the 3rd other trace via
flush_work(). Again not completely sure about XFS, but the traces and
code diving seems to suggest it.
-------------------------------------------------------------------------------------------------
[79183.692355] INFO: task xxx:33434 blocked for more than 120 seconds.
[79183.730997] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
......
[79183.954738] Call Trace:
[79183.969516]  [<ffffffff81695ac9>] schedule+0x29/0x70
                                          /*    inline xlog_wait here */
[79183.999357]  [<ffffffffa073497a>] _xfs_log_force_lsn+0x2fa/0x350 [xfs]
[79184.038497]  [<ffffffff810c84c0>] ? wake_up_state+0x20/0x20
[79184.071941]  [<ffffffffa071542e>] xfs_file_fsync+0x10e/0x1e0 [xfs]
[79184.109010]  [<ffffffff812374d5>] do_fsync+0x65/0xa0
[79184.138826]  [<ffffffff8120368f>] ? SyS_write+0x9f/0xe0
[79184.170182]  [<ffffffff812377a0>] SyS_fsync+0x10/0x20
[79184.200514]  [<ffffffff816a0ac9>] system_call_fastpath+0x16/0x1b

> patch. So unless Shaohua explains otherwise, my understanding is that
> this is not necessarily caused by cgawb, but rather this is a separate
> but similar gap in the broader block I/O throttling mechanism that also
> needs to be addressed before throttling is usable (in practice) by
> filesystems. You presumably deal with this by filtering out metadata
> bios, while the patch Shaohua mentions deals with it by promoting
> metadata bios to the root cgroup.

Either way would work, agreed. Though I wouldn't call it 'priority
inversion', as Dave explained it's more like 'global log mistakenly
throttled'. The report is to provide some evidence that this single
patch is not enough for xfs writeback. And we should wait for the
other patch from Shaohua.

>
> Brian
>
>> And fsync might yield kernel INFO messages when bps * 120 < dirty data,
>> giving something like below. That's Ok though, since throttling and all
>> types
>> of sync calls are contradictory in nature.
>> ------------------------------------------------------------
>> ---------------------------------
>> [Mar 2 14:27] INFO: task dd:40794 blocked for more than 120 seconds.
>> [  +0.000747]       Not tainted 4.15.7 #1
>> [  +0.000610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [  +0.000614] dd              D    0 40794  40793 0x10000000
>> [  +0.000610] Call Trace:
>> [  +0.000619]  ? __schedule+0x28d/0x870
>> [  +0.000611]  schedule+0x32/0x80
>> [  +0.000615]  io_schedule+0x12/0x40
>> [  +0.000612]  wait_on_page_bit_common+0xff/0x1a0
>> [  +0.000609]  ? page_cache_tree_insert+0xd0/0xd0
>> [  +0.000614]  __filemap_fdatawait_range+0xe2/0x150
>> [  +0.000614]  file_write_and_wait_range+0x62/0xa0
>> [  +0.000664]  xfs_file_fsync+0x65/0x1d0 [xfs]
>> [  +0.000621]  do_fsync+0x38/0x60
>> [  +0.000610]  SyS_fsync+0xc/0x10
>> [  +0.000616]  do_syscall_64+0x6e/0x1a0
>> [  +0.000612]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>> [  +0.000610] RIP: 0033:0x7f8131231020
>> [  +0.000604] RSP: 002b:00007fff81c04f48 EFLAGS: 00000246 ORIG_RAX:
>> 000000000000004a
>> [  +0.000610] RAX: ffffffffffffffda RBX: 00007f8131af9690 RCX:
>> 00007f8131231020
>> [  +0.000619] RDX: 0000000004000000 RSI: 0000000000000000 RDI:
>> 0000000000000001
>> [  +0.000626] RBP: 0000000000000020 R08: 00000000ffffffff R09:
>> 0000000000000000
>> [  +0.000618] R10: 00007fff81c04d10 R11: 0000000000000246 R12:
>> 0000000000000020
>> [  +0.000620] R13: 0000000000000000 R14: 00007fff81c054cf R15:
>> 00007fff81c05198
>>
>>
>> > Shaohua,
>> >
>> > Do you have a reference to the older metadata related patch mentioned in
>> > the commit log that presumably addressed this?
>> >
>> > Brian
>> >
>> > >
>> > > > Not familiar with XFS, but seems log bios are partially stuck in
>> > > > throttled cgroups, leaving other innocent groups waiting for
>> > > > completion. To cope with this we bypassed REQ_META log bios in
>> > > > blk_throtl_bio().
>> > >
>> > > Yup, the log is global and should not be throttled. Metadata is less
>> > > obvious as to what the correct thing to do is, because writes are
>> > > always done from internal kernel threads but reads are done from a
>> > > mix of kernel threads, user cgroup contexts and user data IO
>> > > completions. Hence there are situations where metadata reads may
>> > > need to be throttled because they are cgroup context only (e.g.
>> > > filesystem directory traversal) but others where reads should not
>> > > be throttled because they have global context (e.g. inside a
>> > > transaction when other buffers are already locked).
>> > >
>>
>>
>> Thanks for the explanation Dave. So there are both read and write xfs_buf,
>> and reads really seem to be complex then.
>>
>>
>> > > Getting this right and keeping it working requires regression tests
>> > > that get run on every release, whether it be upstream or distro
>> > > kernels, and that means we need tests in fstests to cover cgroup IO
>> > > control....
>> > >
>> > > Cheers,
>> > >
>> > > Dave.
>> > > --
>> > > Dave Chinner
>> > > david@fromorbit.com
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> > > the body of a message to majordomo@vger.kernel.org
>> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brian Foster March 28, 2018, 11:24 a.m. UTC | #12

On Wed, Mar 28, 2018 at 12:37:32PM +0800, 张本龙 wrote:
> 2018-03-27 19:36 GMT+08:00 Brian Foster <bfoster@redhat.com>:
> > On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote:
> >> On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
> >> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> >> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> >> > > > Hi Shaohua and XFS,
> >> > > >
> >> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> >> > > > mentioned about charging to root cgroup (also in an earlier email
> >> > > > discussion), and seems the 4.16.0-rc6 code is not handling it
> >> > > > separately.
> >> > > >
> >> > > > In our case to support XFS cgroup writeback control, which was ported
> >> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> >> > > > trouble. Threads from throttled docker might submit_bio in following
> >> > > > path by its own identity, this docker blkcg accumulated large amounts
> >> > > > of data (e.g., 20GB), thus such log gets blocked.
> >> > >
> >> > > And thus displaying the reason why I originally refused to merge
> >> > > this code until regression tests were added to fstests to exercise
> >> > > these sorts of issues. This stuff adds new internal filesystem IO
> >> > > ordering constraints, so we need tests that exercise it and ensure
> >> > > we don't accidentally break it in future.
> >> > >
> >> >
> >> > Hmm, but if the user issues fsync from the throttled cgroup then won't
> >> > that throttling occur today, regardless of cgroup aware writeback? My
> >> > understanding is that cgawb just accurately accounts writeback I/Os to
> >> > the owner of the cached pages. IOW, if the buffered writer and fsync
> >> > call are in the same throttled cgroup, then the throttling works just as
> >> > it would with cgawb and the writer being in a throttled cgroup.
> >> >
> >> > So ISTM that this is an independent problem. What am I missing?
> >> >
> >> > Shaohua,
> >> >
> >> > Do you have a reference to the older metadata related patch mentioned in
> >> > the commit log that presumably addressed this?
> >>
> >> The problem is about priority reversion. Say you do a fsync in a low prio
> >> cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
> >> prio cgroup, the cgroup will wait for fsync IO finished, which is already
> >> submitted by the low prio cgroup and run in low prio. This makes the high prio
> >> cgroup run slow. The proposed patch is to force all metadata write submitted
> >> from root cgroup regardless which task submitted, which can fix this issue.
> >>
> >
> > Right, but it seems to me that this can happen with or without cgroup
> > aware writeback. This patch just introduces the final bits required to
> > carry the page owner from however it is tracked in the writeback machine
> > to the bio submitted by the fs. It doesn't introduce/enable/implement
> > I/O throttling itself, which is already in place and usable (outside of
> > the buffered write page owner problem fixed by this patch), right?
> >
> > So without this patch, if a task in throttled cgroup A does a bunch of
> > buffered writes and calls fsync, then another task in unthrottled cgroup
> > B calls fsync, aren't we (XFS) susceptible to priority inversion via
> > these same log I/O serialization points? If not, then what am I missing?
> >
> 
> Ok first let's agree we're not talking about 2 groups fsync the same
> file. What's demonstrated in the original post, is that group-A MIGHT
> submit xlog in the fsync path via xlog_sync() -> xfs_buf_submit().
> Threads from group-B are waiting for the log from flush_work(), at the
> same time kworkers from xlog_cil_push_work().
> 
> The problem is not about fsync get stuck on the target file. Actually
> group-B should be waiting on filemap_write_and_wait_range() if it
> were, as xfs_file_fsync() would flush the real data before
> _xfs_log_force_lsn.
> 

Yes, I don't mean to suggest we're contending on the same file across
different cgroups. Rather, we're getting stuck on a dependency of a
global resource (the log) between groups with very different priorities.

> Here is the setup: group-A has 10 fio jobs each running on 20GB files,
> and also some agents with not much IO; group-B just has the agents.
> With this patch we set a bps=20Mb/s to A, thus the large amounts of
> fio traffic are blocking the group (Note the fio are not doing any
> fsync though). At this time, one agent does an fsync to a non-dirty
> file, bypassing filemap_write_and_wait_range() and doing xlog_sync().
> Here we go, log bios are stuck in group-A. Then group-B and kworkers
> are waiting for log integrity indefinitely from various paths. Without
> this patch fio should writeout at full disk speed, say 200MB/s, so
> it's not accumulated in group-A.
> 

I see. So in your particular setup, cgawb is required to reproduce
because you aren't actually pushing data through the cgroup via fsync.
Therefore, cgawb is what actually enables the throttling for your use
case. The throttling slows down further I/O from group A and thus
creates the problematic conditions for contention on the log.

However, it looks like the only requirement to cause this particular
condition is that an fsync submits a log bio in an active and throttled
cgroup that ends up blocked for a period of time. Other log serializing
tasks in other cgroups are now susceptible to the first cgroup's
throttle if they have to wait on log I/O completion. So replace your
cgawb use case with one that creates similar conditions by somebody else
waiting on fsync -> filemap_write_and_wait_range() on a separate file in
the same (throttled) group and afaict the same thing can occur without
cgawb in the picture at all.

But between this and your other email I think I see how this isn't what
you are actually reproducing. Thanks for the explanation.

Brian

> > I'm not saying this isn't a problem that needs fixing, I just want to
> > make sure I understand the fundamental problem(s), what this cgawb patch
> > actually does and doesn't do and whether there is a logical dependency
> > between this patch and the proposed metadata filtering patch.
> >
> > Brian
> >
> >> Thanks,
> >> Shaohua
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brian Foster March 28, 2018, 11:32 a.m. UTC | #13

On Wed, Mar 28, 2018 at 08:56:14AM +1100, Dave Chinner wrote:
> On Tue, Mar 27, 2018 at 07:36:48AM -0400, Brian Foster wrote:
> > On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote:
> > > On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
> > > > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> > > > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > > > > > Hi Shaohua and XFS,
> > > > > > 
> > > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > > > > > mentioned about charging to root cgroup (also in an earlier email
> > > > > > discussion), and seems the 4.16.0-rc6 code is not handling it
> > > > > > separately.
> > > > > > 
> > > > > > In our case to support XFS cgroup writeback control, which was ported
> > > > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > > > > > trouble. Threads from throttled docker might submit_bio in following
> > > > > > path by its own identity, this docker blkcg accumulated large amounts
> > > > > > of data (e.g., 20GB), thus such log gets blocked.
> > > > > 
> > > > > And thus displaying the reason why I originally refused to merge
> > > > > this code until regression tests were added to fstests to exercise
> > > > > these sorts of issues. This stuff adds new internal filesystem IO
> > > > > ordering constraints, so we need tests that exercise it and ensure
> > > > > we don't accidentally break it in future.
> > > > > 
> > > > 
> > > > Hmm, but if the user issues fsync from the throttled cgroup then won't
> > > > that throttling occur today, regardless of cgroup aware writeback? My
> > > > understanding is that cgawb just accurately accounts writeback I/Os to
> > > > the owner of the cached pages. IOW, if the buffered writer and fsync
> > > > call are in the same throttled cgroup, then the throttling works just as
> > > > it would with cgawb and the writer being in a throttled cgroup.
> > > > 
> > > > So ISTM that this is an independent problem. What am I missing?
> > > > 
> > > > Shaohua,
> > > > 
> > > > Do you have a reference to the older metadata related patch mentioned in
> > > > the commit log that presumably addressed this?
> > > 
> > > The problem is about priority reversion. Say you do a fsync in a low prio
> > > cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
> > > prio cgroup, the cgroup will wait for fsync IO finished, which is already
> > > submitted by the low prio cgroup and run in low prio. This makes the high prio
> > > cgroup run slow. The proposed patch is to force all metadata write submitted
> > > from root cgroup regardless which task submitted, which can fix this issue.
> > > 
> > 
> > Right, but it seems to me that this can happen with or without cgroup
> > aware writeback. This patch just introduces the final bits required to
> > carry the page owner from however it is tracked in the writeback machine
> > to the bio submitted by the fs. It doesn't introduce/enable/implement
> > I/O throttling itself, which is already in place and usable (outside of
> > the buffered write page owner problem fixed by this patch), right?
> > 
> > So without this patch, if a task in throttled cgroup A does a bunch of
> > buffered writes and calls fsync, then another task in unthrottled cgroup
> > B calls fsync, aren't we (XFS) susceptible to priority inversion via
> > these same log I/O serialization points? If not, then what am I missing?
> 
> Well, I was originally told that bios send by a filesystem without
> cgroup info would be accounted to the root cgroup and hence not
> throttled at all. i.e. metadata, and any untagged data IO. In that
> case, there should be no problems what-so-ever as all XFS IO should
> be issued at the same priority
> 

It sounds like this was actually the plan at some point but the
associated patch (to which there is still no reference?) was never
committed. I have no idea why that is.. whether it was just lost or
nacked with outstanding issues..?

> However, given the way people are talking about needing to bypass
> block cgroup throttling via REQ_META hacks, I'm guessing that it
> doesn't actually work that way. i.e. somewhere in the block layer it
> attaches the current task cgroup to bios without existing cgroup
> information, and so we end up with priority inversion problems
> whenever cgroups and throttling are in use regardless of whether the
> filesystem supports it or not....
> 
> Ah, yeah, the task cgroup gets added through this path:
> 
> submit_bio
>   generic_make_request
>     generic_make_request_checks
>       blkcg_bio_issue_check
>         bio_blkcg
> 

Right, this is what I thought based on the code. For anything not
explicitly associated with a cgroup, it pretty much depends on the user
context of the I/O submission.

> > I'm not saying this isn't a problem that needs fixing, I just want to
> > make sure I understand the fundamental problem(s), what this cgawb patch
> > actually does and doesn't do and whether there is a logical dependency
> > between this patch and the proposed metadata filtering patch.
> 
> Seems like there's a lot more work to be done to make this stuff
> work reliably than anyone has been telling us. Without regression
> tests, we're never going to know if it actually works or not....
> 

I'm wondering if there's something simple/high-level we can do in the
filesystem to sort of flip the default block layer condition above on
its head with regard to metadata. For example, forcibly associate all
metadata bios to the root cgroup somewhere down in the xfs_buf_submit()
path. So rather than doing metadata I/O on behalf of a variety of random
cgroups based purely on userspace configuration and activity, we start
with a consistent and predictable base behavior.

The former (current) behavior leaves us open to these kinds of
problematic blocked states. The latter approach changes those gaps into
something that instead may just not honor the throttling limits for
certain metadata heavy workloads until finer grained control is
implemented in the fs (which can be implemented as needed). That's also
advantageous because by default we won't throttle certain I/Os that
might have complex locking/transaction dependencies and instead opt-in
to such metadata throttling in the obvious contexts where it's
appropriate. For example, the directory read example you mentioned
before could set a state on the buffer that prevents preemptive
association so the bio is actually associated with current userspace
context in the block layer and throttled appropriately.

Of course, I haven't looked into how reasonable any of that is to
implement. It would be nice to see whatever patch was previously
proposed for reference. :/

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner March 28, 2018, 10:35 p.m. UTC | #14

On Wed, Mar 28, 2018 at 07:32:32AM -0400, Brian Foster wrote:
> On Wed, Mar 28, 2018 at 08:56:14AM +1100, Dave Chinner wrote:
> > On Tue, Mar 27, 2018 at 07:36:48AM -0400, Brian Foster wrote:
> > > On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote:
> > > > On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
> > > > > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> > > > > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > > > > > > Hi Shaohua and XFS,
> > > > > > > 
> > > > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > > > > > > mentioned about charging to root cgroup (also in an earlier email
> > > > > > > discussion), and seems the 4.16.0-rc6 code is not handling it
> > > > > > > separately.
> > > > > > > 
> > > > > > > In our case to support XFS cgroup writeback control, which was ported
> > > > > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > > > > > > trouble. Threads from throttled docker might submit_bio in following
> > > > > > > path by its own identity, this docker blkcg accumulated large amounts
> > > > > > > of data (e.g., 20GB), thus such log gets blocked.
> > > > > > 
> > > > > > And thus displaying the reason why I originally refused to merge
> > > > > > this code until regression tests were added to fstests to exercise
> > > > > > these sorts of issues. This stuff adds new internal filesystem IO
> > > > > > ordering constraints, so we need tests that exercise it and ensure
> > > > > > we don't accidentally break it in future.
> > > > > > 
> > > > > 
> > > > > Hmm, but if the user issues fsync from the throttled cgroup then won't
> > > > > that throttling occur today, regardless of cgroup aware writeback? My
> > > > > understanding is that cgawb just accurately accounts writeback I/Os to
> > > > > the owner of the cached pages. IOW, if the buffered writer and fsync
> > > > > call are in the same throttled cgroup, then the throttling works just as
> > > > > it would with cgawb and the writer being in a throttled cgroup.
> > > > > 
> > > > > So ISTM that this is an independent problem. What am I missing?
> > > > > 
> > > > > Shaohua,
> > > > > 
> > > > > Do you have a reference to the older metadata related patch mentioned in
> > > > > the commit log that presumably addressed this?
> > > > 
> > > > The problem is about priority reversion. Say you do a fsync in a low prio
> > > > cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
> > > > prio cgroup, the cgroup will wait for fsync IO finished, which is already
> > > > submitted by the low prio cgroup and run in low prio. This makes the high prio
> > > > cgroup run slow. The proposed patch is to force all metadata write submitted
> > > > from root cgroup regardless which task submitted, which can fix this issue.
> > > > 
> > > 
> > > Right, but it seems to me that this can happen with or without cgroup
> > > aware writeback. This patch just introduces the final bits required to
> > > carry the page owner from however it is tracked in the writeback machine
> > > to the bio submitted by the fs. It doesn't introduce/enable/implement
> > > I/O throttling itself, which is already in place and usable (outside of
> > > the buffered write page owner problem fixed by this patch), right?
> > > 
> > > So without this patch, if a task in throttled cgroup A does a bunch of
> > > buffered writes and calls fsync, then another task in unthrottled cgroup
> > > B calls fsync, aren't we (XFS) susceptible to priority inversion via
> > > these same log I/O serialization points? If not, then what am I missing?
> > 
> > Well, I was originally told that bios send by a filesystem without
> > cgroup info would be accounted to the root cgroup and hence not
> > throttled at all. i.e. metadata, and any untagged data IO. In that
> > case, there should be no problems what-so-ever as all XFS IO should
> > be issued at the same priority
> > 
> 
> It sounds like this was actually the plan at some point but the
> associated patch (to which there is still no reference?) was never
> committed. I have no idea why that is.. whether it was just lost or
> nacked with outstanding issues..?

I've never seen the patch that is being spoken of. I'm guessing it
lives in some dark corner of FB's internal kernel tree and they have
no motivation to push stuff like this upstream....

> > However, given the way people are talking about needing to bypass
> > block cgroup throttling via REQ_META hacks, I'm guessing that it
> > doesn't actually work that way. i.e. somewhere in the block layer it
> > attaches the current task cgroup to bios without existing cgroup
> > information, and so we end up with priority inversion problems
> > whenever cgroups and throttling are in use regardless of whether the
> > filesystem supports it or not....
> > 
> > Ah, yeah, the task cgroup gets added through this path:
> > 
> > submit_bio
> >   generic_make_request
> >     generic_make_request_checks
> >       blkcg_bio_issue_check
> >         bio_blkcg
> > 
> 
> Right, this is what I thought based on the code. For anything not
> explicitly associated with a cgroup, it pretty much depends on the user
> context of the I/O submission.

I suspect that the code when originally proposed didn't have this
"fall back to task context" code in it. i.e.  it was added later,
thereby changing the default behaviour and introducing all these
filesystem level issues.

> > > I'm not saying this isn't a problem that needs fixing, I just want to
> > > make sure I understand the fundamental problem(s), what this cgawb patch
> > > actually does and doesn't do and whether there is a logical dependency
> > > between this patch and the proposed metadata filtering patch.
> > 
> > Seems like there's a lot more work to be done to make this stuff
> > work reliably than anyone has been telling us. Without regression
> > tests, we're never going to know if it actually works or not....
> > 
> 
> I'm wondering if there's something simple/high-level we can do in the
> filesystem to sort of flip the default block layer condition above on
> its head with regard to metadata.

I'm guessing that's what the patches that look at REQ_META and
assign it to the root cgroup does - it makes them global scope
rather than whatever is attached to them and prevents them from
being throttled by the existing/default cgroup configuration.

> For example, forcibly associate all
> metadata bios to the root cgroup somewhere down in the xfs_buf_submit()
> path. So rather than doing metadata I/O on behalf of a variety of random
> cgroups based purely on userspace configuration and activity, we start
> with a consistent and predictable base behavior.

I'd prefer the block layer does that by default when it comes across
an untagged REQ_META bio. That way we don't have to sprinkle magic
cgroup pixie dust through every single filesystem to deal with this
same problem....

Cheers,

Dave.

[V2] xfs: implement cgroup writeback support

Commit Message

Comments

Patch