Message ID | f13839700d372a4663a08a11883e9c89b13056ca.1521752282.git.shli@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 22 Mar 2018, at 17:11, Shaohua Li wrote: > From: Shaohua Li <shli@fb.com> > > Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup > writeback support). Tested with a fio test, verified writeback is > throttled against cgroup io.max write bandwidth, also verified moving > the fio test to another cgroup and the writeback is throttled against > new cgroup setting. > > This only controls the file data write for cgroup. For metadata, since > xfs dispatches the metadata write in specific threads, it's possible > low > prio app's metadata could harm high prio app's metadata. A while back, > Tejun has a patch to force metadata belonging to root cgroup for > btrfs. > I had a similiar patch for xfs too. But Since Tejun's patch isn't in > upstream, I'll delay post the xfs patch. I think we've worked out all the problems on the btrfs/generic patch set, we'll get those sent out shortly. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Shaohua and XFS, May I ask how are we gonna handle REQ_META issued from XFS? As you mentioned about charging to root cgroup (also in an earlier email discussion), and seems the 4.16.0-rc6 code is not handling it separately. In our case to support XFS cgroup writeback control, which was ported and slightly adapted to 3.10.0, ignoring xfs log bios resulted in trouble. Threads from throttled docker might submit_bio in following path by its own identity, this docker blkcg accumulated large amounts of data (e.g., 20GB), thus such log gets blocked. ------------------------------------------------------------------------------------------------------------------------------------------------ xxx-agent 22825 [001] 78772.391023: probe:submit_bio: (ffffffff812f70c0) bio=ffff880fb6e1f300 bi_css=0 4f70c1 submit_bio (/usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.cgwb.rdt.x86_64/vmlinux) 4e440 xfs_buf_submit ([xfs]) 6e59b xlog_bdstrat ([xfs]) 704c5 xlog_sync ([xfs]) 70683 xlog_state_release_iclog ([xfs]) 71917 _xfs_log_force_lsn ([xfs]) 5249e xfs_file_fsync ([xfs]) 4374d5 do_fsync (/usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.cgwb.rdt.x86_64/vmlinux) 4377a0 sys_fsync (/usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.cgwb.rdt.x86_64/vmlinux) 8a0ac9 system_call (/usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.cgwb.rdt.x86_64/vmlinux 1df9c4 [unknown] (/home/xxx/xxx-agent/data/install/xxx-agent/16/xxx-agent) Meanwhile some other container without bps limits, or kworkers from root cgroup, might get stuck from: ---------------------------------------------------------------------------------------------------------------------------------------------- [79183.692355] INFO: task xxx:33434 blocked for more than 120 seconds. [79183.730997] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [79183.778095] consul D ffff880169958000 0 33434 24659 0x00000000 [79183.820478] ffff880bce3e7e38 0000000000000082 ffff880c7af8bec0 ffff880bce3e7fd8 [79183.865232] ffff880bce3e7fd8 ffff880bce3e7fd8 ffff880c7af8bec0 ffff880c7af8bec0 [79183.909997] ffff880169bfa928 ffff880bce3e7ef4 0000000000000000 ffff880c7af8bec0 [79183.954738] Call Trace: [79183.969516] [<ffffffff81695ac9>] schedule+0x29/0x70 [79183.999357] [<ffffffffa073497a>] _xfs_log_force_lsn+0x2fa/0x350 [xfs] [79184.038497] [<ffffffff810c84c0>] ? wake_up_state+0x20/0x20 [79184.071941] [<ffffffffa071542e>] xfs_file_fsync+0x10e/0x1e0 [xfs] [79184.109010] [<ffffffff812374d5>] do_fsync+0x65/0xa0 [79184.138826] [<ffffffff8120368f>] ? SyS_write+0x9f/0xe0 [79184.170182] [<ffffffff812377a0>] SyS_fsync+0x10/0x20 [79184.200514] [<ffffffff816a0ac9>] system_call_fastpath+0x16/0x1b [79184.236613] INFO: task xxx:38778 blocked for more than 120 seconds. [79184.275238] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [79184.322329] consul D ffff880fe728af10 0 38778 32505 0x00000000 [79184.364694] ffff881a18fe7c10 0000000000000082 ffff88196a74bec0 ffff881a18fe7fd8 [79184.409424] ffff881a18fe7fd8 ffff881a18fe7fd8 ffff88196a74bec0 ffff881a18fe7d60 [79184.454167] ffff881a18fe7d68 7fffffffffffffff ffff88196a74bec0 ffffffffffffffff [79184.498909] Call Trace: [79184.513673] [<ffffffff81695ac9>] schedule+0x29/0x70 [79184.543475] [<ffffffff81693529>] schedule_timeout+0x239/0x2c0 [79184.578468] [<ffffffff810c4f99>] ? ttwu_do_wakeup+0x19/0xd0 [79184.612419] [<ffffffff810c512d>] ? ttwu_do_activate.constprop.91+0x5d/0x70 [79184.654148] [<ffffffff810c8308>] ? try_to_wake_up+0x1c8/0x320 [79184.689141] [<ffffffff81695ea6>] wait_for_completion+0x116/0x170 [79184.725688] [<ffffffff810c84c0>] ? wake_up_state+0x20/0x20 [79184.759126] [<ffffffff810ac67c>] flush_work+0xfc/0x1c0 [79184.790481] [<ffffffff810a85f0>] ? move_linked_works+0x90/0x90 [79184.826020] [<ffffffffa073622a>] xlog_cil_force_lsn+0x8a/0x210 [xfs] [79184.864644] [<ffffffff811850df>] ? __filemap_fdatawrite_range+0xbf/0xe0 [79184.904838] [<ffffffffa073470f>] _xfs_log_force_lsn+0x8f/0x350 [xfs] [79184.943467] [<ffffffff8118531f>] ? filemap_fdatawait_range+0x1f/0x30 [79184.982086] [<ffffffff81694c42>] ? down_read+0x12/0x30 [79185.013466] [<ffffffffa071542e>] xfs_file_fsync+0x10e/0x1e0 [xfs] [79185.050533] [<ffffffff812374d5>] do_fsync+0x65/0xa0 [79185.080335] [<ffffffff8120368f>] ? SyS_write+0x9f/0xe0 [79185.111691] [<ffffffff812377a0>] SyS_fsync+0x10/0x20 [79185.141986] [<ffffffff816a0ac9>] system_call_fastpath+0x16/0x1b [79185.178051] INFO: task kworker/1:0:12593 blocked for more than 120 seconds. [79185.219258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [79185.266353] kworker/1:0 D ffff8801698caf10 0 12593 2 0x00000000 [79185.308743] Workqueue: xfs-cil/sdb1 xlog_cil_push_work [xfs] [79185.342757] ffff880d00e3bbe8 0000000000000046 ffff880169732f10 ffff880d00e3bfd8 [79185.387475] ffff880d00e3bfd8 ffff880d00e3bfd8 ffff880169732f10 ffff880169bfa800 [79185.432197] ffff880169bfa928 ffff880bd3b4a5c0 ffff880169732f10 ffff880169bfa900 [79185.476946] Call Trace: [79185.491725] [<ffffffff81695ac9>] schedule+0x29/0x70 [79185.521548] [<ffffffffa073377a>] xlog_state_get_iclog_space+0x10a/0x310 [xfs] [79185.565010] [<ffffffff810c84c0>] ? wake_up_state+0x20/0x20 [79185.598476] [<ffffffffa0733c99>] xlog_write+0x1a9/0x720 [xfs] [79185.633477] [<ffffffffa07359c9>] xlog_cil_push+0x239/0x420 [xfs] [79185.670037] [<ffffffffa0735bc5>] xlog_cil_push_work+0x15/0x20 [xfs] [79185.708138] [<ffffffff810ab45b>] process_one_work+0x17b/0x470 [79185.743124] [<ffffffff810ac413>] worker_thread+0x2a3/0x410 [79185.776542] [<ffffffff810ac170>] ? rescuer_thread+0x460/0x460 [79185.811546] [<ffffffff810b3a4f>] kthread+0xcf/0xe0 [79185.840835] [<ffffffff810b3980>] ? kthread_create_on_node+0x140/0x140 [79185.879982] [<ffffffff816a0a18>] ret_from_fork+0x58/0x90 [79185.912391] [<ffffffff810b3980>] ? kthread_create_on_node+0x140/0x140 Not familiar with XFS, but seems log bios are partially stuck in throttled cgroups, leaving other innocent groups waiting for completion. To cope with this we bypassed REQ_META log bios in blk_throtl_bio(). So just writing to ensure we know about the issue. Absolutely great if your other patch already fixed this! Thanks, Benlong 2018-03-23 5:11 GMT+08:00 Shaohua Li <shli@kernel.org>: > From: Shaohua Li <shli@fb.com> > > Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup > writeback support). Tested with a fio test, verified writeback is > throttled against cgroup io.max write bandwidth, also verified moving > the fio test to another cgroup and the writeback is throttled against > new cgroup setting. > > This only controls the file data write for cgroup. For metadata, since > xfs dispatches the metadata write in specific threads, it's possible low > prio app's metadata could harm high prio app's metadata. A while back, > Tejun has a patch to force metadata belonging to root cgroup for btrfs. > I had a similiar patch for xfs too. But Since Tejun's patch isn't in > upstream, I'll delay post the xfs patch. > > Cc: Tejun Heo <tj@kernel.org> > Cc: Darrick J. Wong <darrick.wong@oracle.com> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Christoph Hellwig <hch@infradead.org> > Signed-off-by: Shaohua Li <shli@fb.com> > --- > fs/xfs/xfs_aops.c | 13 +++++++++++-- > fs/xfs/xfs_super.c | 1 + > 2 files changed, 12 insertions(+), 2 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 19eadc8..5f70584 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -589,7 +589,8 @@ xfs_alloc_ioend( > struct inode *inode, > unsigned int type, > xfs_off_t offset, > - struct buffer_head *bh) > + struct buffer_head *bh, > + struct writeback_control *wbc) > { > struct xfs_ioend *ioend; > struct bio *bio; > @@ -606,6 +607,8 @@ xfs_alloc_ioend( > INIT_WORK(&ioend->io_work, xfs_end_io); > ioend->io_append_trans = NULL; > ioend->io_bio = bio; > + /* attach new bio to its cgroup */ > + wbc_init_bio(wbc, bio); > return ioend; > } > > @@ -633,6 +636,8 @@ xfs_chain_bio( > ioend->io_bio->bi_write_hint = ioend->io_inode->i_write_hint; > submit_bio(ioend->io_bio); > ioend->io_bio = new; > + /* attach new bio to its cgroup */ > + wbc_init_bio(wbc, new); > } > > /* > @@ -656,7 +661,8 @@ xfs_add_to_ioend( > offset != wpc->ioend->io_offset + wpc->ioend->io_size) { > if (wpc->ioend) > list_add(&wpc->ioend->io_list, iolist); > - wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh); > + wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, > + bh, wbc); > } > > /* > @@ -666,6 +672,9 @@ xfs_add_to_ioend( > while (xfs_bio_add_buffer(wpc->ioend->io_bio, bh) != bh->b_size) > xfs_chain_bio(wpc->ioend, wbc, bh); > > + /* Charge write size to its cgroup for cgroup switching track */ > + wbc_account_io(wbc, bh->b_page, bh->b_size); > + > wpc->ioend->io_size += bh->b_size; > wpc->last_block = bh->b_blocknr; > xfs_start_buffer_writeback(bh); > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 951271f..95c2d3d 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1666,6 +1666,7 @@ xfs_fs_fill_super( > sb->s_max_links = XFS_MAXLINK; > sb->s_time_gran = 1; > set_posix_acl_flag(sb); > + sb->s_iflags |= SB_I_CGROUPWB; > > /* version 5 superblocks support inode version counters. */ > if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5) > -- > 2.9.5 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 22, 2018 at 02:11:45PM -0700, Shaohua Li wrote: > From: Shaohua Li <shli@fb.com> > > Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup > writeback support). Tested with a fio test, verified writeback is > throttled against cgroup io.max write bandwidth, also verified moving > the fio test to another cgroup and the writeback is throttled against > new cgroup setting. > > This only controls the file data write for cgroup. For metadata, since > xfs dispatches the metadata write in specific threads, it's possible low > prio app's metadata could harm high prio app's metadata. A while back, > Tejun has a patch to force metadata belonging to root cgroup for btrfs. > I had a similiar patch for xfs too. But Since Tejun's patch isn't in > upstream, I'll delay post the xfs patch. > > Cc: Tejun Heo <tj@kernel.org> > Cc: Darrick J. Wong <darrick.wong@oracle.com> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Christoph Hellwig <hch@infradead.org> > Signed-off-by: Shaohua Li <shli@fb.com> > --- > fs/xfs/xfs_aops.c | 13 +++++++++++-- > fs/xfs/xfs_super.c | 1 + > 2 files changed, 12 insertions(+), 2 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 19eadc8..5f70584 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -589,7 +589,8 @@ xfs_alloc_ioend( > struct inode *inode, > unsigned int type, > xfs_off_t offset, > - struct buffer_head *bh) > + struct buffer_head *bh, > + struct writeback_control *wbc) > { > struct xfs_ioend *ioend; > struct bio *bio; > @@ -606,6 +607,8 @@ xfs_alloc_ioend( > INIT_WORK(&ioend->io_work, xfs_end_io); > ioend->io_append_trans = NULL; > ioend->io_bio = bio; > + /* attach new bio to its cgroup */ > + wbc_init_bio(wbc, bio); Just a nit, but can we move the wbc_init_bio() calls to earlier in the function where we setup/init the bio? (Same comment for xfs_chain_bio()). Otherwise this looks fine to me and passes the posted xfstests test. Brian > return ioend; > } > > @@ -633,6 +636,8 @@ xfs_chain_bio( > ioend->io_bio->bi_write_hint = ioend->io_inode->i_write_hint; > submit_bio(ioend->io_bio); > ioend->io_bio = new; > + /* attach new bio to its cgroup */ > + wbc_init_bio(wbc, new); > } > > /* > @@ -656,7 +661,8 @@ xfs_add_to_ioend( > offset != wpc->ioend->io_offset + wpc->ioend->io_size) { > if (wpc->ioend) > list_add(&wpc->ioend->io_list, iolist); > - wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh); > + wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, > + bh, wbc); > } > > /* > @@ -666,6 +672,9 @@ xfs_add_to_ioend( > while (xfs_bio_add_buffer(wpc->ioend->io_bio, bh) != bh->b_size) > xfs_chain_bio(wpc->ioend, wbc, bh); > > + /* Charge write size to its cgroup for cgroup switching track */ > + wbc_account_io(wbc, bh->b_page, bh->b_size); > + > wpc->ioend->io_size += bh->b_size; > wpc->last_block = bh->b_blocknr; > xfs_start_buffer_writeback(bh); > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 951271f..95c2d3d 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1666,6 +1666,7 @@ xfs_fs_fill_super( > sb->s_max_links = XFS_MAXLINK; > sb->s_time_gran = 1; > set_posix_acl_flag(sb); > + sb->s_iflags |= SB_I_CGROUPWB; > > /* version 5 superblocks support inode version counters. */ > if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5) > -- > 2.9.5 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote: > Hi Shaohua and XFS, > > May I ask how are we gonna handle REQ_META issued from XFS? As you > mentioned about charging to root cgroup (also in an earlier email > discussion), and seems the 4.16.0-rc6 code is not handling it > separately. > > In our case to support XFS cgroup writeback control, which was ported > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in > trouble. Threads from throttled docker might submit_bio in following > path by its own identity, this docker blkcg accumulated large amounts > of data (e.g., 20GB), thus such log gets blocked. And thus displaying the reason why I originally refused to merge this code until regression tests were added to fstests to exercise these sorts of issues. This stuff adds new internal filesystem IO ordering constraints, so we need tests that exercise it and ensure we don't accidentally break it in future. > Not familiar with XFS, but seems log bios are partially stuck in > throttled cgroups, leaving other innocent groups waiting for > completion. To cope with this we bypassed REQ_META log bios in > blk_throtl_bio(). Yup, the log is global and should not be throttled. Metadata is less obvious as to what the correct thing to do is, because writes are always done from internal kernel threads but reads are done from a mix of kernel threads, user cgroup contexts and user data IO completions. Hence there are situations where metadata reads may need to be throttled because they are cgroup context only (e.g. filesystem directory traversal) but others where reads should not be throttled because they have global context (e.g. inside a transaction when other buffers are already locked). Getting this right and keeping it working requires regression tests that get run on every release, whether it be upstream or distro kernels, and that means we need tests in fstests to cover cgroup IO control.... Cheers, Dave.
On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote: > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote: > > Hi Shaohua and XFS, > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you > > mentioned about charging to root cgroup (also in an earlier email > > discussion), and seems the 4.16.0-rc6 code is not handling it > > separately. > > > > In our case to support XFS cgroup writeback control, which was ported > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in > > trouble. Threads from throttled docker might submit_bio in following > > path by its own identity, this docker blkcg accumulated large amounts > > of data (e.g., 20GB), thus such log gets blocked. > > And thus displaying the reason why I originally refused to merge > this code until regression tests were added to fstests to exercise > these sorts of issues. This stuff adds new internal filesystem IO > ordering constraints, so we need tests that exercise it and ensure > we don't accidentally break it in future. > Hmm, but if the user issues fsync from the throttled cgroup then won't that throttling occur today, regardless of cgroup aware writeback? My understanding is that cgawb just accurately accounts writeback I/Os to the owner of the cached pages. IOW, if the buffered writer and fsync call are in the same throttled cgroup, then the throttling works just as it would with cgawb and the writer being in a throttled cgroup. So ISTM that this is an independent problem. What am I missing? Shaohua, Do you have a reference to the older metadata related patch mentioned in the commit log that presumably addressed this? Brian > > > Not familiar with XFS, but seems log bios are partially stuck in > > throttled cgroups, leaving other innocent groups waiting for > > completion. To cope with this we bypassed REQ_META log bios in > > blk_throtl_bio(). > > Yup, the log is global and should not be throttled. Metadata is less > obvious as to what the correct thing to do is, because writes are > always done from internal kernel threads but reads are done from a > mix of kernel threads, user cgroup contexts and user data IO > completions. Hence there are situations where metadata reads may > need to be throttled because they are cgroup context only (e.g. > filesystem directory traversal) but others where reads should not > be throttled because they have global context (e.g. inside a > transaction when other buffers are already locked). > > Getting this right and keeping it working requires regression tests > that get run on every release, whether it be upstream or distro > kernels, and that means we need tests in fstests to cover cgroup IO > control.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote: > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote: > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote: > > > Hi Shaohua and XFS, > > > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you > > > mentioned about charging to root cgroup (also in an earlier email > > > discussion), and seems the 4.16.0-rc6 code is not handling it > > > separately. > > > > > > In our case to support XFS cgroup writeback control, which was ported > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in > > > trouble. Threads from throttled docker might submit_bio in following > > > path by its own identity, this docker blkcg accumulated large amounts > > > of data (e.g., 20GB), thus such log gets blocked. > > > > And thus displaying the reason why I originally refused to merge > > this code until regression tests were added to fstests to exercise > > these sorts of issues. This stuff adds new internal filesystem IO > > ordering constraints, so we need tests that exercise it and ensure > > we don't accidentally break it in future. > > > > Hmm, but if the user issues fsync from the throttled cgroup then won't > that throttling occur today, regardless of cgroup aware writeback? My > understanding is that cgawb just accurately accounts writeback I/Os to > the owner of the cached pages. IOW, if the buffered writer and fsync > call are in the same throttled cgroup, then the throttling works just as > it would with cgawb and the writer being in a throttled cgroup. > > So ISTM that this is an independent problem. What am I missing? > > Shaohua, > > Do you have a reference to the older metadata related patch mentioned in > the commit log that presumably addressed this? The problem is about priority reversion. Say you do a fsync in a low prio cgroup, the IO will be submitted with low prio. Now you do a fsync in a high prio cgroup, the cgroup will wait for fsync IO finished, which is already submitted by the low prio cgroup and run in low prio. This makes the high prio cgroup run slow. The proposed patch is to force all metadata write submitted from root cgroup regardless which task submitted, which can fix this issue. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote: > On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote: > > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote: > > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote: > > > > Hi Shaohua and XFS, > > > > > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you > > > > mentioned about charging to root cgroup (also in an earlier email > > > > discussion), and seems the 4.16.0-rc6 code is not handling it > > > > separately. > > > > > > > > In our case to support XFS cgroup writeback control, which was ported > > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in > > > > trouble. Threads from throttled docker might submit_bio in following > > > > path by its own identity, this docker blkcg accumulated large amounts > > > > of data (e.g., 20GB), thus such log gets blocked. > > > > > > And thus displaying the reason why I originally refused to merge > > > this code until regression tests were added to fstests to exercise > > > these sorts of issues. This stuff adds new internal filesystem IO > > > ordering constraints, so we need tests that exercise it and ensure > > > we don't accidentally break it in future. > > > > > > > Hmm, but if the user issues fsync from the throttled cgroup then won't > > that throttling occur today, regardless of cgroup aware writeback? My > > understanding is that cgawb just accurately accounts writeback I/Os to > > the owner of the cached pages. IOW, if the buffered writer and fsync > > call are in the same throttled cgroup, then the throttling works just as > > it would with cgawb and the writer being in a throttled cgroup. > > > > So ISTM that this is an independent problem. What am I missing? > > > > Shaohua, > > > > Do you have a reference to the older metadata related patch mentioned in > > the commit log that presumably addressed this? > > The problem is about priority reversion. Say you do a fsync in a low prio > cgroup, the IO will be submitted with low prio. Now you do a fsync in a high > prio cgroup, the cgroup will wait for fsync IO finished, which is already > submitted by the low prio cgroup and run in low prio. This makes the high prio > cgroup run slow. The proposed patch is to force all metadata write submitted > from root cgroup regardless which task submitted, which can fix this issue. > Right, but it seems to me that this can happen with or without cgroup aware writeback. This patch just introduces the final bits required to carry the page owner from however it is tracked in the writeback machine to the bio submitted by the fs. It doesn't introduce/enable/implement I/O throttling itself, which is already in place and usable (outside of the buffered write page owner problem fixed by this patch), right? So without this patch, if a task in throttled cgroup A does a bunch of buffered writes and calls fsync, then another task in unthrottled cgroup B calls fsync, aren't we (XFS) susceptible to priority inversion via these same log I/O serialization points? If not, then what am I missing? I'm not saying this isn't a problem that needs fixing, I just want to make sure I understand the fundamental problem(s), what this cgawb patch actually does and doesn't do and whether there is a logical dependency between this patch and the proposed metadata filtering patch. Brian > Thanks, > Shaohua > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 27, 2018 at 06:54:27PM +0800, 张本龙 wrote: > 2018-03-27 0:28 GMT+08:00 Brian Foster <bfoster@redhat.com>: > > > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote: > > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote: > > > > Hi Shaohua and XFS, > > > > > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you > > > > mentioned about charging to root cgroup (also in an earlier email > > > > discussion), and seems the 4.16.0-rc6 code is not handling it > > > > separately. > > > > > > > > In our case to support XFS cgroup writeback control, which was ported > > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in > > > > trouble. Threads from throttled docker might submit_bio in following > > > > path by its own identity, this docker blkcg accumulated large amounts > > > > of data (e.g., 20GB), thus such log gets blocked. > > > > > > And thus displaying the reason why I originally refused to merge > > > this code until regression tests were added to fstests to exercise > > > these sorts of issues. This stuff adds new internal filesystem IO > > > ordering constraints, so we need tests that exercise it and ensure > > > we don't accidentally break it in future. > > > > > > > Hmm, but if the user issues fsync from the throttled cgroup then won't > > that throttling occur today, regardless of cgroup aware writeback? My > > understanding is that cgawb just accurately accounts writeback I/Os to > > the owner of the cached pages. IOW, if the buffered writer and fsync > > call are in the same throttled cgroup, then the throttling works just as > > it would with cgawb and the writer being in a throttled cgroup. > > > > So ISTM that this is an independent problem. What am I missing? > > > Yeah throttling would work for fsync traffic even without cgwb, but that's > because the way bio_blkcg() is implemented: charging to bio->bi_css when > set but submitter blkcg when not. So without cgwb normal traffic from > writeback kworkers are attributed to blkcg_root, which is normally > unlimited. > Ok, that's what I thought. Your report shows everything backed up on an fsync call, however, so I didn't really see how it related to this patch. So unless Shaohua explains otherwise, my understanding is that this is not necessarily caused by cgawb, but rather this is a separate but similar gap in the broader block I/O throttling mechanism that also needs to be addressed before throttling is usable (in practice) by filesystems. You presumably deal with this by filtering out metadata bios, while the patch Shaohua mentions deals with it by promoting metadata bios to the root cgroup. Brian > And fsync might yield kernel INFO messages when bps * 120 < dirty data, > giving something like below. That's Ok though, since throttling and all > types > of sync calls are contradictory in nature. > ------------------------------------------------------------ > --------------------------------- > [Mar 2 14:27] INFO: task dd:40794 blocked for more than 120 seconds. > [ +0.000747] Not tainted 4.15.7 #1 > [ +0.000610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ +0.000614] dd D 0 40794 40793 0x10000000 > [ +0.000610] Call Trace: > [ +0.000619] ? __schedule+0x28d/0x870 > [ +0.000611] schedule+0x32/0x80 > [ +0.000615] io_schedule+0x12/0x40 > [ +0.000612] wait_on_page_bit_common+0xff/0x1a0 > [ +0.000609] ? page_cache_tree_insert+0xd0/0xd0 > [ +0.000614] __filemap_fdatawait_range+0xe2/0x150 > [ +0.000614] file_write_and_wait_range+0x62/0xa0 > [ +0.000664] xfs_file_fsync+0x65/0x1d0 [xfs] > [ +0.000621] do_fsync+0x38/0x60 > [ +0.000610] SyS_fsync+0xc/0x10 > [ +0.000616] do_syscall_64+0x6e/0x1a0 > [ +0.000612] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [ +0.000610] RIP: 0033:0x7f8131231020 > [ +0.000604] RSP: 002b:00007fff81c04f48 EFLAGS: 00000246 ORIG_RAX: > 000000000000004a > [ +0.000610] RAX: ffffffffffffffda RBX: 00007f8131af9690 RCX: > 00007f8131231020 > [ +0.000619] RDX: 0000000004000000 RSI: 0000000000000000 RDI: > 0000000000000001 > [ +0.000626] RBP: 0000000000000020 R08: 00000000ffffffff R09: > 0000000000000000 > [ +0.000618] R10: 00007fff81c04d10 R11: 0000000000000246 R12: > 0000000000000020 > [ +0.000620] R13: 0000000000000000 R14: 00007fff81c054cf R15: > 00007fff81c05198 > > > > Shaohua, > > > > Do you have a reference to the older metadata related patch mentioned in > > the commit log that presumably addressed this? > > > > Brian > > > > > > > > > Not familiar with XFS, but seems log bios are partially stuck in > > > > throttled cgroups, leaving other innocent groups waiting for > > > > completion. To cope with this we bypassed REQ_META log bios in > > > > blk_throtl_bio(). > > > > > > Yup, the log is global and should not be throttled. Metadata is less > > > obvious as to what the correct thing to do is, because writes are > > > always done from internal kernel threads but reads are done from a > > > mix of kernel threads, user cgroup contexts and user data IO > > > completions. Hence there are situations where metadata reads may > > > need to be throttled because they are cgroup context only (e.g. > > > filesystem directory traversal) but others where reads should not > > > be throttled because they have global context (e.g. inside a > > > transaction when other buffers are already locked). > > > > > > Thanks for the explanation Dave. So there are both read and write xfs_buf, > and reads really seem to be complex then. > > > > > Getting this right and keeping it working requires regression tests > > > that get run on every release, whether it be upstream or distro > > > kernels, and that means we need tests in fstests to cover cgroup IO > > > control.... > > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 27, 2018 at 07:36:48AM -0400, Brian Foster wrote: > On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote: > > On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote: > > > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote: > > > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote: > > > > > Hi Shaohua and XFS, > > > > > > > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you > > > > > mentioned about charging to root cgroup (also in an earlier email > > > > > discussion), and seems the 4.16.0-rc6 code is not handling it > > > > > separately. > > > > > > > > > > In our case to support XFS cgroup writeback control, which was ported > > > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in > > > > > trouble. Threads from throttled docker might submit_bio in following > > > > > path by its own identity, this docker blkcg accumulated large amounts > > > > > of data (e.g., 20GB), thus such log gets blocked. > > > > > > > > And thus displaying the reason why I originally refused to merge > > > > this code until regression tests were added to fstests to exercise > > > > these sorts of issues. This stuff adds new internal filesystem IO > > > > ordering constraints, so we need tests that exercise it and ensure > > > > we don't accidentally break it in future. > > > > > > > > > > Hmm, but if the user issues fsync from the throttled cgroup then won't > > > that throttling occur today, regardless of cgroup aware writeback? My > > > understanding is that cgawb just accurately accounts writeback I/Os to > > > the owner of the cached pages. IOW, if the buffered writer and fsync > > > call are in the same throttled cgroup, then the throttling works just as > > > it would with cgawb and the writer being in a throttled cgroup. > > > > > > So ISTM that this is an independent problem. What am I missing? > > > > > > Shaohua, > > > > > > Do you have a reference to the older metadata related patch mentioned in > > > the commit log that presumably addressed this? > > > > The problem is about priority reversion. Say you do a fsync in a low prio > > cgroup, the IO will be submitted with low prio. Now you do a fsync in a high > > prio cgroup, the cgroup will wait for fsync IO finished, which is already > > submitted by the low prio cgroup and run in low prio. This makes the high prio > > cgroup run slow. The proposed patch is to force all metadata write submitted > > from root cgroup regardless which task submitted, which can fix this issue. > > > > Right, but it seems to me that this can happen with or without cgroup > aware writeback. This patch just introduces the final bits required to > carry the page owner from however it is tracked in the writeback machine > to the bio submitted by the fs. It doesn't introduce/enable/implement > I/O throttling itself, which is already in place and usable (outside of > the buffered write page owner problem fixed by this patch), right? > > So without this patch, if a task in throttled cgroup A does a bunch of > buffered writes and calls fsync, then another task in unthrottled cgroup > B calls fsync, aren't we (XFS) susceptible to priority inversion via > these same log I/O serialization points? If not, then what am I missing? Well, I was originally told that bios send by a filesystem without cgroup info would be accounted to the root cgroup and hence not throttled at all. i.e. metadata, and any untagged data IO. In that case, there should be no problems what-so-ever as all XFS IO should be issued at the same priority However, given the way people are talking about needing to bypass block cgroup throttling via REQ_META hacks, I'm guessing that it doesn't actually work that way. i.e. somewhere in the block layer it attaches the current task cgroup to bios without existing cgroup information, and so we end up with priority inversion problems whenever cgroups and throttling are in use regardless of whether the filesystem supports it or not.... Ah, yeah, the task cgroup gets added through this path: submit_bio generic_make_request generic_make_request_checks blkcg_bio_issue_check bio_blkcg > I'm not saying this isn't a problem that needs fixing, I just want to > make sure I understand the fundamental problem(s), what this cgawb patch > actually does and doesn't do and whether there is a logical dependency > between this patch and the proposed metadata filtering patch. Seems like there's a lot more work to be done to make this stuff work reliably than anyone has been telling us. Without regression tests, we're never going to know if it actually works or not.... Cheers, Dave.
2018-03-27 19:36 GMT+08:00 Brian Foster <bfoster@redhat.com>: > On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote: >> On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote: >> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote: >> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote: >> > > > Hi Shaohua and XFS, >> > > > >> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you >> > > > mentioned about charging to root cgroup (also in an earlier email >> > > > discussion), and seems the 4.16.0-rc6 code is not handling it >> > > > separately. >> > > > >> > > > In our case to support XFS cgroup writeback control, which was ported >> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in >> > > > trouble. Threads from throttled docker might submit_bio in following >> > > > path by its own identity, this docker blkcg accumulated large amounts >> > > > of data (e.g., 20GB), thus such log gets blocked. >> > > >> > > And thus displaying the reason why I originally refused to merge >> > > this code until regression tests were added to fstests to exercise >> > > these sorts of issues. This stuff adds new internal filesystem IO >> > > ordering constraints, so we need tests that exercise it and ensure >> > > we don't accidentally break it in future. >> > > >> > >> > Hmm, but if the user issues fsync from the throttled cgroup then won't >> > that throttling occur today, regardless of cgroup aware writeback? My >> > understanding is that cgawb just accurately accounts writeback I/Os to >> > the owner of the cached pages. IOW, if the buffered writer and fsync >> > call are in the same throttled cgroup, then the throttling works just as >> > it would with cgawb and the writer being in a throttled cgroup. >> > >> > So ISTM that this is an independent problem. What am I missing? >> > >> > Shaohua, >> > >> > Do you have a reference to the older metadata related patch mentioned in >> > the commit log that presumably addressed this? >> >> The problem is about priority reversion. Say you do a fsync in a low prio >> cgroup, the IO will be submitted with low prio. Now you do a fsync in a high >> prio cgroup, the cgroup will wait for fsync IO finished, which is already >> submitted by the low prio cgroup and run in low prio. This makes the high prio >> cgroup run slow. The proposed patch is to force all metadata write submitted >> from root cgroup regardless which task submitted, which can fix this issue. >> > > Right, but it seems to me that this can happen with or without cgroup > aware writeback. This patch just introduces the final bits required to > carry the page owner from however it is tracked in the writeback machine > to the bio submitted by the fs. It doesn't introduce/enable/implement > I/O throttling itself, which is already in place and usable (outside of > the buffered write page owner problem fixed by this patch), right? > > So without this patch, if a task in throttled cgroup A does a bunch of > buffered writes and calls fsync, then another task in unthrottled cgroup > B calls fsync, aren't we (XFS) susceptible to priority inversion via > these same log I/O serialization points? If not, then what am I missing? > Ok first let's agree we're not talking about 2 groups fsync the same file. What's demonstrated in the original post, is that group-A MIGHT submit xlog in the fsync path via xlog_sync() -> xfs_buf_submit(). Threads from group-B are waiting for the log from flush_work(), at the same time kworkers from xlog_cil_push_work(). The problem is not about fsync get stuck on the target file. Actually group-B should be waiting on filemap_write_and_wait_range() if it were, as xfs_file_fsync() would flush the real data before _xfs_log_force_lsn. Here is the setup: group-A has 10 fio jobs each running on 20GB files, and also some agents with not much IO; group-B just has the agents. With this patch we set a bps=20Mb/s to A, thus the large amounts of fio traffic are blocking the group (Note the fio are not doing any fsync though). At this time, one agent does an fsync to a non-dirty file, bypassing filemap_write_and_wait_range() and doing xlog_sync(). Here we go, log bios are stuck in group-A. Then group-B and kworkers are waiting for log integrity indefinitely from various paths. Without this patch fio should writeout at full disk speed, say 200MB/s, so it's not accumulated in group-A. > I'm not saying this isn't a problem that needs fixing, I just want to > make sure I understand the fundamental problem(s), what this cgawb patch > actually does and doesn't do and whether there is a logical dependency > between this patch and the proposed metadata filtering patch. > > Brian > >> Thanks, >> Shaohua >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2018-03-27 19:50 GMT+08:00 Brian Foster <bfoster@redhat.com>: > On Tue, Mar 27, 2018 at 06:54:27PM +0800, 张本龙 wrote: >> 2018-03-27 0:28 GMT+08:00 Brian Foster <bfoster@redhat.com>: >> >> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote: >> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote: >> > > > Hi Shaohua and XFS, >> > > > >> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you >> > > > mentioned about charging to root cgroup (also in an earlier email >> > > > discussion), and seems the 4.16.0-rc6 code is not handling it >> > > > separately. >> > > > >> > > > In our case to support XFS cgroup writeback control, which was ported >> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in >> > > > trouble. Threads from throttled docker might submit_bio in following >> > > > path by its own identity, this docker blkcg accumulated large amounts >> > > > of data (e.g., 20GB), thus such log gets blocked. >> > > >> > > And thus displaying the reason why I originally refused to merge >> > > this code until regression tests were added to fstests to exercise >> > > these sorts of issues. This stuff adds new internal filesystem IO >> > > ordering constraints, so we need tests that exercise it and ensure >> > > we don't accidentally break it in future. >> > > >> > >> > Hmm, but if the user issues fsync from the throttled cgroup then won't >> > that throttling occur today, regardless of cgroup aware writeback? My >> > understanding is that cgawb just accurately accounts writeback I/Os to >> > the owner of the cached pages. IOW, if the buffered writer and fsync >> > call are in the same throttled cgroup, then the throttling works just as >> > it would with cgawb and the writer being in a throttled cgroup. >> > >> > So ISTM that this is an independent problem. What am I missing? >> >> >> Yeah throttling would work for fsync traffic even without cgwb, but that's >> because the way bio_blkcg() is implemented: charging to bio->bi_css when >> set but submitter blkcg when not. So without cgwb normal traffic from >> writeback kworkers are attributed to blkcg_root, which is normally >> unlimited. >> > > Ok, that's what I thought. Your report shows everything backed up on an > fsync call, however, so I didn't really see how it related to this Sorry for misleading but the original report is not showing things backed up on fsync. It shows fsync MIGHT submit the log. -------------------------------------------------------------------------- xxx-agent 22825 [001] 78772.391023: probe:submit_bio: 4f70c1 submit_bio 4e440 xfs_buf_submit ([xfs]) 6e59b xlog_bdstrat ([xfs]) 704c5 xlog_sync ([xfs]) 70683 xlog_state_release_iclog ([xfs]) 71917 _xfs_log_force_lsn ([xfs]) 5249e xfs_file_fsync ([xfs]) 4374d5 do_fsync 4377a0 sys_fsync 8a0ac9 system_call 1df9c4 [unknown] Please note this is NOT a kernel INFO, it's a dynamic perf probe trace to see who is submitting log. As the agent's docker is now blocked by other fio threads, the log would be stuck from this path. The following trace is kernel INFO, it shows another docker waits for log integrity and timeout, similar is the 3rd other trace via flush_work(). Again not completely sure about XFS, but the traces and code diving seems to suggest it. ------------------------------------------------------------------------------------------------- [79183.692355] INFO: task xxx:33434 blocked for more than 120 seconds. [79183.730997] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ...... [79183.954738] Call Trace: [79183.969516] [<ffffffff81695ac9>] schedule+0x29/0x70 /* inline xlog_wait here */ [79183.999357] [<ffffffffa073497a>] _xfs_log_force_lsn+0x2fa/0x350 [xfs] [79184.038497] [<ffffffff810c84c0>] ? wake_up_state+0x20/0x20 [79184.071941] [<ffffffffa071542e>] xfs_file_fsync+0x10e/0x1e0 [xfs] [79184.109010] [<ffffffff812374d5>] do_fsync+0x65/0xa0 [79184.138826] [<ffffffff8120368f>] ? SyS_write+0x9f/0xe0 [79184.170182] [<ffffffff812377a0>] SyS_fsync+0x10/0x20 [79184.200514] [<ffffffff816a0ac9>] system_call_fastpath+0x16/0x1b > patch. So unless Shaohua explains otherwise, my understanding is that > this is not necessarily caused by cgawb, but rather this is a separate > but similar gap in the broader block I/O throttling mechanism that also > needs to be addressed before throttling is usable (in practice) by > filesystems. You presumably deal with this by filtering out metadata > bios, while the patch Shaohua mentions deals with it by promoting > metadata bios to the root cgroup. Either way would work, agreed. Though I wouldn't call it 'priority inversion', as Dave explained it's more like 'global log mistakenly throttled'. The report is to provide some evidence that this single patch is not enough for xfs writeback. And we should wait for the other patch from Shaohua. > > Brian > >> And fsync might yield kernel INFO messages when bps * 120 < dirty data, >> giving something like below. That's Ok though, since throttling and all >> types >> of sync calls are contradictory in nature. >> ------------------------------------------------------------ >> --------------------------------- >> [Mar 2 14:27] INFO: task dd:40794 blocked for more than 120 seconds. >> [ +0.000747] Not tainted 4.15.7 #1 >> [ +0.000610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables >> this message. >> [ +0.000614] dd D 0 40794 40793 0x10000000 >> [ +0.000610] Call Trace: >> [ +0.000619] ? __schedule+0x28d/0x870 >> [ +0.000611] schedule+0x32/0x80 >> [ +0.000615] io_schedule+0x12/0x40 >> [ +0.000612] wait_on_page_bit_common+0xff/0x1a0 >> [ +0.000609] ? page_cache_tree_insert+0xd0/0xd0 >> [ +0.000614] __filemap_fdatawait_range+0xe2/0x150 >> [ +0.000614] file_write_and_wait_range+0x62/0xa0 >> [ +0.000664] xfs_file_fsync+0x65/0x1d0 [xfs] >> [ +0.000621] do_fsync+0x38/0x60 >> [ +0.000610] SyS_fsync+0xc/0x10 >> [ +0.000616] do_syscall_64+0x6e/0x1a0 >> [ +0.000612] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >> [ +0.000610] RIP: 0033:0x7f8131231020 >> [ +0.000604] RSP: 002b:00007fff81c04f48 EFLAGS: 00000246 ORIG_RAX: >> 000000000000004a >> [ +0.000610] RAX: ffffffffffffffda RBX: 00007f8131af9690 RCX: >> 00007f8131231020 >> [ +0.000619] RDX: 0000000004000000 RSI: 0000000000000000 RDI: >> 0000000000000001 >> [ +0.000626] RBP: 0000000000000020 R08: 00000000ffffffff R09: >> 0000000000000000 >> [ +0.000618] R10: 00007fff81c04d10 R11: 0000000000000246 R12: >> 0000000000000020 >> [ +0.000620] R13: 0000000000000000 R14: 00007fff81c054cf R15: >> 00007fff81c05198 >> >> >> > Shaohua, >> > >> > Do you have a reference to the older metadata related patch mentioned in >> > the commit log that presumably addressed this? >> > >> > Brian >> > >> > > >> > > > Not familiar with XFS, but seems log bios are partially stuck in >> > > > throttled cgroups, leaving other innocent groups waiting for >> > > > completion. To cope with this we bypassed REQ_META log bios in >> > > > blk_throtl_bio(). >> > > >> > > Yup, the log is global and should not be throttled. Metadata is less >> > > obvious as to what the correct thing to do is, because writes are >> > > always done from internal kernel threads but reads are done from a >> > > mix of kernel threads, user cgroup contexts and user data IO >> > > completions. Hence there are situations where metadata reads may >> > > need to be throttled because they are cgroup context only (e.g. >> > > filesystem directory traversal) but others where reads should not >> > > be throttled because they have global context (e.g. inside a >> > > transaction when other buffers are already locked). >> > > >> >> >> Thanks for the explanation Dave. So there are both read and write xfs_buf, >> and reads really seem to be complex then. >> >> >> > > Getting this right and keeping it working requires regression tests >> > > that get run on every release, whether it be upstream or distro >> > > kernels, and that means we need tests in fstests to cover cgroup IO >> > > control.... >> > > >> > > Cheers, >> > > >> > > Dave. >> > > -- >> > > Dave Chinner >> > > david@fromorbit.com >> > > -- >> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in >> > > the body of a message to majordomo@vger.kernel.org >> > > More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 28, 2018 at 12:37:32PM +0800, 张本龙 wrote: > 2018-03-27 19:36 GMT+08:00 Brian Foster <bfoster@redhat.com>: > > On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote: > >> On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote: > >> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote: > >> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote: > >> > > > Hi Shaohua and XFS, > >> > > > > >> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you > >> > > > mentioned about charging to root cgroup (also in an earlier email > >> > > > discussion), and seems the 4.16.0-rc6 code is not handling it > >> > > > separately. > >> > > > > >> > > > In our case to support XFS cgroup writeback control, which was ported > >> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in > >> > > > trouble. Threads from throttled docker might submit_bio in following > >> > > > path by its own identity, this docker blkcg accumulated large amounts > >> > > > of data (e.g., 20GB), thus such log gets blocked. > >> > > > >> > > And thus displaying the reason why I originally refused to merge > >> > > this code until regression tests were added to fstests to exercise > >> > > these sorts of issues. This stuff adds new internal filesystem IO > >> > > ordering constraints, so we need tests that exercise it and ensure > >> > > we don't accidentally break it in future. > >> > > > >> > > >> > Hmm, but if the user issues fsync from the throttled cgroup then won't > >> > that throttling occur today, regardless of cgroup aware writeback? My > >> > understanding is that cgawb just accurately accounts writeback I/Os to > >> > the owner of the cached pages. IOW, if the buffered writer and fsync > >> > call are in the same throttled cgroup, then the throttling works just as > >> > it would with cgawb and the writer being in a throttled cgroup. > >> > > >> > So ISTM that this is an independent problem. What am I missing? > >> > > >> > Shaohua, > >> > > >> > Do you have a reference to the older metadata related patch mentioned in > >> > the commit log that presumably addressed this? > >> > >> The problem is about priority reversion. Say you do a fsync in a low prio > >> cgroup, the IO will be submitted with low prio. Now you do a fsync in a high > >> prio cgroup, the cgroup will wait for fsync IO finished, which is already > >> submitted by the low prio cgroup and run in low prio. This makes the high prio > >> cgroup run slow. The proposed patch is to force all metadata write submitted > >> from root cgroup regardless which task submitted, which can fix this issue. > >> > > > > Right, but it seems to me that this can happen with or without cgroup > > aware writeback. This patch just introduces the final bits required to > > carry the page owner from however it is tracked in the writeback machine > > to the bio submitted by the fs. It doesn't introduce/enable/implement > > I/O throttling itself, which is already in place and usable (outside of > > the buffered write page owner problem fixed by this patch), right? > > > > So without this patch, if a task in throttled cgroup A does a bunch of > > buffered writes and calls fsync, then another task in unthrottled cgroup > > B calls fsync, aren't we (XFS) susceptible to priority inversion via > > these same log I/O serialization points? If not, then what am I missing? > > > > Ok first let's agree we're not talking about 2 groups fsync the same > file. What's demonstrated in the original post, is that group-A MIGHT > submit xlog in the fsync path via xlog_sync() -> xfs_buf_submit(). > Threads from group-B are waiting for the log from flush_work(), at the > same time kworkers from xlog_cil_push_work(). > > The problem is not about fsync get stuck on the target file. Actually > group-B should be waiting on filemap_write_and_wait_range() if it > were, as xfs_file_fsync() would flush the real data before > _xfs_log_force_lsn. > Yes, I don't mean to suggest we're contending on the same file across different cgroups. Rather, we're getting stuck on a dependency of a global resource (the log) between groups with very different priorities. > Here is the setup: group-A has 10 fio jobs each running on 20GB files, > and also some agents with not much IO; group-B just has the agents. > With this patch we set a bps=20Mb/s to A, thus the large amounts of > fio traffic are blocking the group (Note the fio are not doing any > fsync though). At this time, one agent does an fsync to a non-dirty > file, bypassing filemap_write_and_wait_range() and doing xlog_sync(). > Here we go, log bios are stuck in group-A. Then group-B and kworkers > are waiting for log integrity indefinitely from various paths. Without > this patch fio should writeout at full disk speed, say 200MB/s, so > it's not accumulated in group-A. > I see. So in your particular setup, cgawb is required to reproduce because you aren't actually pushing data through the cgroup via fsync. Therefore, cgawb is what actually enables the throttling for your use case. The throttling slows down further I/O from group A and thus creates the problematic conditions for contention on the log. However, it looks like the only requirement to cause this particular condition is that an fsync submits a log bio in an active and throttled cgroup that ends up blocked for a period of time. Other log serializing tasks in other cgroups are now susceptible to the first cgroup's throttle if they have to wait on log I/O completion. So replace your cgawb use case with one that creates similar conditions by somebody else waiting on fsync -> filemap_write_and_wait_range() on a separate file in the same (throttled) group and afaict the same thing can occur without cgawb in the picture at all. But between this and your other email I think I see how this isn't what you are actually reproducing. Thanks for the explanation. Brian > > I'm not saying this isn't a problem that needs fixing, I just want to > > make sure I understand the fundamental problem(s), what this cgawb patch > > actually does and doesn't do and whether there is a logical dependency > > between this patch and the proposed metadata filtering patch. > > > > Brian > > > >> Thanks, > >> Shaohua > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 28, 2018 at 08:56:14AM +1100, Dave Chinner wrote: > On Tue, Mar 27, 2018 at 07:36:48AM -0400, Brian Foster wrote: > > On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote: > > > On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote: > > > > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote: > > > > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote: > > > > > > Hi Shaohua and XFS, > > > > > > > > > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you > > > > > > mentioned about charging to root cgroup (also in an earlier email > > > > > > discussion), and seems the 4.16.0-rc6 code is not handling it > > > > > > separately. > > > > > > > > > > > > In our case to support XFS cgroup writeback control, which was ported > > > > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in > > > > > > trouble. Threads from throttled docker might submit_bio in following > > > > > > path by its own identity, this docker blkcg accumulated large amounts > > > > > > of data (e.g., 20GB), thus such log gets blocked. > > > > > > > > > > And thus displaying the reason why I originally refused to merge > > > > > this code until regression tests were added to fstests to exercise > > > > > these sorts of issues. This stuff adds new internal filesystem IO > > > > > ordering constraints, so we need tests that exercise it and ensure > > > > > we don't accidentally break it in future. > > > > > > > > > > > > > Hmm, but if the user issues fsync from the throttled cgroup then won't > > > > that throttling occur today, regardless of cgroup aware writeback? My > > > > understanding is that cgawb just accurately accounts writeback I/Os to > > > > the owner of the cached pages. IOW, if the buffered writer and fsync > > > > call are in the same throttled cgroup, then the throttling works just as > > > > it would with cgawb and the writer being in a throttled cgroup. > > > > > > > > So ISTM that this is an independent problem. What am I missing? > > > > > > > > Shaohua, > > > > > > > > Do you have a reference to the older metadata related patch mentioned in > > > > the commit log that presumably addressed this? > > > > > > The problem is about priority reversion. Say you do a fsync in a low prio > > > cgroup, the IO will be submitted with low prio. Now you do a fsync in a high > > > prio cgroup, the cgroup will wait for fsync IO finished, which is already > > > submitted by the low prio cgroup and run in low prio. This makes the high prio > > > cgroup run slow. The proposed patch is to force all metadata write submitted > > > from root cgroup regardless which task submitted, which can fix this issue. > > > > > > > Right, but it seems to me that this can happen with or without cgroup > > aware writeback. This patch just introduces the final bits required to > > carry the page owner from however it is tracked in the writeback machine > > to the bio submitted by the fs. It doesn't introduce/enable/implement > > I/O throttling itself, which is already in place and usable (outside of > > the buffered write page owner problem fixed by this patch), right? > > > > So without this patch, if a task in throttled cgroup A does a bunch of > > buffered writes and calls fsync, then another task in unthrottled cgroup > > B calls fsync, aren't we (XFS) susceptible to priority inversion via > > these same log I/O serialization points? If not, then what am I missing? > > Well, I was originally told that bios send by a filesystem without > cgroup info would be accounted to the root cgroup and hence not > throttled at all. i.e. metadata, and any untagged data IO. In that > case, there should be no problems what-so-ever as all XFS IO should > be issued at the same priority > It sounds like this was actually the plan at some point but the associated patch (to which there is still no reference?) was never committed. I have no idea why that is.. whether it was just lost or nacked with outstanding issues..? > However, given the way people are talking about needing to bypass > block cgroup throttling via REQ_META hacks, I'm guessing that it > doesn't actually work that way. i.e. somewhere in the block layer it > attaches the current task cgroup to bios without existing cgroup > information, and so we end up with priority inversion problems > whenever cgroups and throttling are in use regardless of whether the > filesystem supports it or not.... > > Ah, yeah, the task cgroup gets added through this path: > > submit_bio > generic_make_request > generic_make_request_checks > blkcg_bio_issue_check > bio_blkcg > Right, this is what I thought based on the code. For anything not explicitly associated with a cgroup, it pretty much depends on the user context of the I/O submission. > > I'm not saying this isn't a problem that needs fixing, I just want to > > make sure I understand the fundamental problem(s), what this cgawb patch > > actually does and doesn't do and whether there is a logical dependency > > between this patch and the proposed metadata filtering patch. > > Seems like there's a lot more work to be done to make this stuff > work reliably than anyone has been telling us. Without regression > tests, we're never going to know if it actually works or not.... > I'm wondering if there's something simple/high-level we can do in the filesystem to sort of flip the default block layer condition above on its head with regard to metadata. For example, forcibly associate all metadata bios to the root cgroup somewhere down in the xfs_buf_submit() path. So rather than doing metadata I/O on behalf of a variety of random cgroups based purely on userspace configuration and activity, we start with a consistent and predictable base behavior. The former (current) behavior leaves us open to these kinds of problematic blocked states. The latter approach changes those gaps into something that instead may just not honor the throttling limits for certain metadata heavy workloads until finer grained control is implemented in the fs (which can be implemented as needed). That's also advantageous because by default we won't throttle certain I/Os that might have complex locking/transaction dependencies and instead opt-in to such metadata throttling in the obvious contexts where it's appropriate. For example, the directory read example you mentioned before could set a state on the buffer that prevents preemptive association so the bio is actually associated with current userspace context in the block layer and throttled appropriately. Of course, I haven't looked into how reasonable any of that is to implement. It would be nice to see whatever patch was previously proposed for reference. :/ Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 28, 2018 at 07:32:32AM -0400, Brian Foster wrote: > On Wed, Mar 28, 2018 at 08:56:14AM +1100, Dave Chinner wrote: > > On Tue, Mar 27, 2018 at 07:36:48AM -0400, Brian Foster wrote: > > > On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote: > > > > On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote: > > > > > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote: > > > > > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote: > > > > > > > Hi Shaohua and XFS, > > > > > > > > > > > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you > > > > > > > mentioned about charging to root cgroup (also in an earlier email > > > > > > > discussion), and seems the 4.16.0-rc6 code is not handling it > > > > > > > separately. > > > > > > > > > > > > > > In our case to support XFS cgroup writeback control, which was ported > > > > > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in > > > > > > > trouble. Threads from throttled docker might submit_bio in following > > > > > > > path by its own identity, this docker blkcg accumulated large amounts > > > > > > > of data (e.g., 20GB), thus such log gets blocked. > > > > > > > > > > > > And thus displaying the reason why I originally refused to merge > > > > > > this code until regression tests were added to fstests to exercise > > > > > > these sorts of issues. This stuff adds new internal filesystem IO > > > > > > ordering constraints, so we need tests that exercise it and ensure > > > > > > we don't accidentally break it in future. > > > > > > > > > > > > > > > > Hmm, but if the user issues fsync from the throttled cgroup then won't > > > > > that throttling occur today, regardless of cgroup aware writeback? My > > > > > understanding is that cgawb just accurately accounts writeback I/Os to > > > > > the owner of the cached pages. IOW, if the buffered writer and fsync > > > > > call are in the same throttled cgroup, then the throttling works just as > > > > > it would with cgawb and the writer being in a throttled cgroup. > > > > > > > > > > So ISTM that this is an independent problem. What am I missing? > > > > > > > > > > Shaohua, > > > > > > > > > > Do you have a reference to the older metadata related patch mentioned in > > > > > the commit log that presumably addressed this? > > > > > > > > The problem is about priority reversion. Say you do a fsync in a low prio > > > > cgroup, the IO will be submitted with low prio. Now you do a fsync in a high > > > > prio cgroup, the cgroup will wait for fsync IO finished, which is already > > > > submitted by the low prio cgroup and run in low prio. This makes the high prio > > > > cgroup run slow. The proposed patch is to force all metadata write submitted > > > > from root cgroup regardless which task submitted, which can fix this issue. > > > > > > > > > > Right, but it seems to me that this can happen with or without cgroup > > > aware writeback. This patch just introduces the final bits required to > > > carry the page owner from however it is tracked in the writeback machine > > > to the bio submitted by the fs. It doesn't introduce/enable/implement > > > I/O throttling itself, which is already in place and usable (outside of > > > the buffered write page owner problem fixed by this patch), right? > > > > > > So without this patch, if a task in throttled cgroup A does a bunch of > > > buffered writes and calls fsync, then another task in unthrottled cgroup > > > B calls fsync, aren't we (XFS) susceptible to priority inversion via > > > these same log I/O serialization points? If not, then what am I missing? > > > > Well, I was originally told that bios send by a filesystem without > > cgroup info would be accounted to the root cgroup and hence not > > throttled at all. i.e. metadata, and any untagged data IO. In that > > case, there should be no problems what-so-ever as all XFS IO should > > be issued at the same priority > > > > It sounds like this was actually the plan at some point but the > associated patch (to which there is still no reference?) was never > committed. I have no idea why that is.. whether it was just lost or > nacked with outstanding issues..? I've never seen the patch that is being spoken of. I'm guessing it lives in some dark corner of FB's internal kernel tree and they have no motivation to push stuff like this upstream.... > > However, given the way people are talking about needing to bypass > > block cgroup throttling via REQ_META hacks, I'm guessing that it > > doesn't actually work that way. i.e. somewhere in the block layer it > > attaches the current task cgroup to bios without existing cgroup > > information, and so we end up with priority inversion problems > > whenever cgroups and throttling are in use regardless of whether the > > filesystem supports it or not.... > > > > Ah, yeah, the task cgroup gets added through this path: > > > > submit_bio > > generic_make_request > > generic_make_request_checks > > blkcg_bio_issue_check > > bio_blkcg > > > > Right, this is what I thought based on the code. For anything not > explicitly associated with a cgroup, it pretty much depends on the user > context of the I/O submission. I suspect that the code when originally proposed didn't have this "fall back to task context" code in it. i.e. it was added later, thereby changing the default behaviour and introducing all these filesystem level issues. > > > I'm not saying this isn't a problem that needs fixing, I just want to > > > make sure I understand the fundamental problem(s), what this cgawb patch > > > actually does and doesn't do and whether there is a logical dependency > > > between this patch and the proposed metadata filtering patch. > > > > Seems like there's a lot more work to be done to make this stuff > > work reliably than anyone has been telling us. Without regression > > tests, we're never going to know if it actually works or not.... > > > > I'm wondering if there's something simple/high-level we can do in the > filesystem to sort of flip the default block layer condition above on > its head with regard to metadata. I'm guessing that's what the patches that look at REQ_META and assign it to the root cgroup does - it makes them global scope rather than whatever is attached to them and prevents them from being throttled by the existing/default cgroup configuration. > For example, forcibly associate all > metadata bios to the root cgroup somewhere down in the xfs_buf_submit() > path. So rather than doing metadata I/O on behalf of a variety of random > cgroups based purely on userspace configuration and activity, we start > with a consistent and predictable base behavior. I'd prefer the block layer does that by default when it comes across an untagged REQ_META bio. That way we don't have to sprinkle magic cgroup pixie dust through every single filesystem to deal with this same problem.... Cheers, Dave.
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 19eadc8..5f70584 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -589,7 +589,8 @@ xfs_alloc_ioend( struct inode *inode, unsigned int type, xfs_off_t offset, - struct buffer_head *bh) + struct buffer_head *bh, + struct writeback_control *wbc) { struct xfs_ioend *ioend; struct bio *bio; @@ -606,6 +607,8 @@ xfs_alloc_ioend( INIT_WORK(&ioend->io_work, xfs_end_io); ioend->io_append_trans = NULL; ioend->io_bio = bio; + /* attach new bio to its cgroup */ + wbc_init_bio(wbc, bio); return ioend; } @@ -633,6 +636,8 @@ xfs_chain_bio( ioend->io_bio->bi_write_hint = ioend->io_inode->i_write_hint; submit_bio(ioend->io_bio); ioend->io_bio = new; + /* attach new bio to its cgroup */ + wbc_init_bio(wbc, new); } /* @@ -656,7 +661,8 @@ xfs_add_to_ioend( offset != wpc->ioend->io_offset + wpc->ioend->io_size) { if (wpc->ioend) list_add(&wpc->ioend->io_list, iolist); - wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh); + wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, + bh, wbc); } /* @@ -666,6 +672,9 @@ xfs_add_to_ioend( while (xfs_bio_add_buffer(wpc->ioend->io_bio, bh) != bh->b_size) xfs_chain_bio(wpc->ioend, wbc, bh); + /* Charge write size to its cgroup for cgroup switching track */ + wbc_account_io(wbc, bh->b_page, bh->b_size); + wpc->ioend->io_size += bh->b_size; wpc->last_block = bh->b_blocknr; xfs_start_buffer_writeback(bh); diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 951271f..95c2d3d 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1666,6 +1666,7 @@ xfs_fs_fill_super( sb->s_max_links = XFS_MAXLINK; sb->s_time_gran = 1; set_posix_acl_flag(sb); + sb->s_iflags |= SB_I_CGROUPWB; /* version 5 superblocks support inode version counters. */ if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)