[cgroup/for-4.3-fixes] cgroup, writeback: don't enable cgroup writeback on traditional hierarchies
diff mbox

Message ID 20150923210729.GA23180@mtj.duckdns.org
State New
Headers show

Commit Message

Tejun Heo Sept. 23, 2015, 9:07 p.m. UTC
inode_cgwb_enabled() gates cgroup writeback support.  If it returns
true, each inode is attached to the corresponding memory domain which
gets mapped to io domain.  It currently only tests whether the
filesystem and bdi support cgroup writeback; however, cgroup writeback
support doesn't work on traditional hierarchies and thus it should
also test whether memcg and iocg are on the default hierarchy.

This caused traditional hierarchy setups to hit the cgroup writeback
path inadvertently and ended up creating separate writeback domains
for each memcg and mapping them all to the root iocg uncovering a
couple issues in the cgroup writeback path.

cgroup writeback was never meant to be enabled on traditional
hierarchies.  Make inode_cgwb_enabled() test whether both memcg and
iocg are on the default hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Artem Bityutskiy <dedekind1@gmail.com>
Reported-by: Dexuan Cui <decui@microsoft.com>
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
Link: http://lkml.kernel.org/g/f30d4a6aa8a546ff88f73021d026a453@SIXPR30MB031.064d.mgd.msft.net
---
Hello,

So, this should make the regression go away.  It doesn't fix the
underlying bugs but they shouldn't get triggered by people not
experimenting with cgroup.

I'm gonna keep digging the underlying issues but this should make the
regressions go away.  If it's okay, I think it'd be better to route
this through cgroup/for-4.3-fixes as it's gonna cause a conflict with
for-4.4 branch and handling the merge there is easier.

Thanks.

 include/linux/backing-dev.h |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Artem Bityutskiy Sept. 24, 2015, 8:09 a.m. UTC | #1
On Wed, 2015-09-23 at 17:07 -0400, Tejun Heo wrote:
> Hello,
> 
> So, this should make the regression go away.  It doesn't fix the
> underlying bugs but they shouldn't get triggered by people not
> experimenting with cgroup.

Tejun,

this hits the nail on the head and makes the problem go away.

I've tested the tip of Linuses tree (v4.3-rc2+) plus this patch - no
data corruption after reboots.

I've tested just the tip of Linuses tree (v4.3-rc2+) without this
patch, and I do see the data corruption after reboots.

Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>

Artem.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dexuan Cui Sept. 24, 2015, 8:40 a.m. UTC | #2
> On Wed, 2015-09-23 at 17:07 -0400, Tejun Heo wrote:

> > Hello,

> >

> > So, this should make the regression go away.  It doesn't fix the

> > underlying bugs but they shouldn't get triggered by people not

> > experimenting with cgroup.

> 

> Tejun,

> 

> this hits the nail on the head and makes the problem go away.

> 

> I've tested the tip of Linuses tree (v4.3-rc2+) plus this patch - no

> data corruption after reboots.

> 

> I've tested just the tip of Linuses tree (v4.3-rc2+) without this

> patch, and I do see the data corruption after reboots.

> 

> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>

> 

> Artem.


I can confirm the patch fixes my "slow write" issue too.

Tested-by: Dexuan Cui <decui@microsoft.com>


-- Dexuan
Jens Axboe Sept. 24, 2015, 4:17 p.m. UTC | #3
On 09/23/2015 03:07 PM, Tejun Heo wrote:
> inode_cgwb_enabled() gates cgroup writeback support.  If it returns
> true, each inode is attached to the corresponding memory domain which
> gets mapped to io domain.  It currently only tests whether the
> filesystem and bdi support cgroup writeback; however, cgroup writeback
> support doesn't work on traditional hierarchies and thus it should
> also test whether memcg and iocg are on the default hierarchy.
>
> This caused traditional hierarchy setups to hit the cgroup writeback
> path inadvertently and ended up creating separate writeback domains
> for each memcg and mapping them all to the root iocg uncovering a
> couple issues in the cgroup writeback path.
>
> cgroup writeback was never meant to be enabled on traditional
> hierarchies.  Make inode_cgwb_enabled() test whether both memcg and
> iocg are on the default hierarchy.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Artem Bityutskiy <dedekind1@gmail.com>
> Reported-by: Dexuan Cui <decui@microsoft.com>
> Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
> Link: http://lkml.kernel.org/g/f30d4a6aa8a546ff88f73021d026a453@SIXPR30MB031.064d.mgd.msft.net
> ---
> Hello,
>
> So, this should make the regression go away.  It doesn't fix the
> underlying bugs but they shouldn't get triggered by people not
> experimenting with cgroup.
>
> I'm gonna keep digging the underlying issues but this should make the
> regressions go away.  If it's okay, I think it'd be better to route
> this through cgroup/for-4.3-fixes as it's gonna cause a conflict with
> for-4.4 branch and handling the merge there is easier.
>
> Thanks.
>
>   include/linux/backing-dev.h |   11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)

I'll ack this since it works around both the corruption issue and the 
performance regression, so we can avoid having to revert parts of this. 
And I know you'll keep hunting and get the real issue fixed in the mean 
time.

Acked-by: Jens Axboe <axboe@fb.com>
Tejun Heo Sept. 24, 2015, 8:47 p.m. UTC | #4
Hello,

On Thu, Sep 24, 2015 at 08:40:18AM +0000, Dexuan Cui wrote:
> I can confirm the patch fixes my "slow write" issue too.
> 
> Tested-by: Dexuan Cui <decui@microsoft.com>

Yeah, this should make it go away w/o using cgroup writeback
explicitly; however, I think the proper solution for cgroup writeback
is moving bandwidth estimation from memory domain to io domain so that
two separate bw estimations wouldn't interfere with each other leading
to unexpected outcomes.  I'll work on the changes.

Thanks.
Tejun Heo Sept. 24, 2015, 8:48 p.m. UTC | #5
On Wed, Sep 23, 2015 at 05:07:29PM -0400, Tejun Heo wrote:
> inode_cgwb_enabled() gates cgroup writeback support.  If it returns
> true, each inode is attached to the corresponding memory domain which
> gets mapped to io domain.  It currently only tests whether the
> filesystem and bdi support cgroup writeback; however, cgroup writeback
> support doesn't work on traditional hierarchies and thus it should
> also test whether memcg and iocg are on the default hierarchy.
> 
> This caused traditional hierarchy setups to hit the cgroup writeback
> path inadvertently and ended up creating separate writeback domains
> for each memcg and mapping them all to the root iocg uncovering a
> couple issues in the cgroup writeback path.
> 
> cgroup writeback was never meant to be enabled on traditional
> hierarchies.  Make inode_cgwb_enabled() test whether both memcg and
> iocg are on the default hierarchy.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Artem Bityutskiy <dedekind1@gmail.com>
> Reported-by: Dexuan Cui <decui@microsoft.com>
> Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
> Link: http://lkml.kernel.org/g/f30d4a6aa8a546ff88f73021d026a453@SIXPR30MB031.064d.mgd.msft.net

Applying to cgroup/for-4.3-fixes.

Thanks.
Tejun Heo Sept. 28, 2015, 9:39 p.m. UTC | #6
Hello,

On Thu, Sep 24, 2015 at 04:47:36PM -0400, Tejun Heo wrote:
> On Thu, Sep 24, 2015 at 08:40:18AM +0000, Dexuan Cui wrote:
> > I can confirm the patch fixes my "slow write" issue too.
> > 
> > Tested-by: Dexuan Cui <decui@microsoft.com>
> 
> Yeah, this should make it go away w/o using cgroup writeback
> explicitly; however, I think the proper solution for cgroup writeback
> is moving bandwidth estimation from memory domain to io domain so that
> two separate bw estimations wouldn't interfere with each other leading
> to unexpected outcomes.  I'll work on the changes.

So, this one actually turns out to be mostly caused by enabling cgroup
writeback when it shouldn't be.  balance_dirty_pages() ended up
looking at a different bdi_writeback from the actual writeback path so
the throttling was completley off, so making sure that cgroup
writeback doesn't get turned on traditional hierarchies is the right
solution here.

While auditing the behavior, I noticed a couple non-critical issues.
Will post patches to fix them soon.

Thanks.

Patch
diff mbox

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5a5d79e..d5eb4ad1 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -13,6 +13,7 @@ 
 #include <linux/sched.h>
 #include <linux/blkdev.h>
 #include <linux/writeback.h>
+#include <linux/memcontrol.h>
 #include <linux/blk-cgroup.h>
 #include <linux/backing-dev-defs.h>
 #include <linux/slab.h>
@@ -252,13 +253,19 @@  int inode_congested(struct inode *inode, int cong_bits);
  * @inode: inode of interest
  *
  * cgroup writeback requires support from both the bdi and filesystem.
- * Test whether @inode has both.
+ * Also, both memcg and iocg have to be on the default hierarchy.  Test
+ * whether all conditions are met.
+ *
+ * Note that the test result may change dynamically on the same inode
+ * depending on how memcg and iocg are configured.
  */
 static inline bool inode_cgwb_enabled(struct inode *inode)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 
-	return bdi_cap_account_dirty(bdi) &&
+	return cgroup_on_dfl(mem_cgroup_root_css->cgroup) &&
+		cgroup_on_dfl(blkcg_root_css->cgroup) &&
+		bdi_cap_account_dirty(bdi) &&
 		(bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
 		(inode->i_sb->s_iflags & SB_I_CGROUPWB);
 }