[22/51] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK

Message ID	1432329245-5844-23-git-send-email-tj@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Tejun Heo <tj@kernel.org> To: axboe@kernel.dk Cc: linux-kernel@vger.kernel.org, jack@suse.cz, hch@infradead.org, hannes@cmpxchg.org, linux-fsdevel@vger.kernel.org, vgoyal@redhat.com, lizefan@huawei.com, cgroups@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.cz, clm@fb.com, fengguang.wu@intel.com, david@fromorbit.com, gthelen@google.com, khlebnikov@yandex-team.ru, Tejun Heo <tj@kernel.org> Subject: [PATCH 22/51] writeback: add {CONFIG\|BDI_CAP\|FS}_CGROUP_WRITEBACK Date: Fri, 22 May 2015 17:13:36 -0400 Message-Id: <1432329245-5844-23-git-send-email-tj@kernel.org> In-Reply-To: <1432329245-5844-1-git-send-email-tj@kernel.org> References: <1432329245-5844-1-git-send-email-tj@kernel.org> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Message ID

1432329245-5844-23-git-send-email-tj@kernel.org (mailing list archive)

State

New, archived

Headers

From: Tejun Heo <tj@kernel.org>
To: axboe@kernel.dk
Cc: linux-kernel@vger.kernel.org, jack@suse.cz, hch@infradead.org,
	hannes@cmpxchg.org, linux-fsdevel@vger.kernel.org,
	vgoyal@redhat.com, lizefan@huawei.com, cgroups@vger.kernel.org,
	linux-mm@kvack.org, mhocko@suse.cz, clm@fb.com,
	fengguang.wu@intel.com, david@fromorbit.com, gthelen@google.com,
	khlebnikov@yandex-team.ru, Tejun Heo <tj@kernel.org>
Subject: [PATCH 22/51] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK
Date: Fri, 22 May 2015 17:13:36 -0400
Message-Id: <1432329245-5844-23-git-send-email-tj@kernel.org>
In-Reply-To: <1432329245-5844-1-git-send-email-tj@kernel.org>
References: <1432329245-5844-1-git-send-email-tj@kernel.org>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Commit Message

Tejun Heo May 22, 2015, 9:13 p.m. UTC

cgroup writeback requires support from both bdi and filesystem sides.
Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
default.  Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
both MEMCG and BLK_CGROUP are enabled.

inode_cgwb_enabled() which determines whether a given inode's both bdi
and fs support cgroup writeback is added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 block/blk-core.c            |  2 +-
 include/linux/backing-dev.h | 32 +++++++++++++++++++++++++++++++-
 include/linux/fs.h          |  1 +
 init/Kconfig                |  5 +++++
 4 files changed, 38 insertions(+), 2 deletions(-)

Comments

Jan Kara June 30, 2015, 9:37 a.m. UTC | #1

On Fri 22-05-15 17:13:36, Tejun Heo wrote:
> cgroup writeback requires support from both bdi and filesystem sides.
> Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
> support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
> default.  Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
> both MEMCG and BLK_CGROUP are enabled.
> 
> inode_cgwb_enabled() which determines whether a given inode's both bdi
> and fs support cgroup writeback is added.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Jan Kara <jack@suse.cz>

Hum, you later changed this to use a per-sb flag instead of a per-fs-type
flag, right? We could do it as well here but OK.

One more question - what does prevent us from supporting CGROUP_WRITEBACK
for all bdis capable of writeback? I guess the reason is that currently
blkcgs are bound to request_queue and we have to have blkcg(s) for
CGROUP_WRITEBACK to work, am I right? But in principle tracking writeback
state and doing writeback per memcg doesn't seem to be bound to any device
properties so we could do that right?

Anyway, this patch looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza

> ---
>  block/blk-core.c            |  2 +-
>  include/linux/backing-dev.h | 32 +++++++++++++++++++++++++++++++-
>  include/linux/fs.h          |  1 +
>  init/Kconfig                |  5 +++++
>  4 files changed, 38 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index f46688f..e0f726f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -620,7 +620,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>  
>  	q->backing_dev_info.ra_pages =
>  			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
> -	q->backing_dev_info.capabilities = 0;
> +	q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK;
>  	q->backing_dev_info.name = "block";
>  	q->node = node_id;
>  
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index bfdaa18..6bb3123 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -134,12 +134,15 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
>   * BDI_CAP_NO_WRITEBACK:   Don't write pages back
>   * BDI_CAP_NO_ACCT_WB:     Don't automatically account writeback pages
>   * BDI_CAP_STRICTLIMIT:    Keep number of dirty pages below bdi threshold.
> + *
> + * BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback.
>   */
>  #define BDI_CAP_NO_ACCT_DIRTY	0x00000001
>  #define BDI_CAP_NO_WRITEBACK	0x00000002
>  #define BDI_CAP_NO_ACCT_WB	0x00000004
>  #define BDI_CAP_STABLE_WRITES	0x00000008
>  #define BDI_CAP_STRICTLIMIT	0x00000010
> +#define BDI_CAP_CGROUP_WRITEBACK 0x00000020
>  
>  #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
>  	(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
> @@ -229,4 +232,31 @@ static inline int bdi_sched_wait(void *word)
>  	return 0;
>  }
>  
> -#endif		/* _LINUX_BACKING_DEV_H */
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +
> +/**
> + * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
> + * @inode: inode of interest
> + *
> + * cgroup writeback requires support from both the bdi and filesystem.
> + * Test whether @inode has both.
> + */
> +static inline bool inode_cgwb_enabled(struct inode *inode)
> +{
> +	struct backing_dev_info *bdi = inode_to_bdi(inode);
> +
> +	return bdi_cap_account_dirty(bdi) &&
> +		(bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
> +		(inode->i_sb->s_type->fs_flags & FS_CGROUP_WRITEBACK);
> +}
> +
> +#else	/* CONFIG_CGROUP_WRITEBACK */
> +
> +static inline bool inode_cgwb_enabled(struct inode *inode)
> +{
> +	return false;
> +}
> +
> +#endif	/* CONFIG_CGROUP_WRITEBACK */
> +
> +#endif	/* _LINUX_BACKING_DEV_H */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ce100b87..74e0ae0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1897,6 +1897,7 @@ struct file_system_type {
>  #define FS_HAS_SUBTYPE		4
>  #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
>  #define FS_USERNS_DEV_MOUNT	16 /* A userns mount does not imply MNT_NODEV */
> +#define FS_CGROUP_WRITEBACK	32	/* Supports cgroup-aware writeback */
>  #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
>  	struct dentry *(*mount) (struct file_system_type *, int,
>  		       const char *, void *);
> diff --git a/init/Kconfig b/init/Kconfig
> index dc24dec..d4f7633 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1141,6 +1141,11 @@ config DEBUG_BLK_CGROUP
>  	Enable some debugging help. Currently it exports additional stat
>  	files in a cgroup which can be useful for debugging.
>  
> +config CGROUP_WRITEBACK
> +	bool
> +	depends on MEMCG && BLK_CGROUP
> +	default y
> +
>  endif # CGROUPS
>  
>  config CHECKPOINT_RESTORE
> -- 
> 2.4.0
>

Tejun Heo July 2, 2015, 1:10 a.m. UTC | #2

Hello, Jan.

On Tue, Jun 30, 2015 at 11:37:51AM +0200, Jan Kara wrote:
> Hum, you later changed this to use a per-sb flag instead of a per-fs-type
> flag, right? We could do it as well here but OK.

The commits were already in stable branch at that point and landed in
mainline during this merge window, so I'm afraid the review points
will have to be addressed as additional patches.

> One more question - what does prevent us from supporting CGROUP_WRITEBACK
> for all bdis capable of writeback? I guess the reason is that currently
> blkcgs are bound to request_queue and we have to have blkcg(s) for
> CGROUP_WRITEBACK to work, am I right? But in principle tracking writeback
> state and doing writeback per memcg doesn't seem to be bound to any device
> properties so we could do that right?

The main issue is that cgroup should somehow know how the processes
are mapped to the underlying IO layer - the IO domain should somehow
be defined.  We can introduce an intermediate abstraction which maps
to blkcg and whatever other cgroup controllers which may define cgroup
IO domains but given that such cases would be fairly niche, I think
we'd be better off making those corner cases represent themselves
using blkcg rather than introducing an additional layer.

Thanks.

Jan Kara July 3, 2015, 10:49 a.m. UTC | #3

On Wed 01-07-15 21:10:56, Tejun Heo wrote:
> Hello, Jan.
> 
> On Tue, Jun 30, 2015 at 11:37:51AM +0200, Jan Kara wrote:
> > Hum, you later changed this to use a per-sb flag instead of a per-fs-type
> > flag, right? We could do it as well here but OK.
> 
> The commits were already in stable branch at that point and landed in
> mainline during this merge window, so I'm afraid the review points
> will have to be addressed as additional patches.

Yeah, I know but I just didn't get to the series earlier. Anyway, I didn't
find fundamental issues so it's easy to change things in followup patches.

> > One more question - what does prevent us from supporting CGROUP_WRITEBACK
> > for all bdis capable of writeback? I guess the reason is that currently
> > blkcgs are bound to request_queue and we have to have blkcg(s) for
> > CGROUP_WRITEBACK to work, am I right? But in principle tracking writeback
> > state and doing writeback per memcg doesn't seem to be bound to any device
> > properties so we could do that right?
> 
> The main issue is that cgroup should somehow know how the processes
> are mapped to the underlying IO layer - the IO domain should somehow
> be defined.  We can introduce an intermediate abstraction which maps
> to blkcg and whatever other cgroup controllers which may define cgroup
> IO domains but given that such cases would be fairly niche, I think
> we'd be better off making those corner cases represent themselves
> using blkcg rather than introducing an additional layer.

Well, unless there is some specific mapping for the device, we could just
fall back to attributing everything to the root cgroup. We would still
account dirty pages in memcg, throttle writers in memcg when there are too
many dirty pages, issue writeback for inodes in memcg with enough dirty
pages etc. Just all IO from different memcgs would be equal so no
separation would be there. But it would still seem better that just
ignoring the split of dirty pages among memcgs as we do now... Thoughts?

								Honza

Tejun Heo July 3, 2015, 5:14 p.m. UTC | #4

Hello,

On Fri, Jul 03, 2015 at 12:49:57PM +0200, Jan Kara wrote:
> Well, unless there is some specific mapping for the device, we could just
> fall back to attributing everything to the root cgroup. We would still
> account dirty pages in memcg, throttle writers in memcg when there are too
> many dirty pages, issue writeback for inodes in memcg with enough dirty
> pages etc. Just all IO from different memcgs would be equal so no
> separation would be there. But it would still seem better that just
> ignoring the split of dirty pages among memcgs as we do now... Thoughts?

Sure, if you mark a bdi as capable of supporing cgroup writeback
without enforcing any IO isolation, the above would be what's
happening.  I'm not convinced this would be something actually useful
tho.  Sure, it changes the behavior but is still gonna be a crapshoot.

Thanks.

diff --git a/block/blk-core.c b/block/blk-core.c
index f46688f..e0f726f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -620,7 +620,7 @@  struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 	q->backing_dev_info.ra_pages =
 			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
-	q->backing_dev_info.capabilities = 0;
+	q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK;
 	q->backing_dev_info.name = "block";
 	q->node = node_id;
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index bfdaa18..6bb3123 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -134,12 +134,15 @@  int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
  * BDI_CAP_NO_WRITEBACK:   Don't write pages back
  * BDI_CAP_NO_ACCT_WB:     Don't automatically account writeback pages
  * BDI_CAP_STRICTLIMIT:    Keep number of dirty pages below bdi threshold.
+ *
+ * BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback.
  */
 #define BDI_CAP_NO_ACCT_DIRTY	0x00000001
 #define BDI_CAP_NO_WRITEBACK	0x00000002
 #define BDI_CAP_NO_ACCT_WB	0x00000004
 #define BDI_CAP_STABLE_WRITES	0x00000008
 #define BDI_CAP_STRICTLIMIT	0x00000010
+#define BDI_CAP_CGROUP_WRITEBACK 0x00000020
 
 #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
 	(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
@@ -229,4 +232,31 @@  static inline int bdi_sched_wait(void *word)
 	return 0;
 }
 
-#endif		/* _LINUX_BACKING_DEV_H */
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+/**
+ * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
+ * @inode: inode of interest
+ *
+ * cgroup writeback requires support from both the bdi and filesystem.
+ * Test whether @inode has both.
+ */
+static inline bool inode_cgwb_enabled(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	return bdi_cap_account_dirty(bdi) &&
+		(bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
+		(inode->i_sb->s_type->fs_flags & FS_CGROUP_WRITEBACK);
+}
+
+#else	/* CONFIG_CGROUP_WRITEBACK */
+
+static inline bool inode_cgwb_enabled(struct inode *inode)
+{
+	return false;
+}
+
+#endif	/* CONFIG_CGROUP_WRITEBACK */
+
+#endif	/* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ce100b87..74e0ae0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1897,6 +1897,7 @@  struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_USERNS_DEV_MOUNT	16 /* A userns mount does not imply MNT_NODEV */
+#define FS_CGROUP_WRITEBACK	32	/* Supports cgroup-aware writeback */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	struct dentry *(*mount) (struct file_system_type *, int,
 		       const char *, void *);
diff --git a/init/Kconfig b/init/Kconfig
index dc24dec..d4f7633 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1141,6 +1141,11 @@  config DEBUG_BLK_CGROUP
 	Enable some debugging help. Currently it exports additional stat
 	files in a cgroup which can be useful for debugging.
 
+config CGROUP_WRITEBACK
+	bool
+	depends on MEMCG && BLK_CGROUP
+	default y
+
 endif # CGROUPS
 
 config CHECKPOINT_RESTORE

[22/51] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK

Commit Message

Comments

Patch