diff mbox

[41/51] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's

Message ID 1432329245-5844-42-git-send-email-tj@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Tejun Heo May 22, 2015, 9:13 p.m. UTC
wakeup_flusher_threads() currently only starts writeback on the root
wb (bdi_writeback).  For cgroup writeback support, update the function
to wake up all wbs and distribute the number of pages to write
according to the proportion of each wb's write bandwidth, which is
implemented in wb_split_bdi_pages().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 2 deletions(-)

Comments

Jan Kara July 1, 2015, 8:15 a.m. UTC | #1
On Fri 22-05-15 17:13:55, Tejun Heo wrote:
> wakeup_flusher_threads() currently only starts writeback on the root
> wb (bdi_writeback).  For cgroup writeback support, update the function
> to wake up all wbs and distribute the number of pages to write
> according to the proportion of each wb's write bandwidth, which is
> implemented in wb_split_bdi_pages().
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Jan Kara <jack@suse.cz>

I was looking at who uses wakeup_flusher_threads(). There are two usecases:

1) sync() - we want to writeback everything
2) We want to relieve memory pressure by cleaning and subsequently
   reclaiming pages.

Neither of these cares about number of pages too much if you write enough.
So similarly as we don't split the passed nr_pages argument among bdis, I
wouldn't split the nr_pages among wbs. Just pass the nr_pages to each wb
and be done with that...

								Honza

> ---
>  fs/fs-writeback.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 92aaf64..508e10c 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -198,6 +198,41 @@ int inode_congested(struct inode *inode, int cong_bits)
>  }
>  EXPORT_SYMBOL_GPL(inode_congested);
>  
> +/**
> + * wb_split_bdi_pages - split nr_pages to write according to bandwidth
> + * @wb: target bdi_writeback to split @nr_pages to
> + * @nr_pages: number of pages to write for the whole bdi
> + *
> + * Split @wb's portion of @nr_pages according to @wb's write bandwidth in
> + * relation to the total write bandwidth of all wb's w/ dirty inodes on
> + * @wb->bdi.
> + */
> +static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
> +{
> +	unsigned long this_bw = wb->avg_write_bandwidth;
> +	unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth);
> +
> +	if (nr_pages == LONG_MAX)
> +		return LONG_MAX;
> +
> +	/*
> +	 * This may be called on clean wb's and proportional distribution
> +	 * may not make sense, just use the original @nr_pages in those
> +	 * cases.  In general, we wanna err on the side of writing more.
> +	 */
> +	if (!tot_bw || this_bw >= tot_bw)
> +		return nr_pages;
> +	else
> +		return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw);
> +}
> +
> +#else	/* CONFIG_CGROUP_WRITEBACK */
> +
> +static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
> +{
> +	return nr_pages;
> +}
> +
>  #endif	/* CONFIG_CGROUP_WRITEBACK */
>  
>  void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> @@ -1187,8 +1222,17 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
>  		nr_pages = get_nr_dirty_pages();
>  
>  	rcu_read_lock();
> -	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
> -		wb_start_writeback(&bdi->wb, nr_pages, false, reason);
> +	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
> +		struct bdi_writeback *wb;
> +		struct wb_iter iter;
> +
> +		if (!bdi_has_dirty_io(bdi))
> +			continue;
> +
> +		bdi_for_each_wb(wb, bdi, &iter, 0)
> +			wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
> +					   false, reason);
> +	}
>  	rcu_read_unlock();
>  }
>  
> -- 
> 2.4.0
>
Tejun Heo July 2, 2015, 2:37 a.m. UTC | #2
Hello,

On Wed, Jul 01, 2015 at 10:15:28AM +0200, Jan Kara wrote:
> I was looking at who uses wakeup_flusher_threads(). There are two usecases:
> 
> 1) sync() - we want to writeback everything
> 2) We want to relieve memory pressure by cleaning and subsequently
>    reclaiming pages.
> 
> Neither of these cares about number of pages too much if you write enough.

What's enough tho?  Saying "yeah let's try about 1000 pages" is one
thing and "let's try about 1000 pages on each of 100 cgroups" is a
quite different operation.  Given the nature of "let's try to write
some", I'd venture to say that writing somewhat less is an a lot
better behavior than possibly trying to write out possibly huge amount
given that the amount of fluctuation such behaviors may cause
system-wide and how non-obvious the reasons for such fluctuations
would be.

> So similarly as we don't split the passed nr_pages argument among bdis, I

bdi's are bound by actual hardware.  wb's aren't.  This is a purely
logical construct and there can be a lot of them.  Again, trying to
write 1024 pages on each of 100 devices and trying to write 1024 * 100
pages to single device are quite different.

Thanks.
Jan Kara July 3, 2015, 1:02 p.m. UTC | #3
On Wed 01-07-15 22:37:06, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jul 01, 2015 at 10:15:28AM +0200, Jan Kara wrote:
> > I was looking at who uses wakeup_flusher_threads(). There are two usecases:
> > 
> > 1) sync() - we want to writeback everything
> > 2) We want to relieve memory pressure by cleaning and subsequently
> >    reclaiming pages.
> > 
> > Neither of these cares about number of pages too much if you write enough.
> 
> What's enough tho?  Saying "yeah let's try about 1000 pages" is one
> thing and "let's try about 1000 pages on each of 100 cgroups" is a
> quite different operation.  Given the nature of "let's try to write
> some", I'd venture to say that writing somewhat less is an a lot
> better behavior than possibly trying to write out possibly huge amount
> given that the amount of fluctuation such behaviors may cause
> system-wide and how non-obvious the reasons for such fluctuations
> would be.
> 
> > So similarly as we don't split the passed nr_pages argument among bdis, I
> 
> bdi's are bound by actual hardware.  wb's aren't.  This is a purely
> logical construct and there can be a lot of them.  Again, trying to
> write 1024 pages on each of 100 devices and trying to write 1024 * 100
> pages to single device are quite different.

OK, I agree with your device vs logical construct argument. However when
splitting pages based on avg throughput each cgroup generates, we know
nothing about actual amount of dirty pages in each cgroup so we may end up
writing much fewer pages than we originally wanted since a cgroup which was
assigned a big chunk needn't have many pages available. So your algorithm
is basically bound to undershoot the requested number of pages in some
cases...

Another concern is that if we have two cgroups with same amount of dirty
pages but cgroup A has them randomly scattered (and thus have much lower
bandwidth) and cgroup B has them in a sequential fashion (thus with higher
bandwidth), you end up cleaning (and subsequently reclaiming) more from
cgroup B. That may be good for immediate memory pressure but could be
considered unfair by the cgroup owner.

Maybe it would be better to split number of pages to write based on
fraction of dirty pages each cgroup has in the bdi?

								Honza
Tejun Heo July 3, 2015, 4:33 p.m. UTC | #4
Hello,

On Fri, Jul 03, 2015 at 03:02:13PM +0200, Jan Kara wrote:
...
> OK, I agree with your device vs logical construct argument. However when
> splitting pages based on avg throughput each cgroup generates, we know
> nothing about actual amount of dirty pages in each cgroup so we may end up
> writing much fewer pages than we originally wanted since a cgroup which was
> assigned a big chunk needn't have many pages available. So your algorithm
> is basically bound to undershoot the requested number of pages in some
> cases...

Sure, but the nr_to_write has never been a strict thing except when
we're writing out everything.  We don't overshoot them but writing out
less than requested is not unusual.  Also, note that write bandwidth
is the primary measure that we base the distribution of dirty pages
on.  Sure, there can be cases where the two deviate but this is the
better measure to use than, say, number of currently dirty pages.

> Another concern is that if we have two cgroups with same amount of dirty
> pages but cgroup A has them randomly scattered (and thus have much lower
> bandwidth) and cgroup B has them in a sequential fashion (thus with higher
> bandwidth), you end up cleaning (and subsequently reclaiming) more from
> cgroup B. That may be good for immediate memory pressure but could be
> considered unfair by the cgroup owner.
> 
> Maybe it would be better to split number of pages to write based on
> fraction of dirty pages each cgroup has in the bdi?

The dirty pages are already distributed according to write bandwidth.
Write bandwidth is the de-facto currency of dirty page distribution.
If it can be shown that some other measure is better for this purpose,
sure, but I don't see why we'd deviate just based on a vague feeling
that something else might be better and given how these mechanisms are
used I don't think going either way would matter a bit.

Thanks.
diff mbox

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 92aaf64..508e10c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -198,6 +198,41 @@  int inode_congested(struct inode *inode, int cong_bits)
 }
 EXPORT_SYMBOL_GPL(inode_congested);
 
+/**
+ * wb_split_bdi_pages - split nr_pages to write according to bandwidth
+ * @wb: target bdi_writeback to split @nr_pages to
+ * @nr_pages: number of pages to write for the whole bdi
+ *
+ * Split @wb's portion of @nr_pages according to @wb's write bandwidth in
+ * relation to the total write bandwidth of all wb's w/ dirty inodes on
+ * @wb->bdi.
+ */
+static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
+{
+	unsigned long this_bw = wb->avg_write_bandwidth;
+	unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth);
+
+	if (nr_pages == LONG_MAX)
+		return LONG_MAX;
+
+	/*
+	 * This may be called on clean wb's and proportional distribution
+	 * may not make sense, just use the original @nr_pages in those
+	 * cases.  In general, we wanna err on the side of writing more.
+	 */
+	if (!tot_bw || this_bw >= tot_bw)
+		return nr_pages;
+	else
+		return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw);
+}
+
+#else	/* CONFIG_CGROUP_WRITEBACK */
+
+static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
+{
+	return nr_pages;
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
@@ -1187,8 +1222,17 @@  void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
 		nr_pages = get_nr_dirty_pages();
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
-		wb_start_writeback(&bdi->wb, nr_pages, false, reason);
+	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
+		struct bdi_writeback *wb;
+		struct wb_iter iter;
+
+		if (!bdi_has_dirty_io(bdi))
+			continue;
+
+		bdi_for_each_wb(wb, bdi, &iter, 0)
+			wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
+					   false, reason);
+	}
 	rcu_read_unlock();
 }