block: transfer source bio's cgroup tags to clone via bio_associate_blkcg() (was: Re: blkio cgroups controller doesn't work with LVM?)
diff mbox

Message ID 20160302175656.GA59991@redhat.com
State Rejected, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Mike Snitzer March 2, 2016, 5:56 p.m. UTC
On Wed, Mar 02 2016 at 11:06P -0500,
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On Thu, Feb 25, 2016 at 09:53:14AM -0500, Mike Snitzer wrote:
> > Right, LVM created devices are bio-based DM devices in the kernel.
> > bio-based block devices do _not_ have an IO scheduler.  Their underlying
> > request-based device does.
> 
> dm devices are not the actual resource source, so I don't think it'd
> work too well to put io controllers on them (can't really do things
> like proportional control without owning the queue).
> 
> > I'm not well-versed on the top-level cgroup interface and how it maps to
> > associated resources that are established in the kernel.  But it could
> > be that the configuration of blkio cgroup against a bio-based LVM device
> > needs to be passed through to the underlying request-based device
> > (e.g. /dev/sda4 in Chris's case)?
> > 
> > I'm also wondering whether the latest cgroup work that Tejun has just
> > finished (afaik to support buffered IO in the IO controller) will afford
> > us a more meaningful reason to work to make cgroups' blkio controller
> > actually work with bio-based devices like LVM's DM devices?
> > 
> > I'm very much open to advice on how to proceed with investigating this
> > integration work.  Tejun, Vivek, anyone else: if you have advice on next
> > steps for DM on this front _please_ yell, thanks!
> 
> I think the only thing necessary is dm transferring bio cgroup tags to
> the bio's that it ends up passing down the stack.  Please take a look
> at fs/btrfs/extent_io.c::btrfs_bio_clone() for an example.  We
> probably should introduce a wrapper for this so that each site doesn't
> need to ifdef it.
> 
> Thanks.

OK, I think this should do it.  Nikolay and/or others can you test this
patch using blkio cgroups controller with LVM devices and report back?

From: Mike Snitzer <snitzer@redhat.com>
Date: Wed, 2 Mar 2016 12:37:39 -0500
Subject: [PATCH] block: transfer source bio's cgroup tags to clone via bio_associate_blkcg()

Move btrfs_bio_clone()'s support for transferring a source bio's cgroup
tags to a clone into both bio_clone_bioset() and __bio_clone_fast().
The former is used by btrfs (MD and blk-core also use it via bio_split).
The latter is used by both DM and bcache.

This should enable the blkio cgroups controller to work with all
stacking bio-based block devices.

Reported-by: Nikolay Borisov <kernel@kyup.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/bio.c          | 10 ++++++++++
 fs/btrfs/extent_io.c |  6 ------
 2 files changed, 10 insertions(+), 6 deletions(-)

Comments

kernel@kyup.com March 2, 2016, 6:03 p.m. UTC | #1
Thanks for the patch I will likely have time to test this sometime next week.
But just to be sure - the expected behavior would be that processes
writing to dm-based devices would experience the fair-shair
scheduling of CFQ (provided that the physical devices that back those
DM devices use CFQ), correct?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer March 2, 2016, 6:05 p.m. UTC | #2
On Wed, Mar 02 2016 at  1:03P -0500,
Nikolay Borisov <kernel@kyup.com> wrote:

> Thanks for the patch I will likely have time to test this sometime next week.
> But just to be sure - the expected behavior would be that processes
> writing to dm-based devices would experience the fair-shair
> scheduling of CFQ (provided that the physical devices that back those
> DM devices use CFQ), correct?

Yes, that is the goal.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vivek Goyal March 2, 2016, 7:18 p.m. UTC | #3
On Wed, Mar 02, 2016 at 08:03:10PM +0200, Nikolay Borisov wrote:
> Thanks for the patch I will likely have time to test this sometime next week.
> But just to be sure - the expected behavior would be that processes
> writing to dm-based devices would experience the fair-shair
> scheduling of CFQ (provided that the physical devices that back those
> DM devices use CFQ), correct?

Nikolay,

I am not sure how well it will work with CFQ of underlying device. It will
get cgroup information right for buffered writes. But cgroup information
for reads and direct writes will come from submitter's context and if dm
layer gets in between, then many a times submitter might be a worker
thread and IO will be attributed to that worker's cgroup (root cgroup).

Give it a try anyway.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Nikolay Borisov March 2, 2016, 7:59 p.m. UTC | #4
On Wednesday, March 2, 2016, Vivek Goyal <vgoyal@redhat.com> wrote:

> On Wed, Mar 02, 2016 at 08:03:10PM +0200, Nikolay Borisov wrote:
> > Thanks for the patch I will likely have time to test this sometime next
> week.
> > But just to be sure - the expected behavior would be that processes
> > writing to dm-based devices would experience the fair-shair
> > scheduling of CFQ (provided that the physical devices that back those
> > DM devices use CFQ), correct?
>
> Nikolay,
>
> I am not sure how well it will work with CFQ of underlying device. It will
> get cgroup information right for buffered writes. But cgroup information


 Right, what's your definition of  buffered writes? My mental model is that
when a process submits a write request to a dm device , the bio is going to
be put on a devi e workqueue which would then  be serviced by a background
worker thread and later the submitter notified. Do you refer to this whole
gamut of operations as buffered writes?

for reads and direct writes will come from submitter's context and if dm
> layer gets in between, then many a times submitter might be a worker
> thread and IO will be attributed to that worker's cgroup (root cgroup).


Be that as it may, proivded that the worker thread is in the  'correct'
cgroup,  then the appropriate babdwidth policies should apply, no?

>
> Give it a try anyway.


Most certainly I will :)


>
> Thanks
> Vivek
>
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vivek Goyal March 2, 2016, 8:10 p.m. UTC | #5
On Wed, Mar 02, 2016 at 09:59:13PM +0200, Nikolay Borisov wrote:
> On Wednesday, March 2, 2016, Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > On Wed, Mar 02, 2016 at 08:03:10PM +0200, Nikolay Borisov wrote:
> > > Thanks for the patch I will likely have time to test this sometime next
> > week.
> > > But just to be sure - the expected behavior would be that processes
> > > writing to dm-based devices would experience the fair-shair
> > > scheduling of CFQ (provided that the physical devices that back those
> > > DM devices use CFQ), correct?
> >
> > Nikolay,
> >
> > I am not sure how well it will work with CFQ of underlying device. It will
> > get cgroup information right for buffered writes. But cgroup information
> 
> 
>  Right, what's your definition of  buffered writes?

Writes which go through page cache.

> My mental model is that
> when a process submits a write request to a dm device , the bio is going to
> be put on a devi e workqueue which would then  be serviced by a background
> worker thread and later the submitter notified. Do you refer to this whole
> gamut of operations as buffered writes?

No, once the bio is submitted to dm device it could be a buffered write or
a direct write.

> 
> for reads and direct writes will come from submitter's context and if dm
> > layer gets in between, then many a times submitter might be a worker
> > thread and IO will be attributed to that worker's cgroup (root cgroup).
> 
> 
> Be that as it may, proivded that the worker thread is in the  'correct'
> cgroup,  then the appropriate babdwidth policies should apply, no?

Worker thread will most likely be in root cgroup. So if a worker thread
is submitting bio, it will be attributed to root cgroup.

We had similar issue with IO priority and it did not work reliably with
CFQ on underlying device when dm devices were sitting on top.

If we really want to give it a try, I guess we will have to put cgroup
info of submitter early in bio at the time of bio creation even for all
kind of IO. Not sure if it is worth the effort.

For the case of IO throttling, I think you should put throttling rules on
the dm device itself. That means as long as filesystem supports the
cgroups, you should be getting right cgroup information for all kind of
IO and throttling should work just fine.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Nikolay Borisov March 2, 2016, 8:19 p.m. UTC | #6
On Wednesday, March 2, 2016, Vivek Goyal <vgoyal@redhat.com> wrote:

> On Wed, Mar 02, 2016 at 09:59:13PM +0200, Nikolay Borisov wrote:
> > On Wednesday, March 2, 2016, Vivek Goyal <vgoyal@redhat.com
> <javascript:;>> wrote:
> >
> > > On Wed, Mar 02, 2016 at 08:03:10PM +0200, Nikolay Borisov wrote:
> > > > Thanks for the patch I will likely have time to test this sometime
> next
> > > week.
> > > > But just to be sure - the expected behavior would be that processes
> > > > writing to dm-based devices would experience the fair-shair
> > > > scheduling of CFQ (provided that the physical devices that back those
> > > > DM devices use CFQ), correct?
> > >
> > > Nikolay,
> > >
> > > I am not sure how well it will work with CFQ of underlying device. It
> will
> > > get cgroup information right for buffered writes. But cgroup
> information
> >
> >
> >  Right, what's your definition of  buffered writes?
>
> Writes which go through page cache.
>
> > My mental model is that
> > when a process submits a write request to a dm device , the bio is going
> to
> > be put on a devi e workqueue which would then  be serviced by a
> background
> > worker thread and later the submitter notified. Do you refer to this
> whole
> > gamut of operations as buffered writes?
>
> No, once the bio is submitted to dm device it could be a buffered write or
> a direct write.
>
> >
> > for reads and direct writes will come from submitter's context and if dm
> > > layer gets in between, then many a times submitter might be a worker
> > > thread and IO will be attributed to that worker's cgroup (root cgroup).
> >
> >
> > Be that as it may, proivded that the worker thread is in the  'correct'
> > cgroup,  then the appropriate babdwidth policies should apply, no?
>
> Worker thread will most likely be in root cgroup. So if a worker thread
> is submitting bio, it will be attributed to root cgroup.
>
> We had similar issue with IO priority and it did not work reliably with
> CFQ on underlying device when dm devices were sitting on top.
>
> If we really want to give it a try, I guess we will have to put cgroup
> info of submitter early in bio at the time of bio creation even for all
> kind of IO. Not sure if it is worth the effort.
>
> For the case of IO throttling, I think you should put throttling rules on
> the dm device itself. That means as long as filesystem supports the
> cgroups, you should be getting right cgroup information for all kind of
> IO and throttling should work just fine.


Throttling does work even now,  but the use case I had in mind was
proportional
distribution of IO. Imagine 50  or so dm devices, hosting IO intensive
workloads. In
this situation, I'd  be interested each of them getting proportional IO
based on the weights
set in the blkcg controller for each respective cgroup for every workload.

>
> Thanks
> Vivek
>
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Chris Friesen March 2, 2016, 8:34 p.m. UTC | #7
On 03/02/2016 02:10 PM, Vivek Goyal wrote:
> On Wed, Mar 02, 2016 at 09:59:13PM +0200, Nikolay Borisov wrote:

> We had similar issue with IO priority and it did not work reliably with
> CFQ on underlying device when dm devices were sitting on top.
>
> If we really want to give it a try, I guess we will have to put cgroup
> info of submitter early in bio at the time of bio creation even for all
> kind of IO. Not sure if it is worth the effort.

As it stands, imagine that you have a hypervisor node running many VMs (or 
containers), each of which is assigned a separate logical volume (possibly 
thin-provisioned) as its rootfs.

Ideally we want the disk accesses by those VMs to be "fair" relative to each 
other, and we want to guarantee a certain amount of bandwidth for the host as well.

Without this sort of feature, how can we accomplish that?

> For the case of IO throttling, I think you should put throttling rules on
> the dm device itself. That means as long as filesystem supports the
> cgroups, you should be getting right cgroup information for all kind of
> IO and throttling should work just fine.

IO throttling isn't all that useful, since it requires you to know in advance 
what your IO rate is.  And it doesn't adjust nicely as the number of competing 
entities changes the way that weight-based schemes do.

Chris

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vivek Goyal March 2, 2016, 8:45 p.m. UTC | #8
On Wed, Mar 02, 2016 at 10:19:38PM +0200, Nikolay Borisov wrote:
> On Wednesday, March 2, 2016, Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > On Wed, Mar 02, 2016 at 09:59:13PM +0200, Nikolay Borisov wrote:
> > > On Wednesday, March 2, 2016, Vivek Goyal <vgoyal@redhat.com
> > <javascript:;>> wrote:
> > >
> > > > On Wed, Mar 02, 2016 at 08:03:10PM +0200, Nikolay Borisov wrote:
> > > > > Thanks for the patch I will likely have time to test this sometime
> > next
> > > > week.
> > > > > But just to be sure - the expected behavior would be that processes
> > > > > writing to dm-based devices would experience the fair-shair
> > > > > scheduling of CFQ (provided that the physical devices that back those
> > > > > DM devices use CFQ), correct?
> > > >
> > > > Nikolay,
> > > >
> > > > I am not sure how well it will work with CFQ of underlying device. It
> > will
> > > > get cgroup information right for buffered writes. But cgroup
> > information
> > >
> > >
> > >  Right, what's your definition of  buffered writes?
> >
> > Writes which go through page cache.
> >
> > > My mental model is that
> > > when a process submits a write request to a dm device , the bio is going
> > to
> > > be put on a devi e workqueue which would then  be serviced by a
> > background
> > > worker thread and later the submitter notified. Do you refer to this
> > whole
> > > gamut of operations as buffered writes?
> >
> > No, once the bio is submitted to dm device it could be a buffered write or
> > a direct write.
> >
> > >
> > > for reads and direct writes will come from submitter's context and if dm
> > > > layer gets in between, then many a times submitter might be a worker
> > > > thread and IO will be attributed to that worker's cgroup (root cgroup).
> > >
> > >
> > > Be that as it may, proivded that the worker thread is in the  'correct'
> > > cgroup,  then the appropriate babdwidth policies should apply, no?
> >
> > Worker thread will most likely be in root cgroup. So if a worker thread
> > is submitting bio, it will be attributed to root cgroup.
> >
> > We had similar issue with IO priority and it did not work reliably with
> > CFQ on underlying device when dm devices were sitting on top.
> >
> > If we really want to give it a try, I guess we will have to put cgroup
> > info of submitter early in bio at the time of bio creation even for all
> > kind of IO. Not sure if it is worth the effort.
> >
> > For the case of IO throttling, I think you should put throttling rules on
> > the dm device itself. That means as long as filesystem supports the
> > cgroups, you should be getting right cgroup information for all kind of
> > IO and throttling should work just fine.
> 
> 
> Throttling does work even now,  but the use case I had in mind was
> proportional
> distribution of IO. Imagine 50  or so dm devices, hosting IO intensive
> workloads. In
> this situation, I'd  be interested each of them getting proportional IO
> based on the weights
> set in the blkcg controller for each respective cgroup for every workload.
> 

I see what you are trying to do. Carry the cgroup information from top to
bottom of IO stack for all kind of IO.

I guess we also need to call  bio_associate_current() when dm accepts
bio from the submitter.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vivek Goyal March 2, 2016, 9:04 p.m. UTC | #9
On Wed, Mar 02, 2016 at 02:34:50PM -0600, Chris Friesen wrote:
> On 03/02/2016 02:10 PM, Vivek Goyal wrote:
> >On Wed, Mar 02, 2016 at 09:59:13PM +0200, Nikolay Borisov wrote:
> 
> >We had similar issue with IO priority and it did not work reliably with
> >CFQ on underlying device when dm devices were sitting on top.
> >
> >If we really want to give it a try, I guess we will have to put cgroup
> >info of submitter early in bio at the time of bio creation even for all
> >kind of IO. Not sure if it is worth the effort.
> 
> As it stands, imagine that you have a hypervisor node running many VMs (or
> containers), each of which is assigned a separate logical volume (possibly
> thin-provisioned) as its rootfs.
> 
> Ideally we want the disk accesses by those VMs to be "fair" relative to each
> other, and we want to guarantee a certain amount of bandwidth for the host
> as well.
> 
> Without this sort of feature, how can we accomplish that?

As of now, you can't. I will try adding bio_associate_current() and see
if that along with Mike's patches gets you what you are looking for.

On a side note, have you tried using CFQ's proportional logic with multile
VMs. Say partition the disk and pass each parition to VM/container and
do the IO. My main concern is that by default each cgroup can add
significant idling overhead and kill overall throughput of disk
(especially for random IO or if cgroup does not have enough IO to keep
disk busy).

One can disable group idling but that kills service differentiation for
most of the workloads.

So I was curious to know if CFQ's proportional bandwidth division is
helping you in real life. (without dm of course).

> 
> >For the case of IO throttling, I think you should put throttling rules on
> >the dm device itself. That means as long as filesystem supports the
> >cgroups, you should be getting right cgroup information for all kind of
> >IO and throttling should work just fine.
> 
> IO throttling isn't all that useful, since it requires you to know in
> advance what your IO rate is.  And it doesn't adjust nicely as the number of
> competing entities changes the way that weight-based schemes do.

Agreed that absolute limits are less useful as compared to dynamic limits
provided by weights. This is more useful for scenario where a cloud
provider does not want to provide disk bandwidth if user has not paid for
it (even if disk bandwidth is available).

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
kernel@kyup.com March 14, 2016, 3:08 p.m. UTC | #10
On 03/02/2016 07:56 PM, Mike Snitzer wrote:
> On Wed, Mar 02 2016 at 11:06P -0500,
> Tejun Heo <tj@kernel.org> wrote:
> 
>> Hello,
>>
>> On Thu, Feb 25, 2016 at 09:53:14AM -0500, Mike Snitzer wrote:
>>> Right, LVM created devices are bio-based DM devices in the kernel.
>>> bio-based block devices do _not_ have an IO scheduler.  Their underlying
>>> request-based device does.
>>
>> dm devices are not the actual resource source, so I don't think it'd
>> work too well to put io controllers on them (can't really do things
>> like proportional control without owning the queue).
>>
>>> I'm not well-versed on the top-level cgroup interface and how it maps to
>>> associated resources that are established in the kernel.  But it could
>>> be that the configuration of blkio cgroup against a bio-based LVM device
>>> needs to be passed through to the underlying request-based device
>>> (e.g. /dev/sda4 in Chris's case)?
>>>
>>> I'm also wondering whether the latest cgroup work that Tejun has just
>>> finished (afaik to support buffered IO in the IO controller) will afford
>>> us a more meaningful reason to work to make cgroups' blkio controller
>>> actually work with bio-based devices like LVM's DM devices?
>>>
>>> I'm very much open to advice on how to proceed with investigating this
>>> integration work.  Tejun, Vivek, anyone else: if you have advice on next
>>> steps for DM on this front _please_ yell, thanks!
>>
>> I think the only thing necessary is dm transferring bio cgroup tags to
>> the bio's that it ends up passing down the stack.  Please take a look
>> at fs/btrfs/extent_io.c::btrfs_bio_clone() for an example.  We
>> probably should introduce a wrapper for this so that each site doesn't
>> need to ifdef it.
>>
>> Thanks.
> 
> OK, I think this should do it.  Nikolay and/or others can you test this
> patch using blkio cgroups controller with LVM devices and report back?
> 
> From: Mike Snitzer <snitzer@redhat.com>
> Date: Wed, 2 Mar 2016 12:37:39 -0500
> Subject: [PATCH] block: transfer source bio's cgroup tags to clone via bio_associate_blkcg()
> 
> Move btrfs_bio_clone()'s support for transferring a source bio's cgroup
> tags to a clone into both bio_clone_bioset() and __bio_clone_fast().
> The former is used by btrfs (MD and blk-core also use it via bio_split).
> The latter is used by both DM and bcache.
> 
> This should enable the blkio cgroups controller to work with all
> stacking bio-based block devices.
> 
> Reported-by: Nikolay Borisov <kernel@kyup.com>
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  block/bio.c          | 10 ++++++++++
>  fs/btrfs/extent_io.c |  6 ------
>  2 files changed, 10 insertions(+), 6 deletions(-)


So I had a chance to test the settings here is what I got when running 
2 container, using LVM-thin for their root device and having applied 
your patch: 

When the 2 containers are using the same blkio.weight values (500) I 
get the following from running DD simultaneously on the 2 containers: 

[root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct
3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB) copied, 165.171 s, 19.0 MB/s

[root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct
3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB) copied, 166.165 s, 18.9 MB/s

Also iostat showed the 2 volumes using almost the same amount of 
IO (around 20mb r/w). I then increase the weight for c1501 to 1000 i.e. 
twice the bandwidth that c1500 has, so I would expect its dd to complete
twice as fast: 

[root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct
3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB) copied, 150.892 s, 20.8 MB/s


[root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct
3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB) copied, 157.167 s, 20.0 MB/s

Now repeating the same tests but this time using the page-cache 
(echo 3 > /proc/sys/vm/drop_caches) was executed before each test run: 

With equal weights (500):
[root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000
3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB) copied, 114.923 s, 27.4 MB/s

[root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000
3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB) copied, 120.245 s, 26.2 MB/s

With (c1501's weight equal to twice that of c1500 (1000)):

[root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000
3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB) copied, 99.0181 s, 31.8 MB/s

[root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000
3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB) copied, 122.872 s, 25.6 MB/s

I'd say that for buffered IO your patch does indeed make a difference, 
and this sort of aligns with what Vivek said about the patch
working for buffered writes but not for direct. 

I will proceed now and test his patch applied for the case of 
direct writes. 

Hope this helps. 


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Nikolay Borisov March 14, 2016, 3:31 p.m. UTC | #11
On 03/14/2016 05:08 PM, Nikolay Borisov wrote:
> 
> 
> On 03/02/2016 07:56 PM, Mike Snitzer wrote:
>> On Wed, Mar 02 2016 at 11:06P -0500,
>> Tejun Heo <tj@kernel.org> wrote:
>>
>>> Hello,
>>>
>>> On Thu, Feb 25, 2016 at 09:53:14AM -0500, Mike Snitzer wrote:
>>>> Right, LVM created devices are bio-based DM devices in the kernel.
>>>> bio-based block devices do _not_ have an IO scheduler.  Their underlying
>>>> request-based device does.
>>>
>>> dm devices are not the actual resource source, so I don't think it'd
>>> work too well to put io controllers on them (can't really do things
>>> like proportional control without owning the queue).
>>>
>>>> I'm not well-versed on the top-level cgroup interface and how it maps to
>>>> associated resources that are established in the kernel.  But it could
>>>> be that the configuration of blkio cgroup against a bio-based LVM device
>>>> needs to be passed through to the underlying request-based device
>>>> (e.g. /dev/sda4 in Chris's case)?
>>>>
>>>> I'm also wondering whether the latest cgroup work that Tejun has just
>>>> finished (afaik to support buffered IO in the IO controller) will afford
>>>> us a more meaningful reason to work to make cgroups' blkio controller
>>>> actually work with bio-based devices like LVM's DM devices?
>>>>
>>>> I'm very much open to advice on how to proceed with investigating this
>>>> integration work.  Tejun, Vivek, anyone else: if you have advice on next
>>>> steps for DM on this front _please_ yell, thanks!
>>>
>>> I think the only thing necessary is dm transferring bio cgroup tags to
>>> the bio's that it ends up passing down the stack.  Please take a look
>>> at fs/btrfs/extent_io.c::btrfs_bio_clone() for an example.  We
>>> probably should introduce a wrapper for this so that each site doesn't
>>> need to ifdef it.
>>>
>>> Thanks.
>>
>> OK, I think this should do it.  Nikolay and/or others can you test this
>> patch using blkio cgroups controller with LVM devices and report back?
>>
>> From: Mike Snitzer <snitzer@redhat.com>
>> Date: Wed, 2 Mar 2016 12:37:39 -0500
>> Subject: [PATCH] block: transfer source bio's cgroup tags to clone via bio_associate_blkcg()
>>
>> Move btrfs_bio_clone()'s support for transferring a source bio's cgroup
>> tags to a clone into both bio_clone_bioset() and __bio_clone_fast().
>> The former is used by btrfs (MD and blk-core also use it via bio_split).
>> The latter is used by both DM and bcache.
>>
>> This should enable the blkio cgroups controller to work with all
>> stacking bio-based block devices.
>>
>> Reported-by: Nikolay Borisov <kernel@kyup.com>
>> Suggested-by: Tejun Heo <tj@kernel.org>
>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>> ---
>>  block/bio.c          | 10 ++++++++++
>>  fs/btrfs/extent_io.c |  6 ------
>>  2 files changed, 10 insertions(+), 6 deletions(-)
> 
> 
> So I had a chance to test the settings here is what I got when running 
> 2 container, using LVM-thin for their root device and having applied 
> your patch: 
> 
> When the 2 containers are using the same blkio.weight values (500) I 
> get the following from running DD simultaneously on the 2 containers: 
> 
> [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct
> 3000+0 records in
> 3000+0 records out
> 3145728000 bytes (3.1 GB) copied, 165.171 s, 19.0 MB/s
> 
> [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct
> 3000+0 records in
> 3000+0 records out
> 3145728000 bytes (3.1 GB) copied, 166.165 s, 18.9 MB/s
> 
> Also iostat showed the 2 volumes using almost the same amount of 
> IO (around 20mb r/w). I then increase the weight for c1501 to 1000 i.e. 
> twice the bandwidth that c1500 has, so I would expect its dd to complete
> twice as fast: 
> 
> [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct
> 3000+0 records in
> 3000+0 records out
> 3145728000 bytes (3.1 GB) copied, 150.892 s, 20.8 MB/s
> 
> 
> [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct
> 3000+0 records in
> 3000+0 records out
> 3145728000 bytes (3.1 GB) copied, 157.167 s, 20.0 MB/s
> 
> Now repeating the same tests but this time using the page-cache 
> (echo 3 > /proc/sys/vm/drop_caches) was executed before each test run: 
> 
> With equal weights (500):
> [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000
> 3000+0 records in
> 3000+0 records out
> 3145728000 bytes (3.1 GB) copied, 114.923 s, 27.4 MB/s
> 
> [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000
> 3000+0 records in
> 3000+0 records out
> 3145728000 bytes (3.1 GB) copied, 120.245 s, 26.2 MB/s
> 
> With (c1501's weight equal to twice that of c1500 (1000)):
> 
> [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000
> 3000+0 records in
> 3000+0 records out
> 3145728000 bytes (3.1 GB) copied, 99.0181 s, 31.8 MB/s
> 
> [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000
> 3000+0 records in
> 3000+0 records out
> 3145728000 bytes (3.1 GB) copied, 122.872 s, 25.6 MB/s

And another test which makes it obvious that your patch works:

[root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=6000
6000+0 records in
6000+0 records out
6291456000 bytes (6.3 GB) copied, 210.466 s, 29.9 MB/s

[root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000
3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB) copied, 201.118 s, 15.6 MB/s


So a file that is twice the size of another one (6vs3 g) is copied for
almost the same amount of time with 2x the bandwidth.


> 
> I'd say that for buffered IO your patch does indeed make a difference, 
> and this sort of aligns with what Vivek said about the patch
> working for buffered writes but not for direct. 
> 
> I will proceed now and test his patch applied for the case of 
> direct writes. 
> 
> Hope this helps. 
> 
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer March 14, 2016, 7:49 p.m. UTC | #12
On Mon, Mar 14 2016 at 11:31am -0400,
Nikolay Borisov <n.borisov@siteground.com> wrote:

> 
> 
> On 03/14/2016 05:08 PM, Nikolay Borisov wrote:
> > 
> > 
> > On 03/02/2016 07:56 PM, Mike Snitzer wrote:
> >> On Wed, Mar 02 2016 at 11:06P -0500,
> >> Tejun Heo <tj@kernel.org> wrote:
> >>
> >>> Hello,
> >>>
> >>> On Thu, Feb 25, 2016 at 09:53:14AM -0500, Mike Snitzer wrote:
> >>>> Right, LVM created devices are bio-based DM devices in the kernel.
> >>>> bio-based block devices do _not_ have an IO scheduler.  Their underlying
> >>>> request-based device does.
> >>>
> >>> dm devices are not the actual resource source, so I don't think it'd
> >>> work too well to put io controllers on them (can't really do things
> >>> like proportional control without owning the queue).
> >>>
> >>>> I'm not well-versed on the top-level cgroup interface and how it maps to
> >>>> associated resources that are established in the kernel.  But it could
> >>>> be that the configuration of blkio cgroup against a bio-based LVM device
> >>>> needs to be passed through to the underlying request-based device
> >>>> (e.g. /dev/sda4 in Chris's case)?
> >>>>
> >>>> I'm also wondering whether the latest cgroup work that Tejun has just
> >>>> finished (afaik to support buffered IO in the IO controller) will afford
> >>>> us a more meaningful reason to work to make cgroups' blkio controller
> >>>> actually work with bio-based devices like LVM's DM devices?
> >>>>
> >>>> I'm very much open to advice on how to proceed with investigating this
> >>>> integration work.  Tejun, Vivek, anyone else: if you have advice on next
> >>>> steps for DM on this front _please_ yell, thanks!
> >>>
> >>> I think the only thing necessary is dm transferring bio cgroup tags to
> >>> the bio's that it ends up passing down the stack.  Please take a look
> >>> at fs/btrfs/extent_io.c::btrfs_bio_clone() for an example.  We
> >>> probably should introduce a wrapper for this so that each site doesn't
> >>> need to ifdef it.
> >>>
> >>> Thanks.
> >>
> >> OK, I think this should do it.  Nikolay and/or others can you test this
> >> patch using blkio cgroups controller with LVM devices and report back?
> >>
> >> From: Mike Snitzer <snitzer@redhat.com>
> >> Date: Wed, 2 Mar 2016 12:37:39 -0500
> >> Subject: [PATCH] block: transfer source bio's cgroup tags to clone via bio_associate_blkcg()
> >>
> >> Move btrfs_bio_clone()'s support for transferring a source bio's cgroup
> >> tags to a clone into both bio_clone_bioset() and __bio_clone_fast().
> >> The former is used by btrfs (MD and blk-core also use it via bio_split).
> >> The latter is used by both DM and bcache.
> >>
> >> This should enable the blkio cgroups controller to work with all
> >> stacking bio-based block devices.
> >>
> >> Reported-by: Nikolay Borisov <kernel@kyup.com>
> >> Suggested-by: Tejun Heo <tj@kernel.org>
> >> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >> ---
> >>  block/bio.c          | 10 ++++++++++
> >>  fs/btrfs/extent_io.c |  6 ------
> >>  2 files changed, 10 insertions(+), 6 deletions(-)
> > 
> > 
> > So I had a chance to test the settings here is what I got when running 
> > 2 container, using LVM-thin for their root device and having applied 
> > your patch: 
> > 
> > When the 2 containers are using the same blkio.weight values (500) I 
> > get the following from running DD simultaneously on the 2 containers: 
> > 
> > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct
> > 3000+0 records in
> > 3000+0 records out
> > 3145728000 bytes (3.1 GB) copied, 165.171 s, 19.0 MB/s
> > 
> > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct
> > 3000+0 records in
> > 3000+0 records out
> > 3145728000 bytes (3.1 GB) copied, 166.165 s, 18.9 MB/s
> > 
> > Also iostat showed the 2 volumes using almost the same amount of 
> > IO (around 20mb r/w). I then increase the weight for c1501 to 1000 i.e. 
> > twice the bandwidth that c1500 has, so I would expect its dd to complete
> > twice as fast: 
> > 
> > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct
> > 3000+0 records in
> > 3000+0 records out
> > 3145728000 bytes (3.1 GB) copied, 150.892 s, 20.8 MB/s
> > 
> > 
> > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct
> > 3000+0 records in
> > 3000+0 records out
> > 3145728000 bytes (3.1 GB) copied, 157.167 s, 20.0 MB/s
> > 
> > Now repeating the same tests but this time using the page-cache 
> > (echo 3 > /proc/sys/vm/drop_caches) was executed before each test run: 
> > 
> > With equal weights (500):
> > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000
> > 3000+0 records in
> > 3000+0 records out
> > 3145728000 bytes (3.1 GB) copied, 114.923 s, 27.4 MB/s
> > 
> > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000
> > 3000+0 records in
> > 3000+0 records out
> > 3145728000 bytes (3.1 GB) copied, 120.245 s, 26.2 MB/s
> > 
> > With (c1501's weight equal to twice that of c1500 (1000)):
> > 
> > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000
> > 3000+0 records in
> > 3000+0 records out
> > 3145728000 bytes (3.1 GB) copied, 99.0181 s, 31.8 MB/s
> > 
> > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000
> > 3000+0 records in
> > 3000+0 records out
> > 3145728000 bytes (3.1 GB) copied, 122.872 s, 25.6 MB/s
> 
> And another test which makes it obvious that your patch works:
> 
> [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=6000
> 6000+0 records in
> 6000+0 records out
> 6291456000 bytes (6.3 GB) copied, 210.466 s, 29.9 MB/s
> 
> [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000
> 3000+0 records in
> 3000+0 records out
> 3145728000 bytes (3.1 GB) copied, 201.118 s, 15.6 MB/s
> 
> 
> So a file that is twice the size of another one (6vs3 g) is copied for
> almost the same amount of time with 2x the bandwidth.

Great.

Jens, can you pick up the patch in question ("[PATCH] block: transfer
source bio's cgroup tags to clone via bio_associate_blkcg()") that I
posted in this thread?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Nikolay Borisov March 14, 2016, 10:08 p.m. UTC | #13
>
> Great.
>
> Jens, can you pick up the patch in question ("[PATCH] block: transfer
> source bio's cgroup tags to clone via bio_associate_blkcg()") that I
> posted in this thread?

And what about Vivek's patch of associating the source bio with the io
context of the process issuing the io?
Would that help in the DIO case?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Patch
diff mbox

diff --git a/block/bio.c b/block/bio.c
index cf75915..25812be 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -584,6 +584,11 @@  void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 	bio->bi_rw = bio_src->bi_rw;
 	bio->bi_iter = bio_src->bi_iter;
 	bio->bi_io_vec = bio_src->bi_io_vec;
+
+#ifdef CONFIG_BLK_CGROUP
+	if (bio_src->bi_css)
+		bio_associate_blkcg(bio, bio_src->bi_css);
+#endif
 }
 EXPORT_SYMBOL(__bio_clone_fast);
 
@@ -689,6 +694,11 @@  integrity_clone:
 		}
 	}
 
+#ifdef CONFIG_BLK_CGROUP
+	if (bio_src->bi_css)
+		bio_associate_blkcg(bio, bio_src->bi_css);
+#endif
+
 	return bio;
 }
 EXPORT_SYMBOL(bio_clone_bioset);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 392592d..8abc330 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2691,12 +2691,6 @@  struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask)
 		btrfs_bio->csum = NULL;
 		btrfs_bio->csum_allocated = NULL;
 		btrfs_bio->end_io = NULL;
-
-#ifdef CONFIG_BLK_CGROUP
-		/* FIXME, put this into bio_clone_bioset */
-		if (bio->bi_css)
-			bio_associate_blkcg(new, bio->bi_css);
-#endif
 	}
 	return new;
 }