diff mbox

[v1,2/6] fs: use on-stack-bio if backing device has BDI_CAP_SYNC capability

Message ID 20170808124959.GB31390@bombadil.infradead.org (mailing list archive)
State New, archived
Headers show

Commit Message

Matthew Wilcox Aug. 8, 2017, 12:49 p.m. UTC
On Tue, Aug 08, 2017 at 03:50:20PM +0900, Minchan Kim wrote:
> There is no need to use dynamic bio allocation for BDI_CAP_SYNC
> devices. They can with on-stack-bio without concern about waiting
> bio allocation from mempool under heavy memory pressure.

This seems ... more complex than necessary?  Why not simply do this:

Comments

Matthew Wilcox Aug. 8, 2017, 1:29 p.m. UTC | #1
On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> +	struct bio sbio;
> +	struct bio_vec sbvec;

... this needs to be sbvec[nr_pages], of course.

> -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> +			bio = &sbio;
> +			bio_init(bio, &sbvec, nr_pages);

... and this needs to be 'sbvec', not '&sbvec'.
Minchan Kim Aug. 9, 2017, 1:48 a.m. UTC | #2
Hi Matthew,

On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> On Tue, Aug 08, 2017 at 03:50:20PM +0900, Minchan Kim wrote:
> > There is no need to use dynamic bio allocation for BDI_CAP_SYNC
> > devices. They can with on-stack-bio without concern about waiting
> > bio allocation from mempool under heavy memory pressure.
> 
> This seems ... more complex than necessary?  Why not simply do this:
> 
> diff --git a/fs/mpage.c b/fs/mpage.c
> index baff8f820c29..6db6bf5131ed 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -157,6 +157,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
>  	unsigned page_block;
>  	unsigned first_hole = blocks_per_page;
>  	struct block_device *bdev = NULL;
> +	struct bio sbio;
> +	struct bio_vec sbvec;
>  	int length;
>  	int fully_mapped = 1;
>  	unsigned nblocks;
> @@ -281,10 +283,17 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
>  								page))
>  				goto out;
>  		}
> -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> +			bio = &sbio;
> +			bio_init(bio, &sbvec, nr_pages);
> +			sbio.bi_bdev = bdev;
> +			sbio.bi_iter.bi_sector = blocks[0] << (blkbits - 9);
> +		} else {
> +			bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
>  				min_t(int, nr_pages, BIO_MAX_PAGES), gfp);
> -		if (bio == NULL)
> -			goto confused;
> +			if (bio == NULL)
> +				goto confused;
> +		}
>  	}
>  
>  	length = first_hole << blkbits;
> @@ -301,6 +310,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
>  	else
>  		*last_block_in_bio = blocks[blocks_per_page - 1];
>  out:
> +	if (bio == &sbio)
> +		bio = mpage_bio_submit(REQ_OP_READ, 0, bio);

Looks nicer but one nitpick:

For reusing mpage_bio_submit, we need to call bio_get for on-stack-bio which
doesn't make sense to me but if you think it's more readable and ok with
overhead with two unnecessary atomic instructions(bio_get/put), I will do it
in next spin.

Thanks.
Minchan Kim Aug. 9, 2017, 1:51 a.m. UTC | #3
On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote:
> On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> > +	struct bio sbio;
> > +	struct bio_vec sbvec;
> 
> ... this needs to be sbvec[nr_pages], of course.
> 
> > -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> > +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> > +			bio = &sbio;
> > +			bio_init(bio, &sbvec, nr_pages);
> 
> ... and this needs to be 'sbvec', not '&sbvec'.

I don't get it why we need sbvec[nr_pages].
On-stack-bio works with per-page.
May I miss something?
Matthew Wilcox Aug. 9, 2017, 2:31 a.m. UTC | #4
On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote:
> On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote:
> > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> > > +	struct bio sbio;
> > > +	struct bio_vec sbvec;
> > 
> > ... this needs to be sbvec[nr_pages], of course.
> > 
> > > -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> > > +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> > > +			bio = &sbio;
> > > +			bio_init(bio, &sbvec, nr_pages);
> > 
> > ... and this needs to be 'sbvec', not '&sbvec'.
> 
> I don't get it why we need sbvec[nr_pages].
> On-stack-bio works with per-page.
> May I miss something?

The way I redid it, it will work with an arbitrary number of pages.
Minchan Kim Aug. 9, 2017, 2:41 a.m. UTC | #5
On Tue, Aug 08, 2017 at 07:31:22PM -0700, Matthew Wilcox wrote:
> On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote:
> > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote:
> > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> > > > +	struct bio sbio;
> > > > +	struct bio_vec sbvec;
> > > 
> > > ... this needs to be sbvec[nr_pages], of course.
> > > 
> > > > -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> > > > +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> > > > +			bio = &sbio;
> > > > +			bio_init(bio, &sbvec, nr_pages);
> > > 
> > > ... and this needs to be 'sbvec', not '&sbvec'.
> > 
> > I don't get it why we need sbvec[nr_pages].
> > On-stack-bio works with per-page.
> > May I miss something?
> 
> The way I redid it, it will work with an arbitrary number of pages.

IIUC, it would be good things with dynamic bio alloction with passing
allocated bio back and forth but on-stack bio cannot work like that.
It should be done in per-page so it is worth?
Matthew Wilcox Aug. 10, 2017, 3:04 a.m. UTC | #6
On Wed, Aug 09, 2017 at 11:41:50AM +0900, Minchan Kim wrote:
> On Tue, Aug 08, 2017 at 07:31:22PM -0700, Matthew Wilcox wrote:
> > On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote:
> > > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote:
> > > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> > > > > +	struct bio sbio;
> > > > > +	struct bio_vec sbvec;
> > > > 
> > > > ... this needs to be sbvec[nr_pages], of course.
> > > > 
> > > > > -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> > > > > +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> > > > > +			bio = &sbio;
> > > > > +			bio_init(bio, &sbvec, nr_pages);
> > > > 
> > > > ... and this needs to be 'sbvec', not '&sbvec'.
> > > 
> > > I don't get it why we need sbvec[nr_pages].
> > > On-stack-bio works with per-page.
> > > May I miss something?
> > 
> > The way I redid it, it will work with an arbitrary number of pages.
> 
> IIUC, it would be good things with dynamic bio alloction with passing
> allocated bio back and forth but on-stack bio cannot work like that.
> It should be done in per-page so it is worth?

I'm not passing the bio back and forth between do_mpage_readpage() and
its callers.  The version I sent allows for multiple pages in a single
on-stack bio (when called from mpage_readpages()).
Dan Williams Aug. 10, 2017, 3:06 a.m. UTC | #7
On Wed, Aug 9, 2017 at 8:04 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, Aug 09, 2017 at 11:41:50AM +0900, Minchan Kim wrote:
>> On Tue, Aug 08, 2017 at 07:31:22PM -0700, Matthew Wilcox wrote:
>> > On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote:
>> > > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote:
>> > > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
>> > > > > +     struct bio sbio;
>> > > > > +     struct bio_vec sbvec;
>> > > >
>> > > > ... this needs to be sbvec[nr_pages], of course.
>> > > >
>> > > > > -             bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
>> > > > > +             if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
>> > > > > +                     bio = &sbio;
>> > > > > +                     bio_init(bio, &sbvec, nr_pages);
>> > > >
>> > > > ... and this needs to be 'sbvec', not '&sbvec'.
>> > >
>> > > I don't get it why we need sbvec[nr_pages].
>> > > On-stack-bio works with per-page.
>> > > May I miss something?
>> >
>> > The way I redid it, it will work with an arbitrary number of pages.
>>
>> IIUC, it would be good things with dynamic bio alloction with passing
>> allocated bio back and forth but on-stack bio cannot work like that.
>> It should be done in per-page so it is worth?
>
> I'm not passing the bio back and forth between do_mpage_readpage() and
> its callers.  The version I sent allows for multiple pages in a single
> on-stack bio (when called from mpage_readpages()).

I like it, but do you think we should switch to sbvec[<constant>] to
preclude pathological cases where nr_pages is large?
Minchan Kim Aug. 10, 2017, 4 a.m. UTC | #8
On Wed, Aug 09, 2017 at 08:04:33PM -0700, Matthew Wilcox wrote:
> On Wed, Aug 09, 2017 at 11:41:50AM +0900, Minchan Kim wrote:
> > On Tue, Aug 08, 2017 at 07:31:22PM -0700, Matthew Wilcox wrote:
> > > On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote:
> > > > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote:
> > > > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote:
> > > > > > +	struct bio sbio;
> > > > > > +	struct bio_vec sbvec;
> > > > > 
> > > > > ... this needs to be sbvec[nr_pages], of course.
> > > > > 
> > > > > > -		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> > > > > > +		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
> > > > > > +			bio = &sbio;
> > > > > > +			bio_init(bio, &sbvec, nr_pages);
> > > > > 
> > > > > ... and this needs to be 'sbvec', not '&sbvec'.
> > > > 
> > > > I don't get it why we need sbvec[nr_pages].
> > > > On-stack-bio works with per-page.
> > > > May I miss something?
> > > 
> > > The way I redid it, it will work with an arbitrary number of pages.
> > 
> > IIUC, it would be good things with dynamic bio alloction with passing
> > allocated bio back and forth but on-stack bio cannot work like that.
> > It should be done in per-page so it is worth?
> 
> I'm not passing the bio back and forth between do_mpage_readpage() and
> its callers.  The version I sent allows for multiple pages in a single
> on-stack bio (when called from mpage_readpages()).

I'm confused. I want to confirm your thought before respinning.
Please correct me if I miss something.

The version you sent to me used on-stack bio within do_mpage_readpage
so that's why I said sbvec[nr_pages] would be pointless because it
works with per-page base unless if we use dynamic bio allocation.

But I guess now you suggest to use on-stack bio in mpage_readpages so
single on-stack bio in mpage_readpages's stack can batch multiple pages
in bvecs of a bio.

Right?
Christoph Hellwig Aug. 11, 2017, 10:46 a.m. UTC | #9
On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
> I like it, but do you think we should switch to sbvec[<constant>] to
> preclude pathological cases where nr_pages is large?

Yes, please.

Then I'd like to see that the on-stack bio even matters for
mpage_readpage / mpage_writepage.  Compared to all the buffer head
overhead the bio allocation should not actually matter in practice.
Jens Axboe Aug. 11, 2017, 2:26 p.m. UTC | #10
On 08/11/2017 04:46 AM, Christoph Hellwig wrote:
> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
>> I like it, but do you think we should switch to sbvec[<constant>] to
>> preclude pathological cases where nr_pages is large?
> 
> Yes, please.
> 
> Then I'd like to see that the on-stack bio even matters for
> mpage_readpage / mpage_writepage.  Compared to all the buffer head
> overhead the bio allocation should not actually matter in practice.

I'm skeptical for that path, too. I also wonder how far we could go
with just doing a per-cpu bio recycling facility, to reduce the cost
of having to allocate a bio. The on-stack bio parts are fine for
simple use case, where simple means that the patch just special
cases the allocation, and doesn't have to change much else.

I had a patch for bio recycling and batched freeing a year or two
ago, I'll see if I can find and resurrect it.
Minchan Kim Aug. 14, 2017, 8:48 a.m. UTC | #11
Hi Christoph,

On Fri, Aug 11, 2017 at 12:46:15PM +0200, Christoph Hellwig wrote:
> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
> > I like it, but do you think we should switch to sbvec[<constant>] to
> > preclude pathological cases where nr_pages is large?
> 
> Yes, please.

Still, I don't understand how sbvec[nr_pages] with on-stack bio in
do_mpage_readpage can help the performance.

IIUC, do_mpage_readpage works with page-base. IOW, it passes just one
page, not multiple pages so if we use on-stack bio, we just add *a page*
via bio_add_page and submit the bio before the function returning.

So, rather than sbvec[1], why de we need sbvec[nr_pages]?

Please, let me open my eyes. :)

Thanks.
Minchan Kim Aug. 14, 2017, 8:50 a.m. UTC | #12
Hi Jens,

On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote:
> On 08/11/2017 04:46 AM, Christoph Hellwig wrote:
> > On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
> >> I like it, but do you think we should switch to sbvec[<constant>] to
> >> preclude pathological cases where nr_pages is large?
> > 
> > Yes, please.
> > 
> > Then I'd like to see that the on-stack bio even matters for
> > mpage_readpage / mpage_writepage.  Compared to all the buffer head
> > overhead the bio allocation should not actually matter in practice.
> 
> I'm skeptical for that path, too. I also wonder how far we could go
> with just doing a per-cpu bio recycling facility, to reduce the cost
> of having to allocate a bio. The on-stack bio parts are fine for
> simple use case, where simple means that the patch just special
> cases the allocation, and doesn't have to change much else.
> 
> I had a patch for bio recycling and batched freeing a year or two
> ago, I'll see if I can find and resurrect it.

So, you want to go with per-cpu bio recycling approach to
remove rw_page?

So, do you want me to hold this patchset?
Jens Axboe Aug. 14, 2017, 2:36 p.m. UTC | #13
On 08/14/2017 02:50 AM, Minchan Kim wrote:
> Hi Jens,
> 
> On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote:
>> On 08/11/2017 04:46 AM, Christoph Hellwig wrote:
>>> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
>>>> I like it, but do you think we should switch to sbvec[<constant>] to
>>>> preclude pathological cases where nr_pages is large?
>>>
>>> Yes, please.
>>>
>>> Then I'd like to see that the on-stack bio even matters for
>>> mpage_readpage / mpage_writepage.  Compared to all the buffer head
>>> overhead the bio allocation should not actually matter in practice.
>>
>> I'm skeptical for that path, too. I also wonder how far we could go
>> with just doing a per-cpu bio recycling facility, to reduce the cost
>> of having to allocate a bio. The on-stack bio parts are fine for
>> simple use case, where simple means that the patch just special
>> cases the allocation, and doesn't have to change much else.
>>
>> I had a patch for bio recycling and batched freeing a year or two
>> ago, I'll see if I can find and resurrect it.
> 
> So, you want to go with per-cpu bio recycling approach to
> remove rw_page?
> 
> So, do you want me to hold this patchset?

I don't want to hold this series up, but I do think the recycling is
a cleaner approach since we don't need to special case anything. I
hope I'll get some time to dust it off, retest, and post soon.
Minchan Kim Aug. 14, 2017, 3:06 p.m. UTC | #14
On Mon, Aug 14, 2017 at 08:36:00AM -0600, Jens Axboe wrote:
> On 08/14/2017 02:50 AM, Minchan Kim wrote:
> > Hi Jens,
> > 
> > On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote:
> >> On 08/11/2017 04:46 AM, Christoph Hellwig wrote:
> >>> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
> >>>> I like it, but do you think we should switch to sbvec[<constant>] to
> >>>> preclude pathological cases where nr_pages is large?
> >>>
> >>> Yes, please.
> >>>
> >>> Then I'd like to see that the on-stack bio even matters for
> >>> mpage_readpage / mpage_writepage.  Compared to all the buffer head
> >>> overhead the bio allocation should not actually matter in practice.
> >>
> >> I'm skeptical for that path, too. I also wonder how far we could go
> >> with just doing a per-cpu bio recycling facility, to reduce the cost
> >> of having to allocate a bio. The on-stack bio parts are fine for
> >> simple use case, where simple means that the patch just special
> >> cases the allocation, and doesn't have to change much else.
> >>
> >> I had a patch for bio recycling and batched freeing a year or two
> >> ago, I'll see if I can find and resurrect it.
> > 
> > So, you want to go with per-cpu bio recycling approach to
> > remove rw_page?
> > 
> > So, do you want me to hold this patchset?
> 
> I don't want to hold this series up, but I do think the recycling is
> a cleaner approach since we don't need to special case anything. I
> hope I'll get some time to dust it off, retest, and post soon.

I don't know how your bio recycling works. But my worry when I heard
per-cpu bio recycling firstly is if it's not reserved pool for
BDI_CAP_SYNCHRONOUS(IOW, if it is shared by several storages),
BIOs can be consumed by slow device(e.g., eMMC) so that a bio for
fastest device(e.g., zram in embedded system) in the system can be
stucked to wait on bio until IO for slow deivce is completed.

I guess it would be a not rare case for swap device under severe
memory pressure because lots of page cache are already reclaimed when
anonymous page start to be reclaimed so that many BIOs can be consumed
for eMMC to fetch code but swap IO to fetch heap data would be stucked
although zram-swap is much faster than eMMC.
As well, time to wait to get BIO among even fastest devices is
simple waste, I guess.

To me, bio suggested by Christoph Hellwig isn't diverge current
path a lot and simple enough to change.

Anyway, I'm okay with either way if we can remove rw_page without
any regression because the maintainance of both rw_page and
make_request is rather burden for zram, too.
Jens Axboe Aug. 14, 2017, 3:14 p.m. UTC | #15
On 08/14/2017 09:06 AM, Minchan Kim wrote:
> On Mon, Aug 14, 2017 at 08:36:00AM -0600, Jens Axboe wrote:
>> On 08/14/2017 02:50 AM, Minchan Kim wrote:
>>> Hi Jens,
>>>
>>> On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote:
>>>> On 08/11/2017 04:46 AM, Christoph Hellwig wrote:
>>>>> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
>>>>>> I like it, but do you think we should switch to sbvec[<constant>] to
>>>>>> preclude pathological cases where nr_pages is large?
>>>>>
>>>>> Yes, please.
>>>>>
>>>>> Then I'd like to see that the on-stack bio even matters for
>>>>> mpage_readpage / mpage_writepage.  Compared to all the buffer head
>>>>> overhead the bio allocation should not actually matter in practice.
>>>>
>>>> I'm skeptical for that path, too. I also wonder how far we could go
>>>> with just doing a per-cpu bio recycling facility, to reduce the cost
>>>> of having to allocate a bio. The on-stack bio parts are fine for
>>>> simple use case, where simple means that the patch just special
>>>> cases the allocation, and doesn't have to change much else.
>>>>
>>>> I had a patch for bio recycling and batched freeing a year or two
>>>> ago, I'll see if I can find and resurrect it.
>>>
>>> So, you want to go with per-cpu bio recycling approach to
>>> remove rw_page?
>>>
>>> So, do you want me to hold this patchset?
>>
>> I don't want to hold this series up, but I do think the recycling is
>> a cleaner approach since we don't need to special case anything. I
>> hope I'll get some time to dust it off, retest, and post soon.
> 
> I don't know how your bio recycling works. But my worry when I heard
> per-cpu bio recycling firstly is if it's not reserved pool for
> BDI_CAP_SYNCHRONOUS(IOW, if it is shared by several storages),
> BIOs can be consumed by slow device(e.g., eMMC) so that a bio for
> fastest device(e.g., zram in embedded system) in the system can be
> stucked to wait on bio until IO for slow deivce is completed.
> 
> I guess it would be a not rare case for swap device under severe
> memory pressure because lots of page cache are already reclaimed when
> anonymous page start to be reclaimed so that many BIOs can be consumed
> for eMMC to fetch code but swap IO to fetch heap data would be stucked
> although zram-swap is much faster than eMMC.
> As well, time to wait to get BIO among even fastest devices is
> simple waste, I guess.

I don't think that's a valid concern. First of all, for the recycling,
it's not like you get to wait on someone else using a recycled bio,
if it's not there you simply go to the regular bio allocator. There
is no waiting for free. The idea is to have allocation be faster since
we can avoid going to the memory allocator for most cases, and speed
up freeing as well, since we can do that in batches too.

Secondly, generally you don't have slow devices and fast devices
intermingled when running workloads. That's the rare case.

> To me, bio suggested by Christoph Hellwig isn't diverge current
> path a lot and simple enough to change.

It doesn't diverge it a lot, but it does split it up.

> Anyway, I'm okay with either way if we can remove rw_page without
> any regression because the maintainance of both rw_page and
> make_request is rather burden for zram, too.

Agree, the ultimate goal of both is to eliminate the need for the
rw_page hack.
Minchan Kim Aug. 14, 2017, 3:31 p.m. UTC | #16
On Mon, Aug 14, 2017 at 09:14:03AM -0600, Jens Axboe wrote:
> On 08/14/2017 09:06 AM, Minchan Kim wrote:
> > On Mon, Aug 14, 2017 at 08:36:00AM -0600, Jens Axboe wrote:
> >> On 08/14/2017 02:50 AM, Minchan Kim wrote:
> >>> Hi Jens,
> >>>
> >>> On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote:
> >>>> On 08/11/2017 04:46 AM, Christoph Hellwig wrote:
> >>>>> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote:
> >>>>>> I like it, but do you think we should switch to sbvec[<constant>] to
> >>>>>> preclude pathological cases where nr_pages is large?
> >>>>>
> >>>>> Yes, please.
> >>>>>
> >>>>> Then I'd like to see that the on-stack bio even matters for
> >>>>> mpage_readpage / mpage_writepage.  Compared to all the buffer head
> >>>>> overhead the bio allocation should not actually matter in practice.
> >>>>
> >>>> I'm skeptical for that path, too. I also wonder how far we could go
> >>>> with just doing a per-cpu bio recycling facility, to reduce the cost
> >>>> of having to allocate a bio. The on-stack bio parts are fine for
> >>>> simple use case, where simple means that the patch just special
> >>>> cases the allocation, and doesn't have to change much else.
> >>>>
> >>>> I had a patch for bio recycling and batched freeing a year or two
> >>>> ago, I'll see if I can find and resurrect it.
> >>>
> >>> So, you want to go with per-cpu bio recycling approach to
> >>> remove rw_page?
> >>>
> >>> So, do you want me to hold this patchset?
> >>
> >> I don't want to hold this series up, but I do think the recycling is
> >> a cleaner approach since we don't need to special case anything. I
> >> hope I'll get some time to dust it off, retest, and post soon.
> > 
> > I don't know how your bio recycling works. But my worry when I heard
> > per-cpu bio recycling firstly is if it's not reserved pool for
> > BDI_CAP_SYNCHRONOUS(IOW, if it is shared by several storages),
> > BIOs can be consumed by slow device(e.g., eMMC) so that a bio for
> > fastest device(e.g., zram in embedded system) in the system can be
> > stucked to wait on bio until IO for slow deivce is completed.
> > 
> > I guess it would be a not rare case for swap device under severe
> > memory pressure because lots of page cache are already reclaimed when
> > anonymous page start to be reclaimed so that many BIOs can be consumed
> > for eMMC to fetch code but swap IO to fetch heap data would be stucked
> > although zram-swap is much faster than eMMC.
> > As well, time to wait to get BIO among even fastest devices is
> > simple waste, I guess.
> 
> I don't think that's a valid concern. First of all, for the recycling,
> it's not like you get to wait on someone else using a recycled bio,
> if it's not there you simply go to the regular bio allocator. There
> is no waiting for free. The idea is to have allocation be faster since
> we can avoid going to the memory allocator for most cases, and speed
> up freeing as well, since we can do that in batches too.

I doubt how it performs well because at the beginning of this
thread[1], Ross said that with even dynamic bio allocation without
rw_page, there is no regression in some testing.
[1] http://lkml.kernel.org/r/<20170728165604.10455-1-ross.zwisler@linux.intel.com>

> 
> Secondly, generally you don't have slow devices and fast devices
> intermingled when running workloads. That's the rare case.

Not true. zRam is really popular swap for embedded devices where
one of low cost product has a really poor slow nand compared to
lz4/lzo [de]comression.

> 
> > To me, bio suggested by Christoph Hellwig isn't diverge current
> > path a lot and simple enough to change.
> 
> It doesn't diverge it a lot, but it does split it up.
> 
> > Anyway, I'm okay with either way if we can remove rw_page without
> > any regression because the maintainance of both rw_page and
> > make_request is rather burden for zram, too.
> 
> Agree, the ultimate goal of both is to eliminate the need for the
> rw_page hack.

Yeb.
Jens Axboe Aug. 14, 2017, 3:38 p.m. UTC | #17
On 08/14/2017 09:31 AM, Minchan Kim wrote:
>> Secondly, generally you don't have slow devices and fast devices
>> intermingled when running workloads. That's the rare case.
> 
> Not true. zRam is really popular swap for embedded devices where
> one of low cost product has a really poor slow nand compared to
> lz4/lzo [de]comression.

I guess that's true for some cases. But as I said earlier, the recycling
really doesn't care about this at all. They can happily coexist, and not
step on each others toes.
Jens Axboe Aug. 14, 2017, 4:17 p.m. UTC | #18
On 08/14/2017 09:38 AM, Jens Axboe wrote:
> On 08/14/2017 09:31 AM, Minchan Kim wrote:
>>> Secondly, generally you don't have slow devices and fast devices
>>> intermingled when running workloads. That's the rare case.
>>
>> Not true. zRam is really popular swap for embedded devices where
>> one of low cost product has a really poor slow nand compared to
>> lz4/lzo [de]comression.
> 
> I guess that's true for some cases. But as I said earlier, the recycling
> really doesn't care about this at all. They can happily coexist, and not
> step on each others toes.

Dusted it off, result is here against -rc5:

http://git.kernel.dk/cgit/linux-block/log/?h=cpu-alloc-cache

I'd like to split the amount of units we cache and the amount of units
we free, right now they are both CPU_ALLOC_CACHE_SIZE. This means that
once we hit that count, we free all of the, and then store the one we
were asked to free. That always keeps 1 local, but maybe it'd make more
sense to cache just free CPU_ALLOC_CACHE_SIZE/2 (or something like that)
so that we retain more than 1 per cpu in case and app preempts when
sleeping for IO and the new task on that CPU then issues IO as well.
Probably minor.

Ran a quick test on nullb0 with 32 sync readers. The test was O_DIRECT
on the block device, so I disabled the __blkdev_direct_IO_simple()
bypass. With the above branch, we get ~18.0M IOPS, and without we get
~14M IOPS. Both ran with iostats disabled, to avoid any interference
from that.
Minchan Kim Aug. 16, 2017, 4:48 a.m. UTC | #19
Hi Jens,

On Mon, Aug 14, 2017 at 10:17:09AM -0600, Jens Axboe wrote:
> On 08/14/2017 09:38 AM, Jens Axboe wrote:
> > On 08/14/2017 09:31 AM, Minchan Kim wrote:
> >>> Secondly, generally you don't have slow devices and fast devices
> >>> intermingled when running workloads. That's the rare case.
> >>
> >> Not true. zRam is really popular swap for embedded devices where
> >> one of low cost product has a really poor slow nand compared to
> >> lz4/lzo [de]comression.
> > 
> > I guess that's true for some cases. But as I said earlier, the recycling
> > really doesn't care about this at all. They can happily coexist, and not
> > step on each others toes.
> 
> Dusted it off, result is here against -rc5:
> 
> http://git.kernel.dk/cgit/linux-block/log/?h=cpu-alloc-cache
> 
> I'd like to split the amount of units we cache and the amount of units
> we free, right now they are both CPU_ALLOC_CACHE_SIZE. This means that
> once we hit that count, we free all of the, and then store the one we
> were asked to free. That always keeps 1 local, but maybe it'd make more
> sense to cache just free CPU_ALLOC_CACHE_SIZE/2 (or something like that)
> so that we retain more than 1 per cpu in case and app preempts when
> sleeping for IO and the new task on that CPU then issues IO as well.
> Probably minor.
> 
> Ran a quick test on nullb0 with 32 sync readers. The test was O_DIRECT
> on the block device, so I disabled the __blkdev_direct_IO_simple()
> bypass. With the above branch, we get ~18.0M IOPS, and without we get
> ~14M IOPS. Both ran with iostats disabled, to avoid any interference
> from that.

Looks promising.
If recycling bio works well enough, I think we don't need to introduce
new split in the path for on-stack bio.
I will test your version on zram-swap!

Thanks.
Jens Axboe Aug. 16, 2017, 3:56 p.m. UTC | #20
On 08/15/2017 10:48 PM, Minchan Kim wrote:
> Hi Jens,
> 
> On Mon, Aug 14, 2017 at 10:17:09AM -0600, Jens Axboe wrote:
>> On 08/14/2017 09:38 AM, Jens Axboe wrote:
>>> On 08/14/2017 09:31 AM, Minchan Kim wrote:
>>>>> Secondly, generally you don't have slow devices and fast devices
>>>>> intermingled when running workloads. That's the rare case.
>>>>
>>>> Not true. zRam is really popular swap for embedded devices where
>>>> one of low cost product has a really poor slow nand compared to
>>>> lz4/lzo [de]comression.
>>>
>>> I guess that's true for some cases. But as I said earlier, the recycling
>>> really doesn't care about this at all. They can happily coexist, and not
>>> step on each others toes.
>>
>> Dusted it off, result is here against -rc5:
>>
>> http://git.kernel.dk/cgit/linux-block/log/?h=cpu-alloc-cache
>>
>> I'd like to split the amount of units we cache and the amount of units
>> we free, right now they are both CPU_ALLOC_CACHE_SIZE. This means that
>> once we hit that count, we free all of the, and then store the one we
>> were asked to free. That always keeps 1 local, but maybe it'd make more
>> sense to cache just free CPU_ALLOC_CACHE_SIZE/2 (or something like that)
>> so that we retain more than 1 per cpu in case and app preempts when
>> sleeping for IO and the new task on that CPU then issues IO as well.
>> Probably minor.
>>
>> Ran a quick test on nullb0 with 32 sync readers. The test was O_DIRECT
>> on the block device, so I disabled the __blkdev_direct_IO_simple()
>> bypass. With the above branch, we get ~18.0M IOPS, and without we get
>> ~14M IOPS. Both ran with iostats disabled, to avoid any interference
>> from that.
> 
> Looks promising.
> If recycling bio works well enough, I think we don't need to introduce
> new split in the path for on-stack bio.
> I will test your version on zram-swap!

Thanks, let me know how it goes. It's quite possible that we'll need
a few further tweaks, but at least the basis should be there.
Minchan Kim Aug. 21, 2017, 6:13 a.m. UTC | #21
Hi Jens,

On Wed, Aug 16, 2017 at 09:56:12AM -0600, Jens Axboe wrote:
> On 08/15/2017 10:48 PM, Minchan Kim wrote:
> > Hi Jens,
> > 
> > On Mon, Aug 14, 2017 at 10:17:09AM -0600, Jens Axboe wrote:
> >> On 08/14/2017 09:38 AM, Jens Axboe wrote:
> >>> On 08/14/2017 09:31 AM, Minchan Kim wrote:
> >>>>> Secondly, generally you don't have slow devices and fast devices
> >>>>> intermingled when running workloads. That's the rare case.
> >>>>
> >>>> Not true. zRam is really popular swap for embedded devices where
> >>>> one of low cost product has a really poor slow nand compared to
> >>>> lz4/lzo [de]comression.
> >>>
> >>> I guess that's true for some cases. But as I said earlier, the recycling
> >>> really doesn't care about this at all. They can happily coexist, and not
> >>> step on each others toes.
> >>
> >> Dusted it off, result is here against -rc5:
> >>
> >> http://git.kernel.dk/cgit/linux-block/log/?h=cpu-alloc-cache
> >>
> >> I'd like to split the amount of units we cache and the amount of units
> >> we free, right now they are both CPU_ALLOC_CACHE_SIZE. This means that
> >> once we hit that count, we free all of the, and then store the one we
> >> were asked to free. That always keeps 1 local, but maybe it'd make more
> >> sense to cache just free CPU_ALLOC_CACHE_SIZE/2 (or something like that)
> >> so that we retain more than 1 per cpu in case and app preempts when
> >> sleeping for IO and the new task on that CPU then issues IO as well.
> >> Probably minor.
> >>
> >> Ran a quick test on nullb0 with 32 sync readers. The test was O_DIRECT
> >> on the block device, so I disabled the __blkdev_direct_IO_simple()
> >> bypass. With the above branch, we get ~18.0M IOPS, and without we get
> >> ~14M IOPS. Both ran with iostats disabled, to avoid any interference
> >> from that.
> > 
> > Looks promising.
> > If recycling bio works well enough, I think we don't need to introduce
> > new split in the path for on-stack bio.
> > I will test your version on zram-swap!
> 
> Thanks, let me know how it goes. It's quite possible that we'll need
> a few further tweaks, but at least the basis should be there.

Sorry for my late reply.

I just finished the swap-in testing in with zram-swap which is critical
for the latency.

For the testing, I made a memcc and put $NR_CPU(mine is 12) processes
in there and each processes consumes 1G so total is 12G while my system
has 16GB memory so there was no global reclaim.
Then, echo 1 > /mnt/memcg/group/force.empty to swap all pages out and
then the programs wait my signal to swap in and I trigger the signal
to every processes to swap in every pages and measures elapsed time
for the swapin.

the value is average usec time elapsed swap-in 1G pages for each process
and I repeated it 10times and stddev is very stable.

swapin:
base(with rw_page)      1100806.73(100.00%)
no-rw_page              1146856.95(104.18%)
Jens's pcp              1146910.00(104.19%)
onstack-bio             1114872.18(101.28%)

In my test, there is no difference between dynamic bio allocation
(i.e., no-rwpage) and pcp approch but onstack-bio is much faster
so it's almost same with rw_page.

swapout test is to measure elapsed time for "echo 1 > /mnt/memcg/test_group/force.empty'
so it's sec unit.

swapout:
base(with rw_page)      7.72(100.00%)
no-rw_page              8.36(108.29%)
Jens's pcp              8.31(107.64%)
onstack-bio             8.19(106.09%)

rw_page's swapout is 6% or more than faster than else.

I tried pmbenchmak with no memcg to see the performance in global reclaim.
Also, I executed background IO job which reads data from HDD.
The value is average usec time elapsed for a page access so smaller is
better.

base(with rw_page)              14.42(100.00%)
no-rw_page                      15.66(108.60%)
Jens's pcp                      15.81(109.64%)
onstack-bio                     15.42(106.93%)

It's similar to swapout test in memcg.
6% or more is not trivial so I doubt we can remove rw_page
at this moment. :(

I will look into the detail with perf.
If you have further optimizations or suggestions, Feel free to
say that. I am happy to test it.

Thanks.
diff mbox

Patch

diff --git a/fs/mpage.c b/fs/mpage.c
index baff8f820c29..6db6bf5131ed 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -157,6 +157,8 @@  do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 	unsigned page_block;
 	unsigned first_hole = blocks_per_page;
 	struct block_device *bdev = NULL;
+	struct bio sbio;
+	struct bio_vec sbvec;
 	int length;
 	int fully_mapped = 1;
 	unsigned nblocks;
@@ -281,10 +283,17 @@  do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 								page))
 				goto out;
 		}
-		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
+		if (bdi_cap_synchronous_io(inode_to_bdi(inode))) {
+			bio = &sbio;
+			bio_init(bio, &sbvec, nr_pages);
+			sbio.bi_bdev = bdev;
+			sbio.bi_iter.bi_sector = blocks[0] << (blkbits - 9);
+		} else {
+			bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
 				min_t(int, nr_pages, BIO_MAX_PAGES), gfp);
-		if (bio == NULL)
-			goto confused;
+			if (bio == NULL)
+				goto confused;
+		}
 	}
 
 	length = first_hole << blkbits;
@@ -301,6 +310,8 @@  do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 	else
 		*last_block_in_bio = blocks[blocks_per_page - 1];
 out:
+	if (bio == &sbio)
+		bio = mpage_bio_submit(REQ_OP_READ, 0, bio);
 	return bio;
 
 confused: