Message ID | 20170808124959.GB31390@bombadil.infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote: > + struct bio sbio; > + struct bio_vec sbvec; ... this needs to be sbvec[nr_pages], of course. > - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), > + if (bdi_cap_synchronous_io(inode_to_bdi(inode))) { > + bio = &sbio; > + bio_init(bio, &sbvec, nr_pages); ... and this needs to be 'sbvec', not '&sbvec'.
Hi Matthew, On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote: > On Tue, Aug 08, 2017 at 03:50:20PM +0900, Minchan Kim wrote: > > There is no need to use dynamic bio allocation for BDI_CAP_SYNC > > devices. They can with on-stack-bio without concern about waiting > > bio allocation from mempool under heavy memory pressure. > > This seems ... more complex than necessary? Why not simply do this: > > diff --git a/fs/mpage.c b/fs/mpage.c > index baff8f820c29..6db6bf5131ed 100644 > --- a/fs/mpage.c > +++ b/fs/mpage.c > @@ -157,6 +157,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, > unsigned page_block; > unsigned first_hole = blocks_per_page; > struct block_device *bdev = NULL; > + struct bio sbio; > + struct bio_vec sbvec; > int length; > int fully_mapped = 1; > unsigned nblocks; > @@ -281,10 +283,17 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, > page)) > goto out; > } > - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), > + if (bdi_cap_synchronous_io(inode_to_bdi(inode))) { > + bio = &sbio; > + bio_init(bio, &sbvec, nr_pages); > + sbio.bi_bdev = bdev; > + sbio.bi_iter.bi_sector = blocks[0] << (blkbits - 9); > + } else { > + bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), > min_t(int, nr_pages, BIO_MAX_PAGES), gfp); > - if (bio == NULL) > - goto confused; > + if (bio == NULL) > + goto confused; > + } > } > > length = first_hole << blkbits; > @@ -301,6 +310,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, > else > *last_block_in_bio = blocks[blocks_per_page - 1]; > out: > + if (bio == &sbio) > + bio = mpage_bio_submit(REQ_OP_READ, 0, bio); Looks nicer but one nitpick: For reusing mpage_bio_submit, we need to call bio_get for on-stack-bio which doesn't make sense to me but if you think it's more readable and ok with overhead with two unnecessary atomic instructions(bio_get/put), I will do it in next spin. Thanks.
On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote: > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote: > > + struct bio sbio; > > + struct bio_vec sbvec; > > ... this needs to be sbvec[nr_pages], of course. > > > - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), > > + if (bdi_cap_synchronous_io(inode_to_bdi(inode))) { > > + bio = &sbio; > > + bio_init(bio, &sbvec, nr_pages); > > ... and this needs to be 'sbvec', not '&sbvec'. I don't get it why we need sbvec[nr_pages]. On-stack-bio works with per-page. May I miss something?
On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote: > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote: > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote: > > > + struct bio sbio; > > > + struct bio_vec sbvec; > > > > ... this needs to be sbvec[nr_pages], of course. > > > > > - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), > > > + if (bdi_cap_synchronous_io(inode_to_bdi(inode))) { > > > + bio = &sbio; > > > + bio_init(bio, &sbvec, nr_pages); > > > > ... and this needs to be 'sbvec', not '&sbvec'. > > I don't get it why we need sbvec[nr_pages]. > On-stack-bio works with per-page. > May I miss something? The way I redid it, it will work with an arbitrary number of pages.
On Tue, Aug 08, 2017 at 07:31:22PM -0700, Matthew Wilcox wrote: > On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote: > > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote: > > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote: > > > > + struct bio sbio; > > > > + struct bio_vec sbvec; > > > > > > ... this needs to be sbvec[nr_pages], of course. > > > > > > > - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), > > > > + if (bdi_cap_synchronous_io(inode_to_bdi(inode))) { > > > > + bio = &sbio; > > > > + bio_init(bio, &sbvec, nr_pages); > > > > > > ... and this needs to be 'sbvec', not '&sbvec'. > > > > I don't get it why we need sbvec[nr_pages]. > > On-stack-bio works with per-page. > > May I miss something? > > The way I redid it, it will work with an arbitrary number of pages. IIUC, it would be good things with dynamic bio alloction with passing allocated bio back and forth but on-stack bio cannot work like that. It should be done in per-page so it is worth?
On Wed, Aug 09, 2017 at 11:41:50AM +0900, Minchan Kim wrote: > On Tue, Aug 08, 2017 at 07:31:22PM -0700, Matthew Wilcox wrote: > > On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote: > > > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote: > > > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote: > > > > > + struct bio sbio; > > > > > + struct bio_vec sbvec; > > > > > > > > ... this needs to be sbvec[nr_pages], of course. > > > > > > > > > - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), > > > > > + if (bdi_cap_synchronous_io(inode_to_bdi(inode))) { > > > > > + bio = &sbio; > > > > > + bio_init(bio, &sbvec, nr_pages); > > > > > > > > ... and this needs to be 'sbvec', not '&sbvec'. > > > > > > I don't get it why we need sbvec[nr_pages]. > > > On-stack-bio works with per-page. > > > May I miss something? > > > > The way I redid it, it will work with an arbitrary number of pages. > > IIUC, it would be good things with dynamic bio alloction with passing > allocated bio back and forth but on-stack bio cannot work like that. > It should be done in per-page so it is worth? I'm not passing the bio back and forth between do_mpage_readpage() and its callers. The version I sent allows for multiple pages in a single on-stack bio (when called from mpage_readpages()).
On Wed, Aug 9, 2017 at 8:04 PM, Matthew Wilcox <willy@infradead.org> wrote: > On Wed, Aug 09, 2017 at 11:41:50AM +0900, Minchan Kim wrote: >> On Tue, Aug 08, 2017 at 07:31:22PM -0700, Matthew Wilcox wrote: >> > On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote: >> > > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote: >> > > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote: >> > > > > + struct bio sbio; >> > > > > + struct bio_vec sbvec; >> > > > >> > > > ... this needs to be sbvec[nr_pages], of course. >> > > > >> > > > > - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), >> > > > > + if (bdi_cap_synchronous_io(inode_to_bdi(inode))) { >> > > > > + bio = &sbio; >> > > > > + bio_init(bio, &sbvec, nr_pages); >> > > > >> > > > ... and this needs to be 'sbvec', not '&sbvec'. >> > > >> > > I don't get it why we need sbvec[nr_pages]. >> > > On-stack-bio works with per-page. >> > > May I miss something? >> > >> > The way I redid it, it will work with an arbitrary number of pages. >> >> IIUC, it would be good things with dynamic bio alloction with passing >> allocated bio back and forth but on-stack bio cannot work like that. >> It should be done in per-page so it is worth? > > I'm not passing the bio back and forth between do_mpage_readpage() and > its callers. The version I sent allows for multiple pages in a single > on-stack bio (when called from mpage_readpages()). I like it, but do you think we should switch to sbvec[<constant>] to preclude pathological cases where nr_pages is large?
On Wed, Aug 09, 2017 at 08:04:33PM -0700, Matthew Wilcox wrote: > On Wed, Aug 09, 2017 at 11:41:50AM +0900, Minchan Kim wrote: > > On Tue, Aug 08, 2017 at 07:31:22PM -0700, Matthew Wilcox wrote: > > > On Wed, Aug 09, 2017 at 10:51:13AM +0900, Minchan Kim wrote: > > > > On Tue, Aug 08, 2017 at 06:29:04AM -0700, Matthew Wilcox wrote: > > > > > On Tue, Aug 08, 2017 at 05:49:59AM -0700, Matthew Wilcox wrote: > > > > > > + struct bio sbio; > > > > > > + struct bio_vec sbvec; > > > > > > > > > > ... this needs to be sbvec[nr_pages], of course. > > > > > > > > > > > - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), > > > > > > + if (bdi_cap_synchronous_io(inode_to_bdi(inode))) { > > > > > > + bio = &sbio; > > > > > > + bio_init(bio, &sbvec, nr_pages); > > > > > > > > > > ... and this needs to be 'sbvec', not '&sbvec'. > > > > > > > > I don't get it why we need sbvec[nr_pages]. > > > > On-stack-bio works with per-page. > > > > May I miss something? > > > > > > The way I redid it, it will work with an arbitrary number of pages. > > > > IIUC, it would be good things with dynamic bio alloction with passing > > allocated bio back and forth but on-stack bio cannot work like that. > > It should be done in per-page so it is worth? > > I'm not passing the bio back and forth between do_mpage_readpage() and > its callers. The version I sent allows for multiple pages in a single > on-stack bio (when called from mpage_readpages()). I'm confused. I want to confirm your thought before respinning. Please correct me if I miss something. The version you sent to me used on-stack bio within do_mpage_readpage so that's why I said sbvec[nr_pages] would be pointless because it works with per-page base unless if we use dynamic bio allocation. But I guess now you suggest to use on-stack bio in mpage_readpages so single on-stack bio in mpage_readpages's stack can batch multiple pages in bvecs of a bio. Right?
On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote: > I like it, but do you think we should switch to sbvec[<constant>] to > preclude pathological cases where nr_pages is large? Yes, please. Then I'd like to see that the on-stack bio even matters for mpage_readpage / mpage_writepage. Compared to all the buffer head overhead the bio allocation should not actually matter in practice.
On 08/11/2017 04:46 AM, Christoph Hellwig wrote: > On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote: >> I like it, but do you think we should switch to sbvec[<constant>] to >> preclude pathological cases where nr_pages is large? > > Yes, please. > > Then I'd like to see that the on-stack bio even matters for > mpage_readpage / mpage_writepage. Compared to all the buffer head > overhead the bio allocation should not actually matter in practice. I'm skeptical for that path, too. I also wonder how far we could go with just doing a per-cpu bio recycling facility, to reduce the cost of having to allocate a bio. The on-stack bio parts are fine for simple use case, where simple means that the patch just special cases the allocation, and doesn't have to change much else. I had a patch for bio recycling and batched freeing a year or two ago, I'll see if I can find and resurrect it.
Hi Christoph, On Fri, Aug 11, 2017 at 12:46:15PM +0200, Christoph Hellwig wrote: > On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote: > > I like it, but do you think we should switch to sbvec[<constant>] to > > preclude pathological cases where nr_pages is large? > > Yes, please. Still, I don't understand how sbvec[nr_pages] with on-stack bio in do_mpage_readpage can help the performance. IIUC, do_mpage_readpage works with page-base. IOW, it passes just one page, not multiple pages so if we use on-stack bio, we just add *a page* via bio_add_page and submit the bio before the function returning. So, rather than sbvec[1], why de we need sbvec[nr_pages]? Please, let me open my eyes. :) Thanks.
Hi Jens, On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote: > On 08/11/2017 04:46 AM, Christoph Hellwig wrote: > > On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote: > >> I like it, but do you think we should switch to sbvec[<constant>] to > >> preclude pathological cases where nr_pages is large? > > > > Yes, please. > > > > Then I'd like to see that the on-stack bio even matters for > > mpage_readpage / mpage_writepage. Compared to all the buffer head > > overhead the bio allocation should not actually matter in practice. > > I'm skeptical for that path, too. I also wonder how far we could go > with just doing a per-cpu bio recycling facility, to reduce the cost > of having to allocate a bio. The on-stack bio parts are fine for > simple use case, where simple means that the patch just special > cases the allocation, and doesn't have to change much else. > > I had a patch for bio recycling and batched freeing a year or two > ago, I'll see if I can find and resurrect it. So, you want to go with per-cpu bio recycling approach to remove rw_page? So, do you want me to hold this patchset?
On 08/14/2017 02:50 AM, Minchan Kim wrote: > Hi Jens, > > On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote: >> On 08/11/2017 04:46 AM, Christoph Hellwig wrote: >>> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote: >>>> I like it, but do you think we should switch to sbvec[<constant>] to >>>> preclude pathological cases where nr_pages is large? >>> >>> Yes, please. >>> >>> Then I'd like to see that the on-stack bio even matters for >>> mpage_readpage / mpage_writepage. Compared to all the buffer head >>> overhead the bio allocation should not actually matter in practice. >> >> I'm skeptical for that path, too. I also wonder how far we could go >> with just doing a per-cpu bio recycling facility, to reduce the cost >> of having to allocate a bio. The on-stack bio parts are fine for >> simple use case, where simple means that the patch just special >> cases the allocation, and doesn't have to change much else. >> >> I had a patch for bio recycling and batched freeing a year or two >> ago, I'll see if I can find and resurrect it. > > So, you want to go with per-cpu bio recycling approach to > remove rw_page? > > So, do you want me to hold this patchset? I don't want to hold this series up, but I do think the recycling is a cleaner approach since we don't need to special case anything. I hope I'll get some time to dust it off, retest, and post soon.
On Mon, Aug 14, 2017 at 08:36:00AM -0600, Jens Axboe wrote: > On 08/14/2017 02:50 AM, Minchan Kim wrote: > > Hi Jens, > > > > On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote: > >> On 08/11/2017 04:46 AM, Christoph Hellwig wrote: > >>> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote: > >>>> I like it, but do you think we should switch to sbvec[<constant>] to > >>>> preclude pathological cases where nr_pages is large? > >>> > >>> Yes, please. > >>> > >>> Then I'd like to see that the on-stack bio even matters for > >>> mpage_readpage / mpage_writepage. Compared to all the buffer head > >>> overhead the bio allocation should not actually matter in practice. > >> > >> I'm skeptical for that path, too. I also wonder how far we could go > >> with just doing a per-cpu bio recycling facility, to reduce the cost > >> of having to allocate a bio. The on-stack bio parts are fine for > >> simple use case, where simple means that the patch just special > >> cases the allocation, and doesn't have to change much else. > >> > >> I had a patch for bio recycling and batched freeing a year or two > >> ago, I'll see if I can find and resurrect it. > > > > So, you want to go with per-cpu bio recycling approach to > > remove rw_page? > > > > So, do you want me to hold this patchset? > > I don't want to hold this series up, but I do think the recycling is > a cleaner approach since we don't need to special case anything. I > hope I'll get some time to dust it off, retest, and post soon. I don't know how your bio recycling works. But my worry when I heard per-cpu bio recycling firstly is if it's not reserved pool for BDI_CAP_SYNCHRONOUS(IOW, if it is shared by several storages), BIOs can be consumed by slow device(e.g., eMMC) so that a bio for fastest device(e.g., zram in embedded system) in the system can be stucked to wait on bio until IO for slow deivce is completed. I guess it would be a not rare case for swap device under severe memory pressure because lots of page cache are already reclaimed when anonymous page start to be reclaimed so that many BIOs can be consumed for eMMC to fetch code but swap IO to fetch heap data would be stucked although zram-swap is much faster than eMMC. As well, time to wait to get BIO among even fastest devices is simple waste, I guess. To me, bio suggested by Christoph Hellwig isn't diverge current path a lot and simple enough to change. Anyway, I'm okay with either way if we can remove rw_page without any regression because the maintainance of both rw_page and make_request is rather burden for zram, too.
On 08/14/2017 09:06 AM, Minchan Kim wrote: > On Mon, Aug 14, 2017 at 08:36:00AM -0600, Jens Axboe wrote: >> On 08/14/2017 02:50 AM, Minchan Kim wrote: >>> Hi Jens, >>> >>> On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote: >>>> On 08/11/2017 04:46 AM, Christoph Hellwig wrote: >>>>> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote: >>>>>> I like it, but do you think we should switch to sbvec[<constant>] to >>>>>> preclude pathological cases where nr_pages is large? >>>>> >>>>> Yes, please. >>>>> >>>>> Then I'd like to see that the on-stack bio even matters for >>>>> mpage_readpage / mpage_writepage. Compared to all the buffer head >>>>> overhead the bio allocation should not actually matter in practice. >>>> >>>> I'm skeptical for that path, too. I also wonder how far we could go >>>> with just doing a per-cpu bio recycling facility, to reduce the cost >>>> of having to allocate a bio. The on-stack bio parts are fine for >>>> simple use case, where simple means that the patch just special >>>> cases the allocation, and doesn't have to change much else. >>>> >>>> I had a patch for bio recycling and batched freeing a year or two >>>> ago, I'll see if I can find and resurrect it. >>> >>> So, you want to go with per-cpu bio recycling approach to >>> remove rw_page? >>> >>> So, do you want me to hold this patchset? >> >> I don't want to hold this series up, but I do think the recycling is >> a cleaner approach since we don't need to special case anything. I >> hope I'll get some time to dust it off, retest, and post soon. > > I don't know how your bio recycling works. But my worry when I heard > per-cpu bio recycling firstly is if it's not reserved pool for > BDI_CAP_SYNCHRONOUS(IOW, if it is shared by several storages), > BIOs can be consumed by slow device(e.g., eMMC) so that a bio for > fastest device(e.g., zram in embedded system) in the system can be > stucked to wait on bio until IO for slow deivce is completed. > > I guess it would be a not rare case for swap device under severe > memory pressure because lots of page cache are already reclaimed when > anonymous page start to be reclaimed so that many BIOs can be consumed > for eMMC to fetch code but swap IO to fetch heap data would be stucked > although zram-swap is much faster than eMMC. > As well, time to wait to get BIO among even fastest devices is > simple waste, I guess. I don't think that's a valid concern. First of all, for the recycling, it's not like you get to wait on someone else using a recycled bio, if it's not there you simply go to the regular bio allocator. There is no waiting for free. The idea is to have allocation be faster since we can avoid going to the memory allocator for most cases, and speed up freeing as well, since we can do that in batches too. Secondly, generally you don't have slow devices and fast devices intermingled when running workloads. That's the rare case. > To me, bio suggested by Christoph Hellwig isn't diverge current > path a lot and simple enough to change. It doesn't diverge it a lot, but it does split it up. > Anyway, I'm okay with either way if we can remove rw_page without > any regression because the maintainance of both rw_page and > make_request is rather burden for zram, too. Agree, the ultimate goal of both is to eliminate the need for the rw_page hack.
On Mon, Aug 14, 2017 at 09:14:03AM -0600, Jens Axboe wrote: > On 08/14/2017 09:06 AM, Minchan Kim wrote: > > On Mon, Aug 14, 2017 at 08:36:00AM -0600, Jens Axboe wrote: > >> On 08/14/2017 02:50 AM, Minchan Kim wrote: > >>> Hi Jens, > >>> > >>> On Fri, Aug 11, 2017 at 08:26:59AM -0600, Jens Axboe wrote: > >>>> On 08/11/2017 04:46 AM, Christoph Hellwig wrote: > >>>>> On Wed, Aug 09, 2017 at 08:06:24PM -0700, Dan Williams wrote: > >>>>>> I like it, but do you think we should switch to sbvec[<constant>] to > >>>>>> preclude pathological cases where nr_pages is large? > >>>>> > >>>>> Yes, please. > >>>>> > >>>>> Then I'd like to see that the on-stack bio even matters for > >>>>> mpage_readpage / mpage_writepage. Compared to all the buffer head > >>>>> overhead the bio allocation should not actually matter in practice. > >>>> > >>>> I'm skeptical for that path, too. I also wonder how far we could go > >>>> with just doing a per-cpu bio recycling facility, to reduce the cost > >>>> of having to allocate a bio. The on-stack bio parts are fine for > >>>> simple use case, where simple means that the patch just special > >>>> cases the allocation, and doesn't have to change much else. > >>>> > >>>> I had a patch for bio recycling and batched freeing a year or two > >>>> ago, I'll see if I can find and resurrect it. > >>> > >>> So, you want to go with per-cpu bio recycling approach to > >>> remove rw_page? > >>> > >>> So, do you want me to hold this patchset? > >> > >> I don't want to hold this series up, but I do think the recycling is > >> a cleaner approach since we don't need to special case anything. I > >> hope I'll get some time to dust it off, retest, and post soon. > > > > I don't know how your bio recycling works. But my worry when I heard > > per-cpu bio recycling firstly is if it's not reserved pool for > > BDI_CAP_SYNCHRONOUS(IOW, if it is shared by several storages), > > BIOs can be consumed by slow device(e.g., eMMC) so that a bio for > > fastest device(e.g., zram in embedded system) in the system can be > > stucked to wait on bio until IO for slow deivce is completed. > > > > I guess it would be a not rare case for swap device under severe > > memory pressure because lots of page cache are already reclaimed when > > anonymous page start to be reclaimed so that many BIOs can be consumed > > for eMMC to fetch code but swap IO to fetch heap data would be stucked > > although zram-swap is much faster than eMMC. > > As well, time to wait to get BIO among even fastest devices is > > simple waste, I guess. > > I don't think that's a valid concern. First of all, for the recycling, > it's not like you get to wait on someone else using a recycled bio, > if it's not there you simply go to the regular bio allocator. There > is no waiting for free. The idea is to have allocation be faster since > we can avoid going to the memory allocator for most cases, and speed > up freeing as well, since we can do that in batches too. I doubt how it performs well because at the beginning of this thread[1], Ross said that with even dynamic bio allocation without rw_page, there is no regression in some testing. [1] http://lkml.kernel.org/r/<20170728165604.10455-1-ross.zwisler@linux.intel.com> > > Secondly, generally you don't have slow devices and fast devices > intermingled when running workloads. That's the rare case. Not true. zRam is really popular swap for embedded devices where one of low cost product has a really poor slow nand compared to lz4/lzo [de]comression. > > > To me, bio suggested by Christoph Hellwig isn't diverge current > > path a lot and simple enough to change. > > It doesn't diverge it a lot, but it does split it up. > > > Anyway, I'm okay with either way if we can remove rw_page without > > any regression because the maintainance of both rw_page and > > make_request is rather burden for zram, too. > > Agree, the ultimate goal of both is to eliminate the need for the > rw_page hack. Yeb.
On 08/14/2017 09:31 AM, Minchan Kim wrote: >> Secondly, generally you don't have slow devices and fast devices >> intermingled when running workloads. That's the rare case. > > Not true. zRam is really popular swap for embedded devices where > one of low cost product has a really poor slow nand compared to > lz4/lzo [de]comression. I guess that's true for some cases. But as I said earlier, the recycling really doesn't care about this at all. They can happily coexist, and not step on each others toes.
On 08/14/2017 09:38 AM, Jens Axboe wrote: > On 08/14/2017 09:31 AM, Minchan Kim wrote: >>> Secondly, generally you don't have slow devices and fast devices >>> intermingled when running workloads. That's the rare case. >> >> Not true. zRam is really popular swap for embedded devices where >> one of low cost product has a really poor slow nand compared to >> lz4/lzo [de]comression. > > I guess that's true for some cases. But as I said earlier, the recycling > really doesn't care about this at all. They can happily coexist, and not > step on each others toes. Dusted it off, result is here against -rc5: http://git.kernel.dk/cgit/linux-block/log/?h=cpu-alloc-cache I'd like to split the amount of units we cache and the amount of units we free, right now they are both CPU_ALLOC_CACHE_SIZE. This means that once we hit that count, we free all of the, and then store the one we were asked to free. That always keeps 1 local, but maybe it'd make more sense to cache just free CPU_ALLOC_CACHE_SIZE/2 (or something like that) so that we retain more than 1 per cpu in case and app preempts when sleeping for IO and the new task on that CPU then issues IO as well. Probably minor. Ran a quick test on nullb0 with 32 sync readers. The test was O_DIRECT on the block device, so I disabled the __blkdev_direct_IO_simple() bypass. With the above branch, we get ~18.0M IOPS, and without we get ~14M IOPS. Both ran with iostats disabled, to avoid any interference from that.
Hi Jens, On Mon, Aug 14, 2017 at 10:17:09AM -0600, Jens Axboe wrote: > On 08/14/2017 09:38 AM, Jens Axboe wrote: > > On 08/14/2017 09:31 AM, Minchan Kim wrote: > >>> Secondly, generally you don't have slow devices and fast devices > >>> intermingled when running workloads. That's the rare case. > >> > >> Not true. zRam is really popular swap for embedded devices where > >> one of low cost product has a really poor slow nand compared to > >> lz4/lzo [de]comression. > > > > I guess that's true for some cases. But as I said earlier, the recycling > > really doesn't care about this at all. They can happily coexist, and not > > step on each others toes. > > Dusted it off, result is here against -rc5: > > http://git.kernel.dk/cgit/linux-block/log/?h=cpu-alloc-cache > > I'd like to split the amount of units we cache and the amount of units > we free, right now they are both CPU_ALLOC_CACHE_SIZE. This means that > once we hit that count, we free all of the, and then store the one we > were asked to free. That always keeps 1 local, but maybe it'd make more > sense to cache just free CPU_ALLOC_CACHE_SIZE/2 (or something like that) > so that we retain more than 1 per cpu in case and app preempts when > sleeping for IO and the new task on that CPU then issues IO as well. > Probably minor. > > Ran a quick test on nullb0 with 32 sync readers. The test was O_DIRECT > on the block device, so I disabled the __blkdev_direct_IO_simple() > bypass. With the above branch, we get ~18.0M IOPS, and without we get > ~14M IOPS. Both ran with iostats disabled, to avoid any interference > from that. Looks promising. If recycling bio works well enough, I think we don't need to introduce new split in the path for on-stack bio. I will test your version on zram-swap! Thanks.
On 08/15/2017 10:48 PM, Minchan Kim wrote: > Hi Jens, > > On Mon, Aug 14, 2017 at 10:17:09AM -0600, Jens Axboe wrote: >> On 08/14/2017 09:38 AM, Jens Axboe wrote: >>> On 08/14/2017 09:31 AM, Minchan Kim wrote: >>>>> Secondly, generally you don't have slow devices and fast devices >>>>> intermingled when running workloads. That's the rare case. >>>> >>>> Not true. zRam is really popular swap for embedded devices where >>>> one of low cost product has a really poor slow nand compared to >>>> lz4/lzo [de]comression. >>> >>> I guess that's true for some cases. But as I said earlier, the recycling >>> really doesn't care about this at all. They can happily coexist, and not >>> step on each others toes. >> >> Dusted it off, result is here against -rc5: >> >> http://git.kernel.dk/cgit/linux-block/log/?h=cpu-alloc-cache >> >> I'd like to split the amount of units we cache and the amount of units >> we free, right now they are both CPU_ALLOC_CACHE_SIZE. This means that >> once we hit that count, we free all of the, and then store the one we >> were asked to free. That always keeps 1 local, but maybe it'd make more >> sense to cache just free CPU_ALLOC_CACHE_SIZE/2 (or something like that) >> so that we retain more than 1 per cpu in case and app preempts when >> sleeping for IO and the new task on that CPU then issues IO as well. >> Probably minor. >> >> Ran a quick test on nullb0 with 32 sync readers. The test was O_DIRECT >> on the block device, so I disabled the __blkdev_direct_IO_simple() >> bypass. With the above branch, we get ~18.0M IOPS, and without we get >> ~14M IOPS. Both ran with iostats disabled, to avoid any interference >> from that. > > Looks promising. > If recycling bio works well enough, I think we don't need to introduce > new split in the path for on-stack bio. > I will test your version on zram-swap! Thanks, let me know how it goes. It's quite possible that we'll need a few further tweaks, but at least the basis should be there.
Hi Jens, On Wed, Aug 16, 2017 at 09:56:12AM -0600, Jens Axboe wrote: > On 08/15/2017 10:48 PM, Minchan Kim wrote: > > Hi Jens, > > > > On Mon, Aug 14, 2017 at 10:17:09AM -0600, Jens Axboe wrote: > >> On 08/14/2017 09:38 AM, Jens Axboe wrote: > >>> On 08/14/2017 09:31 AM, Minchan Kim wrote: > >>>>> Secondly, generally you don't have slow devices and fast devices > >>>>> intermingled when running workloads. That's the rare case. > >>>> > >>>> Not true. zRam is really popular swap for embedded devices where > >>>> one of low cost product has a really poor slow nand compared to > >>>> lz4/lzo [de]comression. > >>> > >>> I guess that's true for some cases. But as I said earlier, the recycling > >>> really doesn't care about this at all. They can happily coexist, and not > >>> step on each others toes. > >> > >> Dusted it off, result is here against -rc5: > >> > >> http://git.kernel.dk/cgit/linux-block/log/?h=cpu-alloc-cache > >> > >> I'd like to split the amount of units we cache and the amount of units > >> we free, right now they are both CPU_ALLOC_CACHE_SIZE. This means that > >> once we hit that count, we free all of the, and then store the one we > >> were asked to free. That always keeps 1 local, but maybe it'd make more > >> sense to cache just free CPU_ALLOC_CACHE_SIZE/2 (or something like that) > >> so that we retain more than 1 per cpu in case and app preempts when > >> sleeping for IO and the new task on that CPU then issues IO as well. > >> Probably minor. > >> > >> Ran a quick test on nullb0 with 32 sync readers. The test was O_DIRECT > >> on the block device, so I disabled the __blkdev_direct_IO_simple() > >> bypass. With the above branch, we get ~18.0M IOPS, and without we get > >> ~14M IOPS. Both ran with iostats disabled, to avoid any interference > >> from that. > > > > Looks promising. > > If recycling bio works well enough, I think we don't need to introduce > > new split in the path for on-stack bio. > > I will test your version on zram-swap! > > Thanks, let me know how it goes. It's quite possible that we'll need > a few further tweaks, but at least the basis should be there. Sorry for my late reply. I just finished the swap-in testing in with zram-swap which is critical for the latency. For the testing, I made a memcc and put $NR_CPU(mine is 12) processes in there and each processes consumes 1G so total is 12G while my system has 16GB memory so there was no global reclaim. Then, echo 1 > /mnt/memcg/group/force.empty to swap all pages out and then the programs wait my signal to swap in and I trigger the signal to every processes to swap in every pages and measures elapsed time for the swapin. the value is average usec time elapsed swap-in 1G pages for each process and I repeated it 10times and stddev is very stable. swapin: base(with rw_page) 1100806.73(100.00%) no-rw_page 1146856.95(104.18%) Jens's pcp 1146910.00(104.19%) onstack-bio 1114872.18(101.28%) In my test, there is no difference between dynamic bio allocation (i.e., no-rwpage) and pcp approch but onstack-bio is much faster so it's almost same with rw_page. swapout test is to measure elapsed time for "echo 1 > /mnt/memcg/test_group/force.empty' so it's sec unit. swapout: base(with rw_page) 7.72(100.00%) no-rw_page 8.36(108.29%) Jens's pcp 8.31(107.64%) onstack-bio 8.19(106.09%) rw_page's swapout is 6% or more than faster than else. I tried pmbenchmak with no memcg to see the performance in global reclaim. Also, I executed background IO job which reads data from HDD. The value is average usec time elapsed for a page access so smaller is better. base(with rw_page) 14.42(100.00%) no-rw_page 15.66(108.60%) Jens's pcp 15.81(109.64%) onstack-bio 15.42(106.93%) It's similar to swapout test in memcg. 6% or more is not trivial so I doubt we can remove rw_page at this moment. :( I will look into the detail with perf. If you have further optimizations or suggestions, Feel free to say that. I am happy to test it. Thanks.
diff --git a/fs/mpage.c b/fs/mpage.c index baff8f820c29..6db6bf5131ed 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -157,6 +157,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, unsigned page_block; unsigned first_hole = blocks_per_page; struct block_device *bdev = NULL; + struct bio sbio; + struct bio_vec sbvec; int length; int fully_mapped = 1; unsigned nblocks; @@ -281,10 +283,17 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, page)) goto out; } - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), + if (bdi_cap_synchronous_io(inode_to_bdi(inode))) { + bio = &sbio; + bio_init(bio, &sbvec, nr_pages); + sbio.bi_bdev = bdev; + sbio.bi_iter.bi_sector = blocks[0] << (blkbits - 9); + } else { + bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), min_t(int, nr_pages, BIO_MAX_PAGES), gfp); - if (bio == NULL) - goto confused; + if (bio == NULL) + goto confused; + } } length = first_hole << blkbits; @@ -301,6 +310,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, else *last_block_in_bio = blocks[blocks_per_page - 1]; out: + if (bio == &sbio) + bio = mpage_bio_submit(REQ_OP_READ, 0, bio); return bio; confused: