Message ID | 20200113153746.26654-1-willy@infradead.org (mailing list archive) |
---|---|
Headers | show |
Series | Replacing the readpages a_op | expand |
On 13 Jan 2020, at 10:37, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" <willy@infradead.org> > > I think everybody hates the readpages API. The fundamental problem > with > it is that it passes the pages to be read on a doubly linked list, > using > the ->lru list in the struct page. That means the filesystems have to > do the work of calling add_to_page_cache{,_lru,_locked}, and handling > failures (because another task is also accessing that chunk of the > file, > and so it fails). > > This is an attempt to add a ->readahead op to replace ->readpages. > I've > converted two users, iomap/xfs and cifs. The cifs conversion is > lacking > fscache support, and that's just because I didn't want to do that > work; > I don't believe there's anything fundamental to it. But I wanted to > do > iomap because it is The Infrastructure Of The Future and cifs because > it > is the sole remaining user of add_to_page_cache_locked(), which > enables > the last two patches in the series. By the way, that gives CIFS > access > to the workingset shadow infrastructure, which it had to ignore before > because it couldn't put pages onto the lru list at the right time. I've always kind of liked the compromise of sending the lists. It's really good at the common case and doesn't have massive problems when things break down. Just glancing through the patches, the old readpages is called in bigger chunks, so for massive reads we can do more effective readahead on metadata. I don't think any of us actually do, but we could. With this new operation, our window is constant, and much smaller. > > The fundamental question is, how do we indicate to the implementation > of > ->readahead what pages to operate on? I've gone with passing a > pagevec. > This has the obvious advantage that it's a data structure that already > exists and is used within filemap for batches of pages. I had to add > a > bit of new infrastructure to support iterating over the pages in the > pagevec, but with that done, it's quite nice. > > I think the biggest problem is that the size of the pagevec is limited > to 15 pages (60kB). So that'll mean that if the readahead window > bumps > all the way up to 256kB, we may end up making 5 BIOs (and merging > them) > instead of one. I'd kind of like to be able to allocate variable > length > pagevecs while allowing regular pagevecs to be allocated on the stack, > but I can't figure out a way to do that. eg this doesn't work: > > - struct page *pages[PAGEVEC_SIZE]; > + union { > + struct page *pages[PAGEVEC_SIZE]; > + struct page *_pages[]; > + } > > and if we just allocate them, useful and wonderful tools are going to > point out when pages[16] is accessed that we've overstepped the end of > the array. > > I have considered alternatives to the pagevec like just having the > ->readahead implementation look up the pages in the i_pages XArray > directly. That didn't work out too well. > Btrfs basically does this now, honestly iomap isn't that far away. Given how sensible iomap is for this, I'd rather see us pile into that abstraction than try to pass pagevecs for large ranges. Otherwise, if the lists are awkward we can make some helpers to make it less error prone? -chris
On Mon, Jan 13, 2020 at 04:42:10PM +0000, Chris Mason wrote: > On 13 Jan 2020, at 10:37, Matthew Wilcox wrote: > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org> > > I think everybody hates the readpages API. The fundamental problem > > with > > it is that it passes the pages to be read on a doubly linked list, > > using > > the ->lru list in the struct page. That means the filesystems have to > > do the work of calling add_to_page_cache{,_lru,_locked}, and handling > > failures (because another task is also accessing that chunk of the > > file, > > and so it fails). > > I've always kind of liked the compromise of sending the lists. It's > really good at the common case and doesn't have massive problems when > things break down. I think we'll have to disagree on that point. Linked lists are awful for the CPU in the common case, and the error handling code for "things break down" is painful. I'm pretty sure I spotted three bugs in the CIFS implementation. > Just glancing through the patches, the old > readpages is called in bigger chunks, so for massive reads we can do > more effective readahead on metadata. I don't think any of us actually > do, but we could. > > With this new operation, our window is constant, and much smaller. > > > The fundamental question is, how do we indicate to the implementation > > of > > ->readahead what pages to operate on? I've gone with passing a > > pagevec. > > This has the obvious advantage that it's a data structure that already > > exists and is used within filemap for batches of pages. I had to add > > a > > bit of new infrastructure to support iterating over the pages in the > > pagevec, but with that done, it's quite nice. > > > > I think the biggest problem is that the size of the pagevec is limited > > to 15 pages (60kB). So that'll mean that if the readahead window > > bumps > > all the way up to 256kB, we may end up making 5 BIOs (and merging > > them) > > instead of one. I'd kind of like to be able to allocate variable > > length > > pagevecs while allowing regular pagevecs to be allocated on the stack, > > but I can't figure out a way to do that. eg this doesn't work: > > > > - struct page *pages[PAGEVEC_SIZE]; > > + union { > > + struct page *pages[PAGEVEC_SIZE]; > > + struct page *_pages[]; > > + } > > > > and if we just allocate them, useful and wonderful tools are going to > > point out when pages[16] is accessed that we've overstepped the end of > > the array. > > > > I have considered alternatives to the pagevec like just having the > > ->readahead implementation look up the pages in the i_pages XArray > > directly. That didn't work out too well. > > > > Btrfs basically does this now, honestly iomap isn't that far away. > Given how sensible iomap is for this, I'd rather see us pile into that > abstraction than try to pass pagevecs for large ranges. Otherwise, if > the lists are awkward we can make some helpers to make it less error > prone? I did do a couple of helpers for lists for iomap before deciding the whole thing was too painful. I didn't look at btrfs until just now, but, um ... int extent_readpages(struct address_space *mapping, struct list_head *pages, unsigned nr_pages) .. struct page *pagepool[16]; .. while (!list_empty(pages)) { .. list_del(&page->lru); if (add_to_page_cache_lru(page, mapping, page->index, .. pagepool[nr++] = page; you're basically doing exactly what i'm proposing to be the new interface! OK, you get one extra page per batch ;-P
On Mon, Jan 13, 2020 at 04:42:10PM +0000, Chris Mason wrote: > Btrfs basically does this now, honestly iomap isn't that far away. > Given how sensible iomap is for this, I'd rather see us pile into that > abstraction than try to pass pagevecs for large ranges. Otherwise, if I completely misread this at first and thought you were proposing we pass a bio_vec to ->readahead. Initially, this is a layering violation, completely unnecessary to have all these extra offset/size fields being allocated and passed around. But ... the bio_vec and the skb_frag_t are now the same data structure, so both block and network use it. It may make sense to have this as the common data structure for 'unit of IO'. The bio supports having the bi_vec allocated externally to the data structure while the skbuff would need to copy the array. Maybe we need a more neutral name than bio_vec so as to not upset people. page_frag, perhaps [1]. [1] Yes, I know about the one in include/linux/mm_types_task.h
On 13 Jan 2020, at 12:40, Matthew Wilcox wrote: > On Mon, Jan 13, 2020 at 04:42:10PM +0000, Chris Mason wrote: > > I did do a couple of helpers for lists for iomap before deciding the > whole thing was too painful. I didn't look at btrfs until just now, > but, um ... > > int extent_readpages(struct address_space *mapping, struct list_head > *pages, > unsigned nr_pages) > .. > struct page *pagepool[16]; > .. > while (!list_empty(pages)) { > .. > list_del(&page->lru); > if (add_to_page_cache_lru(page, mapping, > page->index, > .. > pagepool[nr++] = page; > > you're basically doing exactly what i'm proposing to be the new > interface! > OK, you get one extra page per batch ;-P This is true, I didn't explain that part well ;) Depending on compression etc we might end up poking the xarray inside the actual IO functions, but the main difference is that btrfs is building a single bio. You're moving the plug so you'll merge into single bio, but I'd rather build 2MB bios than merge them. I guess it doesn't feel like enough of a win to justify the churn. If we find a way to do much larger pagevecs, I think this makes more sense. -chris
On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote: > This is true, I didn't explain that part well ;) Depending on > compression etc we might end up poking the xarray inside the actual IO > functions, but the main difference is that btrfs is building a single > bio. You're moving the plug so you'll merge into single bio, but I'd > rather build 2MB bios than merge them. Why don't we store a bio pointer inside the plug? You're opencoding that, iomap is opencoding that, and I bet there's a dozen other places where we pass a bio around. Then blk_finish_plug can submit the bio.
On 1/13/20 2:58 PM, Matthew Wilcox wrote: > On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote: >> This is true, I didn't explain that part well ;) Depending on >> compression etc we might end up poking the xarray inside the actual IO >> functions, but the main difference is that btrfs is building a single >> bio. You're moving the plug so you'll merge into single bio, but I'd >> rather build 2MB bios than merge them. > > Why don't we store a bio pointer inside the plug? You're opencoding that, > iomap is opencoding that, and I bet there's a dozen other places where > we pass a bio around. Then blk_finish_plug can submit the bio. Plugs aren't necessarily a bio, they can be callbacks too.
On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote: > On 1/13/20 2:58 PM, Matthew Wilcox wrote: > > On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote: > >> This is true, I didn't explain that part well ;) Depending on > >> compression etc we might end up poking the xarray inside the actual IO > >> functions, but the main difference is that btrfs is building a single > >> bio. You're moving the plug so you'll merge into single bio, but I'd > >> rather build 2MB bios than merge them. > > > > Why don't we store a bio pointer inside the plug? You're opencoding that, > > iomap is opencoding that, and I bet there's a dozen other places where > > we pass a bio around. Then blk_finish_plug can submit the bio. > > Plugs aren't necessarily a bio, they can be callbacks too. I'm thinking something as simple as this: @@ -1711,6 +1711,7 @@ void blk_start_plug(struct blk_plug *plug) INIT_LIST_HEAD(&plug->mq_list); INIT_LIST_HEAD(&plug->cb_list); + plug->bio = NULL; plug->rq_count = 0; plug->multiple_queues = false; @@ -1786,6 +1787,8 @@ void blk_finish_plug(struct blk_plug *plug) { if (plug != current->plug) return; + if (plug->bio) + submit_bio(plug->bio); blk_flush_plug_list(plug, false); current->plug = NULL; @@ -1160,6 +1160,7 @@ extern void blk_set_queue_dying(struct request_queue *); struct blk_plug { struct list_head mq_list; /* blk-mq requests */ struct list_head cb_list; /* md requires an unplug callback */ + struct bio *bio; unsigned short rq_count; bool multiple_queues; }; with accessors to 'get_current_bio()' and 'set_current_bio()'.
On 1/13/20 3:10 PM, Matthew Wilcox wrote: > On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote: >> On 1/13/20 2:58 PM, Matthew Wilcox wrote: >>> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote: >>>> This is true, I didn't explain that part well ;) Depending on >>>> compression etc we might end up poking the xarray inside the actual IO >>>> functions, but the main difference is that btrfs is building a single >>>> bio. You're moving the plug so you'll merge into single bio, but I'd >>>> rather build 2MB bios than merge them. >>> >>> Why don't we store a bio pointer inside the plug? You're opencoding that, >>> iomap is opencoding that, and I bet there's a dozen other places where >>> we pass a bio around. Then blk_finish_plug can submit the bio. >> >> Plugs aren't necessarily a bio, they can be callbacks too. > > I'm thinking something as simple as this: > > @@ -1711,6 +1711,7 @@ void blk_start_plug(struct blk_plug *plug) > > INIT_LIST_HEAD(&plug->mq_list); > INIT_LIST_HEAD(&plug->cb_list); > + plug->bio = NULL; > plug->rq_count = 0; > plug->multiple_queues = false; > > @@ -1786,6 +1787,8 @@ void blk_finish_plug(struct blk_plug *plug) > { > if (plug != current->plug) > return; > + if (plug->bio) > + submit_bio(plug->bio); > blk_flush_plug_list(plug, false); > > current->plug = NULL; > @@ -1160,6 +1160,7 @@ extern void blk_set_queue_dying(struct request_queue *); > struct blk_plug { > struct list_head mq_list; /* blk-mq requests */ > struct list_head cb_list; /* md requires an unplug callback */ > + struct bio *bio; > unsigned short rq_count; > bool multiple_queues; > }; > > with accessors to 'get_current_bio()' and 'set_current_bio()'. It's a little odd imho, the plugging generally collect requests. Sounds what you're looking for is some plug owner private data, which just happens to be a bio in this case? Is this over repeated calls to some IO generating helper? Would it be more efficient if that helper could generate the full bio in one go, instead of piecemeal?
On Mon, Jan 13, 2020 at 10:54 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jan 13, 2020 at 04:42:10PM +0000, Chris Mason wrote: > > Btrfs basically does this now, honestly iomap isn't that far away. > > Given how sensible iomap is for this, I'd rather see us pile into > > that abstraction than try to pass pagevecs for large ranges. > > Otherwise, if > > I completely misread this at first and thought you were proposing we > pass a bio_vec to ->readahead. Initially, this is a layering > violation, completely unnecessary to have all these extra offset/size > fields being allocated and passed around. But ... the bio_vec and the > skb_frag_t are now the same data structure, so both block and network > use it. It may make sense to have this as the common data structure > for 'unit of IO'. The bio supports having the bi_vec allocated > externally to the data structure while the skbuff would need to copy > the array. > > Maybe we need a more neutral name than bio_vec so as to not upset > people. page_frag, perhaps [1]. > > [1] Yes, I know about the one in include/linux/mm_types_task.h Note that bio_vecs support page merging, so page fragment isn't very descriptive. Not sure what a good name would be, but fragment isn't it. But any shared type would imply that others should support that as well if you're passing it around (or by pointer).
On Mon, Jan 13, 2020 at 03:14:26PM -0700, Jens Axboe wrote: > On 1/13/20 3:10 PM, Matthew Wilcox wrote: > > On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote: > >> On 1/13/20 2:58 PM, Matthew Wilcox wrote: > >>> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote: > >>>> This is true, I didn't explain that part well ;) Depending on > >>>> compression etc we might end up poking the xarray inside the actual IO > >>>> functions, but the main difference is that btrfs is building a single > >>>> bio. You're moving the plug so you'll merge into single bio, but I'd > >>>> rather build 2MB bios than merge them. > >>> > >>> Why don't we store a bio pointer inside the plug? You're opencoding that, > >>> iomap is opencoding that, and I bet there's a dozen other places where > >>> we pass a bio around. Then blk_finish_plug can submit the bio. > >> > >> Plugs aren't necessarily a bio, they can be callbacks too. > > > > I'm thinking something as simple as this: > > It's a little odd imho, the plugging generally collect requests. Sounds > what you're looking for is some plug owner private data, which just > happens to be a bio in this case? > > Is this over repeated calls to some IO generating helper? Would it be > more efficient if that helper could generate the full bio in one go, > instead of piecemeal? The issue is around ->readpages. Take a look at how iomap_readpages works, for example. We're under a plug (taken in mm/readahead.c), but we still go through the rigamarole of keeping a pointer to the bio in ctx.bio and passing ctx around so that we don't end up with many fragments which have to be recombined into a single bio at the end. I think what I want is a bio I can reach from current, somehow. And the plug feels like a natural place to keep it because it's basically saying "I want to do lots of little IOs and have them combined". The fact that the iomap code has a bio that it precombines fragments into suggests to me that the existing antifragmentation code in the plugging mechanism isn't good enough. So let's make it better by storing a bio in the plug and then we can get rid of the bio in the iomap code.
On 1/13/20 3:27 PM, Matthew Wilcox wrote: > On Mon, Jan 13, 2020 at 03:14:26PM -0700, Jens Axboe wrote: >> On 1/13/20 3:10 PM, Matthew Wilcox wrote: >>> On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote: >>>> On 1/13/20 2:58 PM, Matthew Wilcox wrote: >>>>> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote: >>>>>> This is true, I didn't explain that part well ;) Depending on >>>>>> compression etc we might end up poking the xarray inside the actual IO >>>>>> functions, but the main difference is that btrfs is building a single >>>>>> bio. You're moving the plug so you'll merge into single bio, but I'd >>>>>> rather build 2MB bios than merge them. >>>>> >>>>> Why don't we store a bio pointer inside the plug? You're opencoding that, >>>>> iomap is opencoding that, and I bet there's a dozen other places where >>>>> we pass a bio around. Then blk_finish_plug can submit the bio. >>>> >>>> Plugs aren't necessarily a bio, they can be callbacks too. >>> >>> I'm thinking something as simple as this: >> >> It's a little odd imho, the plugging generally collect requests. Sounds >> what you're looking for is some plug owner private data, which just >> happens to be a bio in this case? >> >> Is this over repeated calls to some IO generating helper? Would it be >> more efficient if that helper could generate the full bio in one go, >> instead of piecemeal? > > The issue is around ->readpages. Take a look at how iomap_readpages > works, for example. We're under a plug (taken in mm/readahead.c), > but we still go through the rigamarole of keeping a pointer to the bio > in ctx.bio and passing ctx around so that we don't end up with many > fragments which have to be recombined into a single bio at the end. > > I think what I want is a bio I can reach from current, somehow. And the > plug feels like a natural place to keep it because it's basically saying > "I want to do lots of little IOs and have them combined". The fact that > the iomap code has a bio that it precombines fragments into suggests to > me that the existing antifragmentation code in the plugging mechanism > isn't good enough. So let's make it better by storing a bio in the plug > and then we can get rid of the bio in the iomap code. But it doesn't fit the plug model very well. If you get preempted, it could go away. Or if you have nested plugs, it also won't work at all. I think it's much saner to keep the "current" bio _outside_ of the plug, and work on it in peace until you want to submit it.
On 13 Jan 2020, at 17:27, Matthew Wilcox wrote: > On Mon, Jan 13, 2020 at 03:14:26PM -0700, Jens Axboe wrote: >> On 1/13/20 3:10 PM, Matthew Wilcox wrote: >>> On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote: >>>> On 1/13/20 2:58 PM, Matthew Wilcox wrote: >>>>> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote: >>>>>> This is true, I didn't explain that part well ;) Depending on >>>>>> compression etc we might end up poking the xarray inside the >>>>>> actual IO >>>>>> functions, but the main difference is that btrfs is building a >>>>>> single >>>>>> bio. You're moving the plug so you'll merge into single bio, but >>>>>> I'd >>>>>> rather build 2MB bios than merge them. >>>>> >>>>> Why don't we store a bio pointer inside the plug? You're >>>>> opencoding that, >>>>> iomap is opencoding that, and I bet there's a dozen other places >>>>> where >>>>> we pass a bio around. Then blk_finish_plug can submit the bio. >>>> >>>> Plugs aren't necessarily a bio, they can be callbacks too. >>> >>> I'm thinking something as simple as this: >> >> It's a little odd imho, the plugging generally collect requests. >> Sounds >> what you're looking for is some plug owner private data, which just >> happens to be a bio in this case? >> >> Is this over repeated calls to some IO generating helper? Would it be >> more efficient if that helper could generate the full bio in one go, >> instead of piecemeal? > > The issue is around ->readpages. Take a look at how iomap_readpages > works, for example. We're under a plug (taken in mm/readahead.c), > but we still go through the rigamarole of keeping a pointer to the bio > in ctx.bio and passing ctx around so that we don't end up with many > fragments which have to be recombined into a single bio at the end. > > I think what I want is a bio I can reach from current, somehow. And > the > plug feels like a natural place to keep it because it's basically > saying > "I want to do lots of little IOs and have them combined". The fact > that > the iomap code has a bio that it precombines fragments into suggests > to > me that the existing antifragmentation code in the plugging mechanism > isn't good enough. So let's make it better by storing a bio in the > plug > and then we can get rid of the bio in the iomap code. Both btrfs and xfs do this, we have a bio that we pass around and build and submit. We both also do some gymnastics in writepages to avoid waiting for the bios we've been building to finish while we're building them. I love the idea of the plug api having a way to hold that for us, but sometimes we really are building the bios, and we don't want the plug to let it go if we happen to schedule. -chris
On Mon, Jan 13, 2020 at 10:34:16PM +0000, Chris Mason wrote: > On 13 Jan 2020, at 17:27, Matthew Wilcox wrote: > > On Mon, Jan 13, 2020 at 03:14:26PM -0700, Jens Axboe wrote: > >> On 1/13/20 3:10 PM, Matthew Wilcox wrote: > >>> On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote: > >>>> On 1/13/20 2:58 PM, Matthew Wilcox wrote: > >>>>> Why don't we store a bio pointer inside the plug? You're > >>>>> opencoding that, > >>>>> iomap is opencoding that, and I bet there's a dozen other places > >>>>> where > >>>>> we pass a bio around. Then blk_finish_plug can submit the bio. > >>>> > >>>> Plugs aren't necessarily a bio, they can be callbacks too. > >> > >> It's a little odd imho, the plugging generally collect requests. > >> Sounds > >> what you're looking for is some plug owner private data, which just > >> happens to be a bio in this case? > >> > >> Is this over repeated calls to some IO generating helper? Would it be > >> more efficient if that helper could generate the full bio in one go, > >> instead of piecemeal? > > > > The issue is around ->readpages. Take a look at how iomap_readpages > > works, for example. We're under a plug (taken in mm/readahead.c), > > but we still go through the rigamarole of keeping a pointer to the bio > > in ctx.bio and passing ctx around so that we don't end up with many > > fragments which have to be recombined into a single bio at the end. > > > > I think what I want is a bio I can reach from current, somehow. And > > the > > plug feels like a natural place to keep it because it's basically > > saying > > "I want to do lots of little IOs and have them combined". The fact > > that > > the iomap code has a bio that it precombines fragments into suggests > > to > > me that the existing antifragmentation code in the plugging mechanism > > isn't good enough. So let's make it better by storing a bio in the > > plug > > and then we can get rid of the bio in the iomap code. > > Both btrfs and xfs do this, we have a bio that we pass around and build > and submit. We both also do some gymnastics in writepages to avoid > waiting for the bios we've been building to finish while we're building > them. > > I love the idea of the plug api having a way to hold that for us, but > sometimes we really are building the bios, and we don't want the plug to > let it go if we happen to schedule. The plug wouldn't have to let the bio go. I appreciate the plug does let requests go on context switch, but it wouldn't have to let the bio go. This bio is being stored on the stack, just as now, so it's still there.
On 13 Jan 2020, at 20:01, Matthew Wilcox wrote: > On Mon, Jan 13, 2020 at 10:34:16PM +0000, Chris Mason wrote: >>> I think what I want is a bio I can reach from current, somehow. And >>> the >>> plug feels like a natural place to keep it because it's basically >>> saying >>> "I want to do lots of little IOs and have them combined". The fact >>> that >>> the iomap code has a bio that it precombines fragments into suggests >>> to >>> me that the existing antifragmentation code in the plugging >>> mechanism >>> isn't good enough. So let's make it better by storing a bio in the >>> plug >>> and then we can get rid of the bio in the iomap code. >> >> Both btrfs and xfs do this, we have a bio that we pass around and >> build >> and submit. We both also do some gymnastics in writepages to avoid >> waiting for the bios we've been building to finish while we're >> building >> them. >> >> I love the idea of the plug api having a way to hold that for us, but >> sometimes we really are building the bios, and we don't want the plug >> to >> let it go if we happen to schedule. > > The plug wouldn't have to let the bio go. I appreciate the plug does > let > requests go on context switch, but it wouldn't have to let the bio go. > This bio is being stored on the stack, just as now, so it's still > there. Plugging is great because it's building up state for the block layer across a wide variety of MM and FS code. btrfs and xfs are building state for themselves across a tiny cross section of their own code. I'd rather not complicate magic state in current in places we can easily pass the bio down. -chris
From: "Matthew Wilcox (Oracle)" <willy@infradead.org> I think everybody hates the readpages API. The fundamental problem with it is that it passes the pages to be read on a doubly linked list, using the ->lru list in the struct page. That means the filesystems have to do the work of calling add_to_page_cache{,_lru,_locked}, and handling failures (because another task is also accessing that chunk of the file, and so it fails). This is an attempt to add a ->readahead op to replace ->readpages. I've converted two users, iomap/xfs and cifs. The cifs conversion is lacking fscache support, and that's just because I didn't want to do that work; I don't believe there's anything fundamental to it. But I wanted to do iomap because it is The Infrastructure Of The Future and cifs because it is the sole remaining user of add_to_page_cache_locked(), which enables the last two patches in the series. By the way, that gives CIFS access to the workingset shadow infrastructure, which it had to ignore before because it couldn't put pages onto the lru list at the right time. The fundamental question is, how do we indicate to the implementation of ->readahead what pages to operate on? I've gone with passing a pagevec. This has the obvious advantage that it's a data structure that already exists and is used within filemap for batches of pages. I had to add a bit of new infrastructure to support iterating over the pages in the pagevec, but with that done, it's quite nice. I think the biggest problem is that the size of the pagevec is limited to 15 pages (60kB). So that'll mean that if the readahead window bumps all the way up to 256kB, we may end up making 5 BIOs (and merging them) instead of one. I'd kind of like to be able to allocate variable length pagevecs while allowing regular pagevecs to be allocated on the stack, but I can't figure out a way to do that. eg this doesn't work: - struct page *pages[PAGEVEC_SIZE]; + union { + struct page *pages[PAGEVEC_SIZE]; + struct page *_pages[]; + } and if we just allocate them, useful and wonderful tools are going to point out when pages[16] is accessed that we've overstepped the end of the array. I have considered alternatives to the pagevec like just having the ->readahead implementation look up the pages in the i_pages XArray directly. That didn't work out too well. Anyway, I want to draw your attention to the diffstat below. Net 91 lines deleted, and that's with adding all the infrastructure for ->readahead and getting rid of none of the infrastructure for ->readpages. There's probably a good couple of hundred lines of code to be deleted there. Matthew Wilcox (Oracle) (8): pagevec: Add an iterator mm: Fix the return type of __do_page_cache_readahead mm: Use a pagevec for readahead mm/fs: Add a_ops->readahead iomap,xfs: Convert from readpages to readahead cifs: Convert from readpages to readahead mm: Remove add_to_page_cache_locked mm: Unify all add_to_page_cache variants Documentation/filesystems/locking.rst | 8 +- Documentation/filesystems/vfs.rst | 9 ++ fs/cifs/file.c | 125 ++++------------------- fs/iomap/buffered-io.c | 60 +++-------- fs/iomap/trace.h | 18 ++-- fs/xfs/xfs_aops.c | 12 +-- include/linux/fs.h | 3 + include/linux/iomap.h | 4 +- include/linux/pagemap.h | 23 +---- include/linux/pagevec.h | 20 ++++ mm/filemap.c | 72 ++++--------- mm/internal.h | 2 +- mm/readahead.c | 141 +++++++++++++++++--------- 13 files changed, 203 insertions(+), 294 deletions(-)