[RFC,0/8] Replacing the readpages a_op

Message ID	20200113153746.26654-1-willy@infradead.org (mailing list archive)
Headers	show Return-Path: <SRS0=XeHS=3C=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9B1B62081E From: Matthew Wilcox <willy@infradead.org> To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, jlayton@kernel.org, hch@infradead.org Subject: [RFC 0/8] Replacing the readpages a_op Date: Mon, 13 Jan 2020 07:37:38 -0800 Message-Id: <20200113153746.26654-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Replacing the readpages a_op \| expand [RFC,0/8] Replacing the readpages a_op [1/8] pagevec: Add an iterator [2/8] mm: Fix the return type of __do_page_cache_readahead [3/8] mm: Use a pagevec for readahead [4/8] mm/fs: Add a_ops->readahead [5/8] iomap,xfs: Convert from readpages to readahead [6/8] cifs: Convert from readpages to readahead [7/8] mm: Remove add_to_page_cache_locked [8/8] mm: Unify all add_to_page_cache variants

Matthew Wilcox Jan. 13, 2020, 3:37 p.m. UTC

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

I think everybody hates the readpages API.  The fundamental problem with
it is that it passes the pages to be read on a doubly linked list, using
the ->lru list in the struct page.  That means the filesystems have to
do the work of calling add_to_page_cache{,_lru,_locked}, and handling
failures (because another task is also accessing that chunk of the file,
and so it fails).

This is an attempt to add a ->readahead op to replace ->readpages.  I've
converted two users, iomap/xfs and cifs.  The cifs conversion is lacking
fscache support, and that's just because I didn't want to do that work;
I don't believe there's anything fundamental to it.  But I wanted to do
iomap because it is The Infrastructure Of The Future and cifs because it
is the sole remaining user of add_to_page_cache_locked(), which enables
the last two patches in the series.  By the way, that gives CIFS access
to the workingset shadow infrastructure, which it had to ignore before
because it couldn't put pages onto the lru list at the right time.

The fundamental question is, how do we indicate to the implementation of
->readahead what pages to operate on?  I've gone with passing a pagevec.
This has the obvious advantage that it's a data structure that already
exists and is used within filemap for batches of pages.  I had to add a
bit of new infrastructure to support iterating over the pages in the
pagevec, but with that done, it's quite nice.

I think the biggest problem is that the size of the pagevec is limited
to 15 pages (60kB).  So that'll mean that if the readahead window bumps
all the way up to 256kB, we may end up making 5 BIOs (and merging them)
instead of one.  I'd kind of like to be able to allocate variable length
pagevecs while allowing regular pagevecs to be allocated on the stack,
but I can't figure out a way to do that.  eg this doesn't work:

-       struct page *pages[PAGEVEC_SIZE];
+       union {
+               struct page *pages[PAGEVEC_SIZE];
+               struct page *_pages[];
+       }

and if we just allocate them, useful and wonderful tools are going to
point out when pages[16] is accessed that we've overstepped the end of
the array.

I have considered alternatives to the pagevec like just having the
->readahead implementation look up the pages in the i_pages XArray
directly.  That didn't work out too well.

Anyway, I want to draw your attention to the diffstat below.  Net 91 lines
deleted, and that's with adding all the infrastructure for ->readahead
and getting rid of none of the infrastructure for ->readpages.  There's
probably a good couple of hundred lines of code to be deleted there.

Matthew Wilcox (Oracle) (8):
  pagevec: Add an iterator
  mm: Fix the return type of __do_page_cache_readahead
  mm: Use a pagevec for readahead
  mm/fs: Add a_ops->readahead
  iomap,xfs: Convert from readpages to readahead
  cifs: Convert from readpages to readahead
  mm: Remove add_to_page_cache_locked
  mm: Unify all add_to_page_cache variants

 Documentation/filesystems/locking.rst |   8 +-
 Documentation/filesystems/vfs.rst     |   9 ++
 fs/cifs/file.c                        | 125 ++++-------------------
 fs/iomap/buffered-io.c                |  60 +++--------
 fs/iomap/trace.h                      |  18 ++--
 fs/xfs/xfs_aops.c                     |  12 +--
 include/linux/fs.h                    |   3 +
 include/linux/iomap.h                 |   4 +-
 include/linux/pagemap.h               |  23 +----
 include/linux/pagevec.h               |  20 ++++
 mm/filemap.c                          |  72 ++++---------
 mm/internal.h                         |   2 +-
 mm/readahead.c                        | 141 +++++++++++++++++---------
 13 files changed, 203 insertions(+), 294 deletions(-)

Chris Mason Jan. 13, 2020, 4:42 p.m. UTC | #1

On 13 Jan 2020, at 10:37, Matthew Wilcox wrote:

> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>
> I think everybody hates the readpages API.  The fundamental problem 
> with
> it is that it passes the pages to be read on a doubly linked list, 
> using
> the ->lru list in the struct page.  That means the filesystems have to
> do the work of calling add_to_page_cache{,_lru,_locked}, and handling
> failures (because another task is also accessing that chunk of the 
> file,
> and so it fails).
>
> This is an attempt to add a ->readahead op to replace ->readpages.  
> I've
> converted two users, iomap/xfs and cifs.  The cifs conversion is 
> lacking
> fscache support, and that's just because I didn't want to do that 
> work;
> I don't believe there's anything fundamental to it.  But I wanted to 
> do
> iomap because it is The Infrastructure Of The Future and cifs because 
> it
> is the sole remaining user of add_to_page_cache_locked(), which 
> enables
> the last two patches in the series.  By the way, that gives CIFS 
> access
> to the workingset shadow infrastructure, which it had to ignore before
> because it couldn't put pages onto the lru list at the right time.

I've always kind of liked the compromise of sending the lists.  It's 
really good at the common case and doesn't have massive problems when 
things break down.   Just glancing through the patches, the old 
readpages is called in bigger chunks, so for massive reads we can do 
more effective readahead on metadata.  I don't think any of us actually 
do, but we could.

With this new operation, our window is constant, and much smaller.

>
> The fundamental question is, how do we indicate to the implementation 
> of
> ->readahead what pages to operate on?  I've gone with passing a 
> pagevec.
> This has the obvious advantage that it's a data structure that already
> exists and is used within filemap for batches of pages.  I had to add 
> a
> bit of new infrastructure to support iterating over the pages in the
> pagevec, but with that done, it's quite nice.
>
> I think the biggest problem is that the size of the pagevec is limited
> to 15 pages (60kB).  So that'll mean that if the readahead window 
> bumps
> all the way up to 256kB, we may end up making 5 BIOs (and merging 
> them)
> instead of one.  I'd kind of like to be able to allocate variable 
> length
> pagevecs while allowing regular pagevecs to be allocated on the stack,
> but I can't figure out a way to do that.  eg this doesn't work:
>
> -       struct page *pages[PAGEVEC_SIZE];
> +       union {
> +               struct page *pages[PAGEVEC_SIZE];
> +               struct page *_pages[];
> +       }
>
> and if we just allocate them, useful and wonderful tools are going to
> point out when pages[16] is accessed that we've overstepped the end of
> the array.
>
> I have considered alternatives to the pagevec like just having the
> ->readahead implementation look up the pages in the i_pages XArray
> directly.  That didn't work out too well.
>

Btrfs basically does this now, honestly iomap isn't that far away.  
Given how sensible iomap is for this, I'd rather see us pile into that 
abstraction than try to pass pagevecs for large ranges.  Otherwise, if 
the lists are awkward we can make some helpers to make it less error 
prone?

-chris

Matthew Wilcox Jan. 13, 2020, 5:40 p.m. UTC | #2

On Mon, Jan 13, 2020 at 04:42:10PM +0000, Chris Mason wrote:
> On 13 Jan 2020, at 10:37, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > I think everybody hates the readpages API.  The fundamental problem 
> > with
> > it is that it passes the pages to be read on a doubly linked list, 
> > using
> > the ->lru list in the struct page.  That means the filesystems have to
> > do the work of calling add_to_page_cache{,_lru,_locked}, and handling
> > failures (because another task is also accessing that chunk of the 
> > file,
> > and so it fails).
> 
> I've always kind of liked the compromise of sending the lists.  It's 
> really good at the common case and doesn't have massive problems when 
> things break down.

I think we'll have to disagree on that point.  Linked lists are awful
for the CPU in the common case, and the error handling code for "things
break down" is painful.  I'm pretty sure I spotted three bugs in the
CIFS implementation.

> Just glancing through the patches, the old 
> readpages is called in bigger chunks, so for massive reads we can do 
> more effective readahead on metadata.  I don't think any of us actually 
> do, but we could.
> 
> With this new operation, our window is constant, and much smaller.
> 
> > The fundamental question is, how do we indicate to the implementation 
> > of
> > ->readahead what pages to operate on?  I've gone with passing a 
> > pagevec.
> > This has the obvious advantage that it's a data structure that already
> > exists and is used within filemap for batches of pages.  I had to add 
> > a
> > bit of new infrastructure to support iterating over the pages in the
> > pagevec, but with that done, it's quite nice.
> >
> > I think the biggest problem is that the size of the pagevec is limited
> > to 15 pages (60kB).  So that'll mean that if the readahead window 
> > bumps
> > all the way up to 256kB, we may end up making 5 BIOs (and merging 
> > them)
> > instead of one.  I'd kind of like to be able to allocate variable 
> > length
> > pagevecs while allowing regular pagevecs to be allocated on the stack,
> > but I can't figure out a way to do that.  eg this doesn't work:
> >
> > -       struct page *pages[PAGEVEC_SIZE];
> > +       union {
> > +               struct page *pages[PAGEVEC_SIZE];
> > +               struct page *_pages[];
> > +       }
> >
> > and if we just allocate them, useful and wonderful tools are going to
> > point out when pages[16] is accessed that we've overstepped the end of
> > the array.
> >
> > I have considered alternatives to the pagevec like just having the
> > ->readahead implementation look up the pages in the i_pages XArray
> > directly.  That didn't work out too well.
> >
> 
> Btrfs basically does this now, honestly iomap isn't that far away.  
> Given how sensible iomap is for this, I'd rather see us pile into that 
> abstraction than try to pass pagevecs for large ranges.  Otherwise, if 
> the lists are awkward we can make some helpers to make it less error 
> prone?

I did do a couple of helpers for lists for iomap before deciding the
whole thing was too painful.  I didn't look at btrfs until just now, but, um ...

int extent_readpages(struct address_space *mapping, struct list_head *pages,
                     unsigned nr_pages)
..
        struct page *pagepool[16];
..
        while (!list_empty(pages)) {
..
                        list_del(&page->lru);
                        if (add_to_page_cache_lru(page, mapping, page->index,
..
                        pagepool[nr++] = page;

you're basically doing exactly what i'm proposing to be the new interface!
OK, you get one extra page per batch ;-P

Matthew Wilcox Jan. 13, 2020, 5:54 p.m. UTC | #3

On Mon, Jan 13, 2020 at 04:42:10PM +0000, Chris Mason wrote:
> Btrfs basically does this now, honestly iomap isn't that far away.  
> Given how sensible iomap is for this, I'd rather see us pile into that 
> abstraction than try to pass pagevecs for large ranges.  Otherwise, if 

I completely misread this at first and thought you were proposing we
pass a bio_vec to ->readahead.  Initially, this is a layering violation,
completely unnecessary to have all these extra offset/size fields being
allocated and passed around.  But ... the bio_vec and the skb_frag_t are
now the same data structure, so both block and network use it.  It may
make sense to have this as the common data structure for 'unit of IO'.
The bio supports having the bi_vec allocated externally to the data
structure while the skbuff would need to copy the array.

Maybe we need a more neutral name than bio_vec so as to not upset people.
page_frag, perhaps [1].

[1] Yes, I know about the one in include/linux/mm_types_task.h

Chris Mason Jan. 13, 2020, 6 p.m. UTC | #4

On 13 Jan 2020, at 12:40, Matthew Wilcox wrote:

> On Mon, Jan 13, 2020 at 04:42:10PM +0000, Chris Mason wrote:
>
> I did do a couple of helpers for lists for iomap before deciding the
> whole thing was too painful.  I didn't look at btrfs until just now, 
> but, um ...
>
> int extent_readpages(struct address_space *mapping, struct list_head 
> *pages,
>                      unsigned nr_pages)
> ..
>         struct page *pagepool[16];
> ..
>         while (!list_empty(pages)) {
> ..
>                         list_del(&page->lru);
>                         if (add_to_page_cache_lru(page, mapping, 
> page->index,
> ..
>                         pagepool[nr++] = page;
>
> you're basically doing exactly what i'm proposing to be the new 
> interface!
> OK, you get one extra page per batch ;-P

This is true, I didn't explain that part well ;)  Depending on 
compression etc we might end up poking the xarray inside the actual IO 
functions, but the main difference is that btrfs is building a single 
bio.  You're moving the plug so you'll merge into single bio, but I'd 
rather build 2MB bios than merge them.

I guess it doesn't feel like enough of a win to justify the churn.  If 
we find a way to do much larger pagevecs, I think this makes more sense.

-chris

Matthew Wilcox Jan. 13, 2020, 9:58 p.m. UTC | #5

On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
> This is true, I didn't explain that part well ;)  Depending on 
> compression etc we might end up poking the xarray inside the actual IO 
> functions, but the main difference is that btrfs is building a single 
> bio.  You're moving the plug so you'll merge into single bio, but I'd 
> rather build 2MB bios than merge them.

Why don't we store a bio pointer inside the plug?  You're opencoding that,
iomap is opencoding that, and I bet there's a dozen other places where
we pass a bio around.  Then blk_finish_plug can submit the bio.

Jens Axboe Jan. 13, 2020, 10 p.m. UTC | #6

On 1/13/20 2:58 PM, Matthew Wilcox wrote:
> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
>> This is true, I didn't explain that part well ;)  Depending on 
>> compression etc we might end up poking the xarray inside the actual IO 
>> functions, but the main difference is that btrfs is building a single 
>> bio.  You're moving the plug so you'll merge into single bio, but I'd 
>> rather build 2MB bios than merge them.
> 
> Why don't we store a bio pointer inside the plug?  You're opencoding that,
> iomap is opencoding that, and I bet there's a dozen other places where
> we pass a bio around.  Then blk_finish_plug can submit the bio.

Plugs aren't necessarily a bio, they can be callbacks too.

Matthew Wilcox Jan. 13, 2020, 10:10 p.m. UTC | #7

On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote:
> On 1/13/20 2:58 PM, Matthew Wilcox wrote:
> > On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
> >> This is true, I didn't explain that part well ;)  Depending on 
> >> compression etc we might end up poking the xarray inside the actual IO 
> >> functions, but the main difference is that btrfs is building a single 
> >> bio.  You're moving the plug so you'll merge into single bio, but I'd 
> >> rather build 2MB bios than merge them.
> > 
> > Why don't we store a bio pointer inside the plug?  You're opencoding that,
> > iomap is opencoding that, and I bet there's a dozen other places where
> > we pass a bio around.  Then blk_finish_plug can submit the bio.
> 
> Plugs aren't necessarily a bio, they can be callbacks too.

I'm thinking something as simple as this:

@@ -1711,6 +1711,7 @@ void blk_start_plug(struct blk_plug *plug)
 
        INIT_LIST_HEAD(&plug->mq_list);
        INIT_LIST_HEAD(&plug->cb_list);
+       plug->bio = NULL;
        plug->rq_count = 0;
        plug->multiple_queues = false;
 
@@ -1786,6 +1787,8 @@ void blk_finish_plug(struct blk_plug *plug)
 {
        if (plug != current->plug)
                return;
+       if (plug->bio)
+               submit_bio(plug->bio);
        blk_flush_plug_list(plug, false);
 
        current->plug = NULL;
@@ -1160,6 +1160,7 @@ extern void blk_set_queue_dying(struct request_queue *);
 struct blk_plug {
        struct list_head mq_list; /* blk-mq requests */
        struct list_head cb_list; /* md requires an unplug callback */
+       struct bio *bio;
        unsigned short rq_count;
        bool multiple_queues;
 };

with accessors to 'get_current_bio()' and 'set_current_bio()'.

Jens Axboe Jan. 13, 2020, 10:14 p.m. UTC | #8

On 1/13/20 3:10 PM, Matthew Wilcox wrote:
> On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote:
>> On 1/13/20 2:58 PM, Matthew Wilcox wrote:
>>> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
>>>> This is true, I didn't explain that part well ;)  Depending on 
>>>> compression etc we might end up poking the xarray inside the actual IO 
>>>> functions, but the main difference is that btrfs is building a single 
>>>> bio.  You're moving the plug so you'll merge into single bio, but I'd 
>>>> rather build 2MB bios than merge them.
>>>
>>> Why don't we store a bio pointer inside the plug?  You're opencoding that,
>>> iomap is opencoding that, and I bet there's a dozen other places where
>>> we pass a bio around.  Then blk_finish_plug can submit the bio.
>>
>> Plugs aren't necessarily a bio, they can be callbacks too.
> 
> I'm thinking something as simple as this:
> 
> @@ -1711,6 +1711,7 @@ void blk_start_plug(struct blk_plug *plug)
>  
>         INIT_LIST_HEAD(&plug->mq_list);
>         INIT_LIST_HEAD(&plug->cb_list);
> +       plug->bio = NULL;
>         plug->rq_count = 0;
>         plug->multiple_queues = false;
>  
> @@ -1786,6 +1787,8 @@ void blk_finish_plug(struct blk_plug *plug)
>  {
>         if (plug != current->plug)
>                 return;
> +       if (plug->bio)
> +               submit_bio(plug->bio);
>         blk_flush_plug_list(plug, false);
>  
>         current->plug = NULL;
> @@ -1160,6 +1160,7 @@ extern void blk_set_queue_dying(struct request_queue *);
>  struct blk_plug {
>         struct list_head mq_list; /* blk-mq requests */
>         struct list_head cb_list; /* md requires an unplug callback */
> +       struct bio *bio;
>         unsigned short rq_count;
>         bool multiple_queues;
>  };
> 
> with accessors to 'get_current_bio()' and 'set_current_bio()'.

It's a little odd imho, the plugging generally collect requests. Sounds
what you're looking for is some plug owner private data, which just
happens to be a bio in this case?

Is this over repeated calls to some IO generating helper? Would it be
more efficient if that helper could generate the full bio in one go,
instead of piecemeal?

Jens Axboe Jan. 13, 2020, 10:19 p.m. UTC | #9

On Mon, Jan 13, 2020 at 10:54 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 13, 2020 at 04:42:10PM +0000, Chris Mason wrote:
> > Btrfs basically does this now, honestly iomap isn't that far away.
> > Given how sensible iomap is for this, I'd rather see us pile into
> > that abstraction than try to pass pagevecs for large ranges.
> > Otherwise, if
>
> I completely misread this at first and thought you were proposing we
> pass a bio_vec to ->readahead.  Initially, this is a layering
> violation, completely unnecessary to have all these extra offset/size
> fields being allocated and passed around.  But ... the bio_vec and the
> skb_frag_t are now the same data structure, so both block and network
> use it.  It may make sense to have this as the common data structure
> for 'unit of IO'.  The bio supports having the bi_vec allocated
> externally to the data structure while the skbuff would need to copy
> the array.
>
> Maybe we need a more neutral name than bio_vec so as to not upset
> people.  page_frag, perhaps [1].
>
> [1] Yes, I know about the one in include/linux/mm_types_task.h

Note that bio_vecs support page merging, so page fragment isn't very
descriptive. Not sure what a good name would be, but fragment isn't it.
But any shared type would imply that others should support that as well
if you're passing it around (or by pointer).

Matthew Wilcox Jan. 13, 2020, 10:27 p.m. UTC | #10

On Mon, Jan 13, 2020 at 03:14:26PM -0700, Jens Axboe wrote:
> On 1/13/20 3:10 PM, Matthew Wilcox wrote:
> > On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote:
> >> On 1/13/20 2:58 PM, Matthew Wilcox wrote:
> >>> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
> >>>> This is true, I didn't explain that part well ;)  Depending on 
> >>>> compression etc we might end up poking the xarray inside the actual IO 
> >>>> functions, but the main difference is that btrfs is building a single 
> >>>> bio.  You're moving the plug so you'll merge into single bio, but I'd 
> >>>> rather build 2MB bios than merge them.
> >>>
> >>> Why don't we store a bio pointer inside the plug?  You're opencoding that,
> >>> iomap is opencoding that, and I bet there's a dozen other places where
> >>> we pass a bio around.  Then blk_finish_plug can submit the bio.
> >>
> >> Plugs aren't necessarily a bio, they can be callbacks too.
> > 
> > I'm thinking something as simple as this:
>
> It's a little odd imho, the plugging generally collect requests. Sounds
> what you're looking for is some plug owner private data, which just
> happens to be a bio in this case?
> 
> Is this over repeated calls to some IO generating helper? Would it be
> more efficient if that helper could generate the full bio in one go,
> instead of piecemeal?

The issue is around ->readpages.  Take a look at how iomap_readpages
works, for example.  We're under a plug (taken in mm/readahead.c),
but we still go through the rigamarole of keeping a pointer to the bio
in ctx.bio and passing ctx around so that we don't end up with many
fragments which have to be recombined into a single bio at the end.

I think what I want is a bio I can reach from current, somehow.  And the
plug feels like a natural place to keep it because it's basically saying
"I want to do lots of little IOs and have them combined".  The fact that
the iomap code has a bio that it precombines fragments into suggests to
me that the existing antifragmentation code in the plugging mechanism
isn't good enough.  So let's make it better by storing a bio in the plug
and then we can get rid of the bio in the iomap code.

Jens Axboe Jan. 13, 2020, 10:30 p.m. UTC | #11

On 1/13/20 3:27 PM, Matthew Wilcox wrote:
> On Mon, Jan 13, 2020 at 03:14:26PM -0700, Jens Axboe wrote:
>> On 1/13/20 3:10 PM, Matthew Wilcox wrote:
>>> On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote:
>>>> On 1/13/20 2:58 PM, Matthew Wilcox wrote:
>>>>> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
>>>>>> This is true, I didn't explain that part well ;)  Depending on 
>>>>>> compression etc we might end up poking the xarray inside the actual IO 
>>>>>> functions, but the main difference is that btrfs is building a single 
>>>>>> bio.  You're moving the plug so you'll merge into single bio, but I'd 
>>>>>> rather build 2MB bios than merge them.
>>>>>
>>>>> Why don't we store a bio pointer inside the plug?  You're opencoding that,
>>>>> iomap is opencoding that, and I bet there's a dozen other places where
>>>>> we pass a bio around.  Then blk_finish_plug can submit the bio.
>>>>
>>>> Plugs aren't necessarily a bio, they can be callbacks too.
>>>
>>> I'm thinking something as simple as this:
>>
>> It's a little odd imho, the plugging generally collect requests. Sounds
>> what you're looking for is some plug owner private data, which just
>> happens to be a bio in this case?
>>
>> Is this over repeated calls to some IO generating helper? Would it be
>> more efficient if that helper could generate the full bio in one go,
>> instead of piecemeal?
> 
> The issue is around ->readpages.  Take a look at how iomap_readpages
> works, for example.  We're under a plug (taken in mm/readahead.c),
> but we still go through the rigamarole of keeping a pointer to the bio
> in ctx.bio and passing ctx around so that we don't end up with many
> fragments which have to be recombined into a single bio at the end.
> 
> I think what I want is a bio I can reach from current, somehow.  And the
> plug feels like a natural place to keep it because it's basically saying
> "I want to do lots of little IOs and have them combined".  The fact that
> the iomap code has a bio that it precombines fragments into suggests to
> me that the existing antifragmentation code in the plugging mechanism
> isn't good enough.  So let's make it better by storing a bio in the plug
> and then we can get rid of the bio in the iomap code.

But it doesn't fit the plug model very well. If you get preempted, it
could go away. Or if you have nested plugs, it also won't work at all. I
think it's much saner to keep the "current" bio _outside_ of the plug,
and work on it in peace until you want to submit it.

Chris Mason Jan. 13, 2020, 10:34 p.m. UTC | #12

On 13 Jan 2020, at 17:27, Matthew Wilcox wrote:

> On Mon, Jan 13, 2020 at 03:14:26PM -0700, Jens Axboe wrote:
>> On 1/13/20 3:10 PM, Matthew Wilcox wrote:
>>> On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote:
>>>> On 1/13/20 2:58 PM, Matthew Wilcox wrote:
>>>>> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
>>>>>> This is true, I didn't explain that part well ;)  Depending on
>>>>>> compression etc we might end up poking the xarray inside the 
>>>>>> actual IO
>>>>>> functions, but the main difference is that btrfs is building a 
>>>>>> single
>>>>>> bio.  You're moving the plug so you'll merge into single bio, but 
>>>>>> I'd
>>>>>> rather build 2MB bios than merge them.
>>>>>
>>>>> Why don't we store a bio pointer inside the plug?  You're 
>>>>> opencoding that,
>>>>> iomap is opencoding that, and I bet there's a dozen other places 
>>>>> where
>>>>> we pass a bio around.  Then blk_finish_plug can submit the bio.
>>>>
>>>> Plugs aren't necessarily a bio, they can be callbacks too.
>>>
>>> I'm thinking something as simple as this:
>>
>> It's a little odd imho, the plugging generally collect requests. 
>> Sounds
>> what you're looking for is some plug owner private data, which just
>> happens to be a bio in this case?
>>
>> Is this over repeated calls to some IO generating helper? Would it be
>> more efficient if that helper could generate the full bio in one go,
>> instead of piecemeal?
>
> The issue is around ->readpages.  Take a look at how iomap_readpages
> works, for example.  We're under a plug (taken in mm/readahead.c),
> but we still go through the rigamarole of keeping a pointer to the bio
> in ctx.bio and passing ctx around so that we don't end up with many
> fragments which have to be recombined into a single bio at the end.
>
> I think what I want is a bio I can reach from current, somehow.  And 
> the
> plug feels like a natural place to keep it because it's basically 
> saying
> "I want to do lots of little IOs and have them combined".  The fact 
> that
> the iomap code has a bio that it precombines fragments into suggests 
> to
> me that the existing antifragmentation code in the plugging mechanism
> isn't good enough.  So let's make it better by storing a bio in the 
> plug
> and then we can get rid of the bio in the iomap code.

Both btrfs and xfs do this, we have a bio that we pass around and build 
and submit.  We both also do some gymnastics in writepages to avoid 
waiting for the bios we've been building to finish while we're building 
them.

I love the idea of the plug api having a way to hold that for us, but 
sometimes we really are building the bios, and we don't want the plug to 
let it go if we happen to schedule.

-chris

Matthew Wilcox Jan. 14, 2020, 1:01 a.m. UTC | #13

On Mon, Jan 13, 2020 at 10:34:16PM +0000, Chris Mason wrote:
> On 13 Jan 2020, at 17:27, Matthew Wilcox wrote:
> > On Mon, Jan 13, 2020 at 03:14:26PM -0700, Jens Axboe wrote:
> >> On 1/13/20 3:10 PM, Matthew Wilcox wrote:
> >>> On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote:
> >>>> On 1/13/20 2:58 PM, Matthew Wilcox wrote:
> >>>>> Why don't we store a bio pointer inside the plug?  You're 
> >>>>> opencoding that,
> >>>>> iomap is opencoding that, and I bet there's a dozen other places 
> >>>>> where
> >>>>> we pass a bio around.  Then blk_finish_plug can submit the bio.
> >>>>
> >>>> Plugs aren't necessarily a bio, they can be callbacks too.
> >>
> >> It's a little odd imho, the plugging generally collect requests. 
> >> Sounds
> >> what you're looking for is some plug owner private data, which just
> >> happens to be a bio in this case?
> >>
> >> Is this over repeated calls to some IO generating helper? Would it be
> >> more efficient if that helper could generate the full bio in one go,
> >> instead of piecemeal?
> >
> > The issue is around ->readpages.  Take a look at how iomap_readpages
> > works, for example.  We're under a plug (taken in mm/readahead.c),
> > but we still go through the rigamarole of keeping a pointer to the bio
> > in ctx.bio and passing ctx around so that we don't end up with many
> > fragments which have to be recombined into a single bio at the end.
> >
> > I think what I want is a bio I can reach from current, somehow.  And 
> > the
> > plug feels like a natural place to keep it because it's basically 
> > saying
> > "I want to do lots of little IOs and have them combined".  The fact 
> > that
> > the iomap code has a bio that it precombines fragments into suggests 
> > to
> > me that the existing antifragmentation code in the plugging mechanism
> > isn't good enough.  So let's make it better by storing a bio in the 
> > plug
> > and then we can get rid of the bio in the iomap code.
> 
> Both btrfs and xfs do this, we have a bio that we pass around and build 
> and submit.  We both also do some gymnastics in writepages to avoid 
> waiting for the bios we've been building to finish while we're building 
> them.
> 
> I love the idea of the plug api having a way to hold that for us, but 
> sometimes we really are building the bios, and we don't want the plug to 
> let it go if we happen to schedule.

The plug wouldn't have to let the bio go.  I appreciate the plug does let
requests go on context switch, but it wouldn't have to let the bio go.
This bio is being stored on the stack, just as now, so it's still there.

Chris Mason Jan. 14, 2020, 1:07 a.m. UTC | #14

On 13 Jan 2020, at 20:01, Matthew Wilcox wrote:

> On Mon, Jan 13, 2020 at 10:34:16PM +0000, Chris Mason wrote:
>>> I think what I want is a bio I can reach from current, somehow.  And
>>> the
>>> plug feels like a natural place to keep it because it's basically
>>> saying
>>> "I want to do lots of little IOs and have them combined".  The fact
>>> that
>>> the iomap code has a bio that it precombines fragments into suggests
>>> to
>>> me that the existing antifragmentation code in the plugging 
>>> mechanism
>>> isn't good enough.  So let's make it better by storing a bio in the
>>> plug
>>> and then we can get rid of the bio in the iomap code.
>>
>> Both btrfs and xfs do this, we have a bio that we pass around and 
>> build
>> and submit.  We both also do some gymnastics in writepages to avoid
>> waiting for the bios we've been building to finish while we're 
>> building
>> them.
>>
>> I love the idea of the plug api having a way to hold that for us, but
>> sometimes we really are building the bios, and we don't want the plug 
>> to
>> let it go if we happen to schedule.
>
> The plug wouldn't have to let the bio go.  I appreciate the plug does 
> let
> requests go on context switch, but it wouldn't have to let the bio go.
> This bio is being stored on the stack, just as now, so it's still 
> there.

Plugging is great because it's building up state for the block layer 
across a wide variety of MM and FS code.

btrfs and xfs are building state for themselves across a tiny cross 
section of their own code.  I'd rather not complicate magic state in 
current in places we can easily pass the bio down.

-chris

[RFC,0/8] Replacing the readpages a_op

Message

Comments