diff mbox series

[RFC] implement orangefs_readahead

Message ID CAOg9mSTQ-zNKXQGBK9QEnwJCvwqh=zFLbLJZy-ibGZwLve4o0w@mail.gmail.com (mailing list archive)
State New, archived
Headers show
Series [RFC] implement orangefs_readahead | expand

Commit Message

Mike Marshall Jan. 31, 2021, 10:25 p.m. UTC
A few weeks ago Matthew Wilcox helped me see how
mm/readahead.c/read_pages was dropping down into
some code designed to take over for filesystems
that didn't implement ->readahead, and how this "failover"
code was screwing over the readahead-like code I'd put
into orangefs_readpage.

I studied a bunch of readahead code in fs and mm and other
filesystems and came up with this patch that seems to work
in the tests I've done so far. Sometimes code I like is
instead irritating to Linus or Al Viro :-), so hopefully some
of y'all will look over what I've got here. There's a
couple of printk's I've left in orangefs_readpage that don't
belong upstream, they help me now for testing though...
Besides the diff at the end of this message, the code is
in: https://github.com/hubcapsc/linux/tree/readahead

I wish I knew how to specify _nr_pages in the readahead_control
structure so that all the extra pages I need could be obtained
in readahead_page instead of part there and the rest in my
open-coded stuff in orangefs_readpage. But it looks to me as
if values in the readahead_control structure are set heuristically
outside of my control over in ondemand_readahead?

[root@vm3 linux]# git diff master..readahead

Comments

Matthew Wilcox Feb. 1, 2021, 1:08 p.m. UTC | #1
On Sun, Jan 31, 2021 at 05:25:02PM -0500, Mike Marshall wrote:
> I wish I knew how to specify _nr_pages in the readahead_control
> structure so that all the extra pages I need could be obtained
> in readahead_page instead of part there and the rest in my
> open-coded stuff in orangefs_readpage. But it looks to me as
> if values in the readahead_control structure are set heuristically
> outside of my control over in ondemand_readahead?

That's right (for now).  I pointed you at some code from Dave Howells
that will allow orangefs to enlarge the readahead window beyond that
determined by the core code's algorithms.

> [root@vm3 linux]# git diff master..readahead
> diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
> index 48f0547d4850..682a968cb82a 100644
> --- a/fs/orangefs/inode.c
> +++ b/fs/orangefs/inode.c
> @@ -244,6 +244,25 @@ static int orangefs_writepages(struct
> address_space *mapping,
> 
>  static int orangefs_launder_page(struct page *);
> 
> +/*
> + * Prefill the page cache with some pages that we're probably
> + * about to need...
> + */
> +static void orangefs_readahead(struct readahead_control *rac)
> +{
> +       pgoff_t index = readahead_index(rac);
> +       struct page *page;
> +
> +       while ((page = readahead_page(rac))) {
> +               prefetchw(&page->flags);
> +               put_page(page);
> +               unlock_page(page);
> +               index++;
> +       }
> +
> +       return;
> +}

This is not the way to do it.  You need to actually kick off readahead in
this routine so that you get pipelining (ie the app is working on pages
0-15 at the same time the server is getting you pages 16-31).  I don't
see much support in orangefs for doing async operations; everything
seems to be modelled on "submit an I/O and wait for it to complete".

I'm happy to help out with pagecache interactions, but adding async
support to orangefs is a little bigger task than I'm willing to put
significant effort into right now.
Mike Marshall Feb. 2, 2021, 3:32 a.m. UTC | #2
>> This is not the way to do it. You need to actually kick
>> off readahead in this routine so that you get pipelining
>> (ie the app is working on pages 0-15 at the same time
>> the server is getting you pages 16-31).

Orangefs isn't very good at reading or writing a few
pages at a time. Its optimal block size is four megabytes.
I'm trying to do IOs big enough to make Orangefs
start flowing like it needs to and then have pages
on hand to fill with the data. Perhaps I can figure
how to use Dave Howell's code to control the
readahead window and make adjustments to
how many pages Orangefs reads per IO and
end up with something that is closer to how
readahead is intended to be used.

This patch is a big performance improvement over
the code that's upstream even though I'm
not using readahead as intended.

>> I don't see much support in orangefs for doing async
>> operations; everything seems to be modelled on
>> "submit an I/O and wait for it to complete".

Yep... when we were polishing up the kernel module to
attempt to go upstream, the code in there for async was
left behind... I might be able to make sense of it now,
Ida know... You've helped me to see this place where
it is needed.

>> adding async
>> support to orangefs is a little bigger task than I'm willing to put
>> significant effort into right now.

The effort and help that you're providing is much
appreciated and just what I need, thanks!

-Mike

On Mon, Feb 1, 2021 at 8:08 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Jan 31, 2021 at 05:25:02PM -0500, Mike Marshall wrote:
> > I wish I knew how to specify _nr_pages in the readahead_control
> > structure so that all the extra pages I need could be obtained
> > in readahead_page instead of part there and the rest in my
> > open-coded stuff in orangefs_readpage. But it looks to me as
> > if values in the readahead_control structure are set heuristically
> > outside of my control over in ondemand_readahead?
>
> That's right (for now).  I pointed you at some code from Dave Howells
> that will allow orangefs to enlarge the readahead window beyond that
> determined by the core code's algorithms.
>
> > [root@vm3 linux]# git diff master..readahead
> > diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
> > index 48f0547d4850..682a968cb82a 100644
> > --- a/fs/orangefs/inode.c
> > +++ b/fs/orangefs/inode.c
> > @@ -244,6 +244,25 @@ static int orangefs_writepages(struct
> > address_space *mapping,
> >
> >  static int orangefs_launder_page(struct page *);
> >
> > +/*
> > + * Prefill the page cache with some pages that we're probably
> > + * about to need...
> > + */
> > +static void orangefs_readahead(struct readahead_control *rac)
> > +{
> > +       pgoff_t index = readahead_index(rac);
> > +       struct page *page;
> > +
> > +       while ((page = readahead_page(rac))) {
> > +               prefetchw(&page->flags);
> > +               put_page(page);
> > +               unlock_page(page);
> > +               index++;
> > +       }
> > +
> > +       return;
> > +}
>
> This is not the way to do it.  You need to actually kick off readahead in
> this routine so that you get pipelining (ie the app is working on pages
> 0-15 at the same time the server is getting you pages 16-31).  I don't
> see much support in orangefs for doing async operations; everything
> seems to be modelled on "submit an I/O and wait for it to complete".
>
> I'm happy to help out with pagecache interactions, but adding async
> support to orangefs is a little bigger task than I'm willing to put
> significant effort into right now.
Mike Marshall March 13, 2021, 3:31 p.m. UTC | #3
Greetings everyone.

I have made another version of orangefs_readahead, without any
of my hand rolled page cache manipulations. I read a bunch of
the source in other filesystems and mm and fs and pagemap.h to
try and get an idea of how to implement readahead so that my
implementation is "with the program".

I have described the flawed code I have upstream now in an
earlier message. My flawed code has no readahead implementation, but
it is much faster than with this readahead implementation.

If this readahead implementation is "the right idea", I can
use it as a framework to implement an async orangefs read function
and start the read at the beginning of my readahead function
and collect the results at the end after the readahead pages
have been marshaled. Also, once some mechanism like David Howells'
code to control the readahead window goes upstream, I should be
able take big enough gulps of readahead to make Orangefs do right.
The heuristically chosen 64 page max that I can get now isn't enough.

I hope some of y'all have the time to review this implementation of
readahead...

Thanks!

-Mike

static void orangefs_readahead(struct readahead_control *rac)
{
struct page **pages;
unsigned int npages = readahead_count(rac);
loff_t offset = readahead_pos(rac);
struct bio_vec *bvs;
int i;
struct iov_iter iter;
struct file *file = rac->file;
struct inode *inode = file->f_mapping->host;
int ret;

/* allocate an array of page pointers. */
pages = kzalloc(npages * (sizeof(struct page *)), GFP_KERNEL);

/* Get a batch of pages to read. */
npages = __readahead_batch(rac, pages, npages);

/* allocate an array of bio_vecs. */
bvs = kzalloc(npages * (sizeof(struct bio_vec)), GFP_KERNEL);

/* hook the bio_vecs to the pages. */
for (i = 0; i < npages; i++) {
bvs[i].bv_page = pages[i];
bvs[i].bv_len = PAGE_SIZE;
bvs[i].bv_offset = 0;
}

iov_iter_bvec(&iter, READ, bvs, npages, npages * PAGE_SIZE);

/* read in the pages. */
ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter,
npages * PAGE_SIZE, inode->i_size, NULL, NULL, file);

/* clean up. */
for (i = 0; i < npages; i++) {
SetPageUptodate(bvs[i].bv_page);
unlock_page(bvs[i].bv_page);
put_page(bvs[i].bv_page);
}
kfree(pages);
kfree(bvs);
}

On Mon, Feb 1, 2021 at 10:32 PM Mike Marshall <hubcap@omnibond.com> wrote:
>
> >> This is not the way to do it. You need to actually kick
> >> off readahead in this routine so that you get pipelining
> >> (ie the app is working on pages 0-15 at the same time
> >> the server is getting you pages 16-31).
>
> Orangefs isn't very good at reading or writing a few
> pages at a time. Its optimal block size is four megabytes.
> I'm trying to do IOs big enough to make Orangefs
> start flowing like it needs to and then have pages
> on hand to fill with the data. Perhaps I can figure
> how to use Dave Howell's code to control the
> readahead window and make adjustments to
> how many pages Orangefs reads per IO and
> end up with something that is closer to how
> readahead is intended to be used.
>
> This patch is a big performance improvement over
> the code that's upstream even though I'm
> not using readahead as intended.
>
> >> I don't see much support in orangefs for doing async
> >> operations; everything seems to be modelled on
> >> "submit an I/O and wait for it to complete".
>
> Yep... when we were polishing up the kernel module to
> attempt to go upstream, the code in there for async was
> left behind... I might be able to make sense of it now,
> Ida know... You've helped me to see this place where
> it is needed.
>
> >> adding async
> >> support to orangefs is a little bigger task than I'm willing to put
> >> significant effort into right now.
>
> The effort and help that you're providing is much
> appreciated and just what I need, thanks!
>
> -Mike
>
> On Mon, Feb 1, 2021 at 8:08 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sun, Jan 31, 2021 at 05:25:02PM -0500, Mike Marshall wrote:
> > > I wish I knew how to specify _nr_pages in the readahead_control
> > > structure so that all the extra pages I need could be obtained
> > > in readahead_page instead of part there and the rest in my
> > > open-coded stuff in orangefs_readpage. But it looks to me as
> > > if values in the readahead_control structure are set heuristically
> > > outside of my control over in ondemand_readahead?
> >
> > That's right (for now).  I pointed you at some code from Dave Howells
> > that will allow orangefs to enlarge the readahead window beyond that
> > determined by the core code's algorithms.
> >
> > > [root@vm3 linux]# git diff master..readahead
> > > diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
> > > index 48f0547d4850..682a968cb82a 100644
> > > --- a/fs/orangefs/inode.c
> > > +++ b/fs/orangefs/inode.c
> > > @@ -244,6 +244,25 @@ static int orangefs_writepages(struct
> > > address_space *mapping,
> > >
> > >  static int orangefs_launder_page(struct page *);
> > >
> > > +/*
> > > + * Prefill the page cache with some pages that we're probably
> > > + * about to need...
> > > + */
> > > +static void orangefs_readahead(struct readahead_control *rac)
> > > +{
> > > +       pgoff_t index = readahead_index(rac);
> > > +       struct page *page;
> > > +
> > > +       while ((page = readahead_page(rac))) {
> > > +               prefetchw(&page->flags);
> > > +               put_page(page);
> > > +               unlock_page(page);
> > > +               index++;
> > > +       }
> > > +
> > > +       return;
> > > +}
> >
> > This is not the way to do it.  You need to actually kick off readahead in
> > this routine so that you get pipelining (ie the app is working on pages
> > 0-15 at the same time the server is getting you pages 16-31).  I don't
> > see much support in orangefs for doing async operations; everything
> > seems to be modelled on "submit an I/O and wait for it to complete".
> >
> > I'm happy to help out with pagecache interactions, but adding async
> > support to orangefs is a little bigger task than I'm willing to put
> > significant effort into right now.
Mike Marshall March 17, 2021, 3:04 a.m. UTC | #4
So I took my orangefs_readahead patch and
David Howells' patches from his fscache-iter
branch and put them on top of Linux 5.12-rc3
so as to get rid of all that
 RIP: 0010:ttm_bo_release+0x2ea/0x340 [ttm]
stuff that was happening to VMs with
5.12-rc2.  Then I added a readahead_expand
call at the start of orangefs_readahead. I
played with various expansion values, but
the bottom line is: it works GREAT in my
simple tests, speeds reads WAY up.

-Mike

On Sat, Mar 13, 2021 at 10:31 AM Mike Marshall <hubcap@omnibond.com> wrote:
>
> Greetings everyone.
>
> I have made another version of orangefs_readahead, without any
> of my hand rolled page cache manipulations. I read a bunch of
> the source in other filesystems and mm and fs and pagemap.h to
> try and get an idea of how to implement readahead so that my
> implementation is "with the program".
>
> I have described the flawed code I have upstream now in an
> earlier message. My flawed code has no readahead implementation, but
> it is much faster than with this readahead implementation.
>
> If this readahead implementation is "the right idea", I can
> use it as a framework to implement an async orangefs read function
> and start the read at the beginning of my readahead function
> and collect the results at the end after the readahead pages
> have been marshaled. Also, once some mechanism like David Howells'
> code to control the readahead window goes upstream, I should be
> able take big enough gulps of readahead to make Orangefs do right.
> The heuristically chosen 64 page max that I can get now isn't enough.
>
> I hope some of y'all have the time to review this implementation of
> readahead...
>
> Thanks!
>
> -Mike
>
> static void orangefs_readahead(struct readahead_control *rac)
> {
> struct page **pages;
> unsigned int npages = readahead_count(rac);
> loff_t offset = readahead_pos(rac);
> struct bio_vec *bvs;
> int i;
> struct iov_iter iter;
> struct file *file = rac->file;
> struct inode *inode = file->f_mapping->host;
> int ret;
>
> /* allocate an array of page pointers. */
> pages = kzalloc(npages * (sizeof(struct page *)), GFP_KERNEL);
>
> /* Get a batch of pages to read. */
> npages = __readahead_batch(rac, pages, npages);
>
> /* allocate an array of bio_vecs. */
> bvs = kzalloc(npages * (sizeof(struct bio_vec)), GFP_KERNEL);
>
> /* hook the bio_vecs to the pages. */
> for (i = 0; i < npages; i++) {
> bvs[i].bv_page = pages[i];
> bvs[i].bv_len = PAGE_SIZE;
> bvs[i].bv_offset = 0;
> }
>
> iov_iter_bvec(&iter, READ, bvs, npages, npages * PAGE_SIZE);
>
> /* read in the pages. */
> ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter,
> npages * PAGE_SIZE, inode->i_size, NULL, NULL, file);
>
> /* clean up. */
> for (i = 0; i < npages; i++) {
> SetPageUptodate(bvs[i].bv_page);
> unlock_page(bvs[i].bv_page);
> put_page(bvs[i].bv_page);
> }
> kfree(pages);
> kfree(bvs);
> }
>
> On Mon, Feb 1, 2021 at 10:32 PM Mike Marshall <hubcap@omnibond.com> wrote:
> >
> > >> This is not the way to do it. You need to actually kick
> > >> off readahead in this routine so that you get pipelining
> > >> (ie the app is working on pages 0-15 at the same time
> > >> the server is getting you pages 16-31).
> >
> > Orangefs isn't very good at reading or writing a few
> > pages at a time. Its optimal block size is four megabytes.
> > I'm trying to do IOs big enough to make Orangefs
> > start flowing like it needs to and then have pages
> > on hand to fill with the data. Perhaps I can figure
> > how to use Dave Howell's code to control the
> > readahead window and make adjustments to
> > how many pages Orangefs reads per IO and
> > end up with something that is closer to how
> > readahead is intended to be used.
> >
> > This patch is a big performance improvement over
> > the code that's upstream even though I'm
> > not using readahead as intended.
> >
> > >> I don't see much support in orangefs for doing async
> > >> operations; everything seems to be modelled on
> > >> "submit an I/O and wait for it to complete".
> >
> > Yep... when we were polishing up the kernel module to
> > attempt to go upstream, the code in there for async was
> > left behind... I might be able to make sense of it now,
> > Ida know... You've helped me to see this place where
> > it is needed.
> >
> > >> adding async
> > >> support to orangefs is a little bigger task than I'm willing to put
> > >> significant effort into right now.
> >
> > The effort and help that you're providing is much
> > appreciated and just what I need, thanks!
> >
> > -Mike
> >
> > On Mon, Feb 1, 2021 at 8:08 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sun, Jan 31, 2021 at 05:25:02PM -0500, Mike Marshall wrote:
> > > > I wish I knew how to specify _nr_pages in the readahead_control
> > > > structure so that all the extra pages I need could be obtained
> > > > in readahead_page instead of part there and the rest in my
> > > > open-coded stuff in orangefs_readpage. But it looks to me as
> > > > if values in the readahead_control structure are set heuristically
> > > > outside of my control over in ondemand_readahead?
> > >
> > > That's right (for now).  I pointed you at some code from Dave Howells
> > > that will allow orangefs to enlarge the readahead window beyond that
> > > determined by the core code's algorithms.
> > >
> > > > [root@vm3 linux]# git diff master..readahead
> > > > diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
> > > > index 48f0547d4850..682a968cb82a 100644
> > > > --- a/fs/orangefs/inode.c
> > > > +++ b/fs/orangefs/inode.c
> > > > @@ -244,6 +244,25 @@ static int orangefs_writepages(struct
> > > > address_space *mapping,
> > > >
> > > >  static int orangefs_launder_page(struct page *);
> > > >
> > > > +/*
> > > > + * Prefill the page cache with some pages that we're probably
> > > > + * about to need...
> > > > + */
> > > > +static void orangefs_readahead(struct readahead_control *rac)
> > > > +{
> > > > +       pgoff_t index = readahead_index(rac);
> > > > +       struct page *page;
> > > > +
> > > > +       while ((page = readahead_page(rac))) {
> > > > +               prefetchw(&page->flags);
> > > > +               put_page(page);
> > > > +               unlock_page(page);
> > > > +               index++;
> > > > +       }
> > > > +
> > > > +       return;
> > > > +}
> > >
> > > This is not the way to do it.  You need to actually kick off readahead in
> > > this routine so that you get pipelining (ie the app is working on pages
> > > 0-15 at the same time the server is getting you pages 16-31).  I don't
> > > see much support in orangefs for doing async operations; everything
> > > seems to be modelled on "submit an I/O and wait for it to complete".
> > >
> > > I'm happy to help out with pagecache interactions, but adding async
> > > support to orangefs is a little bigger task than I'm willing to put
> > > significant effort into right now.
David Howells March 24, 2021, 11:10 a.m. UTC | #5
Mike Marshall <hubcap@omnibond.com> wrote:

> /* allocate an array of bio_vecs. */
> bvs = kzalloc(npages * (sizeof(struct bio_vec)), GFP_KERNEL);
>

Better to use kcalloc() here as it has overflow checking.

> /* hook the bio_vecs to the pages. */
> for (i = 0; i < npages; i++) {
> bvs[i].bv_page = pages[i];
> bvs[i].bv_len = PAGE_SIZE;
> bvs[i].bv_offset = 0;
> }
> 
> iov_iter_bvec(&iter, READ, bvs, npages, npages * PAGE_SIZE);
> 
> /* read in the pages. */
> ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter,
> npages * PAGE_SIZE, inode->i_size, NULL, NULL, file);
> 
> /* clean up. */
> for (i = 0; i < npages; i++) {
> SetPageUptodate(bvs[i].bv_page);
> unlock_page(bvs[i].bv_page);
> put_page(bvs[i].bv_page);
> }
> kfree(pages);
> kfree(bvs);
> }

Could you try ITER_XARRAY instead of ITER_BVEC:

	https://lore.kernel.org/linux-fsdevel/161653786033.2770958.14154191921867463240.stgit@warthog.procyon.org.uk/T/#u

Setting the iterator looks like:

	iov_iter_xarray(&iter, READ, &mapping->i_pages,
			offset, npages * PAGE_SIZE);

The xarray iterator will handle THPs, but I'm not sure if bvecs will.

Cleanup afterwards would look something like:

	static void afs_file_read_done(struct afs_read *req)
	{
		struct afs_vnode *vnode = req->vnode;
		struct page *page;
		pgoff_t index = req->pos >> PAGE_SHIFT;
		pgoff_t last = index + req->nr_pages - 1;

		XA_STATE(xas, &vnode->vfs_inode.i_mapping->i_pages, index);

		if (iov_iter_count(req->iter) > 0) {
			/* The read was short - clear the excess buffer. */
			_debug("afterclear %zx %zx %llx/%llx",
			       req->iter->iov_offset,
			       iov_iter_count(req->iter),
			       req->actual_len, req->len);
			iov_iter_zero(iov_iter_count(req->iter), req->iter);
		}

		rcu_read_lock();
		xas_for_each(&xas, page, last) {
			page_endio(page, false, 0);
			put_page(page);
		}
		rcu_read_unlock();

		task_io_account_read(req->len);
		req->cleanup = NULL;
	}

David
Mike Marshall March 27, 2021, 2:55 a.m. UTC | #6
Hi David.

I implemented a version with iov_iter_xarray (below).
It appears to be doing "the right thing" when it
gets called, but then I get a backtrace in the kernel
ring buffer "RIP: 0010:read_pages+0x1a1/0x2c0" which is
page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
------------[ cut here ]------------
kernel BUG at include/linux/pagemap.h:892!

So it seems that in mm/readahead.c/read_pages I end up
entering the "Clean up the remaining pages" part, and
never make it through even one iteration... it happens
whether I use readahead_expand or not.

I've been looking at it a long time :-), I'll look more
tomorrow... do you see anything obvious?




static void orangefs_readahead_cleanup(struct xarray *i_pages,
pgoff_t index,
unsigned int npages,
struct iov_iter *iter)
{
pgoff_t last;
struct page *page;
XA_STATE(xas, i_pages, index);

last = npages - 1;

if (iov_iter_count(iter) > 0)
iov_iter_zero(iov_iter_count(iter), iter);

rcu_read_lock();
xas_for_each(&xas, page, last) {
page_endio(page, false, 0);
put_page(page);
}
rcu_read_unlock();
}

static void orangefs_readahead(struct readahead_control *rac)
{
unsigned int npages;
loff_t offset;
struct iov_iter iter;
struct file *file = rac->file;
struct inode *inode = file->f_mapping->host;

struct xarray *i_pages;
pgoff_t index;

int ret;

loff_t new_start = readahead_index(rac) * PAGE_SIZE;
size_t new_len = 524288;

readahead_expand(rac, new_start, new_len);

npages = readahead_count(rac);
offset = readahead_pos(rac);
i_pages = &file->f_mapping->i_pages;


iov_iter_xarray(&iter, READ, i_pages, offset, npages * PAGE_SIZE);

/* read in the pages. */
ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter,
npages * PAGE_SIZE, inode->i_size, NULL, NULL, file);

/* clean up. */
index = offset >> PAGE_SHIFT;
orangefs_readahead_cleanup(i_pages, index, npages, &iter);
}

On Wed, Mar 24, 2021 at 7:10 AM David Howells <dhowells@redhat.com> wrote:
>
> Mike Marshall <hubcap@omnibond.com> wrote:
>
> > /* allocate an array of bio_vecs. */
> > bvs = kzalloc(npages * (sizeof(struct bio_vec)), GFP_KERNEL);
> >
>
> Better to use kcalloc() here as it has overflow checking.
>
> > /* hook the bio_vecs to the pages. */
> > for (i = 0; i < npages; i++) {
> > bvs[i].bv_page = pages[i];
> > bvs[i].bv_len = PAGE_SIZE;
> > bvs[i].bv_offset = 0;
> > }
> >
> > iov_iter_bvec(&iter, READ, bvs, npages, npages * PAGE_SIZE);
> >
> > /* read in the pages. */
> > ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter,
> > npages * PAGE_SIZE, inode->i_size, NULL, NULL, file);
> >
> > /* clean up. */
> > for (i = 0; i < npages; i++) {
> > SetPageUptodate(bvs[i].bv_page);
> > unlock_page(bvs[i].bv_page);
> > put_page(bvs[i].bv_page);
> > }
> > kfree(pages);
> > kfree(bvs);
> > }
>
> Could you try ITER_XARRAY instead of ITER_BVEC:
>
>         https://lore.kernel.org/linux-fsdevel/161653786033.2770958.14154191921867463240.stgit@warthog.procyon.org.uk/T/#u
>
> Setting the iterator looks like:
>
>         iov_iter_xarray(&iter, READ, &mapping->i_pages,
>                         offset, npages * PAGE_SIZE);
>
> The xarray iterator will handle THPs, but I'm not sure if bvecs will.
>
> Cleanup afterwards would look something like:
>
>         static void afs_file_read_done(struct afs_read *req)
>         {
>                 struct afs_vnode *vnode = req->vnode;
>                 struct page *page;
>                 pgoff_t index = req->pos >> PAGE_SHIFT;
>                 pgoff_t last = index + req->nr_pages - 1;
>
>                 XA_STATE(xas, &vnode->vfs_inode.i_mapping->i_pages, index);
>
>                 if (iov_iter_count(req->iter) > 0) {
>                         /* The read was short - clear the excess buffer. */
>                         _debug("afterclear %zx %zx %llx/%llx",
>                                req->iter->iov_offset,
>                                iov_iter_count(req->iter),
>                                req->actual_len, req->len);
>                         iov_iter_zero(iov_iter_count(req->iter), req->iter);
>                 }
>
>                 rcu_read_lock();
>                 xas_for_each(&xas, page, last) {
>                         page_endio(page, false, 0);
>                         put_page(page);
>                 }
>                 rcu_read_unlock();
>
>                 task_io_account_read(req->len);
>                 req->cleanup = NULL;
>         }
>
> David
>
Matthew Wilcox March 27, 2021, 3:50 a.m. UTC | #7
On Fri, Mar 26, 2021 at 10:55:44PM -0400, Mike Marshall wrote:
> Hi David.
> 
> I implemented a version with iov_iter_xarray (below).
> It appears to be doing "the right thing" when it
> gets called, but then I get a backtrace in the kernel
> ring buffer "RIP: 0010:read_pages+0x1a1/0x2c0" which is
> page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
> ------------[ cut here ]------------
> kernel BUG at include/linux/pagemap.h:892!
> 
> So it seems that in mm/readahead.c/read_pages I end up
> entering the "Clean up the remaining pages" part, and
> never make it through even one iteration... it happens
> whether I use readahead_expand or not.
> 
> I've been looking at it a long time :-), I'll look more
> tomorrow... do you see anything obvious?

Yes; Dave's sample code doesn't consume the pages from the readahead
iterator, so the core code thinks you didn't consume them and unlocks
/ puts the pages for you.  That goes wrong, because you did actually
consume them.  Glad I added the assertions now!

We should probably add something like:

static inline void readahead_consume(struct readahead_control *ractl,
		unsigned int nr)
{
	ractl->_nr_pages -= nr;
	ractl->_index += nr;
}

to indicate that you consumed the pages other than by calling
readahead_page() or readahead_page_batch().  Or maybe Dave can
wrap iov_iter_xarray() in a readahead_iter() macro or something
that takes care of adjusting index & nr_pages for you.
David Howells March 27, 2021, 8:31 a.m. UTC | #8
Matthew Wilcox <willy@infradead.org> wrote:

> > I've been looking at it a long time :-), I'll look more
> > tomorrow... do you see anything obvious?
> 
> Yes; Dave's sample code doesn't consume the pages from the readahead
> iterator, so the core code thinks you didn't consume them and unlocks
> / puts the pages for you.  That goes wrong, because you did actually
> consume them.  Glad I added the assertions now!

Yeah...  The cleanup function that I posted potentially happens asynchronously
from the ->readahead function, so the ractl consumption has to be done
elsewhere and may already have happened.  In the case of the code I posted
from, it's actually done in the netfs lib that's in the works:

void netfs_readahead(...)
{
...
	/* Drop the refs on the pages here rather than in the cache or
	 * filesystem.  The locks will be dropped in netfs_rreq_unlock().
	 */
	while ((page = readahead_page(ractl)))
		put_page(page);
...
}

> We should probably add something like:
> 
> static inline void readahead_consume(struct readahead_control *ractl,
> 		unsigned int nr)
> {
> 	ractl->_nr_pages -= nr;
> 	ractl->_index += nr;
> }
> 
> to indicate that you consumed the pages other than by calling
> readahead_page() or readahead_page_batch().  Or maybe Dave can
> wrap iov_iter_xarray() in a readahead_iter() macro or something
> that takes care of adjusting index & nr_pages for you.

I'm not sure either is useful for my case since iov_iter_xarray() for me isn't
being used anywhere that there's an ractl and I still have to drop the page
refs.

However, in Mike's orangefs_readahead_cleanup(), he could replace:

	rcu_read_lock();
	xas_for_each(&xas, page, last) {
		page_endio(page, false, 0);
		put_page(page);
	}
	rcu_read_unlock();

with:

	while ((page = readahead_page(ractl))) {
		page_endio(page, false, 0);
		put_page(page);
	}

maybe?

David
Matthew Wilcox March 27, 2021, 1:56 p.m. UTC | #9
On Sat, Mar 27, 2021 at 08:31:38AM +0000, David Howells wrote:
> However, in Mike's orangefs_readahead_cleanup(), he could replace:
> 
> 	rcu_read_lock();
> 	xas_for_each(&xas, page, last) {
> 		page_endio(page, false, 0);
> 		put_page(page);
> 	}
> 	rcu_read_unlock();
> 
> with:
> 
> 	while ((page = readahead_page(ractl))) {
> 		page_endio(page, false, 0);
> 		put_page(page);
> 	}
> 
> maybe?

I'd rather see that than open-coded use of the XArray.  It's mildly
slower, but given that we're talking about doing I/O, probably not enough
to care about.
Mike Marshall March 27, 2021, 3:40 p.m. UTC | #10
OK... I used David's suggestion, and also put I it right in
orangefs_readahead, orangefs_readahead_cleanup is gone.

It seems to me to work great, I used it some with some printks
in it and watched it do like I think it ought to.

Here's an example of what's upstream in
5.11.8-200.fc33.x86_64:

# dd if=/pvfsmnt/z1 of=/dev/null bs=4194304 count=30
30+0 records in
30+0 records out
125829120 bytes (126 MB, 120 MiB) copied, 5.77943 s, 21.8 MB/s

And here's this version of orangefs_readahead on top of
5.12.0-rc4:

# dd if=/pvfsmnt/z1 of=/dev/null bs=4194304 count=30
30+0 records in
30+0 records out
125829120 bytes (126 MB, 120 MiB) copied, 0.325919 s, 386 MB/s

So now we're getting somewhere :-). I hope readahead_expand
will be upstream soon.

I plan to use inode->i_size and offset to decide how much
expansion is needed on each call to orangefs_readahead,
I hope looking at i_size isn't one of those race condition
things I'm always screwing up on.

If y'all think the orangefs_readahead below is an OK starting
point, I'll add in the i_size/offset logic so I can get
fullsized orangefs gulps of readahead all the way up to
the last whatever sized fragment of the file and run
xfstests on it to see if it still seems like it is doing right.

One day when it is possible I wish I could figure out how to use
huge pages or something, copying 1024 pages at a time out of the orangefs
internal buffer into the page cache is probably slower than if
I could figure out a way to copy 4194304 bytes out of our buffer
into the page cache at once...

Matthew>> but given that we're talking about doing I/O, probably
Matthew>> not enough to care about.

With orangefs that almost ALL we care about.

Thanks for your help!

-Mike

static void orangefs_readahead(struct readahead_control *rac)
{
unsigned int npages;
loff_t offset;
struct iov_iter iter;
struct file *file = rac->file;
struct inode *inode = file->f_mapping->host;

struct xarray *i_pages;
pgoff_t index;
struct page *page;

int ret;

loff_t new_start = readahead_index(rac) * PAGE_SIZE;
size_t new_len = 524288;
readahead_expand(rac, new_start, new_len);

npages = readahead_count(rac);
offset = readahead_pos(rac);
i_pages = &file->f_mapping->i_pages;

iov_iter_xarray(&iter, READ, i_pages, offset, npages * PAGE_SIZE);

/* read in the pages. */
ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter,
npages * PAGE_SIZE, inode->i_size, NULL, NULL, file);

/* clean up. */
while ((page = readahead_page(rac))) {
page_endio(page, false, 0);
put_page(page);
}
}

On Sat, Mar 27, 2021 at 9:57 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Mar 27, 2021 at 08:31:38AM +0000, David Howells wrote:
> > However, in Mike's orangefs_readahead_cleanup(), he could replace:
> >
> >       rcu_read_lock();
> >       xas_for_each(&xas, page, last) {
> >               page_endio(page, false, 0);
> >               put_page(page);
> >       }
> >       rcu_read_unlock();
> >
> > with:
> >
> >       while ((page = readahead_page(ractl))) {
> >               page_endio(page, false, 0);
> >               put_page(page);
> >       }
> >
> > maybe?
>
> I'd rather see that than open-coded use of the XArray.  It's mildly
> slower, but given that we're talking about doing I/O, probably not enough
> to care about.
Matthew Wilcox March 27, 2021, 3:56 p.m. UTC | #11
On Sat, Mar 27, 2021 at 11:40:08AM -0400, Mike Marshall wrote:
> int ret;
> 
> loff_t new_start = readahead_index(rac) * PAGE_SIZE;

That looks like readahead_pos() to me.

> size_t new_len = 524288;
> readahead_expand(rac, new_start, new_len);
> 
> npages = readahead_count(rac);
> offset = readahead_pos(rac);
> i_pages = &file->f_mapping->i_pages;
> 
> iov_iter_xarray(&iter, READ, i_pages, offset, npages * PAGE_SIZE);

readahead_length()?

> /* read in the pages. */
> ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter,
> npages * PAGE_SIZE, inode->i_size, NULL, NULL, file);
> 
> /* clean up. */
> while ((page = readahead_page(rac))) {
> page_endio(page, false, 0);
> put_page(page);
> }
> }

What if wait_for_direct_io() returns an error?  Shouldn't you be calling

page_endio(page, false, ret)

?

> On Sat, Mar 27, 2021 at 9:57 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, Mar 27, 2021 at 08:31:38AM +0000, David Howells wrote:
> > > However, in Mike's orangefs_readahead_cleanup(), he could replace:
> > >
> > >       rcu_read_lock();
> > >       xas_for_each(&xas, page, last) {
> > >               page_endio(page, false, 0);
> > >               put_page(page);
> > >       }
> > >       rcu_read_unlock();
> > >
> > > with:
> > >
> > >       while ((page = readahead_page(ractl))) {
> > >               page_endio(page, false, 0);
> > >               put_page(page);
> > >       }
> > >
> > > maybe?
> >
> > I'd rather see that than open-coded use of the XArray.  It's mildly
> > slower, but given that we're talking about doing I/O, probably not enough
> > to care about.
Mike Marshall March 28, 2021, 3:04 a.m. UTC | #12
This seems OK... ?

static void orangefs_readahead(struct readahead_control *rac)
{
loff_t offset;
struct iov_iter iter;
struct file *file = rac->file;
struct inode *inode = file->f_mapping->host;
struct xarray *i_pages;
struct page *page;
loff_t new_start = readahead_pos(rac);
int ret;
size_t new_len = 0;

loff_t bytes_remaining = inode->i_size - readahead_pos(rac);
loff_t pages_remaining = bytes_remaining / PAGE_SIZE;

if (pages_remaining >= 1024)
new_len = 4194304;
else if (pages_remaining > readahead_count(rac))
new_len = bytes_remaining;

if (new_len)
readahead_expand(rac, new_start, new_len);

offset = readahead_pos(rac);
i_pages = &file->f_mapping->i_pages;

iov_iter_xarray(&iter, READ, i_pages, offset, readahead_length(rac));

/* read in the pages. */
if ((ret = wait_for_direct_io(ORANGEFS_IO_READ, inode,
&offset, &iter, readahead_length(rac),
inode->i_size, NULL, NULL, file)) < 0)
gossip_debug(GOSSIP_FILE_DEBUG,
"%s: wait_for_direct_io failed. \n", __func__);
else
ret = 0;

/* clean up. */
while ((page = readahead_page(rac))) {
page_endio(page, false, ret);
put_page(page);
}
}

I need to go remember how to git send-email through the
kernel.org email server, I apologize for the way gmail
unformats my code, even in plain text mode...

-Mike

On Sat, Mar 27, 2021 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Mar 27, 2021 at 11:40:08AM -0400, Mike Marshall wrote:
> > int ret;
> >
> > loff_t new_start = readahead_index(rac) * PAGE_SIZE;
>
> That looks like readahead_pos() to me.
>
> > size_t new_len = 524288;
> > readahead_expand(rac, new_start, new_len);
> >
> > npages = readahead_count(rac);
> > offset = readahead_pos(rac);
> > i_pages = &file->f_mapping->i_pages;
> >
> > iov_iter_xarray(&iter, READ, i_pages, offset, npages * PAGE_SIZE);
>
> readahead_length()?
>
> > /* read in the pages. */
> > ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter,
> > npages * PAGE_SIZE, inode->i_size, NULL, NULL, file);
> >
> > /* clean up. */
> > while ((page = readahead_page(rac))) {
> > page_endio(page, false, 0);
> > put_page(page);
> > }
> > }
>
> What if wait_for_direct_io() returns an error?  Shouldn't you be calling
>
> page_endio(page, false, ret)
>
> ?
>
> > On Sat, Mar 27, 2021 at 9:57 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sat, Mar 27, 2021 at 08:31:38AM +0000, David Howells wrote:
> > > > However, in Mike's orangefs_readahead_cleanup(), he could replace:
> > > >
> > > >       rcu_read_lock();
> > > >       xas_for_each(&xas, page, last) {
> > > >               page_endio(page, false, 0);
> > > >               put_page(page);
> > > >       }
> > > >       rcu_read_unlock();
> > > >
> > > > with:
> > > >
> > > >       while ((page = readahead_page(ractl))) {
> > > >               page_endio(page, false, 0);
> > > >               put_page(page);
> > > >       }
> > > >
> > > > maybe?
> > >
> > > I'd rather see that than open-coded use of the XArray.  It's mildly
> > > slower, but given that we're talking about doing I/O, probably not enough
> > > to care about.
Mike Marshall March 29, 2021, 1:51 a.m. UTC | #13
I have Linux 5.12-rc4.

On top of that I have "David's patches", an 85 patch
series I peeled off of David Howell's fscache-iter
branch on git.kernel.org a few days ago. These
patches include what I need to use
readahead_expand.

On top of that is my orangefs_readahead patch.

I have run through the xfstests I run, and am
failing generic 75 112 127 & 263, which I
normally pass.

So I got rid of my orangefs_readahead patch.
Still failing 75 112 127 & 263.

Then I got rid of David's patches, I'm at
generic Linux 5.12-rc4, and am no longer
failing those tests.

Just thought I should mention it... other
than that, I'm real happy with the
orangefs_readahead patch, it is a
giant improvement. Y'all's help
made all the difference...

-Mike

On Sat, Mar 27, 2021 at 11:04 PM Mike Marshall <hubcap@omnibond.com> wrote:
>
> This seems OK... ?
>
> static void orangefs_readahead(struct readahead_control *rac)
> {
> loff_t offset;
> struct iov_iter iter;
> struct file *file = rac->file;
> struct inode *inode = file->f_mapping->host;
> struct xarray *i_pages;
> struct page *page;
> loff_t new_start = readahead_pos(rac);
> int ret;
> size_t new_len = 0;
>
> loff_t bytes_remaining = inode->i_size - readahead_pos(rac);
> loff_t pages_remaining = bytes_remaining / PAGE_SIZE;
>
> if (pages_remaining >= 1024)
> new_len = 4194304;
> else if (pages_remaining > readahead_count(rac))
> new_len = bytes_remaining;
>
> if (new_len)
> readahead_expand(rac, new_start, new_len);
>
> offset = readahead_pos(rac);
> i_pages = &file->f_mapping->i_pages;
>
> iov_iter_xarray(&iter, READ, i_pages, offset, readahead_length(rac));
>
> /* read in the pages. */
> if ((ret = wait_for_direct_io(ORANGEFS_IO_READ, inode,
> &offset, &iter, readahead_length(rac),
> inode->i_size, NULL, NULL, file)) < 0)
> gossip_debug(GOSSIP_FILE_DEBUG,
> "%s: wait_for_direct_io failed. \n", __func__);
> else
> ret = 0;
>
> /* clean up. */
> while ((page = readahead_page(rac))) {
> page_endio(page, false, ret);
> put_page(page);
> }
> }
>
> I need to go remember how to git send-email through the
> kernel.org email server, I apologize for the way gmail
> unformats my code, even in plain text mode...
>
> -Mike
>
> On Sat, Mar 27, 2021 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, Mar 27, 2021 at 11:40:08AM -0400, Mike Marshall wrote:
> > > int ret;
> > >
> > > loff_t new_start = readahead_index(rac) * PAGE_SIZE;
> >
> > That looks like readahead_pos() to me.
> >
> > > size_t new_len = 524288;
> > > readahead_expand(rac, new_start, new_len);
> > >
> > > npages = readahead_count(rac);
> > > offset = readahead_pos(rac);
> > > i_pages = &file->f_mapping->i_pages;
> > >
> > > iov_iter_xarray(&iter, READ, i_pages, offset, npages * PAGE_SIZE);
> >
> > readahead_length()?
> >
> > > /* read in the pages. */
> > > ret = wait_for_direct_io(ORANGEFS_IO_READ, inode, &offset, &iter,
> > > npages * PAGE_SIZE, inode->i_size, NULL, NULL, file);
> > >
> > > /* clean up. */
> > > while ((page = readahead_page(rac))) {
> > > page_endio(page, false, 0);
> > > put_page(page);
> > > }
> > > }
> >
> > What if wait_for_direct_io() returns an error?  Shouldn't you be calling
> >
> > page_endio(page, false, ret)
> >
> > ?
> >
> > > On Sat, Mar 27, 2021 at 9:57 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, Mar 27, 2021 at 08:31:38AM +0000, David Howells wrote:
> > > > > However, in Mike's orangefs_readahead_cleanup(), he could replace:
> > > > >
> > > > >       rcu_read_lock();
> > > > >       xas_for_each(&xas, page, last) {
> > > > >               page_endio(page, false, 0);
> > > > >               put_page(page);
> > > > >       }
> > > > >       rcu_read_unlock();
> > > > >
> > > > > with:
> > > > >
> > > > >       while ((page = readahead_page(ractl))) {
> > > > >               page_endio(page, false, 0);
> > > > >               put_page(page);
> > > > >       }
> > > > >
> > > > > maybe?
> > > >
> > > > I'd rather see that than open-coded use of the XArray.  It's mildly
> > > > slower, but given that we're talking about doing I/O, probably not enough
> > > > to care about.
David Howells March 29, 2021, 9:37 a.m. UTC | #14
Mike Marshall <hubcap@omnibond.com> wrote:

> Then I got rid of David's patches, I'm at
> generic Linux 5.12-rc4, and am no longer
> failing those tests.

Um - which patches?

David
Mike Marshall March 29, 2021, 11:25 p.m. UTC | #15
David>> Um - which patches?

fscache-iter from
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

When the top commit was
cachefiles: Don't shorten if over 1G: ff60d1fc9c7cb93d53d1d081f65490ff4ab2f122

followed by a bunch of commits up to
iov_iter: Add ITER_XARRAY: 153907f0e36425bdefd50182ca9733d8bc8e716a

And after that it was Linus' 5.12-rc2 tree...
Linux 5.12-rc2: a38fd8748464831584a19438cbb3082b5a2dab15

You were posting patches from there on fs-devel, and I was
reading your messages and knew what I needed was included there.
When you asked me to try iov_iter_xarray I looked for it
in your git tree instead of peeling it out of my
gmail. Perhaps I got more stuff than you intended...

I did    git format-patch a38fd874..ff60d1fc
and added that to my Linux 5.12-rc4 tree to make my orangefs_readahead
patch that uses readahead_expand.

-Mike

On Mon, Mar 29, 2021 at 5:37 AM David Howells <dhowells@redhat.com> wrote:
>
> Mike Marshall <hubcap@omnibond.com> wrote:
>
> > Then I got rid of David's patches, I'm at
> > generic Linux 5.12-rc4, and am no longer
> > failing those tests.
>
> Um - which patches?
>
> David
>
David Howells April 1, 2021, 1:42 p.m. UTC | #16
Mike Marshall <hubcap@omnibond.com> wrote:

> I did    git format-patch a38fd874..ff60d1fc
> and added that to my Linux 5.12-rc4 tree to make my orangefs_readahead
> patch that uses readahead_expand.

You're using the readahead_expand patch and the iov_iter_xarray patches?

Do you have a public branch I can look at?

David
Mike Marshall April 8, 2021, 8:39 p.m. UTC | #17
Hi David... I've been gone on a motorcycle adventure,
sorry for the delay... here's my public branch...

https://github.com/hubcapsc/linux/tree/readahead_v3

-Mike

On Thu, Apr 1, 2021 at 9:42 AM David Howells <dhowells@redhat.com> wrote:
>
> Mike Marshall <hubcap@omnibond.com> wrote:
>
> > I did    git format-patch a38fd874..ff60d1fc
> > and added that to my Linux 5.12-rc4 tree to make my orangefs_readahead
> > patch that uses readahead_expand.
>
> You're using the readahead_expand patch and the iov_iter_xarray patches?
>
> Do you have a public branch I can look at?
>
> David
>
David Howells April 13, 2021, 3:08 p.m. UTC | #18
Mike Marshall <hubcap@omnibond.com> wrote:

> Hi David... I've been gone on a motorcycle adventure,
> sorry for the delay... here's my public branch...
> 
> https://github.com/hubcapsc/linux/tree/readahead_v3

That seems to have all of my fscache-iter branch in it.  I thought you'd said
you'd dropped them because they were causing problems.

Anyway, I've distilled the basic netfs lib patches, including the readahead
expansion patch and ITER_XARRAY patch here:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib

if that's of use to you?

If you're using any of these patches, would it be possible to get a Tested-by
for them that I can add?

David
Mike Marshall April 16, 2021, 2:36 p.m. UTC | #19
Hi David...

I got your netfs-lib branch as of Apr 15.

I added my orangefs_readahead on top of it and ran through
xfstests. I failed generic 75, 112, 127, & 263, which I
don't usually fail.

I took off my orangefs_readahead patch and ran xfstests with
your untouched netfs-lib branch. No regressions.

I git-reset back to 5.12.0-rc4 (I think your netfs-lib branch is based
on a linus-tree-of-the-day?) and ran xfstests... no regressions.

So... I think all your stuff is working well from my perspective
and that I need to figure out why my orangefs_readahead patch
is causing the regressions I listed. My readahead implementation (via your
readahead_expand) is really fast, but it is bare-bones... I'm probably
leaving out some important stuff... I see other filesystem's
readahead implementations doing stuff like avoiding doing readahead
on pages that have yet to be written back for example.

The top two commits at https://github.com/hubcapsc/linux/tree/readahead_v3
is the current state of my readahead implementation.

Please do add
Tested-by: Mike Marshall <hubcap@omnibond.com>

-Mike

On Tue, Apr 13, 2021 at 11:08 AM David Howells <dhowells@redhat.com> wrote:
>
> Mike Marshall <hubcap@omnibond.com> wrote:
>
> > Hi David... I've been gone on a motorcycle adventure,
> > sorry for the delay... here's my public branch...
> >
> > https://github.com/hubcapsc/linux/tree/readahead_v3
>
> That seems to have all of my fscache-iter branch in it.  I thought you'd said
> you'd dropped them because they were causing problems.
>
> Anyway, I've distilled the basic netfs lib patches, including the readahead
> expansion patch and ITER_XARRAY patch here:
>
>         https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib
>
> if that's of use to you?
>
> If you're using any of these patches, would it be possible to get a Tested-by
> for them that I can add?
>
> David
>
Matthew Wilcox April 16, 2021, 3:14 p.m. UTC | #20
On Fri, Apr 16, 2021 at 10:36:52AM -0400, Mike Marshall wrote:
> So... I think all your stuff is working well from my perspective
> and that I need to figure out why my orangefs_readahead patch
> is causing the regressions I listed. My readahead implementation (via your
> readahead_expand) is really fast, but it is bare-bones... I'm probably
> leaving out some important stuff... I see other filesystem's
> readahead implementations doing stuff like avoiding doing readahead
> on pages that have yet to be written back for example.

You do?!  Actual readahead implementations, or people still implementing
the old ->readpages() method?  The ->readahead() method is _only_ called
for pages which are freshly allocated, Locked and !Uptodate.  If you ever
see a page which is Dirty or Writeback, something has gone very wrong.
Could you tell me which filesystem you saw that bogosity in?

> The top two commits at https://github.com/hubcapsc/linux/tree/readahead_v3
> is the current state of my readahead implementation.
> 
> Please do add
> Tested-by: Mike Marshall <hubcap@omnibond.com>
> 
> -Mike
> 
> On Tue, Apr 13, 2021 at 11:08 AM David Howells <dhowells@redhat.com> wrote:
> >
> > Mike Marshall <hubcap@omnibond.com> wrote:
> >
> > > Hi David... I've been gone on a motorcycle adventure,
> > > sorry for the delay... here's my public branch...
> > >
> > > https://github.com/hubcapsc/linux/tree/readahead_v3
> >
> > That seems to have all of my fscache-iter branch in it.  I thought you'd said
> > you'd dropped them because they were causing problems.
> >
> > Anyway, I've distilled the basic netfs lib patches, including the readahead
> > expansion patch and ITER_XARRAY patch here:
> >
> >         https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib
> >
> > if that's of use to you?
> >
> > If you're using any of these patches, would it be possible to get a Tested-by
> > for them that I can add?
> >
> > David
> >
David Howells April 16, 2021, 3:41 p.m. UTC | #21
In orangefs_readahead():

	loff_t bytes_remaining = inode->i_size - readahead_pos(rac);
	loff_t pages_remaining = bytes_remaining / PAGE_SIZE;

What happens if bytes_remaining < PAGE_SIZE?  Or even what happens if
bytes_remaining % PAGE_SIZE != 0?

	if ((ret = wait_for_direct_io(ORANGEFS_IO_READ, inode,
			&offset, &iter, readahead_length(rac),
			inode->i_size, NULL, NULL, file)) < 0)

I wonder if you should use iov_length(&iter) rather than
readahead_length(rac).  They *should* be the same.

Also, should you cache inode->i_size lest it change under you due to truncate?

David
Mike Marshall April 25, 2021, 1:43 a.m. UTC | #22
>> What happens if bytes_remaining < PAGE_SIZE?

I think on a call where that occurs, new_len won't get set
and readahead_expand won't get called. I don't see how that's
not correct, but I question me more than I question you :-) ...

>> what happens if bytes_remaining % PAGE_SIZE != 0

I think bytes_remaining % PAGE_SIZE worth of bytes won't get read on
that call, but that the readahead callout keeps getting called until all
the bytes are read? After you asked this, I thought about adding 1 to
new_len in such cases, and did some tests that way, it seems to me like it
works out as is.

>> I wonder if you should use iov_length(&iter)

iov_length has two arguments. The first one would maybe be iter.iov and
the second one would be... ?

>> should you cache inode->i_size lest it change under you due to truncate

That seems important, but I can't return an error from the void
readahead callout. Would I react by somehow returning the pages
back to their original condition, or ?

Anywho... I see that you've force pushed a new netfs... I think you
have it applied to a linus-tree-of-the-day on top of 5.12-rc4?
I have taken these patches from
git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git (netfs-lib)

0001-iov_iter-Add-ITER_XARRAY.patch
0002-mm-Add-set-end-wait-functions-for-PG_private_2.patch
0003-mm-filemap-Pass-the-file_ra_state-in-the-ractl.patch
0004-fs-Document-file_ra_state.patch
0005-mm-readahead-Handle-ractl-nr_pages-being-modified.patch
0006-mm-Implement-readahead_control-pageset-expansion.patch
0007-netfs-Make-a-netfs-helper-module.patch
0008-netfs-Documentation-for-helper-library.patch
0009-netfs-mm-Move-PG_fscache-helper-funcs-to-linux-netfs.patch
0010-netfs-mm-Add-set-end-wait_on_page_fscache-aliases.patch
0011-netfs-Provide-readahead-and-readpage-netfs-helpers.patch
0012-netfs-Add-tracepoints.patch
0013-netfs-Gather-stats.patch
0014-netfs-Add-write_begin-helper.patch
0015-netfs-Define-an-interface-to-talk-to-a-cache.patch
0016-netfs-Add-a-tracepoint-to-log-failures-that-would-be.patch
0017-fscache-cachefiles-Add-alternate-API-to-use-kiocb-fo.patch

... and added them on top of Linux 5.12-rc8 and added my
readahead patch to that.

Now I fail one extra xfstest, I fail generic/075, generic/112,
generic/127, generic/263 and generic/438. I haven't found an
obvious (to me) problem with my patch and I can't claim to understand
everything that is going on in the patches I have of yours...

I'll keep looking...

-Mike

On Fri, Apr 16, 2021 at 11:41 AM David Howells <dhowells@redhat.com> wrote:
>
> In orangefs_readahead():
>
>         loff_t bytes_remaining = inode->i_size - readahead_pos(rac);
>         loff_t pages_remaining = bytes_remaining / PAGE_SIZE;
>
> What happens if bytes_remaining < PAGE_SIZE?  Or even what happens if
> bytes_remaining % PAGE_SIZE != 0?
>
>         if ((ret = wait_for_direct_io(ORANGEFS_IO_READ, inode,
>                         &offset, &iter, readahead_length(rac),
>                         inode->i_size, NULL, NULL, file)) < 0)
>
> I wonder if you should use iov_length(&iter) rather than
> readahead_length(rac).  They *should* be the same.
>
> Also, should you cache inode->i_size lest it change under you due to truncate?
>
> David
>
Mike Marshall April 25, 2021, 1:51 a.m. UTC | #23
>>You do?!  Actual readahead implementations, or
>>people still implementing the old ->readpages() method?

No :-) I grabbed that as an example off the top of
my head of the kind of thing I saw while reading
readahead code, but that I didn't try to handle
in my simple implementation of readahead. I'm
guessing that since I have some xfstest regressions
maybe my implementation overlooks one or
more important details...

-Mike

On Fri, Apr 16, 2021 at 11:14 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 16, 2021 at 10:36:52AM -0400, Mike Marshall wrote:
> > So... I think all your stuff is working well from my perspective
> > and that I need to figure out why my orangefs_readahead patch
> > is causing the regressions I listed. My readahead implementation (via your
> > readahead_expand) is really fast, but it is bare-bones... I'm probably
> > leaving out some important stuff... I see other filesystem's
> > readahead implementations doing stuff like avoiding doing readahead
> > on pages that have yet to be written back for example.
>
> You do?!  Actual readahead implementations, or people still implementing
> the old ->readpages() method?  The ->readahead() method is _only_ called
> for pages which are freshly allocated, Locked and !Uptodate.  If you ever
> see a page which is Dirty or Writeback, something has gone very wrong.
> Could you tell me which filesystem you saw that bogosity in?
>
> > The top two commits at https://github.com/hubcapsc/linux/tree/readahead_v3
> > is the current state of my readahead implementation.
> >
> > Please do add
> > Tested-by: Mike Marshall <hubcap@omnibond.com>
> >
> > -Mike
> >
> > On Tue, Apr 13, 2021 at 11:08 AM David Howells <dhowells@redhat.com> wrote:
> > >
> > > Mike Marshall <hubcap@omnibond.com> wrote:
> > >
> > > > Hi David... I've been gone on a motorcycle adventure,
> > > > sorry for the delay... here's my public branch...
> > > >
> > > > https://github.com/hubcapsc/linux/tree/readahead_v3
> > >
> > > That seems to have all of my fscache-iter branch in it.  I thought you'd said
> > > you'd dropped them because they were causing problems.
> > >
> > > Anyway, I've distilled the basic netfs lib patches, including the readahead
> > > expansion patch and ITER_XARRAY patch here:
> > >
> > >         https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib
> > >
> > > if that's of use to you?
> > >
> > > If you're using any of these patches, would it be possible to get a Tested-by
> > > for them that I can add?
> > >
> > > David
> > >
David Howells April 25, 2021, 7:49 a.m. UTC | #24
Mike Marshall <hubcap@omnibond.com> wrote:

> >> I wonder if you should use iov_length(&iter)
> 
> iov_length has two arguments. The first one would maybe be iter.iov and
> the second one would be... ?

Sorry, I meant iov_iter_count(&iter).

I'll look at the other things next week.  Is it easy to set up an orangefs
client and server?

David
David Howells April 26, 2021, 8:37 a.m. UTC | #25
Mike Marshall <hubcap@omnibond.com> wrote:

> Anywho... I see that you've force pushed a new netfs... I think you
> have it applied to a linus-tree-of-the-day on top of 5.12-rc4?
> I have taken these patches from
> git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git (netfs-lib)
> 
> 0001-iov_iter-Add-ITER_XARRAY.patch
> 0002-mm-Add-set-end-wait-functions-for-PG_private_2.patch
> 0003-mm-filemap-Pass-the-file_ra_state-in-the-ractl.patch
> 0004-fs-Document-file_ra_state.patch
> 0005-mm-readahead-Handle-ractl-nr_pages-being-modified.patch
> 0006-mm-Implement-readahead_control-pageset-expansion.patch
> 0007-netfs-Make-a-netfs-helper-module.patch
> 0008-netfs-Documentation-for-helper-library.patch
> 0009-netfs-mm-Move-PG_fscache-helper-funcs-to-linux-netfs.patch
> 0010-netfs-mm-Add-set-end-wait_on_page_fscache-aliases.patch
> 0011-netfs-Provide-readahead-and-readpage-netfs-helpers.patch
> 0012-netfs-Add-tracepoints.patch
> 0013-netfs-Gather-stats.patch
> 0014-netfs-Add-write_begin-helper.patch
> 0015-netfs-Define-an-interface-to-talk-to-a-cache.patch
> 0016-netfs-Add-a-tracepoint-to-log-failures-that-would-be.patch
> 0017-fscache-cachefiles-Add-alternate-API-to-use-kiocb-fo.patch

Can you add this patch also:

https://lore.kernel.org/r/3545034.1619392490@warthog.procyon.org.uk/
[PATCH] iov_iter: Four fixes for ITER_XARRAY

David
Mike Marshall April 26, 2021, 2:53 p.m. UTC | #26
>> Is it easy to set up an orangefs client and server?

I think it is easy to set up a test system on a single VM,
but I do it all the time. I souped up the build details in
Documentation/filesystems/orangefs.rst not too long
ago, I hope it is useful.

Your VM would need to be a "developer" VM,
with all the autotools stuff and such if you build
from source. I also worked on the configure stuff
so that you would learn about any packages you
lack at configure time, I hope that is also still good.

I read your message about trying again with the
"Four fixes for ITER_XARRAY" patch, I'll report
on how that goes...

-Mike

On Sun, Apr 25, 2021 at 3:49 AM David Howells <dhowells@redhat.com> wrote:
>
> Mike Marshall <hubcap@omnibond.com> wrote:
>
> > >> I wonder if you should use iov_length(&iter)
> >
> > iov_length has two arguments. The first one would maybe be iter.iov and
> > the second one would be... ?
>
> Sorry, I meant iov_iter_count(&iter).
>
> I'll look at the other things next week.  Is it easy to set up an orangefs
> client and server?
>
> David
>
Mike Marshall April 26, 2021, 7:01 p.m. UTC | #27
I added the "Four fixes for ITER_XARRAY" patch and got things
running again.

I ran the tests I had regressions on by themselves, and I still fail
generic/075, generic/112, generic/127 and generic/263.

generic/438 passes now.

I was analyzing what test 075 was doing when it was failing, and I
found it to be doing this:

/home/hubcap/xfstests-dev/ltp/fsx -d -N 1000 -S 0 -P
/home/hubcap/xfstests-dev /pvfsmnt/whatever

The above used to fail every time... now it works every time.

Progress :-).

I'm about to launch the whole suite of tests, I'll report back
on what happens later...

-Mike

On Mon, Apr 26, 2021 at 10:53 AM Mike Marshall <hubcap@omnibond.com> wrote:
>
> >> Is it easy to set up an orangefs client and server?
>
> I think it is easy to set up a test system on a single VM,
> but I do it all the time. I souped up the build details in
> Documentation/filesystems/orangefs.rst not too long
> ago, I hope it is useful.
>
> Your VM would need to be a "developer" VM,
> with all the autotools stuff and such if you build
> from source. I also worked on the configure stuff
> so that you would learn about any packages you
> lack at configure time, I hope that is also still good.
>
> I read your message about trying again with the
> "Four fixes for ITER_XARRAY" patch, I'll report
> on how that goes...
>
> -Mike
>
> On Sun, Apr 25, 2021 at 3:49 AM David Howells <dhowells@redhat.com> wrote:
> >
> > Mike Marshall <hubcap@omnibond.com> wrote:
> >
> > > >> I wonder if you should use iov_length(&iter)
> > >
> > > iov_length has two arguments. The first one would maybe be iter.iov and
> > > the second one would be... ?
> >
> > Sorry, I meant iov_iter_count(&iter).
> >
> > I'll look at the other things next week.  Is it easy to set up an orangefs
> > client and server?
> >
> > David
> >
David Howells April 26, 2021, 8:01 p.m. UTC | #28
Mike Marshall <hubcap@omnibond.com> wrote:

> Progress :-).

Yay! :-)

David
diff mbox series

Patch

diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 48f0547d4850..682a968cb82a 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -244,6 +244,25 @@  static int orangefs_writepages(struct
address_space *mapping,

 static int orangefs_launder_page(struct page *);

+/*
+ * Prefill the page cache with some pages that we're probably
+ * about to need...
+ */
+static void orangefs_readahead(struct readahead_control *rac)
+{
+       pgoff_t index = readahead_index(rac);
+       struct page *page;
+
+       while ((page = readahead_page(rac))) {
+               prefetchw(&page->flags);
+               put_page(page);
+               unlock_page(page);
+               index++;
+       }
+
+       return;
+}
+
 static int orangefs_readpage(struct file *file, struct page *page)
 {
        struct inode *inode = page->mapping->host;
@@ -260,11 +279,16 @@  static int orangefs_readpage(struct file *file,
struct page *page)
        int remaining;

        /*
-        * Get up to this many bytes from Orangefs at a time and try
-        * to fill them into the page cache at once. Tests with dd made
-        * this seem like a reasonable static number, if there was
-        * interest perhaps this number could be made setable through
-        * sysfs...
+        * Orangefs isn't a good fit for reading files one page at
+        * a time. Get up to "read_size" bytes from Orangefs at a time and
+        * try to fill them into the page cache at once. Readahead code in
+        * mm already got us some extra pages by calling orangefs_readahead,
+        * but it doesn't know how many we actually wanted, so we'll
+        * get some more after we use up the extra ones we got from
+        * orangefs_readahead. Tests with dd made "read_size" seem
+        * like a reasonable static number of bytes to get from orangefs,
+        * if there was interest perhaps "read_size" could be made
+        * setable through sysfs or something...
         */
        read_size = 524288;

@@ -302,31 +326,19 @@  static int orangefs_readpage(struct file *file,
struct page *page)
                slot_index = 0;
                while ((remaining - PAGE_SIZE) >= PAGE_SIZE) {
                        remaining -= PAGE_SIZE;
-                       /*
-                        * It is an optimization to try and fill more than one
-                        * page... by now we've already gotten the single
-                        * page we were after, if stuff doesn't seem to
-                        * be going our way at this point just return
-                        * and hope for the best.
-                        *
-                        * If we look for pages and they're already there is
-                        * one reason to give up, and if they're not there
-                        * and we can't create them is another reason.
-                        */

                        index++;
                        slot_index++;
-                       next_page = find_get_page(inode->i_mapping, index);
+                       next_page = find_lock_page(inode->i_mapping, index);
                        if (next_page) {
-                               gossip_debug(GOSSIP_FILE_DEBUG,
-                                       "%s: found next page, quitting\n",
-                                       __func__);
-                               put_page(next_page);
-                               goto out;
+                               printk("%s: found readahead page\n", __func__);
+                       } else {
+                               next_page =
+                                       find_or_create_page(inode->i_mapping,
+                                                               index,
+                                                               GFP_KERNEL);
+                               printk("%s: alloced my own page\n", __func__);
                        }
-                       next_page = find_or_create_page(inode->i_mapping,
-                                                       index,
-                                                       GFP_KERNEL);
                        /*
                         * I've never hit this, leave it as a printk for
                         * now so it will be obvious.
@@ -659,6 +671,7 @@  static ssize_t orangefs_direct_IO(struct kiocb *iocb,
 /** ORANGEFS2 implementation of address space operations */
 static const struct address_space_operations orangefs_address_operations = {
        .writepage = orangefs_writepage,
+       .readahead = orangefs_readahead,
        .readpage = orangefs_readpage,
        .writepages = orangefs_writepages,
        .set_page_dirty = __set_page_dirty_nobuffers,
diff --git a/fs/orangefs/orangefs-mod.c b/fs/orangefs/orangefs-mod.c
index 74a3d6337ef4..cd7297815f91 100644
--- a/fs/orangefs/orangefs-mod.c
+++ b/fs/orangefs/orangefs-mod.c
@@ -31,7 +31,7 @@  static ulong module_parm_debug_mask;
 __u64 orangefs_gossip_debug_mask;
 int op_timeout_secs = ORANGEFS_DEFAULT_OP_TIMEOUT_SECS;
 int slot_timeout_secs = ORANGEFS_DEFAULT_SLOT_TIMEOUT_SECS;
-int orangefs_cache_timeout_msecs = 50;
+int orangefs_cache_timeout_msecs = 500;
 int orangefs_dcache_timeout_msecs = 50;
 int orangefs_getattr_timeout_msecs = 50;