diff mbox series

[v6,05/13] NFS: Improve algorithm for falling back to uncached readdir

Message ID 20220221160851.15508-6-trondmy@kernel.org (mailing list archive)
State New, archived
Headers show
Series Readdir improvements | expand

Commit Message

Trond Myklebust Feb. 21, 2022, 4:08 p.m. UTC
From: Trond Myklebust <trond.myklebust@hammerspace.com>

When reading a very large directory, we want to try to keep the page
cache up to date if doing so is inexpensive. Right now, we will try to
refill the page cache if it is non-empty, irrespective of whether or not
doing so is going to take a long time.

Replace that algorithm with something that looks at how many times we've
refilled the page cache without seeing a cache hit.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c           | 51 +++++++++++++++++++++---------------------
 include/linux/nfs_fs.h |  1 +
 2 files changed, 27 insertions(+), 25 deletions(-)

Comments

Benjamin Coddington Feb. 21, 2022, 4:45 p.m. UTC | #1
On 21 Feb 2022, at 11:08, trondmy@kernel.org wrote:

> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> When reading a very large directory, we want to try to keep the page
> cache up to date if doing so is inexpensive. Right now, we will try to
> refill the page cache if it is non-empty, irrespective of whether or not
> doing so is going to take a long time.
>
> Replace that algorithm with something that looks at how many times we've
> refilled the page cache without seeing a cache hit.

Hi Trond, I've been following your work here - thanks for it.

I'm wondering if there might be a regression on this patch for the case
where two or more directory readers are part way through a large directory
when the pagecache is truncated.  If I'm reading this correctly, those
readers will stop caching after 5 fills and finish the remainder of their
directory reads in the uncached mode.

Isn't there an OP amplification per reader in this case?

Ben
Trond Myklebust Feb. 21, 2022, 7:58 p.m. UTC | #2
On Mon, 2022-02-21 at 11:45 -0500, Benjamin Coddington wrote:
> On 21 Feb 2022, at 11:08, trondmy@kernel.org wrote:
> 
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > When reading a very large directory, we want to try to keep the
> > page
> > cache up to date if doing so is inexpensive. Right now, we will try
> > to
> > refill the page cache if it is non-empty, irrespective of whether
> > or not
> > doing so is going to take a long time.
> > 
> > Replace that algorithm with something that looks at how many times
> > we've
> > refilled the page cache without seeing a cache hit.
> 
> Hi Trond, I've been following your work here - thanks for it.
> 
> I'm wondering if there might be a regression on this patch for the
> case
> where two or more directory readers are part way through a large
> directory
> when the pagecache is truncated.  If I'm reading this correctly,
> those
> readers will stop caching after 5 fills and finish the remainder of
> their
> directory reads in the uncached mode.
> 
> Isn't there an OP amplification per reader in this case?
> 

Depends... In the old case, we basically stopped doing uncached readdir
if a third process starts filling the page cache again. In particular,
this means we were vulnerable to restarting over and over once page
reclaim starts to kick in for very large directories.

In this new one, we have each process give it a try (5 fills each), and
then fallback to uncached. Yes, there will be corner cases where this
will perform less well than the old algorithm, but it should also be
more deterministic.

I am open to suggestions for better ways to determine when to cut over
to uncached readdir. This is one way, that I think is better than what
we have, however I'm sure it can be improved upon.
Benjamin Coddington Feb. 21, 2022, 8:22 p.m. UTC | #3
On 21 Feb 2022, at 14:58, Trond Myklebust wrote:

> On Mon, 2022-02-21 at 11:45 -0500, Benjamin Coddington wrote:
>> On 21 Feb 2022, at 11:08, trondmy@kernel.org wrote:
>>
>>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>>>
>>> When reading a very large directory, we want to try to keep the
>>> page
>>> cache up to date if doing so is inexpensive. Right now, we will try
>>> to
>>> refill the page cache if it is non-empty, irrespective of whether
>>> or not
>>> doing so is going to take a long time.
>>>
>>> Replace that algorithm with something that looks at how many times
>>> we've
>>> refilled the page cache without seeing a cache hit.
>>
>> Hi Trond, I've been following your work here - thanks for it.
>>
>> I'm wondering if there might be a regression on this patch for the
>> case
>> where two or more directory readers are part way through a large
>> directory
>> when the pagecache is truncated.  If I'm reading this correctly,
>> those
>> readers will stop caching after 5 fills and finish the remainder of
>> their
>> directory reads in the uncached mode.
>>
>> Isn't there an OP amplification per reader in this case?
>>
>
> Depends... In the old case, we basically stopped doing uncached readdir
> if a third process starts filling the page cache again. In particular,
> this means we were vulnerable to restarting over and over once page
> reclaim starts to kick in for very large directories.
>
> In this new one, we have each process give it a try (5 fills each), and
> then fallback to uncached. Yes, there will be corner cases where this
> will perform less well than the old algorithm, but it should also be
> more deterministic.
>
> I am open to suggestions for better ways to determine when to cut over
> to uncached readdir. This is one way, that I think is better than what
> we have, however I'm sure it can be improved upon.

I still have old patches that allow each page to be "versioned" with the
change attribute, page_index, and cookie.  This allows the page cache to be
culled page-by-page, and multiple fillers can continue to fill pages at
"headless" page offsets that match their original cookie and page_index
pair.  This change would mean readers don't have to start over filling the
page cache when the cache is dropped, so we wouldn't need to worry about
when to cut over to the uncached mode - it makes the problem go away.

I felt there wasn't much interest in this work, and our most vocal customer
was happy enough with last winter's readdir improvements (thanks!) that I
didn't follow up, but I can refresh those patches and send them along again.

Ben
Trond Myklebust Feb. 21, 2022, 8:55 p.m. UTC | #4
On Mon, 2022-02-21 at 15:22 -0500, Benjamin Coddington wrote:
> On 21 Feb 2022, at 14:58, Trond Myklebust wrote:
> 
> > On Mon, 2022-02-21 at 11:45 -0500, Benjamin Coddington wrote:
> > > On 21 Feb 2022, at 11:08, trondmy@kernel.org wrote:
> > > 
> > > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > > 
> > > > When reading a very large directory, we want to try to keep the
> > > > page
> > > > cache up to date if doing so is inexpensive. Right now, we will
> > > > try
> > > > to
> > > > refill the page cache if it is non-empty, irrespective of
> > > > whether
> > > > or not
> > > > doing so is going to take a long time.
> > > > 
> > > > Replace that algorithm with something that looks at how many
> > > > times
> > > > we've
> > > > refilled the page cache without seeing a cache hit.
> > > 
> > > Hi Trond, I've been following your work here - thanks for it.
> > > 
> > > I'm wondering if there might be a regression on this patch for
> > > the
> > > case
> > > where two or more directory readers are part way through a large
> > > directory
> > > when the pagecache is truncated.  If I'm reading this correctly,
> > > those
> > > readers will stop caching after 5 fills and finish the remainder
> > > of
> > > their
> > > directory reads in the uncached mode.
> > > 
> > > Isn't there an OP amplification per reader in this case?
> > > 
> > 
> > Depends... In the old case, we basically stopped doing uncached
> > readdir
> > if a third process starts filling the page cache again. In
> > particular,
> > this means we were vulnerable to restarting over and over once page
> > reclaim starts to kick in for very large directories.
> > 
> > In this new one, we have each process give it a try (5 fills each),
> > and
> > then fallback to uncached. Yes, there will be corner cases where
> > this
> > will perform less well than the old algorithm, but it should also
> > be
> > more deterministic.
> > 
> > I am open to suggestions for better ways to determine when to cut
> > over
> > to uncached readdir. This is one way, that I think is better than
> > what
> > we have, however I'm sure it can be improved upon.
> 
> I still have old patches that allow each page to be "versioned" with
> the
> change attribute, page_index, and cookie.  This allows the page cache
> to be
> culled page-by-page, and multiple fillers can continue to fill pages
> at
> "headless" page offsets that match their original cookie and
> page_index
> pair.  This change would mean readers don't have to start over
> filling the
> page cache when the cache is dropped, so we wouldn't need to worry
> about
> when to cut over to the uncached mode - it makes the problem go away.
> 
> I felt there wasn't much interest in this work, and our most vocal
> customer
> was happy enough with last winter's readdir improvements (thanks!)
> that I
> didn't follow up, but I can refresh those patches and send them along
> again.
> 

We will always need the ability to cut over to uncached readdir.

If the cookie is no longer returned by the server because one or more
files were deleted then we need to resolve the situation somehow (IOW:
the 'rm *' case). The new algorithm _does_ improve performance on those
situations, because it no longer requires us to read the entire
directory before switching over: we try 5 times, then fail over.
Benjamin Coddington Feb. 21, 2022, 9:10 p.m. UTC | #5
On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
>
> We will always need the ability to cut over to uncached readdir.

Yes.

> If the cookie is no longer returned by the server because one or more
> files were deleted then we need to resolve the situation somehow (IOW:
> the 'rm *' case). The new algorithm _does_ improve performance on those
> situations, because it no longer requires us to read the entire
> directory before switching over: we try 5 times, then fail over.

Yes, using per-page validation doesn't remove the need for uncached readdir.
It does allow a reader to simply resume filling the cache where it left
off.  There's no need to try 5 times and fail over.  And there's no need to
make a trade-off and make the situation worse in certain scenarios.

I thought I'd point that out and make an offer to re-submit it.  Any
interest?

Ben
Trond Myklebust Feb. 21, 2022, 11:20 p.m. UTC | #6
On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington wrote:
> On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
> > 
> > We will always need the ability to cut over to uncached readdir.
> 
> Yes.
> 
> > If the cookie is no longer returned by the server because one or
> > more
> > files were deleted then we need to resolve the situation somehow
> > (IOW:
> > the 'rm *' case). The new algorithm _does_ improve performance on
> > those
> > situations, because it no longer requires us to read the entire
> > directory before switching over: we try 5 times, then fail over.
> 
> Yes, using per-page validation doesn't remove the need for uncached
> readdir.
> It does allow a reader to simply resume filling the cache where it
> left
> off.  There's no need to try 5 times and fail over.  And there's no
> need to
> make a trade-off and make the situation worse in certain scenarios.
> 
> I thought I'd point that out and make an offer to re-submit it.  Any
> interest?
> 

As I recall, I had concerns about that approach. Can you explain again
how it will work?

A few of the concerns I have revolve around telldir()/seekdir(). If the
platform is 32-bit, then we cannot use cookies as the telldir() output,
and so our only option is to use offsets into the page cache (this is
why this patch carves out an exception if desc->dir_cookie == 0). How
would that work with you scheme?
Even in the 64-bit case where are able to use cookies for
telldir()/seekdir(), how do we determine an appropriate page index
after a seekdir()?
Benjamin Coddington Feb. 22, 2022, 12:50 p.m. UTC | #7
On 21 Feb 2022, at 18:20, Trond Myklebust wrote:

> On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington wrote:
>> On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
>>>
>>> We will always need the ability to cut over to uncached readdir.
>>
>> Yes.
>>
>>> If the cookie is no longer returned by the server because one or more
>>> files were deleted then we need to resolve the situation somehow (IOW:
>>> the 'rm *' case). The new algorithm _does_ improve performance on those
>>> situations, because it no longer requires us to read the entire
>>> directory before switching over: we try 5 times, then fail over.
>>
>> Yes, using per-page validation doesn't remove the need for uncached
>> readdir.  It does allow a reader to simply resume filling the cache where
>> it left off.  There's no need to try 5 times and fail over.  And there's
>> no need to make a trade-off and make the situation worse in certain
>> scenarios.
>>
>> I thought I'd point that out and make an offer to re-submit it.  Any
>> interest?
>>
>
> As I recall, I had concerns about that approach. Can you explain again
> how it will work?

Every page of readdir results has the directory's change attr stored on the
page.  That, along with the page's index and the first cookie is enough
information to determine if the page's data can be used by another process.

Which means that when the pagecache is dropped, fillers don't have to restart
filling the cache at page index 0, they can continue to fill at whatever
index they were at previously.  If another process finds a page that doesn't
match its page index, cookie, and the current directory's change attr, the
page is dropped and refilled from that process' indexing.

> A few of the concerns I have revolve around telldir()/seekdir(). If the
> platform is 32-bit, then we cannot use cookies as the telldir() output,
> and so our only option is to use offsets into the page cache (this is
> why this patch carves out an exception if desc->dir_cookie == 0). How
> would that work with you scheme?

For 32-bit seekdir, pages are filled starting at index 0.  This is very
unlikely to match other readers (unless they also do the _same_ seekdir).

> Even in the 64-bit case where are able to use cookies for
> telldir()/seekdir(), how do we determine an appropriate page index
> after a seekdir()?

We don't.  Instead we start filling at index 0.  Again, the pagecache will
only be useful to other processes that have done the same llseek.

This approach optimizes the pagecache for processes that are doing similar
work, and has the added benefit of scaling well for large directory listings
under memory pressure.  Also a number of classes of directory modifications
(such as renames, or insertions/deletions at locations a reader has moved
beyond) are no longer a reason to re-fill the pagecache from scratch.

Ben
Trond Myklebust Feb. 22, 2022, 8:11 p.m. UTC | #8
On Tue, 2022-02-22 at 07:50 -0500, Benjamin Coddington wrote:
> On 21 Feb 2022, at 18:20, Trond Myklebust wrote:
> 
> > On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington wrote:
> > > On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
> > > > 
> > > > We will always need the ability to cut over to uncached
> > > > readdir.
> > > 
> > > Yes.
> > > 
> > > > If the cookie is no longer returned by the server because one
> > > > or more
> > > > files were deleted then we need to resolve the situation
> > > > somehow (IOW:
> > > > the 'rm *' case). The new algorithm _does_ improve performance
> > > > on those
> > > > situations, because it no longer requires us to read the entire
> > > > directory before switching over: we try 5 times, then fail
> > > > over.
> > > 
> > > Yes, using per-page validation doesn't remove the need for
> > > uncached
> > > readdir.  It does allow a reader to simply resume filling the
> > > cache where
> > > it left off.  There's no need to try 5 times and fail over.  And
> > > there's
> > > no need to make a trade-off and make the situation worse in
> > > certain
> > > scenarios.
> > > 
> > > I thought I'd point that out and make an offer to re-submit it. 
> > > Any
> > > interest?
> > > 
> > 
> > As I recall, I had concerns about that approach. Can you explain
> > again
> > how it will work?
> 
> Every page of readdir results has the directory's change attr stored
> on the
> page.  That, along with the page's index and the first cookie is
> enough
> information to determine if the page's data can be used by another
> process.
> 
> Which means that when the pagecache is dropped, fillers don't have to
> restart
> filling the cache at page index 0, they can continue to fill at
> whatever
> index they were at previously.  If another process finds a page that
> doesn't
> match its page index, cookie, and the current directory's change
> attr, the
> page is dropped and refilled from that process' indexing.
> 
> > A few of the concerns I have revolve around telldir()/seekdir(). If
> > the
> > platform is 32-bit, then we cannot use cookies as the telldir()
> > output,
> > and so our only option is to use offsets into the page cache (this
> > is
> > why this patch carves out an exception if desc->dir_cookie == 0).
> > How
> > would that work with you scheme?
> 
> For 32-bit seekdir, pages are filled starting at index 0.  This is
> very
> unlikely to match other readers (unless they also do the _same_
> seekdir).
> 
> > Even in the 64-bit case where are able to use cookies for
> > telldir()/seekdir(), how do we determine an appropriate page index
> > after a seekdir()?
> 
> We don't.  Instead we start filling at index 0.  Again, the pagecache
> will
> only be useful to other processes that have done the same llseek.
> 
> This approach optimizes the pagecache for processes that are doing
> similar
> work, and has the added benefit of scaling well for large directory
> listings
> under memory pressure.  Also a number of classes of directory
> modifications
> (such as renames, or insertions/deletions at locations a reader has
> moved
> beyond) are no longer a reason to re-fill the pagecache from scratch.
> 

OK, you've got me more or less sold on it.

I'd still like to figure out how to improve the performance for seekdir
(since I do have an interest in re-exporting NFS) but I've been playing
around with a couple of patches that implement your concept and they do
seem to work well for the common case of a linear read through the
directory.
Benjamin Coddington Feb. 22, 2022, 8:21 p.m. UTC | #9
On 22 Feb 2022, at 15:11, Trond Myklebust wrote:

> On Tue, 2022-02-22 at 07:50 -0500, Benjamin Coddington wrote:
>> On 21 Feb 2022, at 18:20, Trond Myklebust wrote:
>>
>>> On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington wrote:
>>>> On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
>>>>>
>>>>> We will always need the ability to cut over to uncached
>>>>> readdir.
>>>>
>>>> Yes.
>>>>
>>>>> If the cookie is no longer returned by the server because one
>>>>> or more
>>>>> files were deleted then we need to resolve the situation
>>>>> somehow (IOW:
>>>>> the 'rm *' case). The new algorithm _does_ improve performance
>>>>> on those
>>>>> situations, because it no longer requires us to read the entire
>>>>> directory before switching over: we try 5 times, then fail
>>>>> over.
>>>>
>>>> Yes, using per-page validation doesn't remove the need for
>>>> uncached
>>>> readdir.  It does allow a reader to simply resume filling the
>>>> cache where
>>>> it left off.  There's no need to try 5 times and fail over.  And
>>>> there's
>>>> no need to make a trade-off and make the situation worse in
>>>> certain
>>>> scenarios.
>>>>
>>>> I thought I'd point that out and make an offer to re-submit it. 
>>>> Any
>>>> interest?
>>>>
>>>
>>> As I recall, I had concerns about that approach. Can you explain
>>> again
>>> how it will work?
>>
>> Every page of readdir results has the directory's change attr stored
>> on the
>> page.  That, along with the page's index and the first cookie is
>> enough
>> information to determine if the page's data can be used by another
>> process.
>>
>> Which means that when the pagecache is dropped, fillers don't have to
>> restart
>> filling the cache at page index 0, they can continue to fill at
>> whatever
>> index they were at previously.  If another process finds a page that
>> doesn't
>> match its page index, cookie, and the current directory's change
>> attr, the
>> page is dropped and refilled from that process' indexing.
>>
>>> A few of the concerns I have revolve around telldir()/seekdir(). If
>>> the
>>> platform is 32-bit, then we cannot use cookies as the telldir()
>>> output,
>>> and so our only option is to use offsets into the page cache (this
>>> is
>>> why this patch carves out an exception if desc->dir_cookie == 0).
>>> How
>>> would that work with you scheme?
>>
>> For 32-bit seekdir, pages are filled starting at index 0.  This is
>> very
>> unlikely to match other readers (unless they also do the _same_
>> seekdir).
>>
>>> Even in the 64-bit case where are able to use cookies for
>>> telldir()/seekdir(), how do we determine an appropriate page index
>>> after a seekdir()?
>>
>> We don't.  Instead we start filling at index 0.  Again, the pagecache
>> will
>> only be useful to other processes that have done the same llseek.
>>
>> This approach optimizes the pagecache for processes that are doing
>> similar
>> work, and has the added benefit of scaling well for large directory
>> listings
>> under memory pressure.  Also a number of classes of directory
>> modifications
>> (such as renames, or insertions/deletions at locations a reader has
>> moved
>> beyond) are no longer a reason to re-fill the pagecache from scratch.
>>
>
> OK, you've got me more or less sold on it.
>
> I'd still like to figure out how to improve the performance for seekdir
> (since I do have an interest in re-exporting NFS) but I've been playing
> around with a couple of patches that implement your concept and they do
> seem to work well for the common case of a linear read through the
> directory.

Nice.  I have another version from the one I originally posted:
https://lore.kernel.org/linux-nfs/cover.1611160120.git.bcodding@redhat.com/

.. but I don't remember exactly the changes and it needs rebasing.  Should I
rebase it against your testing branch and send the result?

Ben
Trond Myklebust Feb. 23, 2022, 12:17 p.m. UTC | #10
On Tue, 2022-02-22 at 15:21 -0500, Benjamin Coddington wrote:
> On 22 Feb 2022, at 15:11, Trond Myklebust wrote:
> 
> > On Tue, 2022-02-22 at 07:50 -0500, Benjamin Coddington wrote:
> > > On 21 Feb 2022, at 18:20, Trond Myklebust wrote:
> > > 
> > > > On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington wrote:
> > > > > On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
> > > > > > 
> > > > > > We will always need the ability to cut over to uncached
> > > > > > readdir.
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > If the cookie is no longer returned by the server because
> > > > > > one
> > > > > > or more
> > > > > > files were deleted then we need to resolve the situation
> > > > > > somehow (IOW:
> > > > > > the 'rm *' case). The new algorithm _does_ improve
> > > > > > performance
> > > > > > on those
> > > > > > situations, because it no longer requires us to read the
> > > > > > entire
> > > > > > directory before switching over: we try 5 times, then fail
> > > > > > over.
> > > > > 
> > > > > Yes, using per-page validation doesn't remove the need for
> > > > > uncached
> > > > > readdir.  It does allow a reader to simply resume filling the
> > > > > cache where
> > > > > it left off.  There's no need to try 5 times and fail over. 
> > > > > And
> > > > > there's
> > > > > no need to make a trade-off and make the situation worse in
> > > > > certain
> > > > > scenarios.
> > > > > 
> > > > > I thought I'd point that out and make an offer to re-submit
> > > > > it. 
> > > > > Any
> > > > > interest?
> > > > > 
> > > > 
> > > > As I recall, I had concerns about that approach. Can you
> > > > explain
> > > > again
> > > > how it will work?
> > > 
> > > Every page of readdir results has the directory's change attr
> > > stored
> > > on the
> > > page.  That, along with the page's index and the first cookie is
> > > enough
> > > information to determine if the page's data can be used by
> > > another
> > > process.
> > > 
> > > Which means that when the pagecache is dropped, fillers don't
> > > have to
> > > restart
> > > filling the cache at page index 0, they can continue to fill at
> > > whatever
> > > index they were at previously.  If another process finds a page
> > > that
> > > doesn't
> > > match its page index, cookie, and the current directory's change
> > > attr, the
> > > page is dropped and refilled from that process' indexing.
> > > 
> > > > A few of the concerns I have revolve around
> > > > telldir()/seekdir(). If
> > > > the
> > > > platform is 32-bit, then we cannot use cookies as the telldir()
> > > > output,
> > > > and so our only option is to use offsets into the page cache
> > > > (this
> > > > is
> > > > why this patch carves out an exception if desc->dir_cookie ==
> > > > 0).
> > > > How
> > > > would that work with you scheme?
> > > 
> > > For 32-bit seekdir, pages are filled starting at index 0.  This
> > > is
> > > very
> > > unlikely to match other readers (unless they also do the _same_
> > > seekdir).
> > > 
> > > > Even in the 64-bit case where are able to use cookies for
> > > > telldir()/seekdir(), how do we determine an appropriate page
> > > > index
> > > > after a seekdir()?
> > > 
> > > We don't.  Instead we start filling at index 0.  Again, the
> > > pagecache
> > > will
> > > only be useful to other processes that have done the same llseek.
> > > 
> > > This approach optimizes the pagecache for processes that are
> > > doing
> > > similar
> > > work, and has the added benefit of scaling well for large
> > > directory
> > > listings
> > > under memory pressure.  Also a number of classes of directory
> > > modifications
> > > (such as renames, or insertions/deletions at locations a reader
> > > has
> > > moved
> > > beyond) are no longer a reason to re-fill the pagecache from
> > > scratch.
> > > 
> > 
> > OK, you've got me more or less sold on it.
> > 
> > I'd still like to figure out how to improve the performance for
> > seekdir
> > (since I do have an interest in re-exporting NFS) but I've been
> > playing
> > around with a couple of patches that implement your concept and
> > they do
> > seem to work well for the common case of a linear read through the
> > directory.
> 
> Nice.  I have another version from the one I originally posted:
> https://lore.kernel.org/linux-nfs/cover.1611160120.git.bcodding@redhat.com/
> 
> .. but I don't remember exactly the changes and it needs rebasing. 
> Should I
> rebase it against your testing branch and send the result?

My 2 patches did something slightly different to yours, storing the
change attribute in the array header instead of in page_private. That
makes for a slightly smaller change.

Having further slept on the idea, I think I know how to solve the
seekdir() issue, at least for the cases where we can use cookies as
telldir()/seekdir() offsets. If we ditch the linear index, and instead
use a hash of the desc->last_cookie as the index, then we can make
random access to the directories work with your idea.
There are a couple of further problems with this concept, but I think
those are solvable. The issues I have identified so far are:

 * invalidate_inode_pages2_range() no longer works to clear out the
   page cache in a predictable way for the readdirplus heuristic.
 * duplicate cookie detection needs to be changed.

I think those are solvable problems.
Benjamin Coddington Feb. 23, 2022, 1:34 p.m. UTC | #11
On 23 Feb 2022, at 7:17, Trond Myklebust wrote:

> On Tue, 2022-02-22 at 15:21 -0500, Benjamin Coddington wrote:
>> On 22 Feb 2022, at 15:11, Trond Myklebust wrote:
>>
>>> On Tue, 2022-02-22 at 07:50 -0500, Benjamin Coddington wrote:
>>>> On 21 Feb 2022, at 18:20, Trond Myklebust wrote:
>>>>
>>>>> On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington wrote:
>>>>>> On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
>>>>>>>
>>>>>>> We will always need the ability to cut over to uncached
>>>>>>> readdir.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>> If the cookie is no longer returned by the server because
>>>>>>> one
>>>>>>> or more
>>>>>>> files were deleted then we need to resolve the situation
>>>>>>> somehow (IOW:
>>>>>>> the 'rm *' case). The new algorithm _does_ improve
>>>>>>> performance
>>>>>>> on those
>>>>>>> situations, because it no longer requires us to read the
>>>>>>> entire
>>>>>>> directory before switching over: we try 5 times, then fail
>>>>>>> over.
>>>>>>
>>>>>> Yes, using per-page validation doesn't remove the need for
>>>>>> uncached
>>>>>> readdir.  It does allow a reader to simply resume filling the
>>>>>> cache where
>>>>>> it left off.  There's no need to try 5 times and fail over. 
>>>>>> And
>>>>>> there's
>>>>>> no need to make a trade-off and make the situation worse in
>>>>>> certain
>>>>>> scenarios.
>>>>>>
>>>>>> I thought I'd point that out and make an offer to re-submit
>>>>>> it. 
>>>>>> Any
>>>>>> interest?
>>>>>>
>>>>>
>>>>> As I recall, I had concerns about that approach. Can you
>>>>> explain
>>>>> again
>>>>> how it will work?
>>>>
>>>> Every page of readdir results has the directory's change attr
>>>> stored
>>>> on the
>>>> page.  That, along with the page's index and the first cookie is
>>>> enough
>>>> information to determine if the page's data can be used by
>>>> another
>>>> process.
>>>>
>>>> Which means that when the pagecache is dropped, fillers don't
>>>> have to
>>>> restart
>>>> filling the cache at page index 0, they can continue to fill at
>>>> whatever
>>>> index they were at previously.  If another process finds a page
>>>> that
>>>> doesn't
>>>> match its page index, cookie, and the current directory's change
>>>> attr, the
>>>> page is dropped and refilled from that process' indexing.
>>>>
>>>>> A few of the concerns I have revolve around
>>>>> telldir()/seekdir(). If
>>>>> the
>>>>> platform is 32-bit, then we cannot use cookies as the telldir()
>>>>> output,
>>>>> and so our only option is to use offsets into the page cache
>>>>> (this
>>>>> is
>>>>> why this patch carves out an exception if desc->dir_cookie ==
>>>>> 0).
>>>>> How
>>>>> would that work with you scheme?
>>>>
>>>> For 32-bit seekdir, pages are filled starting at index 0.  This
>>>> is
>>>> very
>>>> unlikely to match other readers (unless they also do the _same_
>>>> seekdir).
>>>>
>>>>> Even in the 64-bit case where are able to use cookies for
>>>>> telldir()/seekdir(), how do we determine an appropriate page
>>>>> index
>>>>> after a seekdir()?
>>>>
>>>> We don't.  Instead we start filling at index 0.  Again, the
>>>> pagecache
>>>> will
>>>> only be useful to other processes that have done the same llseek.
>>>>
>>>> This approach optimizes the pagecache for processes that are
>>>> doing
>>>> similar
>>>> work, and has the added benefit of scaling well for large
>>>> directory
>>>> listings
>>>> under memory pressure.  Also a number of classes of directory
>>>> modifications
>>>> (such as renames, or insertions/deletions at locations a reader
>>>> has
>>>> moved
>>>> beyond) are no longer a reason to re-fill the pagecache from
>>>> scratch.
>>>>
>>>
>>> OK, you've got me more or less sold on it.
>>>
>>> I'd still like to figure out how to improve the performance for
>>> seekdir
>>> (since I do have an interest in re-exporting NFS) but I've been
>>> playing
>>> around with a couple of patches that implement your concept and
>>> they do
>>> seem to work well for the common case of a linear read through the
>>> directory.
>>
>> Nice.  I have another version from the one I originally posted:
>> https://lore.kernel.org/linux-nfs/cover.1611160120.git.bcodding@redhat.com/
>>
>> .. but I don't remember exactly the changes and it needs rebasing. 
>> Should I
>> rebase it against your testing branch and send the result?
>
> My 2 patches did something slightly different to yours, storing the
> change attribute in the array header instead of in page_private. That
> makes for a slightly smaller change.

I worried that increasing the size of the array header wouldn't allow us 
to
store as many entries per page.

> Having further slept on the idea, I think I know how to solve the
> seekdir() issue, at least for the cases where we can use cookies as
> telldir()/seekdir() offsets. If we ditch the linear index, and instead
> use a hash of the desc->last_cookie as the index, then we can make
> random access to the directories work with your idea.

Yes!  Nice!

> There are a couple of further problems with this concept, but I think
> those are solvable. The issues I have identified so far are:
>
>  * invalidate_inode_pages2_range() no longer works to clear out the
>    page cache in a predictable way for the readdirplus heuristic.
>  * duplicate cookie detection needs to be changed.

I don't understand those problems, yet.  I have a few other things I 
have to
do, but after those I will come back and give them some attention.

FWIW, I rebased that old series of page-validation work on your
e85f86150b32 NFS: Trace effects of the readdirplus heuristic

I'll send it as a reply to [ v6 13/13] in this posting, but I've only
compile-tested it so far.

Ben
Trond Myklebust Feb. 23, 2022, 9:31 p.m. UTC | #12
On Wed, 2022-02-23 at 08:34 -0500, Benjamin Coddington wrote:
> On 23 Feb 2022, at 7:17, Trond Myklebust wrote:
> 
> > On Tue, 2022-02-22 at 15:21 -0500, Benjamin Coddington wrote:
> > > On 22 Feb 2022, at 15:11, Trond Myklebust wrote:
> > > 
> > > > On Tue, 2022-02-22 at 07:50 -0500, Benjamin Coddington wrote:
> > > > > On 21 Feb 2022, at 18:20, Trond Myklebust wrote:
> > > > > 
> > > > > > On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington
> > > > > > wrote:
> > > > > > > On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
> > > > > > > > 
> > > > > > > > We will always need the ability to cut over to uncached
> > > > > > > > readdir.
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > > If the cookie is no longer returned by the server
> > > > > > > > because
> > > > > > > > one
> > > > > > > > or more
> > > > > > > > files were deleted then we need to resolve the
> > > > > > > > situation
> > > > > > > > somehow (IOW:
> > > > > > > > the 'rm *' case). The new algorithm _does_ improve
> > > > > > > > performance
> > > > > > > > on those
> > > > > > > > situations, because it no longer requires us to read
> > > > > > > > the
> > > > > > > > entire
> > > > > > > > directory before switching over: we try 5 times, then
> > > > > > > > fail
> > > > > > > > over.
> > > > > > > 
> > > > > > > Yes, using per-page validation doesn't remove the need
> > > > > > > for
> > > > > > > uncached
> > > > > > > readdir.  It does allow a reader to simply resume filling
> > > > > > > the
> > > > > > > cache where
> > > > > > > it left off.  There's no need to try 5 times and fail
> > > > > > > over. 
> > > > > > > And
> > > > > > > there's
> > > > > > > no need to make a trade-off and make the situation worse
> > > > > > > in
> > > > > > > certain
> > > > > > > scenarios.
> > > > > > > 
> > > > > > > I thought I'd point that out and make an offer to re-
> > > > > > > submit
> > > > > > > it. 
> > > > > > > Any
> > > > > > > interest?
> > > > > > > 
> > > > > > 
> > > > > > As I recall, I had concerns about that approach. Can you
> > > > > > explain
> > > > > > again
> > > > > > how it will work?
> > > > > 
> > > > > Every page of readdir results has the directory's change attr
> > > > > stored
> > > > > on the
> > > > > page.  That, along with the page's index and the first cookie
> > > > > is
> > > > > enough
> > > > > information to determine if the page's data can be used by
> > > > > another
> > > > > process.
> > > > > 
> > > > > Which means that when the pagecache is dropped, fillers don't
> > > > > have to
> > > > > restart
> > > > > filling the cache at page index 0, they can continue to fill
> > > > > at
> > > > > whatever
> > > > > index they were at previously.  If another process finds a
> > > > > page
> > > > > that
> > > > > doesn't
> > > > > match its page index, cookie, and the current directory's
> > > > > change
> > > > > attr, the
> > > > > page is dropped and refilled from that process' indexing.
> > > > > 
> > > > > > A few of the concerns I have revolve around
> > > > > > telldir()/seekdir(). If
> > > > > > the
> > > > > > platform is 32-bit, then we cannot use cookies as the
> > > > > > telldir()
> > > > > > output,
> > > > > > and so our only option is to use offsets into the page
> > > > > > cache
> > > > > > (this
> > > > > > is
> > > > > > why this patch carves out an exception if desc->dir_cookie
> > > > > > ==
> > > > > > 0).
> > > > > > How
> > > > > > would that work with you scheme?
> > > > > 
> > > > > For 32-bit seekdir, pages are filled starting at index 0. 
> > > > > This
> > > > > is
> > > > > very
> > > > > unlikely to match other readers (unless they also do the
> > > > > _same_
> > > > > seekdir).
> > > > > 
> > > > > > Even in the 64-bit case where are able to use cookies for
> > > > > > telldir()/seekdir(), how do we determine an appropriate
> > > > > > page
> > > > > > index
> > > > > > after a seekdir()?
> > > > > 
> > > > > We don't.  Instead we start filling at index 0.  Again, the
> > > > > pagecache
> > > > > will
> > > > > only be useful to other processes that have done the same
> > > > > llseek.
> > > > > 
> > > > > This approach optimizes the pagecache for processes that are
> > > > > doing
> > > > > similar
> > > > > work, and has the added benefit of scaling well for large
> > > > > directory
> > > > > listings
> > > > > under memory pressure.  Also a number of classes of directory
> > > > > modifications
> > > > > (such as renames, or insertions/deletions at locations a
> > > > > reader
> > > > > has
> > > > > moved
> > > > > beyond) are no longer a reason to re-fill the pagecache from
> > > > > scratch.
> > > > > 
> > > > 
> > > > OK, you've got me more or less sold on it.
> > > > 
> > > > I'd still like to figure out how to improve the performance for
> > > > seekdir
> > > > (since I do have an interest in re-exporting NFS) but I've been
> > > > playing
> > > > around with a couple of patches that implement your concept and
> > > > they do
> > > > seem to work well for the common case of a linear read through
> > > > the
> > > > directory.
> > > 
> > > Nice.  I have another version from the one I originally posted:
> > > https://lore.kernel.org/linux-nfs/cover.1611160120.git.bcodding@redhat.com/
> > > 
> > > .. but I don't remember exactly the changes and it needs
> > > rebasing. 
> > > Should I
> > > rebase it against your testing branch and send the result?
> > 
> > My 2 patches did something slightly different to yours, storing the
> > change attribute in the array header instead of in page_private.
> > That
> > makes for a slightly smaller change.
> 
> I worried that increasing the size of the array header wouldn't allow
> us 
> to
> store as many entries per page.

The struct nfs_cache_array header is 24 bytes long with the change
attribute (as opposed to 16 bytes without it). This size is independent
of the architecture, assuming that 'unsigned int' is 32-bits and
unsigned char is 8-bits (as is always the case on Linux).

On a 64-bit system, A single struct nfs_cache_array_entry ends up being
32 bytes long. So in a standard 4k page with a 24-byte or 16 byte
header you will be able to cache exactly 127 cache array entries.

On a 32-bit system, the cache entry is 28 bytes long (the difference
being the pointer length), and you can pack 145 entries in the 4k page.

IOW: The change in header size makes no difference to the number of
entries you can cache, because in both cases, the header remains
smaller than a single entry.
diff mbox series

Patch

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a7f25fef890a..d0707e63ce45 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -71,19 +71,16 @@  const struct address_space_operations nfs_dir_aops = {
 
 #define NFS_INIT_DTSIZE PAGE_SIZE
 
-static struct nfs_open_dir_context *alloc_nfs_open_dir_context(struct inode *dir)
+static struct nfs_open_dir_context *
+alloc_nfs_open_dir_context(struct inode *dir)
 {
 	struct nfs_inode *nfsi = NFS_I(dir);
 	struct nfs_open_dir_context *ctx;
-	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
 	if (ctx != NULL) {
-		ctx->duped = 0;
 		ctx->attr_gencount = nfsi->attr_gencount;
-		ctx->dir_cookie = 0;
-		ctx->dup_cookie = 0;
-		ctx->page_index = 0;
 		ctx->dtsize = NFS_INIT_DTSIZE;
-		ctx->eof = false;
 		spin_lock(&dir->i_lock);
 		if (list_empty(&nfsi->open_files) &&
 		    (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
@@ -170,6 +167,7 @@  struct nfs_readdir_descriptor {
 	unsigned long	timestamp;
 	unsigned long	gencount;
 	unsigned long	attr_gencount;
+	unsigned int	page_fill_misses;
 	unsigned int	cache_entry_index;
 	unsigned int	buffer_fills;
 	unsigned int	dtsize;
@@ -925,6 +923,18 @@  nfs_readdir_page_get_cached(struct nfs_readdir_descriptor *desc)
 					   desc->last_cookie);
 }
 
+#define NFS_READDIR_PAGE_FILL_MISS_MAX 5
+/*
+ * If we've tried to refill the page cache more than 5 times, and
+ * still not found our cookie, then we should stop and fall back
+ * to uncached readdir
+ */
+static bool nfs_readdir_may_fill_pagecache(struct nfs_readdir_descriptor *desc)
+{
+	return desc->dir_cookie == 0 ||
+	       desc->page_fill_misses < NFS_READDIR_PAGE_FILL_MISS_MAX;
+}
+
 /*
  * Returns 0 if desc->dir_cookie was found on page desc->page_index
  * and locks the page to prevent removal from the page cache.
@@ -940,6 +950,8 @@  static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 	if (!desc->page)
 		return -ENOMEM;
 	if (nfs_readdir_page_needs_filling(desc->page)) {
+		if (!nfs_readdir_may_fill_pagecache(desc))
+			return -EBADCOOKIE;
 		desc->page_index_max = desc->page_index;
 		res = nfs_readdir_xdr_to_array(desc, nfsi->cookieverf, verf,
 					       &desc->page, 1);
@@ -958,36 +970,22 @@  static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 		if (desc->page_index == 0)
 			memcpy(nfsi->cookieverf, verf,
 			       sizeof(nfsi->cookieverf));
+		desc->page_fill_misses++;
 	}
 	res = nfs_readdir_search_array(desc);
-	if (res == 0)
+	if (res == 0) {
+		desc->page_fill_misses = 0;
 		return 0;
+	}
 	nfs_readdir_page_unlock_and_put_cached(desc);
 	return res;
 }
 
-static bool nfs_readdir_dont_search_cache(struct nfs_readdir_descriptor *desc)
-{
-	struct address_space *mapping = desc->file->f_mapping;
-	struct inode *dir = file_inode(desc->file);
-	unsigned int dtsize = NFS_SERVER(dir)->dtsize;
-	loff_t size = i_size_read(dir);
-
-	/*
-	 * Default to uncached readdir if the page cache is empty, and
-	 * we're looking for a non-zero cookie in a large directory.
-	 */
-	return desc->dir_cookie != 0 && mapping->nrpages == 0 && size > dtsize;
-}
-
 /* Search for desc->dir_cookie from the beginning of the page cache */
 static int readdir_search_pagecache(struct nfs_readdir_descriptor *desc)
 {
 	int res;
 
-	if (nfs_readdir_dont_search_cache(desc))
-		return -EBADCOOKIE;
-
 	do {
 		if (desc->page_index == 0) {
 			desc->current_index = 0;
@@ -1149,6 +1147,7 @@  static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	page_index = dir_ctx->page_index;
 	desc->attr_gencount = dir_ctx->attr_gencount;
 	desc->eof = dir_ctx->eof;
+	desc->page_fill_misses = dir_ctx->page_fill_misses;
 	nfs_set_dtsize(desc, dir_ctx->dtsize);
 	memcpy(desc->verf, dir_ctx->verf, sizeof(desc->verf));
 	spin_unlock(&file->f_lock);
@@ -1204,6 +1203,7 @@  static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	dir_ctx->duped = desc->duped;
 	dir_ctx->attr_gencount = desc->attr_gencount;
 	dir_ctx->page_index = desc->page_index;
+	dir_ctx->page_fill_misses = desc->page_fill_misses;
 	dir_ctx->eof = desc->eof;
 	dir_ctx->dtsize = desc->dtsize;
 	memcpy(dir_ctx->verf, desc->verf, sizeof(dir_ctx->verf));
@@ -1247,6 +1247,7 @@  static loff_t nfs_llseek_dir(struct file *filp, loff_t offset, int whence)
 			dir_ctx->dir_cookie = offset;
 		else
 			dir_ctx->dir_cookie = 0;
+		dir_ctx->page_fill_misses = 0;
 		if (offset == 0)
 			memset(dir_ctx->verf, 0, sizeof(dir_ctx->verf));
 		dir_ctx->duped = 0;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index d27f7e788624..3165927048e4 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -106,6 +106,7 @@  struct nfs_open_dir_context {
 	__u64 dir_cookie;
 	__u64 dup_cookie;
 	pgoff_t page_index;
+	unsigned int page_fill_misses;
 	unsigned int dtsize;
 	signed char duped;
 	bool eof;