diff mbox series

[09/11] NFS: Improve performance of listing directories being modified

Message ID 1604325011-29427-10-git-send-email-dwysocha@redhat.com (mailing list archive)
State New
Headers show
Series Add NFS readdir tracepoints and improve performance of reading directories | expand

Commit Message

David Wysochanski Nov. 2, 2020, 1:50 p.m. UTC
A process can hang forever to 'ls -l' a directory while the directory
is being modified such as another NFS client adding files to the
directory.  The problem is seen specifically with larger directories
(I tested with 1 million) and/or slower NFS server responses to
READDIR.  If a combination of the NFS directory size, the NFS server
responses to READDIR is such that the 'ls' process gets partially
through the listing before the attribute cache expires (time
exceeds acdirmax), we drop the pagecache and have to re-fill it,
and as a result, the process may never complete.  One could argue
for larger directories the acdirmin/acdirmax should be increased,
but it's not always possible to tune this effectively.

The root cause of this problem is due to how the NFS readdir cache
currently works.  The main search function, readdir_search_pagecache(),
always starts searching at page_index and cookie == 0, and for any
page not in the cache, fills in the page with entries obtained in
a READDIR NFS call.  If a page already exists, we proceed to
nfs_readdir_search_for_cookie(), which searches for the cookie
(pos) of the readdir call.  The search is O(n), where n is the
directory size before the cookie in question is found, and every
entry to nfs_readdir() pays this penalty, irrespective of the
current directory position (dir_context.pos).  The search is
expensive due to the opaque nature of readdir cookies, and the fact
that no mapping (hash) exists from cookies to pages.  In the case
of a directory being modified, the above behavior can become an
excessive penalty, since the same process is forced to fill pages it
may be no longer interested in (the entries were passed in a previous
nfs_readdir call), and this can essentially lead no forward progress.

To fix this problem, at the end of nfs_readdir(), save the page_index
corresponding to the directory position (cookie) inside the process's
nfs_open_dir_context.  Then at the next entry of nfs_readdir(), use
the saved page_index as the starting search point rather than starting
at page_index == 0.  Not only does this fix the problem of listing
a directory being modified, it also significantly improves performance
in the unmodified case since no extra search penalty is paid at each
entry to nfs_readdir().

In the case of lseek, since there is no hash or other mapping from a
cookie value to the page->index, just reset nfs_open_dir_context.page_index
to 0, which will reset the search to the old behavior.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
---
 fs/nfs/dir.c           | 8 +++++++-
 include/linux/nfs_fs.h | 1 +
 2 files changed, 8 insertions(+), 1 deletion(-)

Comments

Trond Myklebust Nov. 2, 2020, 4:21 p.m. UTC | #1
On Mon, 2020-11-02 at 08:50 -0500, Dave Wysochanski wrote:
> A process can hang forever to 'ls -l' a directory while the directory
> is being modified such as another NFS client adding files to the
> directory.  The problem is seen specifically with larger directories
> (I tested with 1 million) and/or slower NFS server responses to
> READDIR.  If a combination of the NFS directory size, the NFS server
> responses to READDIR is such that the 'ls' process gets partially
> through the listing before the attribute cache expires (time
> exceeds acdirmax), we drop the pagecache and have to re-fill it,
> and as a result, the process may never complete.  One could argue
> for larger directories the acdirmin/acdirmax should be increased,
> but it's not always possible to tune this effectively.
> 
> The root cause of this problem is due to how the NFS readdir cache
> currently works.  The main search function,
> readdir_search_pagecache(),
> always starts searching at page_index and cookie == 0, and for any
> page not in the cache, fills in the page with entries obtained in
> a READDIR NFS call.  If a page already exists, we proceed to
> nfs_readdir_search_for_cookie(), which searches for the cookie
> (pos) of the readdir call.  The search is O(n), where n is the
> directory size before the cookie in question is found, and every
> entry to nfs_readdir() pays this penalty, irrespective of the
> current directory position (dir_context.pos).  The search is
> expensive due to the opaque nature of readdir cookies, and the fact
> that no mapping (hash) exists from cookies to pages.  In the case
> of a directory being modified, the above behavior can become an
> excessive penalty, since the same process is forced to fill pages it
> may be no longer interested in (the entries were passed in a previous
> nfs_readdir call), and this can essentially lead no forward progress.
> 
> To fix this problem, at the end of nfs_readdir(), save the page_index
> corresponding to the directory position (cookie) inside the process's
> nfs_open_dir_context.  Then at the next entry of nfs_readdir(), use
> the saved page_index as the starting search point rather than
> starting
> at page_index == 0.  Not only does this fix the problem of listing
> a directory being modified, it also significantly improves
> performance
> in the unmodified case since no extra search penalty is paid at each
> entry to nfs_readdir().
> 
> In the case of lseek, since there is no hash or other mapping from a
> cookie value to the page->index, just reset
> nfs_open_dir_context.page_index
> to 0, which will reset the search to the old behavior.
> 
> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> ---
>  fs/nfs/dir.c           | 8 +++++++-
>  include/linux/nfs_fs.h | 1 +
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 52e06c8fc7cd..b266f505b521 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -78,6 +78,7 @@ static struct nfs_open_dir_context
> *alloc_nfs_open_dir_context(struct inode *dir
>                 ctx->attr_gencount = nfsi->attr_gencount;
>                 ctx->dir_cookie = 0;
>                 ctx->dup_cookie = 0;
> +               ctx->page_index = 0;
>                 ctx->cred = get_cred(cred);
>                 spin_lock(&dir->i_lock);
>                 if (list_empty(&nfsi->open_files) &&
> @@ -763,7 +764,7 @@ int
> find_and_lock_cache_page(nfs_readdir_descriptor_t *desc)
>         return res;
>  }
>  
> -/* Search for desc->dir_cookie from the beginning of the page cache
> */
> +/* Search for desc->dir_cookie starting at desc->page_index */
>  static inline
>  int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
>  {
> @@ -885,6 +886,8 @@ static int nfs_readdir(struct file *file, struct
> dir_context *ctx)
>                 .ctx = ctx,
>                 .dir_cookie = &dir_ctx->dir_cookie,
>                 .plus = nfs_use_readdirplus(inode, ctx),
> +               .page_index = dir_ctx->page_index,
> +               .last_cookie = nfs_readdir_use_cookie(file) ? ctx-
> >pos : 0,
>         },
>                         *desc = &my_desc;
>         int res = 0;
> @@ -938,6 +941,7 @@ static int nfs_readdir(struct file *file, struct
> dir_context *ctx)
>  out:
>         if (res > 0)
>                 res = 0;
> +       dir_ctx->page_index = desc->page_index;
>         trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx->dir_cookie,
>                                NFS_SERVER(inode)->dtsize,
> my_desc.plus, res);
>         return res;
> @@ -975,6 +979,8 @@ static loff_t nfs_llseek_dir(struct file *filp,
> loff_t offset, int whence)
>                 else
>                         dir_ctx->dir_cookie = 0;
>                 dir_ctx->duped = 0;
> +               /* Force readdir_search_pagecache to start over */
> +               dir_ctx->page_index = 0;
>         }
>         inode_unlock(inode);
>         return offset;
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index a2c6455ea3fa..0e55c0154ccd 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -93,6 +93,7 @@ struct nfs_open_dir_context {
>         __u64 dir_cookie;
>         __u64 dup_cookie;
>         signed char duped;
> +       unsigned long   page_index;
>  };
>  
>  /*

NACK. It makes no sense to store the page index as a cursor.
David Wysochanski Nov. 2, 2020, 4:26 p.m. UTC | #2
On Mon, Nov 2, 2020 at 11:22 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> On Mon, 2020-11-02 at 08:50 -0500, Dave Wysochanski wrote:
> > A process can hang forever to 'ls -l' a directory while the directory
> > is being modified such as another NFS client adding files to the
> > directory.  The problem is seen specifically with larger directories
> > (I tested with 1 million) and/or slower NFS server responses to
> > READDIR.  If a combination of the NFS directory size, the NFS server
> > responses to READDIR is such that the 'ls' process gets partially
> > through the listing before the attribute cache expires (time
> > exceeds acdirmax), we drop the pagecache and have to re-fill it,
> > and as a result, the process may never complete.  One could argue
> > for larger directories the acdirmin/acdirmax should be increased,
> > but it's not always possible to tune this effectively.
> >
> > The root cause of this problem is due to how the NFS readdir cache
> > currently works.  The main search function,
> > readdir_search_pagecache(),
> > always starts searching at page_index and cookie == 0, and for any
> > page not in the cache, fills in the page with entries obtained in
> > a READDIR NFS call.  If a page already exists, we proceed to
> > nfs_readdir_search_for_cookie(), which searches for the cookie
> > (pos) of the readdir call.  The search is O(n), where n is the
> > directory size before the cookie in question is found, and every
> > entry to nfs_readdir() pays this penalty, irrespective of the
> > current directory position (dir_context.pos).  The search is
> > expensive due to the opaque nature of readdir cookies, and the fact
> > that no mapping (hash) exists from cookies to pages.  In the case
> > of a directory being modified, the above behavior can become an
> > excessive penalty, since the same process is forced to fill pages it
> > may be no longer interested in (the entries were passed in a previous
> > nfs_readdir call), and this can essentially lead no forward progress.
> >
> > To fix this problem, at the end of nfs_readdir(), save the page_index
> > corresponding to the directory position (cookie) inside the process's
> > nfs_open_dir_context.  Then at the next entry of nfs_readdir(), use
> > the saved page_index as the starting search point rather than
> > starting
> > at page_index == 0.  Not only does this fix the problem of listing
> > a directory being modified, it also significantly improves
> > performance
> > in the unmodified case since no extra search penalty is paid at each
> > entry to nfs_readdir().
> >
> > In the case of lseek, since there is no hash or other mapping from a
> > cookie value to the page->index, just reset
> > nfs_open_dir_context.page_index
> > to 0, which will reset the search to the old behavior.
> >
> > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> > ---
> >  fs/nfs/dir.c           | 8 +++++++-
> >  include/linux/nfs_fs.h | 1 +
> >  2 files changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > index 52e06c8fc7cd..b266f505b521 100644
> > --- a/fs/nfs/dir.c
> > +++ b/fs/nfs/dir.c
> > @@ -78,6 +78,7 @@ static struct nfs_open_dir_context
> > *alloc_nfs_open_dir_context(struct inode *dir
> >                 ctx->attr_gencount = nfsi->attr_gencount;
> >                 ctx->dir_cookie = 0;
> >                 ctx->dup_cookie = 0;
> > +               ctx->page_index = 0;
> >                 ctx->cred = get_cred(cred);
> >                 spin_lock(&dir->i_lock);
> >                 if (list_empty(&nfsi->open_files) &&
> > @@ -763,7 +764,7 @@ int
> > find_and_lock_cache_page(nfs_readdir_descriptor_t *desc)
> >         return res;
> >  }
> >
> > -/* Search for desc->dir_cookie from the beginning of the page cache
> > */
> > +/* Search for desc->dir_cookie starting at desc->page_index */
> >  static inline
> >  int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
> >  {
> > @@ -885,6 +886,8 @@ static int nfs_readdir(struct file *file, struct
> > dir_context *ctx)
> >                 .ctx = ctx,
> >                 .dir_cookie = &dir_ctx->dir_cookie,
> >                 .plus = nfs_use_readdirplus(inode, ctx),
> > +               .page_index = dir_ctx->page_index,
> > +               .last_cookie = nfs_readdir_use_cookie(file) ? ctx-
> > >pos : 0,
> >         },
> >                         *desc = &my_desc;
> >         int res = 0;
> > @@ -938,6 +941,7 @@ static int nfs_readdir(struct file *file, struct
> > dir_context *ctx)
> >  out:
> >         if (res > 0)
> >                 res = 0;
> > +       dir_ctx->page_index = desc->page_index;
> >         trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx->dir_cookie,
> >                                NFS_SERVER(inode)->dtsize,
> > my_desc.plus, res);
> >         return res;
> > @@ -975,6 +979,8 @@ static loff_t nfs_llseek_dir(struct file *filp,
> > loff_t offset, int whence)
> >                 else
> >                         dir_ctx->dir_cookie = 0;
> >                 dir_ctx->duped = 0;
> > +               /* Force readdir_search_pagecache to start over */
> > +               dir_ctx->page_index = 0;
> >         }
> >         inode_unlock(inode);
> >         return offset;
> > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > index a2c6455ea3fa..0e55c0154ccd 100644
> > --- a/include/linux/nfs_fs.h
> > +++ b/include/linux/nfs_fs.h
> > @@ -93,6 +93,7 @@ struct nfs_open_dir_context {
> >         __u64 dir_cookie;
> >         __u64 dup_cookie;
> >         signed char duped;
> > +       unsigned long   page_index;
> >  };
> >
> >  /*
>
> NACK. It makes no sense to store the page index as a cursor.
>

A similar thing was done recently with:
227823d2074d nfs: optimise readdir cache page invalidation
Trond Myklebust Nov. 2, 2020, 5:31 p.m. UTC | #3
On Mon, 2020-11-02 at 11:26 -0500, David Wysochanski wrote:
> On Mon, Nov 2, 2020 at 11:22 AM Trond Myklebust <
> trondmy@hammerspace.com> wrote:
> > 
> > On Mon, 2020-11-02 at 08:50 -0500, Dave Wysochanski wrote:
> > > A process can hang forever to 'ls -l' a directory while the
> > > directory
> > > is being modified such as another NFS client adding files to the
> > > directory.  The problem is seen specifically with larger
> > > directories
> > > (I tested with 1 million) and/or slower NFS server responses to
> > > READDIR.  If a combination of the NFS directory size, the NFS
> > > server
> > > responses to READDIR is such that the 'ls' process gets partially
> > > through the listing before the attribute cache expires (time
> > > exceeds acdirmax), we drop the pagecache and have to re-fill it,
> > > and as a result, the process may never complete.  One could argue
> > > for larger directories the acdirmin/acdirmax should be increased,
> > > but it's not always possible to tune this effectively.
> > > 
> > > The root cause of this problem is due to how the NFS readdir
> > > cache
> > > currently works.  The main search function,
> > > readdir_search_pagecache(),
> > > always starts searching at page_index and cookie == 0, and for
> > > any
> > > page not in the cache, fills in the page with entries obtained in
> > > a READDIR NFS call.  If a page already exists, we proceed to
> > > nfs_readdir_search_for_cookie(), which searches for the cookie
> > > (pos) of the readdir call.  The search is O(n), where n is the
> > > directory size before the cookie in question is found, and every
> > > entry to nfs_readdir() pays this penalty, irrespective of the
> > > current directory position (dir_context.pos).  The search is
> > > expensive due to the opaque nature of readdir cookies, and the
> > > fact
> > > that no mapping (hash) exists from cookies to pages.  In the case
> > > of a directory being modified, the above behavior can become an
> > > excessive penalty, since the same process is forced to fill pages
> > > it
> > > may be no longer interested in (the entries were passed in a
> > > previous
> > > nfs_readdir call), and this can essentially lead no forward
> > > progress.
> > > 
> > > To fix this problem, at the end of nfs_readdir(), save the
> > > page_index
> > > corresponding to the directory position (cookie) inside the
> > > process's
> > > nfs_open_dir_context.  Then at the next entry of nfs_readdir(),
> > > use
> > > the saved page_index as the starting search point rather than
> > > starting
> > > at page_index == 0.  Not only does this fix the problem of
> > > listing
> > > a directory being modified, it also significantly improves
> > > performance
> > > in the unmodified case since no extra search penalty is paid at
> > > each
> > > entry to nfs_readdir().
> > > 
> > > In the case of lseek, since there is no hash or other mapping
> > > from a
> > > cookie value to the page->index, just reset
> > > nfs_open_dir_context.page_index
> > > to 0, which will reset the search to the old behavior.
> > > 
> > > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> > > ---
> > >  fs/nfs/dir.c           | 8 +++++++-
> > >  include/linux/nfs_fs.h | 1 +
> > >  2 files changed, 8 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > index 52e06c8fc7cd..b266f505b521 100644
> > > --- a/fs/nfs/dir.c
> > > +++ b/fs/nfs/dir.c
> > > @@ -78,6 +78,7 @@ static struct nfs_open_dir_context
> > > *alloc_nfs_open_dir_context(struct inode *dir
> > >                 ctx->attr_gencount = nfsi->attr_gencount;
> > >                 ctx->dir_cookie = 0;
> > >                 ctx->dup_cookie = 0;
> > > +               ctx->page_index = 0;
> > >                 ctx->cred = get_cred(cred);
> > >                 spin_lock(&dir->i_lock);
> > >                 if (list_empty(&nfsi->open_files) &&
> > > @@ -763,7 +764,7 @@ int
> > > find_and_lock_cache_page(nfs_readdir_descriptor_t *desc)
> > >         return res;
> > >  }
> > > 
> > > -/* Search for desc->dir_cookie from the beginning of the page
> > > cache
> > > */
> > > +/* Search for desc->dir_cookie starting at desc->page_index */
> > >  static inline
> > >  int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
> > >  {
> > > @@ -885,6 +886,8 @@ static int nfs_readdir(struct file *file,
> > > struct
> > > dir_context *ctx)
> > >                 .ctx = ctx,
> > >                 .dir_cookie = &dir_ctx->dir_cookie,
> > >                 .plus = nfs_use_readdirplus(inode, ctx),
> > > +               .page_index = dir_ctx->page_index,
> > > +               .last_cookie = nfs_readdir_use_cookie(file) ?
> > > ctx-
> > > > pos : 0,
> > >         },
> > >                         *desc = &my_desc;
> > >         int res = 0;
> > > @@ -938,6 +941,7 @@ static int nfs_readdir(struct file *file,
> > > struct
> > > dir_context *ctx)
> > >  out:
> > >         if (res > 0)
> > >                 res = 0;
> > > +       dir_ctx->page_index = desc->page_index;
> > >         trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx-
> > > >dir_cookie,
> > >                                NFS_SERVER(inode)->dtsize,
> > > my_desc.plus, res);
> > >         return res;
> > > @@ -975,6 +979,8 @@ static loff_t nfs_llseek_dir(struct file
> > > *filp,
> > > loff_t offset, int whence)
> > >                 else
> > >                         dir_ctx->dir_cookie = 0;
> > >                 dir_ctx->duped = 0;
> > > +               /* Force readdir_search_pagecache to start over
> > > */
> > > +               dir_ctx->page_index = 0;
> > >         }
> > >         inode_unlock(inode);
> > >         return offset;
> > > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > > index a2c6455ea3fa..0e55c0154ccd 100644
> > > --- a/include/linux/nfs_fs.h
> > > +++ b/include/linux/nfs_fs.h
> > > @@ -93,6 +93,7 @@ struct nfs_open_dir_context {
> > >         __u64 dir_cookie;
> > >         __u64 dup_cookie;
> > >         signed char duped;
> > > +       unsigned long   page_index;
> > >  };
> > > 
> > >  /*
> > 
> > NACK. It makes no sense to store the page index as a cursor.
> > 
> 
> A similar thing was done recently with:
> 227823d2074d nfs: optimise readdir cache page invalidation
> 

That's a very different thing. It is about discarding page data in
order to force a re-read of the contents into cache.

What you're doing is basically trying to guess where the data is
located. which might work in some cases where the directory is
completely static, but if it shrinks (e.g. due to a few unlink() or
rename() calls) so that you overshoot the cookie, then you can end up
reading all the way to the end of the directory before doing an
uncached readdir.

IOW: This will have a detrimental effect for some workloads, which
needs to be weighed up against the benefits. I saw that you've tested
with large directories, but what workloads were you testing on those
directories?
David Wysochanski Nov. 2, 2020, 7:45 p.m. UTC | #4
On Mon, Nov 2, 2020 at 12:31 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> On Mon, 2020-11-02 at 11:26 -0500, David Wysochanski wrote:
> > On Mon, Nov 2, 2020 at 11:22 AM Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > >
> > > On Mon, 2020-11-02 at 08:50 -0500, Dave Wysochanski wrote:
> > > > A process can hang forever to 'ls -l' a directory while the
> > > > directory
> > > > is being modified such as another NFS client adding files to the
> > > > directory.  The problem is seen specifically with larger
> > > > directories
> > > > (I tested with 1 million) and/or slower NFS server responses to
> > > > READDIR.  If a combination of the NFS directory size, the NFS
> > > > server
> > > > responses to READDIR is such that the 'ls' process gets partially
> > > > through the listing before the attribute cache expires (time
> > > > exceeds acdirmax), we drop the pagecache and have to re-fill it,
> > > > and as a result, the process may never complete.  One could argue
> > > > for larger directories the acdirmin/acdirmax should be increased,
> > > > but it's not always possible to tune this effectively.
> > > >
> > > > The root cause of this problem is due to how the NFS readdir
> > > > cache
> > > > currently works.  The main search function,
> > > > readdir_search_pagecache(),
> > > > always starts searching at page_index and cookie == 0, and for
> > > > any
> > > > page not in the cache, fills in the page with entries obtained in
> > > > a READDIR NFS call.  If a page already exists, we proceed to
> > > > nfs_readdir_search_for_cookie(), which searches for the cookie
> > > > (pos) of the readdir call.  The search is O(n), where n is the
> > > > directory size before the cookie in question is found, and every
> > > > entry to nfs_readdir() pays this penalty, irrespective of the
> > > > current directory position (dir_context.pos).  The search is
> > > > expensive due to the opaque nature of readdir cookies, and the
> > > > fact
> > > > that no mapping (hash) exists from cookies to pages.  In the case
> > > > of a directory being modified, the above behavior can become an
> > > > excessive penalty, since the same process is forced to fill pages
> > > > it
> > > > may be no longer interested in (the entries were passed in a
> > > > previous
> > > > nfs_readdir call), and this can essentially lead no forward
> > > > progress.
> > > >
> > > > To fix this problem, at the end of nfs_readdir(), save the
> > > > page_index
> > > > corresponding to the directory position (cookie) inside the
> > > > process's
> > > > nfs_open_dir_context.  Then at the next entry of nfs_readdir(),
> > > > use
> > > > the saved page_index as the starting search point rather than
> > > > starting
> > > > at page_index == 0.  Not only does this fix the problem of
> > > > listing
> > > > a directory being modified, it also significantly improves
> > > > performance
> > > > in the unmodified case since no extra search penalty is paid at
> > > > each
> > > > entry to nfs_readdir().
> > > >
> > > > In the case of lseek, since there is no hash or other mapping
> > > > from a
> > > > cookie value to the page->index, just reset
> > > > nfs_open_dir_context.page_index
> > > > to 0, which will reset the search to the old behavior.
> > > >
> > > > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> > > > ---
> > > >  fs/nfs/dir.c           | 8 +++++++-
> > > >  include/linux/nfs_fs.h | 1 +
> > > >  2 files changed, 8 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > index 52e06c8fc7cd..b266f505b521 100644
> > > > --- a/fs/nfs/dir.c
> > > > +++ b/fs/nfs/dir.c
> > > > @@ -78,6 +78,7 @@ static struct nfs_open_dir_context
> > > > *alloc_nfs_open_dir_context(struct inode *dir
> > > >                 ctx->attr_gencount = nfsi->attr_gencount;
> > > >                 ctx->dir_cookie = 0;
> > > >                 ctx->dup_cookie = 0;
> > > > +               ctx->page_index = 0;
> > > >                 ctx->cred = get_cred(cred);
> > > >                 spin_lock(&dir->i_lock);
> > > >                 if (list_empty(&nfsi->open_files) &&
> > > > @@ -763,7 +764,7 @@ int
> > > > find_and_lock_cache_page(nfs_readdir_descriptor_t *desc)
> > > >         return res;
> > > >  }
> > > >
> > > > -/* Search for desc->dir_cookie from the beginning of the page
> > > > cache
> > > > */
> > > > +/* Search for desc->dir_cookie starting at desc->page_index */
> > > >  static inline
> > > >  int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
> > > >  {
> > > > @@ -885,6 +886,8 @@ static int nfs_readdir(struct file *file,
> > > > struct
> > > > dir_context *ctx)
> > > >                 .ctx = ctx,
> > > >                 .dir_cookie = &dir_ctx->dir_cookie,
> > > >                 .plus = nfs_use_readdirplus(inode, ctx),
> > > > +               .page_index = dir_ctx->page_index,
> > > > +               .last_cookie = nfs_readdir_use_cookie(file) ?
> > > > ctx-
> > > > > pos : 0,
> > > >         },
> > > >                         *desc = &my_desc;
> > > >         int res = 0;
> > > > @@ -938,6 +941,7 @@ static int nfs_readdir(struct file *file,
> > > > struct
> > > > dir_context *ctx)
> > > >  out:
> > > >         if (res > 0)
> > > >                 res = 0;
> > > > +       dir_ctx->page_index = desc->page_index;
> > > >         trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx-
> > > > >dir_cookie,
> > > >                                NFS_SERVER(inode)->dtsize,
> > > > my_desc.plus, res);
> > > >         return res;
> > > > @@ -975,6 +979,8 @@ static loff_t nfs_llseek_dir(struct file
> > > > *filp,
> > > > loff_t offset, int whence)
> > > >                 else
> > > >                         dir_ctx->dir_cookie = 0;
> > > >                 dir_ctx->duped = 0;
> > > > +               /* Force readdir_search_pagecache to start over
> > > > */
> > > > +               dir_ctx->page_index = 0;
> > > >         }
> > > >         inode_unlock(inode);
> > > >         return offset;
> > > > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > > > index a2c6455ea3fa..0e55c0154ccd 100644
> > > > --- a/include/linux/nfs_fs.h
> > > > +++ b/include/linux/nfs_fs.h
> > > > @@ -93,6 +93,7 @@ struct nfs_open_dir_context {
> > > >         __u64 dir_cookie;
> > > >         __u64 dup_cookie;
> > > >         signed char duped;
> > > > +       unsigned long   page_index;
> > > >  };
> > > >
> > > >  /*
> > >
> > > NACK. It makes no sense to store the page index as a cursor.
> > >
> >
> > A similar thing was done recently with:
> > 227823d2074d nfs: optimise readdir cache page invalidation
> >
>
> That's a very different thing. It is about discarding page data in
> order to force a re-read of the contents into cache.
>
Right - I only pointed it out because it is in effect a cursor about
the last access into the cache but it's on a global basis, not
process context.

> What you're doing is basically trying to guess where the data is
> located. which might work in some cases where the directory is
> completely static, but if it shrinks (e.g. due to a few unlink() or
> rename() calls) so that you overshoot the cookie, then you can end up
> reading all the way to the end of the directory before doing an
> uncached readdir.
>
First, consider the unmodified (idle directory) scenario.  Today the
performance is bad the larger the directory goes - do you see why?
I tried to explain in the cover letter and header but maybe it's not clear?

Second, the modified scenario today the performance is very bad
because of the same problem - the cookie is reset and the process
needs to start over at cookie 0, repeating READDIRs.  But maybe
there's a specific scenario I'm not thinking about.

The way I thought about this is that if you're in a heavily modified
scenario with a large directory and you're past the 'acdirmax' time,
you have to make the choice of either:
a) ignoring 'acdirmax' (this is what the NFSv3 patch did) and even
though you know the cache expired you keep going as though it
did not (at least until a different process starts a listing)
b) honoring 'acdirmax' (drop the pagecache), but keep going the
best you can based on the previous information and don't try to
rebuild the cache before continuing.

> IOW: This will have a detrimental effect for some workloads, which
> needs to be weighed up against the benefits. I saw that you've tested
> with large directories, but what workloads were you testing on those
> directories?
>
I can definitely do further testing and any scenario you want to try to
break it or find a pathological scenario. So far I've tested the
reader ("ls -lf") in parallel with one of the two writers:
1) random add a file every 0.1s:
while true; do i=$((1 + RANDOM % $NUM_FILES)); echo $i; touch
$MNT2/file$i.bin; builtin sleep 0.1; done > /dev/null 2>&1 &
2) random delete a file every 0.1 s:
while true; do i=$((1 + RANDOM % $NUM_FILES)); echo $i; rm -f
$MNT2/file$i; builtin sleep 0.1; done > /dev/null 2>&1 &

In no case did I see it take a longer time or ops vs vanilla 5.9, the idle
and modified performance is better (measured in seconds and ops)
with this patch.  Below is a short summary.  Note that the first time and
ops is with an idle directory, and the second one is the modified.

5.9 (vanilla): random delete a file every 0.1 s:
Ops increased from 4734 to 8834
Time increased from 23 to 44

5.9 (this patch): random delete a file every 0.1 s:
Ops increased from 4697 to 4696
Time increased from 20 to 30


5.9 (vanilla): random add a file every 0.1s:
Ops increased from 4734 to 9168
Time increased from 23 to 43

5.9 (this patch): random add a file every 0.1s:
Ops increased from 4697 to 4702
Time increased from 21 to 32


> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
>
>
Trond Myklebust Nov. 2, 2020, 9:30 p.m. UTC | #5
On Mon, 2020-11-02 at 14:45 -0500, David Wysochanski wrote:
> On Mon, Nov 2, 2020 at 12:31 PM Trond Myklebust <
> trondmy@hammerspace.com> wrote:
> > 
> > On Mon, 2020-11-02 at 11:26 -0500, David Wysochanski wrote:
> > > On Mon, Nov 2, 2020 at 11:22 AM Trond Myklebust <
> > > trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Mon, 2020-11-02 at 08:50 -0500, Dave Wysochanski wrote:
> > > > > A process can hang forever to 'ls -l' a directory while the
> > > > > directory
> > > > > is being modified such as another NFS client adding files to
> > > > > the
> > > > > directory.  The problem is seen specifically with larger
> > > > > directories
> > > > > (I tested with 1 million) and/or slower NFS server responses
> > > > > to
> > > > > READDIR.  If a combination of the NFS directory size, the NFS
> > > > > server
> > > > > responses to READDIR is such that the 'ls' process gets
> > > > > partially
> > > > > through the listing before the attribute cache expires (time
> > > > > exceeds acdirmax), we drop the pagecache and have to re-fill
> > > > > it,
> > > > > and as a result, the process may never complete.  One could
> > > > > argue
> > > > > for larger directories the acdirmin/acdirmax should be
> > > > > increased,
> > > > > but it's not always possible to tune this effectively.
> > > > > 
> > > > > The root cause of this problem is due to how the NFS readdir
> > > > > cache
> > > > > currently works.  The main search function,
> > > > > readdir_search_pagecache(),
> > > > > always starts searching at page_index and cookie == 0, and
> > > > > for
> > > > > any
> > > > > page not in the cache, fills in the page with entries
> > > > > obtained in
> > > > > a READDIR NFS call.  If a page already exists, we proceed to
> > > > > nfs_readdir_search_for_cookie(), which searches for the
> > > > > cookie
> > > > > (pos) of the readdir call.  The search is O(n), where n is
> > > > > the
> > > > > directory size before the cookie in question is found, and
> > > > > every
> > > > > entry to nfs_readdir() pays this penalty, irrespective of the
> > > > > current directory position (dir_context.pos).  The search is
> > > > > expensive due to the opaque nature of readdir cookies, and
> > > > > the
> > > > > fact
> > > > > that no mapping (hash) exists from cookies to pages.  In the
> > > > > case
> > > > > of a directory being modified, the above behavior can become
> > > > > an
> > > > > excessive penalty, since the same process is forced to fill
> > > > > pages
> > > > > it
> > > > > may be no longer interested in (the entries were passed in a
> > > > > previous
> > > > > nfs_readdir call), and this can essentially lead no forward
> > > > > progress.
> > > > > 
> > > > > To fix this problem, at the end of nfs_readdir(), save the
> > > > > page_index
> > > > > corresponding to the directory position (cookie) inside the
> > > > > process's
> > > > > nfs_open_dir_context.  Then at the next entry of
> > > > > nfs_readdir(),
> > > > > use
> > > > > the saved page_index as the starting search point rather than
> > > > > starting
> > > > > at page_index == 0.  Not only does this fix the problem of
> > > > > listing
> > > > > a directory being modified, it also significantly improves
> > > > > performance
> > > > > in the unmodified case since no extra search penalty is paid
> > > > > at
> > > > > each
> > > > > entry to nfs_readdir().
> > > > > 
> > > > > In the case of lseek, since there is no hash or other mapping
> > > > > from a
> > > > > cookie value to the page->index, just reset
> > > > > nfs_open_dir_context.page_index
> > > > > to 0, which will reset the search to the old behavior.
> > > > > 
> > > > > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> > > > > ---
> > > > >  fs/nfs/dir.c           | 8 +++++++-
> > > > >  include/linux/nfs_fs.h | 1 +
> > > > >  2 files changed, 8 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > index 52e06c8fc7cd..b266f505b521 100644
> > > > > --- a/fs/nfs/dir.c
> > > > > +++ b/fs/nfs/dir.c
> > > > > @@ -78,6 +78,7 @@ static struct nfs_open_dir_context
> > > > > *alloc_nfs_open_dir_context(struct inode *dir
> > > > >                 ctx->attr_gencount = nfsi->attr_gencount;
> > > > >                 ctx->dir_cookie = 0;
> > > > >                 ctx->dup_cookie = 0;
> > > > > +               ctx->page_index = 0;
> > > > >                 ctx->cred = get_cred(cred);
> > > > >                 spin_lock(&dir->i_lock);
> > > > >                 if (list_empty(&nfsi->open_files) &&
> > > > > @@ -763,7 +764,7 @@ int
> > > > > find_and_lock_cache_page(nfs_readdir_descriptor_t *desc)
> > > > >         return res;
> > > > >  }
> > > > > 
> > > > > -/* Search for desc->dir_cookie from the beginning of the
> > > > > page
> > > > > cache
> > > > > */
> > > > > +/* Search for desc->dir_cookie starting at desc->page_index
> > > > > */
> > > > >  static inline
> > > > >  int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
> > > > >  {
> > > > > @@ -885,6 +886,8 @@ static int nfs_readdir(struct file *file,
> > > > > struct
> > > > > dir_context *ctx)
> > > > >                 .ctx = ctx,
> > > > >                 .dir_cookie = &dir_ctx->dir_cookie,
> > > > >                 .plus = nfs_use_readdirplus(inode, ctx),
> > > > > +               .page_index = dir_ctx->page_index,
> > > > > +               .last_cookie = nfs_readdir_use_cookie(file) ?
> > > > > ctx-
> > > > > > pos : 0,
> > > > >         },
> > > > >                         *desc = &my_desc;
> > > > >         int res = 0;
> > > > > @@ -938,6 +941,7 @@ static int nfs_readdir(struct file *file,
> > > > > struct
> > > > > dir_context *ctx)
> > > > >  out:
> > > > >         if (res > 0)
> > > > >                 res = 0;
> > > > > +       dir_ctx->page_index = desc->page_index;
> > > > >         trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx-
> > > > > > dir_cookie,
> > > > >                                NFS_SERVER(inode)->dtsize,
> > > > > my_desc.plus, res);
> > > > >         return res;
> > > > > @@ -975,6 +979,8 @@ static loff_t nfs_llseek_dir(struct file
> > > > > *filp,
> > > > > loff_t offset, int whence)
> > > > >                 else
> > > > >                         dir_ctx->dir_cookie = 0;
> > > > >                 dir_ctx->duped = 0;
> > > > > +               /* Force readdir_search_pagecache to start
> > > > > over
> > > > > */
> > > > > +               dir_ctx->page_index = 0;
> > > > >         }
> > > > >         inode_unlock(inode);
> > > > >         return offset;
> > > > > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > > > > index a2c6455ea3fa..0e55c0154ccd 100644
> > > > > --- a/include/linux/nfs_fs.h
> > > > > +++ b/include/linux/nfs_fs.h
> > > > > @@ -93,6 +93,7 @@ struct nfs_open_dir_context {
> > > > >         __u64 dir_cookie;
> > > > >         __u64 dup_cookie;
> > > > >         signed char duped;
> > > > > +       unsigned long   page_index;
> > > > >  };
> > > > > 
> > > > >  /*
> > > > 
> > > > NACK. It makes no sense to store the page index as a cursor.
> > > > 
> > > 
> > > A similar thing was done recently with:
> > > 227823d2074d nfs: optimise readdir cache page invalidation
> > > 
> > 
> > That's a very different thing. It is about discarding page data in
> > order to force a re-read of the contents into cache.
> > 
> Right - I only pointed it out because it is in effect a cursor about
> the last access into the cache but it's on a global basis, not
> process context.
> 
> > What you're doing is basically trying to guess where the data is
> > located. which might work in some cases where the directory is
> > completely static, but if it shrinks (e.g. due to a few unlink() or
> > rename() calls) so that you overshoot the cookie, then you can end
> > up
> > reading all the way to the end of the directory before doing an
> > uncached readdir.
> > 
> First, consider the unmodified (idle directory) scenario.  Today the
> performance is bad the larger the directory goes - do you see why?
> I tried to explain in the cover letter and header but maybe it's not
> clear?
> 
> Second, the modified scenario today the performance is very bad
> because of the same problem - the cookie is reset and the process
> needs to start over at cookie 0, repeating READDIRs.  But maybe
> there's a specific scenario I'm not thinking about.
> 
> The way I thought about this is that if you're in a heavily modified
> scenario with a large directory and you're past the 'acdirmax' time,
> you have to make the choice of either:
> a) ignoring 'acdirmax' (this is what the NFSv3 patch did) and even
> though you know the cache expired you keep going as though it
> did not (at least until a different process starts a listing)
> b) honoring 'acdirmax' (drop the pagecache), but keep going the
> best you can based on the previous information and don't try to
> rebuild the cache before continuing.
> 
> > IOW: This will have a detrimental effect for some workloads, which
> > needs to be weighed up against the benefits. I saw that you've
> > tested
> > with large directories, but what workloads were you testing on
> > those
> > directories?
> > 
> I can definitely do further testing and any scenario you want to try
> to
> break it or find a pathological scenario. So far I've tested the
> reader ("ls -lf") in parallel with one of the two writers:
> 1) random add a file every 0.1s:
> while true; do i=$((1 + RANDOM % $NUM_FILES)); echo $i; touch
> $MNT2/file$i.bin; builtin sleep 0.1; done > /dev/null 2>&1 &
> 2) random delete a file every 0.1 s:
> while true; do i=$((1 + RANDOM % $NUM_FILES)); echo $i; rm -f
> $MNT2/file$i; builtin sleep 0.1; done > /dev/null 2>&1 &
> 
> In no case did I see it take a longer time or ops vs vanilla 5.9, the
> idle
> and modified performance is better (measured in seconds and ops)
> with this patch.  Below is a short summary.  Note that the first time
> and
> ops is with an idle directory, and the second one is the modified.
> 
> 5.9 (vanilla): random delete a file every 0.1 s:
> Ops increased from 4734 to 8834
> Time increased from 23 to 44
> 
> 5.9 (this patch): random delete a file every 0.1 s:
> Ops increased from 4697 to 4696
> Time increased from 20 to 30
> 
> 
> 5.9 (vanilla): random add a file every 0.1s:
> Ops increased from 4734 to 9168
> Time increased from 23 to 43
> 
> 5.9 (this patch): random add a file every 0.1s:
> Ops increased from 4697 to 4702
> Time increased from 21 to 32
> 

If you're not seeing any change in number of ops then those numbers are
basically telling you that you're not seeing any cache invalidation.
You should be seeing cache invalidation when you are creating and
deleting files and are doing simultaneous readdirs.
David Wysochanski Nov. 2, 2020, 10:05 p.m. UTC | #6
On Mon, Nov 2, 2020 at 4:30 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> On Mon, 2020-11-02 at 14:45 -0500, David Wysochanski wrote:
> > On Mon, Nov 2, 2020 at 12:31 PM Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > >
> > > On Mon, 2020-11-02 at 11:26 -0500, David Wysochanski wrote:
> > > > On Mon, Nov 2, 2020 at 11:22 AM Trond Myklebust <
> > > > trondmy@hammerspace.com> wrote:
> > > > >
> > > > > On Mon, 2020-11-02 at 08:50 -0500, Dave Wysochanski wrote:
> > > > > > A process can hang forever to 'ls -l' a directory while the
> > > > > > directory
> > > > > > is being modified such as another NFS client adding files to
> > > > > > the
> > > > > > directory.  The problem is seen specifically with larger
> > > > > > directories
> > > > > > (I tested with 1 million) and/or slower NFS server responses
> > > > > > to
> > > > > > READDIR.  If a combination of the NFS directory size, the NFS
> > > > > > server
> > > > > > responses to READDIR is such that the 'ls' process gets
> > > > > > partially
> > > > > > through the listing before the attribute cache expires (time
> > > > > > exceeds acdirmax), we drop the pagecache and have to re-fill
> > > > > > it,
> > > > > > and as a result, the process may never complete.  One could
> > > > > > argue
> > > > > > for larger directories the acdirmin/acdirmax should be
> > > > > > increased,
> > > > > > but it's not always possible to tune this effectively.
> > > > > >
> > > > > > The root cause of this problem is due to how the NFS readdir
> > > > > > cache
> > > > > > currently works.  The main search function,
> > > > > > readdir_search_pagecache(),
> > > > > > always starts searching at page_index and cookie == 0, and
> > > > > > for
> > > > > > any
> > > > > > page not in the cache, fills in the page with entries
> > > > > > obtained in
> > > > > > a READDIR NFS call.  If a page already exists, we proceed to
> > > > > > nfs_readdir_search_for_cookie(), which searches for the
> > > > > > cookie
> > > > > > (pos) of the readdir call.  The search is O(n), where n is
> > > > > > the
> > > > > > directory size before the cookie in question is found, and
> > > > > > every
> > > > > > entry to nfs_readdir() pays this penalty, irrespective of the
> > > > > > current directory position (dir_context.pos).  The search is
> > > > > > expensive due to the opaque nature of readdir cookies, and
> > > > > > the
> > > > > > fact
> > > > > > that no mapping (hash) exists from cookies to pages.  In the
> > > > > > case
> > > > > > of a directory being modified, the above behavior can become
> > > > > > an
> > > > > > excessive penalty, since the same process is forced to fill
> > > > > > pages
> > > > > > it
> > > > > > may be no longer interested in (the entries were passed in a
> > > > > > previous
> > > > > > nfs_readdir call), and this can essentially lead no forward
> > > > > > progress.
> > > > > >
> > > > > > To fix this problem, at the end of nfs_readdir(), save the
> > > > > > page_index
> > > > > > corresponding to the directory position (cookie) inside the
> > > > > > process's
> > > > > > nfs_open_dir_context.  Then at the next entry of
> > > > > > nfs_readdir(),
> > > > > > use
> > > > > > the saved page_index as the starting search point rather than
> > > > > > starting
> > > > > > at page_index == 0.  Not only does this fix the problem of
> > > > > > listing
> > > > > > a directory being modified, it also significantly improves
> > > > > > performance
> > > > > > in the unmodified case since no extra search penalty is paid
> > > > > > at
> > > > > > each
> > > > > > entry to nfs_readdir().
> > > > > >
> > > > > > In the case of lseek, since there is no hash or other mapping
> > > > > > from a
> > > > > > cookie value to the page->index, just reset
> > > > > > nfs_open_dir_context.page_index
> > > > > > to 0, which will reset the search to the old behavior.
> > > > > >
> > > > > > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> > > > > > ---
> > > > > >  fs/nfs/dir.c           | 8 +++++++-
> > > > > >  include/linux/nfs_fs.h | 1 +
> > > > > >  2 files changed, 8 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > > index 52e06c8fc7cd..b266f505b521 100644
> > > > > > --- a/fs/nfs/dir.c
> > > > > > +++ b/fs/nfs/dir.c
> > > > > > @@ -78,6 +78,7 @@ static struct nfs_open_dir_context
> > > > > > *alloc_nfs_open_dir_context(struct inode *dir
> > > > > >                 ctx->attr_gencount = nfsi->attr_gencount;
> > > > > >                 ctx->dir_cookie = 0;
> > > > > >                 ctx->dup_cookie = 0;
> > > > > > +               ctx->page_index = 0;
> > > > > >                 ctx->cred = get_cred(cred);
> > > > > >                 spin_lock(&dir->i_lock);
> > > > > >                 if (list_empty(&nfsi->open_files) &&
> > > > > > @@ -763,7 +764,7 @@ int
> > > > > > find_and_lock_cache_page(nfs_readdir_descriptor_t *desc)
> > > > > >         return res;
> > > > > >  }
> > > > > >
> > > > > > -/* Search for desc->dir_cookie from the beginning of the
> > > > > > page
> > > > > > cache
> > > > > > */
> > > > > > +/* Search for desc->dir_cookie starting at desc->page_index
> > > > > > */
> > > > > >  static inline
> > > > > >  int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
> > > > > >  {
> > > > > > @@ -885,6 +886,8 @@ static int nfs_readdir(struct file *file,
> > > > > > struct
> > > > > > dir_context *ctx)
> > > > > >                 .ctx = ctx,
> > > > > >                 .dir_cookie = &dir_ctx->dir_cookie,
> > > > > >                 .plus = nfs_use_readdirplus(inode, ctx),
> > > > > > +               .page_index = dir_ctx->page_index,
> > > > > > +               .last_cookie = nfs_readdir_use_cookie(file) ?
> > > > > > ctx-
> > > > > > > pos : 0,
> > > > > >         },
> > > > > >                         *desc = &my_desc;
> > > > > >         int res = 0;
> > > > > > @@ -938,6 +941,7 @@ static int nfs_readdir(struct file *file,
> > > > > > struct
> > > > > > dir_context *ctx)
> > > > > >  out:
> > > > > >         if (res > 0)
> > > > > >                 res = 0;
> > > > > > +       dir_ctx->page_index = desc->page_index;
> > > > > >         trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx-
> > > > > > > dir_cookie,
> > > > > >                                NFS_SERVER(inode)->dtsize,
> > > > > > my_desc.plus, res);
> > > > > >         return res;
> > > > > > @@ -975,6 +979,8 @@ static loff_t nfs_llseek_dir(struct file
> > > > > > *filp,
> > > > > > loff_t offset, int whence)
> > > > > >                 else
> > > > > >                         dir_ctx->dir_cookie = 0;
> > > > > >                 dir_ctx->duped = 0;
> > > > > > +               /* Force readdir_search_pagecache to start
> > > > > > over
> > > > > > */
> > > > > > +               dir_ctx->page_index = 0;
> > > > > >         }
> > > > > >         inode_unlock(inode);
> > > > > >         return offset;
> > > > > > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > > > > > index a2c6455ea3fa..0e55c0154ccd 100644
> > > > > > --- a/include/linux/nfs_fs.h
> > > > > > +++ b/include/linux/nfs_fs.h
> > > > > > @@ -93,6 +93,7 @@ struct nfs_open_dir_context {
> > > > > >         __u64 dir_cookie;
> > > > > >         __u64 dup_cookie;
> > > > > >         signed char duped;
> > > > > > +       unsigned long   page_index;
> > > > > >  };
> > > > > >
> > > > > >  /*
> > > > >
> > > > > NACK. It makes no sense to store the page index as a cursor.
> > > > >
> > > >
> > > > A similar thing was done recently with:
> > > > 227823d2074d nfs: optimise readdir cache page invalidation
> > > >
> > >
> > > That's a very different thing. It is about discarding page data in
> > > order to force a re-read of the contents into cache.
> > >
> > Right - I only pointed it out because it is in effect a cursor about
> > the last access into the cache but it's on a global basis, not
> > process context.
> >
> > > What you're doing is basically trying to guess where the data is
> > > located. which might work in some cases where the directory is
> > > completely static, but if it shrinks (e.g. due to a few unlink() or
> > > rename() calls) so that you overshoot the cookie, then you can end
> > > up
> > > reading all the way to the end of the directory before doing an
> > > uncached readdir.
> > >
> > First, consider the unmodified (idle directory) scenario.  Today the
> > performance is bad the larger the directory goes - do you see why?
> > I tried to explain in the cover letter and header but maybe it's not
> > clear?
> >
> > Second, the modified scenario today the performance is very bad
> > because of the same problem - the cookie is reset and the process
> > needs to start over at cookie 0, repeating READDIRs.  But maybe
> > there's a specific scenario I'm not thinking about.
> >
> > The way I thought about this is that if you're in a heavily modified
> > scenario with a large directory and you're past the 'acdirmax' time,
> > you have to make the choice of either:
> > a) ignoring 'acdirmax' (this is what the NFSv3 patch did) and even
> > though you know the cache expired you keep going as though it
> > did not (at least until a different process starts a listing)
> > b) honoring 'acdirmax' (drop the pagecache), but keep going the
> > best you can based on the previous information and don't try to
> > rebuild the cache before continuing.
> >
> > > IOW: This will have a detrimental effect for some workloads, which
> > > needs to be weighed up against the benefits. I saw that you've
> > > tested
> > > with large directories, but what workloads were you testing on
> > > those
> > > directories?
> > >
> > I can definitely do further testing and any scenario you want to try
> > to
> > break it or find a pathological scenario. So far I've tested the
> > reader ("ls -lf") in parallel with one of the two writers:
> > 1) random add a file every 0.1s:
> > while true; do i=$((1 + RANDOM % $NUM_FILES)); echo $i; touch
> > $MNT2/file$i.bin; builtin sleep 0.1; done > /dev/null 2>&1 &
> > 2) random delete a file every 0.1 s:
> > while true; do i=$((1 + RANDOM % $NUM_FILES)); echo $i; rm -f
> > $MNT2/file$i; builtin sleep 0.1; done > /dev/null 2>&1 &
> >
> > In no case did I see it take a longer time or ops vs vanilla 5.9, the
> > idle
> > and modified performance is better (measured in seconds and ops)
> > with this patch.  Below is a short summary.  Note that the first time
> > and
> > ops is with an idle directory, and the second one is the modified.
> >
> > 5.9 (vanilla): random delete a file every 0.1 s:
> > Ops increased from 4734 to 8834
> > Time increased from 23 to 44
> >
> > 5.9 (this patch): random delete a file every 0.1 s:
> > Ops increased from 4697 to 4696
> > Time increased from 20 to 30
> >
> >
> > 5.9 (vanilla): random add a file every 0.1s:
> > Ops increased from 4734 to 9168
> > Time increased from 23 to 43
> >
> > 5.9 (this patch): random add a file every 0.1s:
> > Ops increased from 4697 to 4702
> > Time increased from 21 to 32
> >
>
> If you're not seeing any change in number of ops then those numbers are
> basically telling you that you're not seeing any cache invalidation.
> You should be seeing cache invalidation when you are creating and
> deleting files and are doing simultaneous readdirs.
>

No I think you're misunderstanding or we're not on the same page.
The difference is, with 5.9 vanilla when the invalidation occurs, the
reader process has to go back to cookie 0 and pays the penalty
to refill all the cache pages up to cookie 'N'.  With the patch this
is not the case - it continues on with cookie 'N' and does not pay the
penalty - the next reader does.

> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
>
>
Frank van der Linden Nov. 3, 2020, 12:09 a.m. UTC | #7
Hi Dave,

Thanks for looking at this.

On Mon, Nov 02, 2020 at 02:45:09PM -0500, David Wysochanski wrote:
> On Mon, Nov 2, 2020 at 12:31 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
> >
> > On Mon, 2020-11-02 at 11:26 -0500, David Wysochanski wrote:
> > > On Mon, Nov 2, 2020 at 11:22 AM Trond Myklebust <
> > > trondmy@hammerspace.com> wrote:
> > > >
> > > > On Mon, 2020-11-02 at 08:50 -0500, Dave Wysochanski wrote:
> > > > > A process can hang forever to 'ls -l' a directory while the
> > > > > directory
> > > > > is being modified such as another NFS client adding files to the
> > > > > directory.  The problem is seen specifically with larger
> > > > > directories
> > > > > (I tested with 1 million) and/or slower NFS server responses to
> > > > > READDIR.  If a combination of the NFS directory size, the NFS
> > > > > server
> > > > > responses to READDIR is such that the 'ls' process gets partially
> > > > > through the listing before the attribute cache expires (time
> > > > > exceeds acdirmax), we drop the pagecache and have to re-fill it,
> > > > > and as a result, the process may never complete.  One could argue
> > > > > for larger directories the acdirmin/acdirmax should be increased,
> > > > > but it's not always possible to tune this effectively.
> > > > >
> > > > > The root cause of this problem is due to how the NFS readdir
> > > > > cache
> > > > > currently works.  The main search function,
> > > > > readdir_search_pagecache(),
> > > > > always starts searching at page_index and cookie == 0, and for
> > > > > any
> > > > > page not in the cache, fills in the page with entries obtained in
> > > > > a READDIR NFS call.  If a page already exists, we proceed to
> > > > > nfs_readdir_search_for_cookie(), which searches for the cookie
> > > > > (pos) of the readdir call.  The search is O(n), where n is the
> > > > > directory size before the cookie in question is found, and every
> > > > > entry to nfs_readdir() pays this penalty, irrespective of the
> > > > > current directory position (dir_context.pos).  The search is
> > > > > expensive due to the opaque nature of readdir cookies, and the
> > > > > fact
> > > > > that no mapping (hash) exists from cookies to pages.  In the case
> > > > > of a directory being modified, the above behavior can become an
> > > > > excessive penalty, since the same process is forced to fill pages
> > > > > it
> > > > > may be no longer interested in (the entries were passed in a
> > > > > previous
> > > > > nfs_readdir call), and this can essentially lead no forward
> > > > > progress.
> > > > >
> > > > > To fix this problem, at the end of nfs_readdir(), save the
> > > > > page_index
> > > > > corresponding to the directory position (cookie) inside the
> > > > > process's
> > > > > nfs_open_dir_context.  Then at the next entry of nfs_readdir(),
> > > > > use
> > > > > the saved page_index as the starting search point rather than
> > > > > starting
> > > > > at page_index == 0.  Not only does this fix the problem of
> > > > > listing
> > > > > a directory being modified, it also significantly improves
> > > > > performance
> > > > > in the unmodified case since no extra search penalty is paid at
> > > > > each
> > > > > entry to nfs_readdir().
> > > > >
> > > > > In the case of lseek, since there is no hash or other mapping
> > > > > from a
> > > > > cookie value to the page->index, just reset
> > > > > nfs_open_dir_context.page_index
> > > > > to 0, which will reset the search to the old behavior.
> > > > >
> > > > > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> > > > > ---
> > > > >  fs/nfs/dir.c           | 8 +++++++-
> > > > >  include/linux/nfs_fs.h | 1 +
> > > > >  2 files changed, 8 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > index 52e06c8fc7cd..b266f505b521 100644
> > > > > --- a/fs/nfs/dir.c
> > > > > +++ b/fs/nfs/dir.c
> > > > > @@ -78,6 +78,7 @@ static struct nfs_open_dir_context
> > > > > *alloc_nfs_open_dir_context(struct inode *dir
> > > > >                 ctx->attr_gencount = nfsi->attr_gencount;
> > > > >                 ctx->dir_cookie = 0;
> > > > >                 ctx->dup_cookie = 0;
> > > > > +               ctx->page_index = 0;
> > > > >                 ctx->cred = get_cred(cred);
> > > > >                 spin_lock(&dir->i_lock);
> > > > >                 if (list_empty(&nfsi->open_files) &&
> > > > > @@ -763,7 +764,7 @@ int
> > > > > find_and_lock_cache_page(nfs_readdir_descriptor_t *desc)
> > > > >         return res;
> > > > >  }
> > > > >
> > > > > -/* Search for desc->dir_cookie from the beginning of the page
> > > > > cache
> > > > > */
> > > > > +/* Search for desc->dir_cookie starting at desc->page_index */
> > > > >  static inline
> > > > >  int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
> > > > >  {
> > > > > @@ -885,6 +886,8 @@ static int nfs_readdir(struct file *file,
> > > > > struct
> > > > > dir_context *ctx)
> > > > >                 .ctx = ctx,
> > > > >                 .dir_cookie = &dir_ctx->dir_cookie,
> > > > >                 .plus = nfs_use_readdirplus(inode, ctx),
> > > > > +               .page_index = dir_ctx->page_index,
> > > > > +               .last_cookie = nfs_readdir_use_cookie(file) ?
> > > > > ctx-
> > > > > > pos : 0,
> > > > >         },
> > > > >                         *desc = &my_desc;
> > > > >         int res = 0;
> > > > > @@ -938,6 +941,7 @@ static int nfs_readdir(struct file *file,
> > > > > struct
> > > > > dir_context *ctx)
> > > > >  out:
> > > > >         if (res > 0)
> > > > >                 res = 0;
> > > > > +       dir_ctx->page_index = desc->page_index;
> > > > >         trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx-
> > > > > >dir_cookie,
> > > > >                                NFS_SERVER(inode)->dtsize,
> > > > > my_desc.plus, res);
> > > > >         return res;
> > > > > @@ -975,6 +979,8 @@ static loff_t nfs_llseek_dir(struct file
> > > > > *filp,
> > > > > loff_t offset, int whence)
> > > > >                 else
> > > > >                         dir_ctx->dir_cookie = 0;
> > > > >                 dir_ctx->duped = 0;
> > > > > +               /* Force readdir_search_pagecache to start over
> > > > > */
> > > > > +               dir_ctx->page_index = 0;
> > > > >         }
> > > > >         inode_unlock(inode);
> > > > >         return offset;
> > > > > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > > > > index a2c6455ea3fa..0e55c0154ccd 100644
> > > > > --- a/include/linux/nfs_fs.h
> > > > > +++ b/include/linux/nfs_fs.h
> > > > > @@ -93,6 +93,7 @@ struct nfs_open_dir_context {
> > > > >         __u64 dir_cookie;
> > > > >         __u64 dup_cookie;
> > > > >         signed char duped;
> > > > > +       unsigned long   page_index;
> > > > >  };
> > > > >
> > > > >  /*
> > > >
> > > > NACK. It makes no sense to store the page index as a cursor.
> > > >
> > >
> > > A similar thing was done recently with:
> > > 227823d2074d nfs: optimise readdir cache page invalidation
> > >
> >
> > That's a very different thing. It is about discarding page data in
> > order to force a re-read of the contents into cache.
> >
> Right - I only pointed it out because it is in effect a cursor about
> the last access into the cache but it's on a global basis, not
> process context.
> 
> > What you're doing is basically trying to guess where the data is
> > located. which might work in some cases where the directory is
> > completely static, but if it shrinks (e.g. due to a few unlink() or
> > rename() calls) so that you overshoot the cookie, then you can end up
> > reading all the way to the end of the directory before doing an
> > uncached readdir.
> >
> First, consider the unmodified (idle directory) scenario.  Today the
> performance is bad the larger the directory goes - do you see why?
> I tried to explain in the cover letter and header but maybe it's not clear?

I agree with you that the patch is good for that case. It seems
like a bug that, if the cache is not invalidated in the meantime,
nfs_readdir would start at offset 0 of the cache while searching.

But, as Trond said, page_index isn't a good cursor once the
page cache was invalidated, you don't know what data you are
leaving out inadvertently.

> 
> Second, the modified scenario today the performance is very bad
> because of the same problem - the cookie is reset and the process
> needs to start over at cookie 0, repeating READDIRs.  But maybe
> there's a specific scenario I'm not thinking about.
> 
> The way I thought about this is that if you're in a heavily modified
> scenario with a large directory and you're past the 'acdirmax' time,
> you have to make the choice of either:
> a) ignoring 'acdirmax' (this is what the NFSv3 patch did) and even
> though you know the cache expired you keep going as though it
> did not (at least until a different process starts a listing)
> b) honoring 'acdirmax' (drop the pagecache), but keep going the
> best you can based on the previous information and don't try to
> rebuild the cache before continuing.

Yeah, it seems like those are your options. The only other way
of doing this that I can think of would be to add logic that
says something like:

	if (pos != 0 and the cache is invalid)
		uncached_readdir()

..which is, by the looks of it, easily done by having
readdir_search_pagecache() return EBADCOOKIE if
there's no cache, and the offset is not 0. E.g. it
shouldn't try to fill the cache in that case, but
just let uncached reads take over.

The idea here is that the only information that is possibly
still valid is the cookie - and that the server will know
how to use the cookie to keep going where you left off.

Of course, this mean that, in the case of another client
modifying the directory, you'll end up doing mostly / all
uncached readdirs after an invalidate. Same for a scenario
where a client did: open llseek(<nonzero_offset), getdents().
Until another process starts at offset 0 and re-fills the
cache, of course.

I'm not sure if that's all that bad, though, it depends
on common usage patterns.

- Frank
Trond Myklebust Nov. 3, 2020, 3:38 a.m. UTC | #8
On Mon, 2020-11-02 at 17:05 -0500, David Wysochanski wrote:
> On Mon, Nov 2, 2020 at 4:30 PM Trond Myklebust <
> trondmy@hammerspace.com> wrote:
> > 
> > On Mon, 2020-11-02 at 14:45 -0500, David Wysochanski wrote:
> > > On Mon, Nov 2, 2020 at 12:31 PM Trond Myklebust <
> > > trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Mon, 2020-11-02 at 11:26 -0500, David Wysochanski wrote:
> > > > > On Mon, Nov 2, 2020 at 11:22 AM Trond Myklebust <
> > > > > trondmy@hammerspace.com> wrote:
> > > > > > 
> > > > > > On Mon, 2020-11-02 at 08:50 -0500, Dave Wysochanski wrote:
> > > > > > > A process can hang forever to 'ls -l' a directory while
> > > > > > > the
> > > > > > > directory
> > > > > > > is being modified such as another NFS client adding files
> > > > > > > to
> > > > > > > the
> > > > > > > directory.  The problem is seen specifically with larger
> > > > > > > directories
> > > > > > > (I tested with 1 million) and/or slower NFS server
> > > > > > > responses
> > > > > > > to
> > > > > > > READDIR.  If a combination of the NFS directory size, the
> > > > > > > NFS
> > > > > > > server
> > > > > > > responses to READDIR is such that the 'ls' process gets
> > > > > > > partially
> > > > > > > through the listing before the attribute cache expires
> > > > > > > (time
> > > > > > > exceeds acdirmax), we drop the pagecache and have to re-
> > > > > > > fill
> > > > > > > it,
> > > > > > > and as a result, the process may never complete.  One
> > > > > > > could
> > > > > > > argue
> > > > > > > for larger directories the acdirmin/acdirmax should be
> > > > > > > increased,
> > > > > > > but it's not always possible to tune this effectively.
> > > > > > > 
> > > > > > > The root cause of this problem is due to how the NFS
> > > > > > > readdir
> > > > > > > cache
> > > > > > > currently works.  The main search function,
> > > > > > > readdir_search_pagecache(),
> > > > > > > always starts searching at page_index and cookie == 0,
> > > > > > > and
> > > > > > > for
> > > > > > > any
> > > > > > > page not in the cache, fills in the page with entries
> > > > > > > obtained in
> > > > > > > a READDIR NFS call.  If a page already exists, we proceed
> > > > > > > to
> > > > > > > nfs_readdir_search_for_cookie(), which searches for the
> > > > > > > cookie
> > > > > > > (pos) of the readdir call.  The search is O(n), where n
> > > > > > > is
> > > > > > > the
> > > > > > > directory size before the cookie in question is found,
> > > > > > > and
> > > > > > > every
> > > > > > > entry to nfs_readdir() pays this penalty, irrespective of
> > > > > > > the
> > > > > > > current directory position (dir_context.pos).  The search
> > > > > > > is
> > > > > > > expensive due to the opaque nature of readdir cookies,
> > > > > > > and
> > > > > > > the
> > > > > > > fact
> > > > > > > that no mapping (hash) exists from cookies to pages.  In
> > > > > > > the
> > > > > > > case
> > > > > > > of a directory being modified, the above behavior can
> > > > > > > become
> > > > > > > an
> > > > > > > excessive penalty, since the same process is forced to
> > > > > > > fill
> > > > > > > pages
> > > > > > > it
> > > > > > > may be no longer interested in (the entries were passed
> > > > > > > in a
> > > > > > > previous
> > > > > > > nfs_readdir call), and this can essentially lead no
> > > > > > > forward
> > > > > > > progress.
> > > > > > > 
> > > > > > > To fix this problem, at the end of nfs_readdir(), save
> > > > > > > the
> > > > > > > page_index
> > > > > > > corresponding to the directory position (cookie) inside
> > > > > > > the
> > > > > > > process's
> > > > > > > nfs_open_dir_context.  Then at the next entry of
> > > > > > > nfs_readdir(),
> > > > > > > use
> > > > > > > the saved page_index as the starting search point rather
> > > > > > > than
> > > > > > > starting
> > > > > > > at page_index == 0.  Not only does this fix the problem
> > > > > > > of
> > > > > > > listing
> > > > > > > a directory being modified, it also significantly
> > > > > > > improves
> > > > > > > performance
> > > > > > > in the unmodified case since no extra search penalty is
> > > > > > > paid
> > > > > > > at
> > > > > > > each
> > > > > > > entry to nfs_readdir().
> > > > > > > 
> > > > > > > In the case of lseek, since there is no hash or other
> > > > > > > mapping
> > > > > > > from a
> > > > > > > cookie value to the page->index, just reset
> > > > > > > nfs_open_dir_context.page_index
> > > > > > > to 0, which will reset the search to the old behavior.
> > > > > > > 
> > > > > > > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> > > > > > > ---
> > > > > > >  fs/nfs/dir.c           | 8 +++++++-
> > > > > > >  include/linux/nfs_fs.h | 1 +
> > > > > > >  2 files changed, 8 insertions(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > > > index 52e06c8fc7cd..b266f505b521 100644
> > > > > > > --- a/fs/nfs/dir.c
> > > > > > > +++ b/fs/nfs/dir.c
> > > > > > > @@ -78,6 +78,7 @@ static struct nfs_open_dir_context
> > > > > > > *alloc_nfs_open_dir_context(struct inode *dir
> > > > > > >                 ctx->attr_gencount = nfsi->attr_gencount;
> > > > > > >                 ctx->dir_cookie = 0;
> > > > > > >                 ctx->dup_cookie = 0;
> > > > > > > +               ctx->page_index = 0;
> > > > > > >                 ctx->cred = get_cred(cred);
> > > > > > >                 spin_lock(&dir->i_lock);
> > > > > > >                 if (list_empty(&nfsi->open_files) &&
> > > > > > > @@ -763,7 +764,7 @@ int
> > > > > > > find_and_lock_cache_page(nfs_readdir_descriptor_t *desc)
> > > > > > >         return res;
> > > > > > >  }
> > > > > > > 
> > > > > > > -/* Search for desc->dir_cookie from the beginning of the
> > > > > > > page
> > > > > > > cache
> > > > > > > */
> > > > > > > +/* Search for desc->dir_cookie starting at desc-
> > > > > > > >page_index
> > > > > > > */
> > > > > > >  static inline
> > > > > > >  int readdir_search_pagecache(nfs_readdir_descriptor_t
> > > > > > > *desc)
> > > > > > >  {
> > > > > > > @@ -885,6 +886,8 @@ static int nfs_readdir(struct file
> > > > > > > *file,
> > > > > > > struct
> > > > > > > dir_context *ctx)
> > > > > > >                 .ctx = ctx,
> > > > > > >                 .dir_cookie = &dir_ctx->dir_cookie,
> > > > > > >                 .plus = nfs_use_readdirplus(inode, ctx),
> > > > > > > +               .page_index = dir_ctx->page_index,
> > > > > > > +               .last_cookie =
> > > > > > > nfs_readdir_use_cookie(file) ?
> > > > > > > ctx-
> > > > > > > > pos : 0,
> > > > > > >         },
> > > > > > >                         *desc = &my_desc;
> > > > > > >         int res = 0;
> > > > > > > @@ -938,6 +941,7 @@ static int nfs_readdir(struct file
> > > > > > > *file,
> > > > > > > struct
> > > > > > > dir_context *ctx)
> > > > > > >  out:
> > > > > > >         if (res > 0)
> > > > > > >                 res = 0;
> > > > > > > +       dir_ctx->page_index = desc->page_index;
> > > > > > >         trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx-
> > > > > > > > dir_cookie,
> > > > > > >                                NFS_SERVER(inode)->dtsize,
> > > > > > > my_desc.plus, res);
> > > > > > >         return res;
> > > > > > > @@ -975,6 +979,8 @@ static loff_t nfs_llseek_dir(struct
> > > > > > > file
> > > > > > > *filp,
> > > > > > > loff_t offset, int whence)
> > > > > > >                 else
> > > > > > >                         dir_ctx->dir_cookie = 0;
> > > > > > >                 dir_ctx->duped = 0;
> > > > > > > +               /* Force readdir_search_pagecache to
> > > > > > > start
> > > > > > > over
> > > > > > > */
> > > > > > > +               dir_ctx->page_index = 0;
> > > > > > >         }
> > > > > > >         inode_unlock(inode);
> > > > > > >         return offset;
> > > > > > > diff --git a/include/linux/nfs_fs.h
> > > > > > > b/include/linux/nfs_fs.h
> > > > > > > index a2c6455ea3fa..0e55c0154ccd 100644
> > > > > > > --- a/include/linux/nfs_fs.h
> > > > > > > +++ b/include/linux/nfs_fs.h
> > > > > > > @@ -93,6 +93,7 @@ struct nfs_open_dir_context {
> > > > > > >         __u64 dir_cookie;
> > > > > > >         __u64 dup_cookie;
> > > > > > >         signed char duped;
> > > > > > > +       unsigned long   page_index;
> > > > > > >  };
> > > > > > > 
> > > > > > >  /*
> > > > > > 
> > > > > > NACK. It makes no sense to store the page index as a
> > > > > > cursor.
> > > > > > 
> > > > > 
> > > > > A similar thing was done recently with:
> > > > > 227823d2074d nfs: optimise readdir cache page invalidation
> > > > > 
> > > > 
> > > > That's a very different thing. It is about discarding page data
> > > > in
> > > > order to force a re-read of the contents into cache.
> > > > 
> > > Right - I only pointed it out because it is in effect a cursor
> > > about
> > > the last access into the cache but it's on a global basis, not
> > > process context.
> > > 
> > > > What you're doing is basically trying to guess where the data
> > > > is
> > > > located. which might work in some cases where the directory is
> > > > completely static, but if it shrinks (e.g. due to a few
> > > > unlink() or
> > > > rename() calls) so that you overshoot the cookie, then you can
> > > > end
> > > > up
> > > > reading all the way to the end of the directory before doing an
> > > > uncached readdir.
> > > > 
> > > First, consider the unmodified (idle directory) scenario.  Today
> > > the
> > > performance is bad the larger the directory goes - do you see
> > > why?
> > > I tried to explain in the cover letter and header but maybe it's
> > > not
> > > clear?
> > > 
> > > Second, the modified scenario today the performance is very bad
> > > because of the same problem - the cookie is reset and the process
> > > needs to start over at cookie 0, repeating READDIRs.  But maybe
> > > there's a specific scenario I'm not thinking about.
> > > 
> > > The way I thought about this is that if you're in a heavily
> > > modified
> > > scenario with a large directory and you're past the 'acdirmax'
> > > time,
> > > you have to make the choice of either:
> > > a) ignoring 'acdirmax' (this is what the NFSv3 patch did) and
> > > even
> > > though you know the cache expired you keep going as though it
> > > did not (at least until a different process starts a listing)
> > > b) honoring 'acdirmax' (drop the pagecache), but keep going the
> > > best you can based on the previous information and don't try to
> > > rebuild the cache before continuing.
> > > 
> > > > IOW: This will have a detrimental effect for some workloads,
> > > > which
> > > > needs to be weighed up against the benefits. I saw that you've
> > > > tested
> > > > with large directories, but what workloads were you testing on
> > > > those
> > > > directories?
> > > > 
> > > I can definitely do further testing and any scenario you want to
> > > try
> > > to
> > > break it or find a pathological scenario. So far I've tested the
> > > reader ("ls -lf") in parallel with one of the two writers:
> > > 1) random add a file every 0.1s:
> > > while true; do i=$((1 + RANDOM % $NUM_FILES)); echo $i; touch
> > > $MNT2/file$i.bin; builtin sleep 0.1; done > /dev/null 2>&1 &
> > > 2) random delete a file every 0.1 s:
> > > while true; do i=$((1 + RANDOM % $NUM_FILES)); echo $i; rm -f
> > > $MNT2/file$i; builtin sleep 0.1; done > /dev/null 2>&1 &
> > > 
> > > In no case did I see it take a longer time or ops vs vanilla 5.9,
> > > the
> > > idle
> > > and modified performance is better (measured in seconds and ops)
> > > with this patch.  Below is a short summary.  Note that the first
> > > time
> > > and
> > > ops is with an idle directory, and the second one is the
> > > modified.
> > > 
> > > 5.9 (vanilla): random delete a file every 0.1 s:
> > > Ops increased from 4734 to 8834
> > > Time increased from 23 to 44
> > > 
> > > 5.9 (this patch): random delete a file every 0.1 s:
> > > Ops increased from 4697 to 4696
> > > Time increased from 20 to 30
> > > 
> > > 
> > > 5.9 (vanilla): random add a file every 0.1s:
> > > Ops increased from 4734 to 9168
> > > Time increased from 23 to 43
> > > 
> > > 5.9 (this patch): random add a file every 0.1s:
> > > Ops increased from 4697 to 4702
> > > Time increased from 21 to 32
> > > 
> > 
> > If you're not seeing any change in number of ops then those numbers
> > are
> > basically telling you that you're not seeing any cache
> > invalidation.
> > You should be seeing cache invalidation when you are creating and
> > deleting files and are doing simultaneous readdirs.
> > 
> 
> No I think you're misunderstanding or we're not on the same page.
> The difference is, with 5.9 vanilla when the invalidation occurs, the
> reader process has to go back to cookie 0 and pays the penalty
> to refill all the cache pages up to cookie 'N'.  With the patch this
> is not the case - it continues on with cookie 'N' and does not pay
> the
> penalty - the next reader does
> 

Then I still don't see how you can avoid corrupting the page cache when
you're in effect just starting filling in cookies at a random page
cache index + offset that bears no relation to where the cookie would
end up if the reader started from 0.

However leaving aside that issue, what's the point of using the page
cache if the next reader has to clear out the data and start afresh
anyway? Why not just let all the threads avoid the page cache and just
do uncached readdir?
David Wysochanski Nov. 3, 2020, 1:29 p.m. UTC | #9
On Mon, Nov 2, 2020 at 10:39 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> On Mon, 2020-11-02 at 17:05 -0500, David Wysochanski wrote:
> > On Mon, Nov 2, 2020 at 4:30 PM Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > >
> > > On Mon, 2020-11-02 at 14:45 -0500, David Wysochanski wrote:
> > > > On Mon, Nov 2, 2020 at 12:31 PM Trond Myklebust <
> > > > trondmy@hammerspace.com> wrote:
> > > > >
> > > > > On Mon, 2020-11-02 at 11:26 -0500, David Wysochanski wrote:
> > > > > > On Mon, Nov 2, 2020 at 11:22 AM Trond Myklebust <
> > > > > > trondmy@hammerspace.com> wrote:
> > > > > > >
> > > > > > > On Mon, 2020-11-02 at 08:50 -0500, Dave Wysochanski wrote:
> > > > > > > > A process can hang forever to 'ls -l' a directory while
> > > > > > > > the
> > > > > > > > directory
> > > > > > > > is being modified such as another NFS client adding files
> > > > > > > > to
> > > > > > > > the
> > > > > > > > directory.  The problem is seen specifically with larger
> > > > > > > > directories
> > > > > > > > (I tested with 1 million) and/or slower NFS server
> > > > > > > > responses
> > > > > > > > to
> > > > > > > > READDIR.  If a combination of the NFS directory size, the
> > > > > > > > NFS
> > > > > > > > server
> > > > > > > > responses to READDIR is such that the 'ls' process gets
> > > > > > > > partially
> > > > > > > > through the listing before the attribute cache expires
> > > > > > > > (time
> > > > > > > > exceeds acdirmax), we drop the pagecache and have to re-
> > > > > > > > fill
> > > > > > > > it,
> > > > > > > > and as a result, the process may never complete.  One
> > > > > > > > could
> > > > > > > > argue
> > > > > > > > for larger directories the acdirmin/acdirmax should be
> > > > > > > > increased,
> > > > > > > > but it's not always possible to tune this effectively.
> > > > > > > >
> > > > > > > > The root cause of this problem is due to how the NFS
> > > > > > > > readdir
> > > > > > > > cache
> > > > > > > > currently works.  The main search function,
> > > > > > > > readdir_search_pagecache(),
> > > > > > > > always starts searching at page_index and cookie == 0,
> > > > > > > > and
> > > > > > > > for
> > > > > > > > any
> > > > > > > > page not in the cache, fills in the page with entries
> > > > > > > > obtained in
> > > > > > > > a READDIR NFS call.  If a page already exists, we proceed
> > > > > > > > to
> > > > > > > > nfs_readdir_search_for_cookie(), which searches for the
> > > > > > > > cookie
> > > > > > > > (pos) of the readdir call.  The search is O(n), where n
> > > > > > > > is
> > > > > > > > the
> > > > > > > > directory size before the cookie in question is found,
> > > > > > > > and
> > > > > > > > every
> > > > > > > > entry to nfs_readdir() pays this penalty, irrespective of
> > > > > > > > the
> > > > > > > > current directory position (dir_context.pos).  The search
> > > > > > > > is
> > > > > > > > expensive due to the opaque nature of readdir cookies,
> > > > > > > > and
> > > > > > > > the
> > > > > > > > fact
> > > > > > > > that no mapping (hash) exists from cookies to pages.  In
> > > > > > > > the
> > > > > > > > case
> > > > > > > > of a directory being modified, the above behavior can
> > > > > > > > become
> > > > > > > > an
> > > > > > > > excessive penalty, since the same process is forced to
> > > > > > > > fill
> > > > > > > > pages
> > > > > > > > it
> > > > > > > > may be no longer interested in (the entries were passed
> > > > > > > > in a
> > > > > > > > previous
> > > > > > > > nfs_readdir call), and this can essentially lead no
> > > > > > > > forward
> > > > > > > > progress.
> > > > > > > >
> > > > > > > > To fix this problem, at the end of nfs_readdir(), save
> > > > > > > > the
> > > > > > > > page_index
> > > > > > > > corresponding to the directory position (cookie) inside
> > > > > > > > the
> > > > > > > > process's
> > > > > > > > nfs_open_dir_context.  Then at the next entry of
> > > > > > > > nfs_readdir(),
> > > > > > > > use
> > > > > > > > the saved page_index as the starting search point rather
> > > > > > > > than
> > > > > > > > starting
> > > > > > > > at page_index == 0.  Not only does this fix the problem
> > > > > > > > of
> > > > > > > > listing
> > > > > > > > a directory being modified, it also significantly
> > > > > > > > improves
> > > > > > > > performance
> > > > > > > > in the unmodified case since no extra search penalty is
> > > > > > > > paid
> > > > > > > > at
> > > > > > > > each
> > > > > > > > entry to nfs_readdir().
> > > > > > > >
> > > > > > > > In the case of lseek, since there is no hash or other
> > > > > > > > mapping
> > > > > > > > from a
> > > > > > > > cookie value to the page->index, just reset
> > > > > > > > nfs_open_dir_context.page_index
> > > > > > > > to 0, which will reset the search to the old behavior.
> > > > > > > >
> > > > > > > > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> > > > > > > > ---
> > > > > > > >  fs/nfs/dir.c           | 8 +++++++-
> > > > > > > >  include/linux/nfs_fs.h | 1 +
> > > > > > > >  2 files changed, 8 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > > > > index 52e06c8fc7cd..b266f505b521 100644
> > > > > > > > --- a/fs/nfs/dir.c
> > > > > > > > +++ b/fs/nfs/dir.c
> > > > > > > > @@ -78,6 +78,7 @@ static struct nfs_open_dir_context
> > > > > > > > *alloc_nfs_open_dir_context(struct inode *dir
> > > > > > > >                 ctx->attr_gencount = nfsi->attr_gencount;
> > > > > > > >                 ctx->dir_cookie = 0;
> > > > > > > >                 ctx->dup_cookie = 0;
> > > > > > > > +               ctx->page_index = 0;
> > > > > > > >                 ctx->cred = get_cred(cred);
> > > > > > > >                 spin_lock(&dir->i_lock);
> > > > > > > >                 if (list_empty(&nfsi->open_files) &&
> > > > > > > > @@ -763,7 +764,7 @@ int
> > > > > > > > find_and_lock_cache_page(nfs_readdir_descriptor_t *desc)
> > > > > > > >         return res;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > -/* Search for desc->dir_cookie from the beginning of the
> > > > > > > > page
> > > > > > > > cache
> > > > > > > > */
> > > > > > > > +/* Search for desc->dir_cookie starting at desc-
> > > > > > > > >page_index
> > > > > > > > */
> > > > > > > >  static inline
> > > > > > > >  int readdir_search_pagecache(nfs_readdir_descriptor_t
> > > > > > > > *desc)
> > > > > > > >  {
> > > > > > > > @@ -885,6 +886,8 @@ static int nfs_readdir(struct file
> > > > > > > > *file,
> > > > > > > > struct
> > > > > > > > dir_context *ctx)
> > > > > > > >                 .ctx = ctx,
> > > > > > > >                 .dir_cookie = &dir_ctx->dir_cookie,
> > > > > > > >                 .plus = nfs_use_readdirplus(inode, ctx),
> > > > > > > > +               .page_index = dir_ctx->page_index,
> > > > > > > > +               .last_cookie =
> > > > > > > > nfs_readdir_use_cookie(file) ?
> > > > > > > > ctx-
> > > > > > > > > pos : 0,
> > > > > > > >         },
> > > > > > > >                         *desc = &my_desc;
> > > > > > > >         int res = 0;
> > > > > > > > @@ -938,6 +941,7 @@ static int nfs_readdir(struct file
> > > > > > > > *file,
> > > > > > > > struct
> > > > > > > > dir_context *ctx)
> > > > > > > >  out:
> > > > > > > >         if (res > 0)
> > > > > > > >                 res = 0;
> > > > > > > > +       dir_ctx->page_index = desc->page_index;
> > > > > > > >         trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx-
> > > > > > > > > dir_cookie,
> > > > > > > >                                NFS_SERVER(inode)->dtsize,
> > > > > > > > my_desc.plus, res);
> > > > > > > >         return res;
> > > > > > > > @@ -975,6 +979,8 @@ static loff_t nfs_llseek_dir(struct
> > > > > > > > file
> > > > > > > > *filp,
> > > > > > > > loff_t offset, int whence)
> > > > > > > >                 else
> > > > > > > >                         dir_ctx->dir_cookie = 0;
> > > > > > > >                 dir_ctx->duped = 0;
> > > > > > > > +               /* Force readdir_search_pagecache to
> > > > > > > > start
> > > > > > > > over
> > > > > > > > */
> > > > > > > > +               dir_ctx->page_index = 0;
> > > > > > > >         }
> > > > > > > >         inode_unlock(inode);
> > > > > > > >         return offset;
> > > > > > > > diff --git a/include/linux/nfs_fs.h
> > > > > > > > b/include/linux/nfs_fs.h
> > > > > > > > index a2c6455ea3fa..0e55c0154ccd 100644
> > > > > > > > --- a/include/linux/nfs_fs.h
> > > > > > > > +++ b/include/linux/nfs_fs.h
> > > > > > > > @@ -93,6 +93,7 @@ struct nfs_open_dir_context {
> > > > > > > >         __u64 dir_cookie;
> > > > > > > >         __u64 dup_cookie;
> > > > > > > >         signed char duped;
> > > > > > > > +       unsigned long   page_index;
> > > > > > > >  };
> > > > > > > >
> > > > > > > >  /*
> > > > > > >
> > > > > > > NACK. It makes no sense to store the page index as a
> > > > > > > cursor.
> > > > > > >
> > > > > >
> > > > > > A similar thing was done recently with:
> > > > > > 227823d2074d nfs: optimise readdir cache page invalidation
> > > > > >
> > > > >
> > > > > That's a very different thing. It is about discarding page data
> > > > > in
> > > > > order to force a re-read of the contents into cache.
> > > > >
> > > > Right - I only pointed it out because it is in effect a cursor
> > > > about
> > > > the last access into the cache but it's on a global basis, not
> > > > process context.
> > > >
> > > > > What you're doing is basically trying to guess where the data
> > > > > is
> > > > > located. which might work in some cases where the directory is
> > > > > completely static, but if it shrinks (e.g. due to a few
> > > > > unlink() or
> > > > > rename() calls) so that you overshoot the cookie, then you can
> > > > > end
> > > > > up
> > > > > reading all the way to the end of the directory before doing an
> > > > > uncached readdir.
> > > > >
> > > > First, consider the unmodified (idle directory) scenario.  Today
> > > > the
> > > > performance is bad the larger the directory goes - do you see
> > > > why?
> > > > I tried to explain in the cover letter and header but maybe it's
> > > > not
> > > > clear?
> > > >
> > > > Second, the modified scenario today the performance is very bad
> > > > because of the same problem - the cookie is reset and the process
> > > > needs to start over at cookie 0, repeating READDIRs.  But maybe
> > > > there's a specific scenario I'm not thinking about.
> > > >
> > > > The way I thought about this is that if you're in a heavily
> > > > modified
> > > > scenario with a large directory and you're past the 'acdirmax'
> > > > time,
> > > > you have to make the choice of either:
> > > > a) ignoring 'acdirmax' (this is what the NFSv3 patch did) and
> > > > even
> > > > though you know the cache expired you keep going as though it
> > > > did not (at least until a different process starts a listing)
> > > > b) honoring 'acdirmax' (drop the pagecache), but keep going the
> > > > best you can based on the previous information and don't try to
> > > > rebuild the cache before continuing.
> > > >
> > > > > IOW: This will have a detrimental effect for some workloads,
> > > > > which
> > > > > needs to be weighed up against the benefits. I saw that you've
> > > > > tested
> > > > > with large directories, but what workloads were you testing on
> > > > > those
> > > > > directories?
> > > > >
> > > > I can definitely do further testing and any scenario you want to
> > > > try
> > > > to
> > > > break it or find a pathological scenario. So far I've tested the
> > > > reader ("ls -lf") in parallel with one of the two writers:
> > > > 1) random add a file every 0.1s:
> > > > while true; do i=$((1 + RANDOM % $NUM_FILES)); echo $i; touch
> > > > $MNT2/file$i.bin; builtin sleep 0.1; done > /dev/null 2>&1 &
> > > > 2) random delete a file every 0.1 s:
> > > > while true; do i=$((1 + RANDOM % $NUM_FILES)); echo $i; rm -f
> > > > $MNT2/file$i; builtin sleep 0.1; done > /dev/null 2>&1 &
> > > >
> > > > In no case did I see it take a longer time or ops vs vanilla 5.9,
> > > > the
> > > > idle
> > > > and modified performance is better (measured in seconds and ops)
> > > > with this patch.  Below is a short summary.  Note that the first
> > > > time
> > > > and
> > > > ops is with an idle directory, and the second one is the
> > > > modified.
> > > >
> > > > 5.9 (vanilla): random delete a file every 0.1 s:
> > > > Ops increased from 4734 to 8834
> > > > Time increased from 23 to 44
> > > >
> > > > 5.9 (this patch): random delete a file every 0.1 s:
> > > > Ops increased from 4697 to 4696
> > > > Time increased from 20 to 30
> > > >
> > > >
> > > > 5.9 (vanilla): random add a file every 0.1s:
> > > > Ops increased from 4734 to 9168
> > > > Time increased from 23 to 43
> > > >
> > > > 5.9 (this patch): random add a file every 0.1s:
> > > > Ops increased from 4697 to 4702
> > > > Time increased from 21 to 32
> > > >
> > >
> > > If you're not seeing any change in number of ops then those numbers
> > > are
> > > basically telling you that you're not seeing any cache
> > > invalidation.
> > > You should be seeing cache invalidation when you are creating and
> > > deleting files and are doing simultaneous readdirs.
> > >
> >
> > No I think you're misunderstanding or we're not on the same page.
> > The difference is, with 5.9 vanilla when the invalidation occurs, the
> > reader process has to go back to cookie 0 and pays the penalty
> > to refill all the cache pages up to cookie 'N'.  With the patch this
> > is not the case - it continues on with cookie 'N' and does not pay
> > the
> > penalty - the next reader does
> >
>
> Then I still don't see how you can avoid corrupting the page cache when
> you're in effect just starting filling in cookies at a random page
> cache index + offset that bears no relation to where the cookie would
> end up if the reader started from 0.
>
I was concerned about this too.

I don't have absolute proof of the pagecache state, but I did do
some tests in live crash listing the pages on the directory inode
while I had deleted a portion of the files in the directory and a
listing was going to see if the pagecache got messed up.  I
think we may end up with pages that get in effect "temporarily orphaned"
by the one reader that runs into this condition, but the next reader
will start at 0 and so those pages will only be temporary and not
findable by anyone else.  If the directory is constantly changing,
then the pagecache will get dumped again soon.  But as you point
out it probably is not a good approach.

> However leaving aside that issue, what's the point of using the page
> cache if the next reader has to clear out the data and start afresh
> anyway? Why not just let all the threads avoid the page cache and just
> do uncached readdir?
>
Yes I agree.  The problem I'm working on finding a solution for is
such that the cache is not valuable anymore - basically we have
timers (acdirmin/acdirmax) that govern how long the cache is valid
and we blow past those when doing one pass through the directory.
In that instance we mainly want to ensure forward progress, so
it would make sense to shift to uncached operation.



> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
>
>
David Wysochanski Nov. 3, 2020, 5:49 p.m. UTC | #10
On Mon, Nov 2, 2020 at 7:09 PM Frank van der Linden <fllinden@amazon.com> wrote:
>
> Hi Dave,
>
> Thanks for looking at this.
>
> On Mon, Nov 02, 2020 at 02:45:09PM -0500, David Wysochanski wrote:
> > On Mon, Nov 2, 2020 at 12:31 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
> > >
> > > On Mon, 2020-11-02 at 11:26 -0500, David Wysochanski wrote:
> > > > On Mon, Nov 2, 2020 at 11:22 AM Trond Myklebust <
> > > > trondmy@hammerspace.com> wrote:
> > > > >
> > > > > On Mon, 2020-11-02 at 08:50 -0500, Dave Wysochanski wrote:
> > > > > > A process can hang forever to 'ls -l' a directory while the
> > > > > > directory
> > > > > > is being modified such as another NFS client adding files to the
> > > > > > directory.  The problem is seen specifically with larger
> > > > > > directories
> > > > > > (I tested with 1 million) and/or slower NFS server responses to
> > > > > > READDIR.  If a combination of the NFS directory size, the NFS
> > > > > > server
> > > > > > responses to READDIR is such that the 'ls' process gets partially
> > > > > > through the listing before the attribute cache expires (time
> > > > > > exceeds acdirmax), we drop the pagecache and have to re-fill it,
> > > > > > and as a result, the process may never complete.  One could argue
> > > > > > for larger directories the acdirmin/acdirmax should be increased,
> > > > > > but it's not always possible to tune this effectively.
> > > > > >
> > > > > > The root cause of this problem is due to how the NFS readdir
> > > > > > cache
> > > > > > currently works.  The main search function,
> > > > > > readdir_search_pagecache(),
> > > > > > always starts searching at page_index and cookie == 0, and for
> > > > > > any
> > > > > > page not in the cache, fills in the page with entries obtained in
> > > > > > a READDIR NFS call.  If a page already exists, we proceed to
> > > > > > nfs_readdir_search_for_cookie(), which searches for the cookie
> > > > > > (pos) of the readdir call.  The search is O(n), where n is the
> > > > > > directory size before the cookie in question is found, and every
> > > > > > entry to nfs_readdir() pays this penalty, irrespective of the
> > > > > > current directory position (dir_context.pos).  The search is
> > > > > > expensive due to the opaque nature of readdir cookies, and the
> > > > > > fact
> > > > > > that no mapping (hash) exists from cookies to pages.  In the case
> > > > > > of a directory being modified, the above behavior can become an
> > > > > > excessive penalty, since the same process is forced to fill pages
> > > > > > it
> > > > > > may be no longer interested in (the entries were passed in a
> > > > > > previous
> > > > > > nfs_readdir call), and this can essentially lead no forward
> > > > > > progress.
> > > > > >
> > > > > > To fix this problem, at the end of nfs_readdir(), save the
> > > > > > page_index
> > > > > > corresponding to the directory position (cookie) inside the
> > > > > > process's
> > > > > > nfs_open_dir_context.  Then at the next entry of nfs_readdir(),
> > > > > > use
> > > > > > the saved page_index as the starting search point rather than
> > > > > > starting
> > > > > > at page_index == 0.  Not only does this fix the problem of
> > > > > > listing
> > > > > > a directory being modified, it also significantly improves
> > > > > > performance
> > > > > > in the unmodified case since no extra search penalty is paid at
> > > > > > each
> > > > > > entry to nfs_readdir().
> > > > > >
> > > > > > In the case of lseek, since there is no hash or other mapping
> > > > > > from a
> > > > > > cookie value to the page->index, just reset
> > > > > > nfs_open_dir_context.page_index
> > > > > > to 0, which will reset the search to the old behavior.
> > > > > >
> > > > > > Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> > > > > > ---
> > > > > >  fs/nfs/dir.c           | 8 +++++++-
> > > > > >  include/linux/nfs_fs.h | 1 +
> > > > > >  2 files changed, 8 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > > index 52e06c8fc7cd..b266f505b521 100644
> > > > > > --- a/fs/nfs/dir.c
> > > > > > +++ b/fs/nfs/dir.c
> > > > > > @@ -78,6 +78,7 @@ static struct nfs_open_dir_context
> > > > > > *alloc_nfs_open_dir_context(struct inode *dir
> > > > > >                 ctx->attr_gencount = nfsi->attr_gencount;
> > > > > >                 ctx->dir_cookie = 0;
> > > > > >                 ctx->dup_cookie = 0;
> > > > > > +               ctx->page_index = 0;
> > > > > >                 ctx->cred = get_cred(cred);
> > > > > >                 spin_lock(&dir->i_lock);
> > > > > >                 if (list_empty(&nfsi->open_files) &&
> > > > > > @@ -763,7 +764,7 @@ int
> > > > > > find_and_lock_cache_page(nfs_readdir_descriptor_t *desc)
> > > > > >         return res;
> > > > > >  }
> > > > > >
> > > > > > -/* Search for desc->dir_cookie from the beginning of the page
> > > > > > cache
> > > > > > */
> > > > > > +/* Search for desc->dir_cookie starting at desc->page_index */
> > > > > >  static inline
> > > > > >  int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
> > > > > >  {
> > > > > > @@ -885,6 +886,8 @@ static int nfs_readdir(struct file *file,
> > > > > > struct
> > > > > > dir_context *ctx)
> > > > > >                 .ctx = ctx,
> > > > > >                 .dir_cookie = &dir_ctx->dir_cookie,
> > > > > >                 .plus = nfs_use_readdirplus(inode, ctx),
> > > > > > +               .page_index = dir_ctx->page_index,
> > > > > > +               .last_cookie = nfs_readdir_use_cookie(file) ?
> > > > > > ctx-
> > > > > > > pos : 0,
> > > > > >         },
> > > > > >                         *desc = &my_desc;
> > > > > >         int res = 0;
> > > > > > @@ -938,6 +941,7 @@ static int nfs_readdir(struct file *file,
> > > > > > struct
> > > > > > dir_context *ctx)
> > > > > >  out:
> > > > > >         if (res > 0)
> > > > > >                 res = 0;
> > > > > > +       dir_ctx->page_index = desc->page_index;
> > > > > >         trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx-
> > > > > > >dir_cookie,
> > > > > >                                NFS_SERVER(inode)->dtsize,
> > > > > > my_desc.plus, res);
> > > > > >         return res;
> > > > > > @@ -975,6 +979,8 @@ static loff_t nfs_llseek_dir(struct file
> > > > > > *filp,
> > > > > > loff_t offset, int whence)
> > > > > >                 else
> > > > > >                         dir_ctx->dir_cookie = 0;
> > > > > >                 dir_ctx->duped = 0;
> > > > > > +               /* Force readdir_search_pagecache to start over
> > > > > > */
> > > > > > +               dir_ctx->page_index = 0;
> > > > > >         }
> > > > > >         inode_unlock(inode);
> > > > > >         return offset;
> > > > > > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > > > > > index a2c6455ea3fa..0e55c0154ccd 100644
> > > > > > --- a/include/linux/nfs_fs.h
> > > > > > +++ b/include/linux/nfs_fs.h
> > > > > > @@ -93,6 +93,7 @@ struct nfs_open_dir_context {
> > > > > >         __u64 dir_cookie;
> > > > > >         __u64 dup_cookie;
> > > > > >         signed char duped;
> > > > > > +       unsigned long   page_index;
> > > > > >  };
> > > > > >
> > > > > >  /*
> > > > >
> > > > > NACK. It makes no sense to store the page index as a cursor.
> > > > >
> > > >
> > > > A similar thing was done recently with:
> > > > 227823d2074d nfs: optimise readdir cache page invalidation
> > > >
> > >
> > > That's a very different thing. It is about discarding page data in
> > > order to force a re-read of the contents into cache.
> > >
> > Right - I only pointed it out because it is in effect a cursor about
> > the last access into the cache but it's on a global basis, not
> > process context.
> >
> > > What you're doing is basically trying to guess where the data is
> > > located. which might work in some cases where the directory is
> > > completely static, but if it shrinks (e.g. due to a few unlink() or
> > > rename() calls) so that you overshoot the cookie, then you can end up
> > > reading all the way to the end of the directory before doing an
> > > uncached readdir.
> > >
> > First, consider the unmodified (idle directory) scenario.  Today the
> > performance is bad the larger the directory goes - do you see why?
> > I tried to explain in the cover letter and header but maybe it's not clear?
>
> I agree with you that the patch is good for that case. It seems
> like a bug that, if the cache is not invalidated in the meantime,
> nfs_readdir would start at offset 0 of the cache while searching.
>
> But, as Trond said, page_index isn't a good cursor once the
> page cache was invalidated, you don't know what data you are
> leaving out inadvertently.
>

Hey Frank - thanks for looking at this too.  Yes I see it this is probably
not a good approach especially since you and Trond seem to come
to the same conclusion.


> >
> > Second, the modified scenario today the performance is very bad
> > because of the same problem - the cookie is reset and the process
> > needs to start over at cookie 0, repeating READDIRs.  But maybe
> > there's a specific scenario I'm not thinking about.
> >
> > The way I thought about this is that if you're in a heavily modified
> > scenario with a large directory and you're past the 'acdirmax' time,
> > you have to make the choice of either:
> > a) ignoring 'acdirmax' (this is what the NFSv3 patch did) and even
> > though you know the cache expired you keep going as though it
> > did not (at least until a different process starts a listing)
> > b) honoring 'acdirmax' (drop the pagecache), but keep going the
> > best you can based on the previous information and don't try to
> > rebuild the cache before continuing.
>
> Yeah, it seems like those are your options. The only other way
> of doing this that I can think of would be to add logic that
> says something like:
>
>         if (pos != 0 and the cache is invalid)
>                 uncached_readdir()
>
> ..which is, by the looks of it, easily done by having
> readdir_search_pagecache() return EBADCOOKIE if
> there's no cache, and the offset is not 0. E.g. it
> shouldn't try to fill the cache in that case, but
> just let uncached reads take over.
>
I thought about other logic, but this seems simple
enough - just check for 0 and probably the number
of pages in the cache to see if it's invalid.

> The idea here is that the only information that is possibly
> still valid is the cookie - and that the server will know
> how to use the cookie to keep going where you left off.
>
Makes sense.

> Of course, this mean that, in the case of another client
> modifying the directory, you'll end up doing mostly / all
> uncached readdirs after an invalidate. Same for a scenario
> where a client did: open llseek(<nonzero_offset), getdents().
> Until another process starts at offset 0 and re-fills the
> cache, of course.
>
> I'm not sure if that's all that bad, though, it depends
> on common usage patterns.
>

Yeah I was wondering about this too - what side effects would
happen when going uncached and how far should we go
to try to detect we're in this pattern.

I guess I was hoping to avoid going uncached because that
means we have to ask the server again for the same info,
which could very well be a lot.  But then again if we're blowing
past acdirmax then we're making the best of a bad situation
and just need to make forward progress.

I also considered whether it would be possible to improve the readdir
cache data structure and search to handle this, especially now that it
looks like Trond's latest patches have given us 16 bytes of free space in
each page.  I was thinking for example of adding a 'next_page_index'
into nfs_cache_array but then I think we get into a mess of overlap of
entries, etc. So far I don't see a clean way forward on that, and the
added complexity does not seem worthwhile.  So trying uncached
like you and Trond suggest may be the best approach.


> - Frank
>
diff mbox series

Patch

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 52e06c8fc7cd..b266f505b521 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -78,6 +78,7 @@  static struct nfs_open_dir_context *alloc_nfs_open_dir_context(struct inode *dir
 		ctx->attr_gencount = nfsi->attr_gencount;
 		ctx->dir_cookie = 0;
 		ctx->dup_cookie = 0;
+		ctx->page_index = 0;
 		ctx->cred = get_cred(cred);
 		spin_lock(&dir->i_lock);
 		if (list_empty(&nfsi->open_files) &&
@@ -763,7 +764,7 @@  int find_and_lock_cache_page(nfs_readdir_descriptor_t *desc)
 	return res;
 }
 
-/* Search for desc->dir_cookie from the beginning of the page cache */
+/* Search for desc->dir_cookie starting at desc->page_index */
 static inline
 int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
 {
@@ -885,6 +886,8 @@  static int nfs_readdir(struct file *file, struct dir_context *ctx)
 		.ctx = ctx,
 		.dir_cookie = &dir_ctx->dir_cookie,
 		.plus = nfs_use_readdirplus(inode, ctx),
+		.page_index = dir_ctx->page_index,
+		.last_cookie = nfs_readdir_use_cookie(file) ? ctx->pos : 0,
 	},
 			*desc = &my_desc;
 	int res = 0;
@@ -938,6 +941,7 @@  static int nfs_readdir(struct file *file, struct dir_context *ctx)
 out:
 	if (res > 0)
 		res = 0;
+	dir_ctx->page_index = desc->page_index;
 	trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx->dir_cookie,
 			       NFS_SERVER(inode)->dtsize, my_desc.plus, res);
 	return res;
@@ -975,6 +979,8 @@  static loff_t nfs_llseek_dir(struct file *filp, loff_t offset, int whence)
 		else
 			dir_ctx->dir_cookie = 0;
 		dir_ctx->duped = 0;
+		/* Force readdir_search_pagecache to start over */
+		dir_ctx->page_index = 0;
 	}
 	inode_unlock(inode);
 	return offset;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a2c6455ea3fa..0e55c0154ccd 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -93,6 +93,7 @@  struct nfs_open_dir_context {
 	__u64 dir_cookie;
 	__u64 dup_cookie;
 	signed char duped;
+	unsigned long	page_index;
 };
 
 /*