Hang/soft lockup in d_invalidate with simultaneous calls
diff mbox

Message ID CACGdZYJBAYz3MXb18i=T91hD5TNFvCXUJH7ByvnJC1BTxjq6pA@mail.gmail.com
State New
Headers show

Commit Message

Khazhismel Kumykov May 16, 2017, 12:05 a.m. UTC
Hi,

I'm seeing behavior in d_invalidate, if multiple threads call d_invalidate on
the same tree at the same, behavior time blows up and all the calls hang with
large enough trees/enough simultaneous callers. (e.g. a directory w/ 100k
entries in d_subdir, and 5 or so threads calling d_invalidate was able to hang
my test VMs)

This seems to be due to thrashing on the dentry locks in multiple threads
tightly looping calling d_walk waiting for a shrink_dentry_list call (also
thrashing the locks) to complete. (d_invalidate loops calling d_walk so long as
any dentry in the tree is in a dcache shrink list).

There was a similar report recently "soft lockup in d_invalidate()" that
proposed in the d_invalidate d_walk to ignore dentries marked as in a shrink
list already, which does prevent this hang/lockup in this case as well, although
I'm not sure it doesn't violate the contract for d_invalidate. (Although the
entries in a shrink list should be dropped eventually, not necessarily by the
time d_invalidate returns).

If someone more familiar with the dcache could provide input I would appreciate.

A reliable repro on mainline is:
 - create a mountpoint with DCACHE_OP_REVALIDATE, e.g. fuse passthrough
 - create a directory and populate with ~100k files with content to
populate dcache
 - create some background processes opening/reading files in this folder (5
      while true; cat $file was enough to get an indefinite hang for me)
 - cause the directory to need to be invalidated (e.g., make_bad_inode on the
    directory)

This results in the background processes all entering d_invalidate and hanging,
while with just one process in d_invalidate (e.g., stat'ing a file in the dir)
things go pretty quickly as expected.


(The proposed patch from other thread)

                         d_lru_del(dentry);


khazhy

Comments

Khazhismel Kumykov May 17, 2017, 9:58 p.m. UTC | #1
On Mon, May 15, 2017 at 5:05 PM, Khazhismel Kumykov <khazhy@google.com> wrote:
> Hi,
>
> I'm seeing behavior in d_invalidate, if multiple threads call d_invalidate on
> the same tree at the same, behavior time blows up and all the calls hang with
> large enough trees/enough simultaneous callers. (e.g. a directory w/ 100k
> entries in d_subdir, and 5 or so threads calling d_invalidate was able to hang
> my test VMs)
>
> This seems to be due to thrashing on the dentry locks in multiple threads
> tightly looping calling d_walk waiting for a shrink_dentry_list call (also
> thrashing the locks) to complete. (d_invalidate loops calling d_walk so long as
> any dentry in the tree is in a dcache shrink list).
>
> There was a similar report recently "soft lockup in d_invalidate()" that
> proposed in the d_invalidate d_walk to ignore dentries marked as in a shrink
> list already, which does prevent this hang/lockup in this case as well, although
> I'm not sure it doesn't violate the contract for d_invalidate. (Although the
> entries in a shrink list should be dropped eventually, not necessarily by the
> time d_invalidate returns).
>
> If someone more familiar with the dcache could provide input I would appreciate.
>
> A reliable repro on mainline is:
>  - create a mountpoint with DCACHE_OP_REVALIDATE, e.g. fuse passthrough
>  - create a directory and populate with ~100k files with content to
> populate dcache
>  - create some background processes opening/reading files in this folder (5
>       while true; cat $file was enough to get an indefinite hang for me)
>  - cause the directory to need to be invalidated (e.g., make_bad_inode on the
>     directory)
>
> This results in the background processes all entering d_invalidate and hanging,
> while with just one process in d_invalidate (e.g., stat'ing a file in the dir)
> things go pretty quickly as expected.
>
>
> (The proposed patch from other thread)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 7b8feb6..3a3b0f37 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1364,7 +1364,7 @@ static enum d_walk_ret select_collect(void *_data,
> struct dentry *dentry)
>                  goto out;
>
>          if (dentry->d_flags & DCACHE_SHRINK_LIST) {
> -               data->found++;
> +               goto out;
>          } else {
>                  if (dentry->d_flags & DCACHE_LRU_LIST)
>                          d_lru_del(dentry);
>
>
> khazhy

Would this change actually violate any guarantees? select_collect
looks like it used to ignore unused dentries that were part of a
shrink list before fe91522a7ba82ca1a51b07e19954b3825e4aaa22 (found
would not be incremented). Once the dentry is on a shrink list would
it be unreachable anyways, so returning from d_invalidate before all
the shrink lists finish processing would be ok?

khazhy
Khazhismel Kumykov May 22, 2017, 6:18 p.m. UTC | #2
On Wed, May 17, 2017 at 2:58 PM Khazhismel Kumykov <khazhy@google.com> wrote:
>
> On Mon, May 15, 2017 at 5:05 PM, Khazhismel Kumykov <khazhy@google.com> wrote:
> > Hi,
> >
> > I'm seeing behavior in d_invalidate, if multiple threads call d_invalidate on
> > the same tree at the same, behavior time blows up and all the calls hang with
> > large enough trees/enough simultaneous callers. (e.g. a directory w/ 100k
> > entries in d_subdir, and 5 or so threads calling d_invalidate was able to hang
> > my test VMs)
> >
> > This seems to be due to thrashing on the dentry locks in multiple threads
> > tightly looping calling d_walk waiting for a shrink_dentry_list call (also
> > thrashing the locks) to complete. (d_invalidate loops calling d_walk so long as
> > any dentry in the tree is in a dcache shrink list).
> >
> > There was a similar report recently "soft lockup in d_invalidate()" that
> > proposed in the d_invalidate d_walk to ignore dentries marked as in a shrink
> > list already, which does prevent this hang/lockup in this case as well, although
> > I'm not sure it doesn't violate the contract for d_invalidate. (Although the
> > entries in a shrink list should be dropped eventually, not necessarily by the
> > time d_invalidate returns).
> >
> > If someone more familiar with the dcache could provide input I would appreciate.
> >
> > A reliable repro on mainline is:
> >  - create a mountpoint with DCACHE_OP_REVALIDATE, e.g. fuse passthrough
> >  - create a directory and populate with ~100k files with content to
> > populate dcache
> >  - create some background processes opening/reading files in this folder (5
> >       while true; cat $file was enough to get an indefinite hang for me)
> >  - cause the directory to need to be invalidated (e.g., make_bad_inode on the
> >     directory)
> >
> > This results in the background processes all entering d_invalidate and hanging,
> > while with just one process in d_invalidate (e.g., stat'ing a file in the dir)
> > things go pretty quickly as expected.
> >
> >
> > (The proposed patch from other thread)
> >
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 7b8feb6..3a3b0f37 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -1364,7 +1364,7 @@ static enum d_walk_ret select_collect(void *_data,
> > struct dentry *dentry)
> >                  goto out;
> >
> >          if (dentry->d_flags & DCACHE_SHRINK_LIST) {
> > -               data->found++;
> > +               goto out;
> >          } else {
> >                  if (dentry->d_flags & DCACHE_LRU_LIST)
> >                          d_lru_del(dentry);
> >
> >
> > khazhy
>
> Would this change actually violate any guarantees? select_collect
> looks like it used to ignore unused dentries that were part of a
> shrink list before fe91522a7ba82ca1a51b07e19954b3825e4aaa22 (found
> would not be incremented). Once the dentry is on a shrink list would
> it be unreachable anyways, so returning from d_invalidate before all
> the shrink lists finish processing would be ok?
>
> khazhy

Pinging in case this got missed, would appreciate thoughts from someone more
familiar with the dcache.

My previous email wasn't completely correct, while before fe91522a7ba8
("don't remove from shrink list in select_collect()") d_invalidate
would not busy
wait for other workers calling shrink_list to compete, it would return -EBUSY,
rather than success, so a change to return success without waiting would not
be equivalent behavior before. Currently, we will loop calling d_walk repeatedly
causing the extreme slowdown observed.

I still want to understand better, in d_invalidate if we can return
without pruning
in-use dentries safely, would returning before unused dentries are
pruned, so long
as we know they will be pruned by another task (are on a shrink list),
be safe as well?
If not, would it make more sense to have have a mutual exclusion on calling
d_invalidate on the same dentries simultaneously?

khazhy
Khazhismel Kumykov May 25, 2017, 10:31 p.m. UTC | #3
On Mon, May 22, 2017 at 11:18 AM, Khazhismel Kumykov <khazhy@google.com> wrote:
> On Wed, May 17, 2017 at 2:58 PM Khazhismel Kumykov <khazhy@google.com> wrote:
>>
>> On Mon, May 15, 2017 at 5:05 PM, Khazhismel Kumykov <khazhy@google.com> wrote:
>> > Hi,
>> >
>> > I'm seeing behavior in d_invalidate, if multiple threads call d_invalidate on
>> > the same tree at the same, behavior time blows up and all the calls hang with
>> > large enough trees/enough simultaneous callers. (e.g. a directory w/ 100k
>> > entries in d_subdir, and 5 or so threads calling d_invalidate was able to hang
>> > my test VMs)
>> >
>> > This seems to be due to thrashing on the dentry locks in multiple threads
>> > tightly looping calling d_walk waiting for a shrink_dentry_list call (also
>> > thrashing the locks) to complete. (d_invalidate loops calling d_walk so long as
>> > any dentry in the tree is in a dcache shrink list).
>> >
>> > There was a similar report recently "soft lockup in d_invalidate()" that
>> > proposed in the d_invalidate d_walk to ignore dentries marked as in a shrink
>> > list already, which does prevent this hang/lockup in this case as well, although
>> > I'm not sure it doesn't violate the contract for d_invalidate. (Although the
>> > entries in a shrink list should be dropped eventually, not necessarily by the
>> > time d_invalidate returns).
>> >
>> > If someone more familiar with the dcache could provide input I would appreciate.
>> >
>> > A reliable repro on mainline is:
>> >  - create a mountpoint with DCACHE_OP_REVALIDATE, e.g. fuse passthrough
>> >  - create a directory and populate with ~100k files with content to
>> > populate dcache
>> >  - create some background processes opening/reading files in this folder (5
>> >       while true; cat $file was enough to get an indefinite hang for me)
>> >  - cause the directory to need to be invalidated (e.g., make_bad_inode on the
>> >     directory)
>> >
>> > This results in the background processes all entering d_invalidate and hanging,
>> > while with just one process in d_invalidate (e.g., stat'ing a file in the dir)
>> > things go pretty quickly as expected.
>> >
>> >
>> > (The proposed patch from other thread)
>> >
>> > diff --git a/fs/dcache.c b/fs/dcache.c
>> > index 7b8feb6..3a3b0f37 100644
>> > --- a/fs/dcache.c
>> > +++ b/fs/dcache.c
>> > @@ -1364,7 +1364,7 @@ static enum d_walk_ret select_collect(void *_data,
>> > struct dentry *dentry)
>> >                  goto out;
>> >
>> >          if (dentry->d_flags & DCACHE_SHRINK_LIST) {
>> > -               data->found++;
>> > +               goto out;
>> >          } else {
>> >                  if (dentry->d_flags & DCACHE_LRU_LIST)
>> >                          d_lru_del(dentry);
>> >
>> >
>> > khazhy
>>
>> Would this change actually violate any guarantees? select_collect
>> looks like it used to ignore unused dentries that were part of a
>> shrink list before fe91522a7ba82ca1a51b07e19954b3825e4aaa22 (found
>> would not be incremented). Once the dentry is on a shrink list would
>> it be unreachable anyways, so returning from d_invalidate before all
>> the shrink lists finish processing would be ok?
>>
>> khazhy
>
> Pinging in case this got missed, would appreciate thoughts from someone more
> familiar with the dcache.
>
> My previous email wasn't completely correct, while before fe91522a7ba8
> ("don't remove from shrink list in select_collect()") d_invalidate
> would not busy
> wait for other workers calling shrink_list to compete, it would return -EBUSY,
> rather than success, so a change to return success without waiting would not
> be equivalent behavior before. Currently, we will loop calling d_walk repeatedly
> causing the extreme slowdown observed.
>
> I still want to understand better, in d_invalidate if we can return
> without pruning
> in-use dentries safely, would returning before unused dentries are
> pruned, so long
> as we know they will be pruned by another task (are on a shrink list),
> be safe as well?
> If not, would it make more sense to have have a mutual exclusion on calling
> d_invalidate on the same dentries simultaneously?
>
> khazhy

ping

Again to summarize:
With a directory with large number of children, which makes a dentry with a
large number dentries in d_subdir, simultaneous calls to d_invalidate on that
dentry take a very long time. As an example, a directory with 160k files,
where a single d_invalidate call took ~1.4s, having 6 tasks call d_invalidate
simultaneously took ~6.5 hours to resolve, about 10000x longer.

dir/
  dir/file-0
  ...
  dir/file-160000

This seems to be due to contention between d_walk and shrink_dentry_list,
and particularly d_invalidate tightly looping d_walk so long as any dentry
it finds is in a shrink list. With this particular directory
structure, d_walk will
hold d_lock for "dir" while walking over d_subdir, meanwhile
shrink_dentry_list will release and regrab "dir"s d_lock every iteration,
also throwing it at the back of the queue.

One proposed solution is in select_collect, do not increment the
number of found unused descendants when we find a dentry in a
shrink list. This proposal will result in d_invalidate and shrink_dcache_parent
returning before necessarily all unused children are pruned, but they will
be pruned at some time soon by the other task currently shrinking, which
I am not sure if that is OK or not.

A quick and dirty fix would be adding a mutex around the shrink loops,
so simultaneous tasks don't thrash with each other.

There are probably also other solutions here, I would appreciate a look
from someone more familiar with this area.

khazhy

Patch
diff mbox

diff --git a/fs/dcache.c b/fs/dcache.c
index 7b8feb6..3a3b0f37 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1364,7 +1364,7 @@  static enum d_walk_ret select_collect(void *_data,
struct dentry *dentry)
                 goto out;

         if (dentry->d_flags & DCACHE_SHRINK_LIST) {
-               data->found++;
+               goto out;
         } else {
                 if (dentry->d_flags & DCACHE_LRU_LIST)