diff mbox

[2/3] Improve fairness when locking the per-superblock s_anon list

Message ID 151019772760.30101.8513274540570798315.stgit@noble (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown Nov. 9, 2017, 3:22 a.m. UTC
bit-spin-locks, as used for dcache hash chains, are not fair.
This is not a problem for the dcache hash table as different CPUs are
likely to access different entries in the hash table so high contention
is not expected.
However anonymous dentryies (created by NFSD) all live on a single hash
chain "s_anon" and the bitlock on this can be highly contended, resulting
in soft-lockup warnings.

So introduce a per-sb (fair) spinlock and take it before grabing the
bitlock on s_anon.  This provides fairness and makes the warnings go away.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/dcache.c        |   11 ++++++++++-
 fs/super.c         |    1 +
 include/linux/fs.h |    1 +
 3 files changed, 12 insertions(+), 1 deletion(-)

Comments

Linus Torvalds Nov. 9, 2017, 7:52 p.m. UTC | #1
On Wed, Nov 8, 2017 at 7:22 PM, NeilBrown <neilb@suse.com> wrote:
> bit-spin-locks, as used for dcache hash chains, are not fair.
> This is not a problem for the dcache hash table as different CPUs are
> likely to access different entries in the hash table so high contention
> is not expected.
> However anonymous dentryies (created by NFSD) all live on a single hash
> chain "s_anon" and the bitlock on this can be highly contended, resulting
> in soft-lockup warnings.

This really seems idiotic.

Let me rephrase this commit message so that you can see why I think it's wrong:

 "NFSd uses a single hash chain for all dentries, which can cause
horrible lock contention, in ways that the normal hashing does not.

  This horrendous contention can cause the machine to have bad latency
spikes, which can cause soft lockup warnings.

  Instead of just fixing the bad design decision that causes this
horrible contention, we'll just special-case this code and use an
additional lock that hides the problem by serializing the actual
contending access".

Honestly, looking at the code, the whole s_anon thing seems entirely
broken. There doesn't even seem to be much reason for it. In pretty
much all cases, we could just hash the damn dentry,

The only reason for actually having s_anon seems to be that we want
some per-superblock list of these unconnected dentries for
shrink_dcache_for_umount().

Everything else would actually be *much* happier with just having the
dentry on the regular hash table. It would entirely get rid of this
stupid performance problem, and it would actually simplify all the
code elsewhere, because it would remove special cases like this

                if (unlikely(IS_ROOT(dentry)))
                        b = &dentry->d_sb->s_anon;
                else
                        b = d_hash(dentry->d_name.hash);

and just turn them into

                b = d_hash(dentry->d_name.hash);

so I really wonder if we could just get rid of s_anon entirely.

Yes, getting rid of s_anon might involve crazy things like "let's just
walk all the dentries at umount time", but honestly, that sounds
preferable. Especially if we can just then do something like

 - set a special flag in the superblock if we ever use __d_obtain_alias()

 - only scan all the dentries on umount if that flag is set.

Hmm?

The other alternative would be to just try to make the bitlocks
fairer. I would find that much less distasteful than this nasty hack.

I could imagine, for example, just turning the bitlocks into "spinlock
on hashed array". That's basically what we did with the page lock,
which _used_ to be basically a blocking bitlock, and where there was
no possible way to not have contention on some pages.

We could afford to have two bits for the bitlock (lock and contention)
to make something like that more viable.

But I really get the feeling that the problem here is not the locking
primitive, but that whole s_anon decision. And my gut feeling is that
it really should be fixable.

Because wouldn't it be much nicer if the horrible nfsd contention just
went away rather than be worked around?

But maybe I'm missing some _other_ reason why s_anon has to be its own
separate list?

          Linus
Al Viro Nov. 9, 2017, 8:50 p.m. UTC | #2
On Thu, Nov 09, 2017 at 11:52:48AM -0800, Linus Torvalds wrote:
> Honestly, looking at the code, the whole s_anon thing seems entirely
> broken. There doesn't even seem to be much reason for it. In pretty
> much all cases, we could just hash the damn dentry,
> 
> The only reason for actually having s_anon seems to be that we want
> some per-superblock list of these unconnected dentries for
> shrink_dcache_for_umount().
> 
> Everything else would actually be *much* happier with just having the
> dentry on the regular hash table. It would entirely get rid of this
> stupid performance problem, and it would actually simplify all the
> code elsewhere, because it would remove special cases like this
> 
>                 if (unlikely(IS_ROOT(dentry)))
>                         b = &dentry->d_sb->s_anon;
>                 else
>                         b = d_hash(dentry->d_name.hash);
> 
> and just turn them into
> 
>                 b = d_hash(dentry->d_name.hash);
> 
> so I really wonder if we could just get rid of s_anon entirely.
> 
> Yes, getting rid of s_anon might involve crazy things like "let's just
> walk all the dentries at umount time", but honestly, that sounds
> preferable. Especially if we can just then do something like
> 
>  - set a special flag in the superblock if we ever use __d_obtain_alias()

Automatically set for a lot of NFS mounts (whenever you mount more than one
tree from the same server, IIRC)...

>  - only scan all the dentries on umount if that flag is set.
> 
> Hmm?

That looks like a bloody painful approach, IMO.  I'm not saying I like
Neil's patch, but I doubt that "let's just walk the entire dcache on
umount" is a good idea.

I wonder if separating the d_obtain_alias() and d_obtain_root() would be
a good idea; the former outnumber the latter by many orders of magnitude.
The tricky part is that we could have a disconnected directory from
d_obtain_alias() with children already connected to it (and thus normally
hashed by d_splice_alias()) and fail to connect the whole thing to parent.

That leaves us with an orphaned tree that might stick around past the
time when we drop all references to dentries in it.  And we want to
get those hunted down and shot on umount.  Could we
	* make s_anon hold d_obtain_root ones + orphans from such
failed reconnects
	* make final dput() treat hashed IS_ROOT as "don't retain it"
	* have d_obtain_alias() put into normal hash, leaving the
"move to s_anon" part to reconnect failures.
	* keep umount side of things unchanged.

I agree that temporary insertions into ->s_anon are bogus; hell, I'm not
even sure we want to put it on _any_ list initially - we want it to look
like it's hashed, so we could set ->next to NULL and have ->pprev point
to itself.  Then normal case for d_obtain_alias() would not bother
the hash chains at all at allocation time, then have it put into the
right hash chain on reconnect.  And on reconnect failure the caller
would've moved it to orphan list (i.e. ->s_anon).
NeilBrown Nov. 9, 2017, 11:09 p.m. UTC | #3
On Thu, Nov 09 2017, Al Viro wrote:

> On Thu, Nov 09, 2017 at 11:52:48AM -0800, Linus Torvalds wrote:
>> Honestly, looking at the code, the whole s_anon thing seems entirely
>> broken. There doesn't even seem to be much reason for it. In pretty
>> much all cases, we could just hash the damn dentry,
>> 
>> The only reason for actually having s_anon seems to be that we want
>> some per-superblock list of these unconnected dentries for
>> shrink_dcache_for_umount().
>> 
>> Everything else would actually be *much* happier with just having the
>> dentry on the regular hash table. It would entirely get rid of this
>> stupid performance problem, and it would actually simplify all the
>> code elsewhere, because it would remove special cases like this
>> 
>>                 if (unlikely(IS_ROOT(dentry)))
>>                         b = &dentry->d_sb->s_anon;
>>                 else
>>                         b = d_hash(dentry->d_name.hash);
>> 
>> and just turn them into
>> 
>>                 b = d_hash(dentry->d_name.hash);
>> 
>> so I really wonder if we could just get rid of s_anon entirely.
>> 
>> Yes, getting rid of s_anon might involve crazy things like "let's just
>> walk all the dentries at umount time", but honestly, that sounds
>> preferable. Especially if we can just then do something like
>> 
>>  - set a special flag in the superblock if we ever use __d_obtain_alias()
>
> Automatically set for a lot of NFS mounts (whenever you mount more than one
> tree from the same server, IIRC)...
>
>>  - only scan all the dentries on umount if that flag is set.
>> 
>> Hmm?
>
> That looks like a bloody painful approach, IMO.  I'm not saying I like
> Neil's patch, but I doubt that "let's just walk the entire dcache on
> umount" is a good idea.
>
> I wonder if separating the d_obtain_alias() and d_obtain_root() would be
> a good idea; the former outnumber the latter by many orders of magnitude.
> The tricky part is that we could have a disconnected directory from
> d_obtain_alias() with children already connected to it (and thus normally
> hashed by d_splice_alias()) and fail to connect the whole thing to parent.
>
> That leaves us with an orphaned tree that might stick around past the
> time when we drop all references to dentries in it.  And we want to
> get those hunted down and shot on umount.  Could we
> 	* make s_anon hold d_obtain_root ones + orphans from such
> failed reconnects
> 	* make final dput() treat hashed IS_ROOT as "don't retain it"
> 	* have d_obtain_alias() put into normal hash, leaving the
> "move to s_anon" part to reconnect failures.
> 	* keep umount side of things unchanged.
>
> I agree that temporary insertions into ->s_anon are bogus; hell, I'm not
> even sure we want to put it on _any_ list initially - we want it to look
> like it's hashed, so we could set ->next to NULL and have ->pprev point
> to itself.  Then normal case for d_obtain_alias() would not bother
> the hash chains at all at allocation time, then have it put into the
> right hash chain on reconnect.  And on reconnect failure the caller
> would've moved it to orphan list (i.e. ->s_anon).

I looked back at the original bug report, and it was d_obtain_alias()
that was particularly suffering.  As this holds two spinlocks while
waiting for the bl_lock, a delay can affect other code that doesn't
touch s_anon.
So improving d_obtain_alias() would be a good goal (as long as we don't
just move the problem of course).

It isn't only "reconnect failure" that leaves things on s_anon.
We normally don't bother trying to connect non-directories.
So if an NFS server is getting lots of read/write request without opens
or other pathname lookups, it could easily have lots of disconnected
files being repeatedly accessed.  Keeping the dentries on d_anon means
we don't need to keep allocating new ones for every request.

So I'm not keen on dropping an IS_ROOT() dentry at final dput(), but
it might make sense to add the dentry to the per-fs list of IS_ROOT
dentries at that time.

One possible approach would be to use d_child rather than d_hash to link
together dentries that don't have a parent.
We could assign a random number to d_name.hash so it could appear to be
hashed without imposing on any one hash chain.  We would still need a
spinlock in the superblock to manage the d_anon list that links the
d_child's together...
I might try to see how the code looks.

Thanks,
NeilBrown
Al Viro Nov. 9, 2017, 11:19 p.m. UTC | #4
On Fri, Nov 10, 2017 at 10:09:09AM +1100, NeilBrown wrote:

> So if an NFS server is getting lots of read/write request without opens
> or other pathname lookups, it could easily have lots of disconnected
> files being repeatedly accessed.  Keeping the dentries on d_anon means
> we don't need to keep allocating new ones for every request.
> 
> So I'm not keen on dropping an IS_ROOT() dentry at final dput(), but
> it might make sense to add the dentry to the per-fs list of IS_ROOT
> dentries at that time.

Watch out for dput() fast path (see fast_dput()) if you go that way.

> One possible approach would be to use d_child rather than d_hash to link
> together dentries that don't have a parent.
> We could assign a random number to d_name.hash so it could appear to be
> hashed without imposing on any one hash chain.  We would still need a
> spinlock in the superblock to manage the d_anon list that links the
> d_child's together...
> I might try to see how the code looks.

Keep in mind that d_hash() includes bits of ->d_parent, aka. dentry itself.
So no need for fake ->d_name.hash; you'll get spread from that part.

->d_child is... delicate.  There are very interesting games around d_walk
vs. dput already; I'd be very careful with that one.
Linus Torvalds Nov. 10, 2017, 12:02 a.m. UTC | #5
On Thu, Nov 9, 2017 at 12:50 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> That looks like a bloody painful approach, IMO.  I'm not saying I like
> Neil's patch, but I doubt that "let's just walk the entire dcache on
> umount" is a good idea.

There may be other approaches, but how often do you umount a
filesystem that you have NFS-exported? Not often.

So making that slower doesn't really sound horrible.

> I wonder if separating the d_obtain_alias() and d_obtain_root() would be
> a good idea; the former outnumber the latter by many orders of magnitude.
> The tricky part is that we could have a disconnected directory from
> d_obtain_alias() with children already connected to it (and thus normally
> hashed by d_splice_alias()) and fail to connect the whole thing to parent.

Well, the real issue (I think) is that we abuse that hash list for the
s_anon thing. I'd like to get rid of that part.

Now, the reason we do that is because we want to have some way to find
dentries that come from that filesystem, but aren't connected.

So I really would like to get rid of s_anon entirely.

We could easily use _another_ list entirely for holding the
"unconnected dentries". Because we do have such a list: the dentry
sibling list.

So what if we instead of using s_anon, we create a "disconnected
root", and make all the disconnected dentries be children of that
disconnected root instead?

Wouldn't that be nicer? The "reconnect" would be basically a "move
from disconnected parent to right place".

Maybe I'm missing something, but that sounds much more logical than
the current s_anon list.

> That leaves us with an orphaned tree that might stick around past the
> time when we drop all references to dentries in it.  And we want to
> get those hunted down and shot on umount.  Could we
>         * make s_anon hold d_obtain_root ones + orphans from such
> failed reconnects
>         * make final dput() treat hashed IS_ROOT as "don't retain it"
>         * have d_obtain_alias() put into normal hash, leaving the
> "move to s_anon" part to reconnect failures.
>         * keep umount side of things unchanged.

I guess that would work too, but I'm not seeing why s_anon is so wonderful.

If we want to make those entries look hashed, let's just hash them,
and use a special parent for it (that "disconnected root"). Why
wouldn't that work?

That would make the umount side _simpler_, because the disconnected
root would be handled exactly like we currently handle the real root.

So we'd just do that same "do_one_tree()" on the disconnected root,
the same way we do on the real one.

That sounds _so_ straightforward that I'm probably missing something
important. And if the "move from disconnected state" is very common,
then maybe the disconnected root lock ends up being a horrible
choke-point instead, simply because we'd take that parent lock all the
time for the add/move operations..

So maybe it's a bad idea. But it sounds clean.

             Linus
Christoph Hellwig Nov. 10, 2017, 8:50 a.m. UTC | #6
On Thu, Nov 09, 2017 at 11:52:48AM -0800, Linus Torvalds wrote:
> Yes, getting rid of s_anon might involve crazy things like "let's just
> walk all the dentries at umount time", but honestly, that sounds
> preferable. Especially if we can just then do something like
> 
>  - set a special flag in the superblock if we ever use __d_obtain_alias()
> 
>  - only scan all the dentries on umount if that flag is set.
> 
> Hmm?

The scan would be extremely painful with our current global dcache.

But then again the global dcache isn't exactly a good idea to start
with.  If we had a per-sb dcache using e.g. the resizable hash table
the networking people seem to be fond off your idea would work really
well.
diff mbox

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index f90141387f01..d5952306206b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -51,6 +51,8 @@ 
  *   - the dcache hash table
  * s_anon bl list spinlock protects:
  *   - the s_anon list (see __d_drop)
+ * dentry->d_sb->s_anon_lock protect:
+ *   - the s_anon bl bitlock, ensuring fairness.
  * dentry->d_sb->s_dentry_lru_lock protects:
  *   - the dcache lru lists and counters
  * d_lock protects:
@@ -484,7 +486,12 @@  void __d_drop(struct dentry *dentry)
 		else
 			b = d_hash(dentry->d_name.hash);
 
-		hlist_bl_lock(b);
+		if (b == &dentry->d_sb->s_anon) {
+			spin_lock(&dentry->d_sb->s_anon_lock);
+			hlist_bl_lock(b);
+			spin_unlock(&dentry->d_sb->s_anon_lock);
+		} else
+			hlist_bl_lock(b);
 		__hlist_bl_del(&dentry->d_hash);
 		dentry->d_hash.pprev = NULL;
 		hlist_bl_unlock(b);
@@ -1965,7 +1972,9 @@  static struct dentry *__d_obtain_alias(struct inode *inode, int disconnected)
 	spin_lock(&tmp->d_lock);
 	__d_set_inode_and_type(tmp, inode, add_flags);
 	hlist_add_head(&tmp->d_u.d_alias, &inode->i_dentry);
+	spin_lock(&tmp->d_sb->s_anon_lock);
 	hlist_bl_lock(&tmp->d_sb->s_anon);
+	spin_unlock(&tmp->d_sb->s_anon_lock);
 	hlist_bl_add_head(&tmp->d_hash, &tmp->d_sb->s_anon);
 	hlist_bl_unlock(&tmp->d_sb->s_anon);
 	spin_unlock(&tmp->d_lock);
diff --git a/fs/super.c b/fs/super.c
index 994db21f59bf..af644ae93445 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -209,6 +209,7 @@  static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	if (s->s_user_ns != &init_user_ns)
 		s->s_iflags |= SB_I_NODEV;
 	INIT_HLIST_NODE(&s->s_instances);
+	spin_lock_init(&s->s_anon_lock);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
 	INIT_LIST_HEAD(&s->s_inodes);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 885266aae2d7..9df9bace53fb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1359,6 +1359,7 @@  struct super_block {
 
 	const struct fscrypt_operations	*s_cop;
 
+	spinlock_t		s_anon_lock;	/* needed for fairness */
 	struct hlist_bl_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
 	struct block_device	*s_bdev;