mbox series

[0/4] locks: avoid thundering-herd wake-ups

Message ID 153369219467.12605.13472423449508444601.stgit@noble (mailing list archive)
Headers show
Series locks: avoid thundering-herd wake-ups | expand

Message

NeilBrown Aug. 8, 2018, 1:51 a.m. UTC
If you have a many-core machine, and have many threads all wanting to
briefly lock a give file (udev is known to do this), you can get quite
poor performance.

When one thread releases a lock, it wakes up all other threads that
are waiting (classic thundering-herd) - one will get the lock and the
others go to sleep.
When you have few cores, this is not very noticeable: by the time the
4th or 5th thread gets enough CPU time to try to claim the lock, the
earlier threads have claimed it, done what was needed, and released.
With 50+ cores, the contention can easily be measured.

This patchset creates a tree of pending lock request in which siblings
don't conflict and each lock request does conflict with its parent.
When a lock is released, only requests which don't conflict with each
other a woken.

Testing shows that lock-acquisitions-per-second is now fairly stable even
as number of contending process goes to 1000.  Without this patch,
locks-per-second drops off steeply after a few 10s of processes.

There is a small cost to this extra complexity.
At 20 processes running a particular test on 72 cores, the lock
acquisitions per second drops from 1.8 million to 1.4 million with
this patch.  For 100 processes, this patch still provides 1.4 million
while without this patch there are about 700,000.

NeilBrown

---

NeilBrown (4):
      fs/locks: rename some lists and pointers.
      fs/locks: allow a lock request to block other requests.
      fs/locks: change all *_conflict() functions to return bool.
      fs/locks: create a tree of dependent requests.


 fs/cifs/file.c                  |    2 -
 fs/locks.c                      |  142 +++++++++++++++++++++++++--------------
 include/linux/fs.h              |    5 +
 include/trace/events/filelock.h |   16 ++--
 4 files changed, 103 insertions(+), 62 deletions(-)

--
Signature

Comments

Jeff Layton Aug. 8, 2018, 4:47 p.m. UTC | #1
On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> If you have a many-core machine, and have many threads all wanting to
> briefly lock a give file (udev is known to do this), you can get quite
> poor performance.
> 
> When one thread releases a lock, it wakes up all other threads that
> are waiting (classic thundering-herd) - one will get the lock and the
> others go to sleep.
> When you have few cores, this is not very noticeable: by the time the
> 4th or 5th thread gets enough CPU time to try to claim the lock, the
> earlier threads have claimed it, done what was needed, and released.
> With 50+ cores, the contention can easily be measured.
> 
> This patchset creates a tree of pending lock request in which siblings
> don't conflict and each lock request does conflict with its parent.
> When a lock is released, only requests which don't conflict with each
> other a woken.
> 
> Testing shows that lock-acquisitions-per-second is now fairly stable even
> as number of contending process goes to 1000.  Without this patch,
> locks-per-second drops off steeply after a few 10s of processes.
> 
> There is a small cost to this extra complexity.
> At 20 processes running a particular test on 72 cores, the lock
> acquisitions per second drops from 1.8 million to 1.4 million with
> this patch.  For 100 processes, this patch still provides 1.4 million
> while without this patch there are about 700,000.
> 
> NeilBrown
> 
> ---
> 
> NeilBrown (4):
>       fs/locks: rename some lists and pointers.
>       fs/locks: allow a lock request to block other requests.
>       fs/locks: change all *_conflict() functions to return bool.
>       fs/locks: create a tree of dependent requests.
> 
> 
>  fs/cifs/file.c                  |    2 -
>  fs/locks.c                      |  142 +++++++++++++++++++++++++--------------
>  include/linux/fs.h              |    5 +
>  include/trace/events/filelock.h |   16 ++--
>  4 files changed, 103 insertions(+), 62 deletions(-)
> 

Nice work! I looked over this and I think it looks good.

I made an attempt to fix this issue several years ago, but my method
sucked as it ended up penalizing the unlocking task too much. This is
much cleaner and should scale well overall, I think.

I'll put this in -next soon and we can aim for merge in v4.20.

Thanks,
J. Bruce Fields Aug. 8, 2018, 6:29 p.m. UTC | #2
On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote:
> On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> > If you have a many-core machine, and have many threads all wanting to
> > briefly lock a give file (udev is known to do this), you can get quite
> > poor performance.
> > 
> > When one thread releases a lock, it wakes up all other threads that
> > are waiting (classic thundering-herd) - one will get the lock and the
> > others go to sleep.
> > When you have few cores, this is not very noticeable: by the time the
> > 4th or 5th thread gets enough CPU time to try to claim the lock, the
> > earlier threads have claimed it, done what was needed, and released.
> > With 50+ cores, the contention can easily be measured.
> > 
> > This patchset creates a tree of pending lock request in which siblings
> > don't conflict and each lock request does conflict with its parent.
> > When a lock is released, only requests which don't conflict with each
> > other a woken.
> > 
> > Testing shows that lock-acquisitions-per-second is now fairly stable even
> > as number of contending process goes to 1000.  Without this patch,
> > locks-per-second drops off steeply after a few 10s of processes.
> > 
> > There is a small cost to this extra complexity.
> > At 20 processes running a particular test on 72 cores, the lock
> > acquisitions per second drops from 1.8 million to 1.4 million with
> > this patch.  For 100 processes, this patch still provides 1.4 million
> > while without this patch there are about 700,000.
> > 
> > NeilBrown
> > 
> > ---
> > 
> > NeilBrown (4):
> >       fs/locks: rename some lists and pointers.
> >       fs/locks: allow a lock request to block other requests.
> >       fs/locks: change all *_conflict() functions to return bool.
> >       fs/locks: create a tree of dependent requests.
> > 
> > 
> >  fs/cifs/file.c                  |    2 -
> >  fs/locks.c                      |  142 +++++++++++++++++++++++++--------------
> >  include/linux/fs.h              |    5 +
> >  include/trace/events/filelock.h |   16 ++--
> >  4 files changed, 103 insertions(+), 62 deletions(-)
> > 
> 
> Nice work! I looked over this and I think it looks good.
> 
> I made an attempt to fix this issue several years ago, but my method
> sucked as it ended up penalizing the unlocking task too much. This is
> much cleaner and should scale well overall, I think.

I think I also took a crack at this at one point while I was at UM/CITI
and never got anything I was happy with.  Looks like good work!

I remember one main obstacle that I felt like I never had a good
benchmark....

How did you choose this workload and hardware?  Was it in fact udev
(booting a large machine?), or was there some other motivation?

Not that I'm likely to do it any time soon, but could you share
sufficient details for someone else to reproduce your results?

--b.
J. Bruce Fields Aug. 8, 2018, 7:54 p.m. UTC | #3
On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
> If you have a many-core machine, and have many threads all wanting to
> briefly lock a give file (udev is known to do this), you can get quite
> poor performance.
> 
> When one thread releases a lock, it wakes up all other threads that
> are waiting (classic thundering-herd) - one will get the lock and the
> others go to sleep.
> When you have few cores, this is not very noticeable: by the time the
> 4th or 5th thread gets enough CPU time to try to claim the lock, the
> earlier threads have claimed it, done what was needed, and released.
> With 50+ cores, the contention can easily be measured.
> 
> This patchset creates a tree of pending lock request in which siblings
> don't conflict and each lock request does conflict with its parent.
> When a lock is released, only requests which don't conflict with each
> other a woken.

Are you sure you aren't depending on the (incorrect) assumption that "X
blocks Y" is a transitive relation?

OK I should be able to answer that question myself, my patience for
code-reading is at a real low this afternoon....

--b.

> 
> Testing shows that lock-acquisitions-per-second is now fairly stable even
> as number of contending process goes to 1000.  Without this patch,
> locks-per-second drops off steeply after a few 10s of processes.
> 
> There is a small cost to this extra complexity.
> At 20 processes running a particular test on 72 cores, the lock
> acquisitions per second drops from 1.8 million to 1.4 million with
> this patch.  For 100 processes, this patch still provides 1.4 million
> while without this patch there are about 700,000.
> 
> NeilBrown
> 
> ---
> 
> NeilBrown (4):
>       fs/locks: rename some lists and pointers.
>       fs/locks: allow a lock request to block other requests.
>       fs/locks: change all *_conflict() functions to return bool.
>       fs/locks: create a tree of dependent requests.
> 
> 
>  fs/cifs/file.c                  |    2 -
>  fs/locks.c                      |  142 +++++++++++++++++++++++++--------------
>  include/linux/fs.h              |    5 +
>  include/trace/events/filelock.h |   16 ++--
>  4 files changed, 103 insertions(+), 62 deletions(-)
> 
> --
> Signature
J. Bruce Fields Aug. 8, 2018, 8:09 p.m. UTC | #4
On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
> On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
> > If you have a many-core machine, and have many threads all wanting to
> > briefly lock a give file (udev is known to do this), you can get quite
> > poor performance.
> > 
> > When one thread releases a lock, it wakes up all other threads that
> > are waiting (classic thundering-herd) - one will get the lock and the
> > others go to sleep.
> > When you have few cores, this is not very noticeable: by the time the
> > 4th or 5th thread gets enough CPU time to try to claim the lock, the
> > earlier threads have claimed it, done what was needed, and released.
> > With 50+ cores, the contention can easily be measured.
> > 
> > This patchset creates a tree of pending lock request in which siblings
> > don't conflict and each lock request does conflict with its parent.
> > When a lock is released, only requests which don't conflict with each
> > other a woken.
> 
> Are you sure you aren't depending on the (incorrect) assumption that "X
> blocks Y" is a transitive relation?
> 
> OK I should be able to answer that question myself, my patience for
> code-reading is at a real low this afternoon....

In other words, is there the possibility of a tree of, say, exclusive
locks with (offset, length) like:

	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)

and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving
a process waiting without there being an actual conflict.

--b.
Frank Filz Aug. 8, 2018, 9:15 p.m. UTC | #5
> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
> > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
> > > If you have a many-core machine, and have many threads all wanting
> > > to briefly lock a give file (udev is known to do this), you can get
> > > quite poor performance.
> > >
> > > When one thread releases a lock, it wakes up all other threads that
> > > are waiting (classic thundering-herd) - one will get the lock and
> > > the others go to sleep.
> > > When you have few cores, this is not very noticeable: by the time
> > > the 4th or 5th thread gets enough CPU time to try to claim the lock,
> > > the earlier threads have claimed it, done what was needed, and
released.
> > > With 50+ cores, the contention can easily be measured.
> > >
> > > This patchset creates a tree of pending lock request in which
> > > siblings don't conflict and each lock request does conflict with its
parent.
> > > When a lock is released, only requests which don't conflict with
> > > each other a woken.
> >
> > Are you sure you aren't depending on the (incorrect) assumption that
> > "X blocks Y" is a transitive relation?
> >
> > OK I should be able to answer that question myself, my patience for
> > code-reading is at a real low this afternoon....
> 
> In other words, is there the possibility of a tree of, say, exclusive
locks with
> (offset, length) like:
> 
> 	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)
> 
> and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving a
process
> waiting without there being an actual conflict.

That implies that the order the locks were received in was:

(0,4)
(2,2)
(1,2)
(0,2)

But couldn't (0,2) have been made only dependent on (0,4)? Of course then
(1,2) is dependent on BOTH (2,2) and (0,2). Does this tree logic handle that
case?

On the other hand, there might be a fairness reason to make (0,2) wait for
(1,2) even though it could have been granted concurrently with (2,2) since
this dependency tree also preserves some of the order of lock requests.

Frank
J. Bruce Fields Aug. 8, 2018, 9:28 p.m. UTC | #6
On Wed, Aug 08, 2018 at 04:09:12PM -0400, J. Bruce Fields wrote:
> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
> > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
> > > If you have a many-core machine, and have many threads all wanting to
> > > briefly lock a give file (udev is known to do this), you can get quite
> > > poor performance.
> > > 
> > > When one thread releases a lock, it wakes up all other threads that
> > > are waiting (classic thundering-herd) - one will get the lock and the
> > > others go to sleep.
> > > When you have few cores, this is not very noticeable: by the time the
> > > 4th or 5th thread gets enough CPU time to try to claim the lock, the
> > > earlier threads have claimed it, done what was needed, and released.
> > > With 50+ cores, the contention can easily be measured.
> > > 
> > > This patchset creates a tree of pending lock request in which siblings
> > > don't conflict and each lock request does conflict with its parent.
> > > When a lock is released, only requests which don't conflict with each
> > > other a woken.
> > 
> > Are you sure you aren't depending on the (incorrect) assumption that "X
> > blocks Y" is a transitive relation?
> > 
> > OK I should be able to answer that question myself, my patience for
> > code-reading is at a real low this afternoon....
> 
> In other words, is there the possibility of a tree of, say, exclusive
> locks with (offset, length) like:
> 
> 	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)
> 
> and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving
> a process waiting without there being an actual conflict.

After batting it back and forth with Jeff on IRC....  So do I understand
right that when we wake a waiter, we leave its own tree of waiters
intact, and when it wakes if it finds a conflict it just adds it lock
(with tree of waiters) in to the tree of the conflicting lock?

If so then yes I think that depends on the transitivity
assumption--you're assuming that finding a conflict between the root of
the tree and a lock proves that all the other members of the tree also
conflict.

So maybe this example works.  (All locks are exclusive and written
(offset, length), X->Y means X is waiting on Y.)

	process acquires (0,3)
	2nd process requests (1,2), is put to sleep.
	3rd process requests (0,2), is put to sleep.

	The tree of waiters now looks like (0,2)->(1,2)->(0,3)

	(0,3) is unlocked.
	A 4th process races in and locks (2,2).
	The 2nd process wakes up, sees this new conflict, and waits on
	(2,2).  Now the tree looks like (0,2)->(1,2)->(2,2), and (0,2)
	is waiting for no reason.

?

--b.
NeilBrown Aug. 8, 2018, 10:34 p.m. UTC | #7
On Wed, Aug 08 2018, Frank Filz wrote:

>> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
>> > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
>> > > If you have a many-core machine, and have many threads all wanting
>> > > to briefly lock a give file (udev is known to do this), you can get
>> > > quite poor performance.
>> > >
>> > > When one thread releases a lock, it wakes up all other threads that
>> > > are waiting (classic thundering-herd) - one will get the lock and
>> > > the others go to sleep.
>> > > When you have few cores, this is not very noticeable: by the time
>> > > the 4th or 5th thread gets enough CPU time to try to claim the lock,
>> > > the earlier threads have claimed it, done what was needed, and
> released.
>> > > With 50+ cores, the contention can easily be measured.
>> > >
>> > > This patchset creates a tree of pending lock request in which
>> > > siblings don't conflict and each lock request does conflict with its
> parent.
>> > > When a lock is released, only requests which don't conflict with
>> > > each other a woken.
>> >
>> > Are you sure you aren't depending on the (incorrect) assumption that
>> > "X blocks Y" is a transitive relation?
>> >
>> > OK I should be able to answer that question myself, my patience for
>> > code-reading is at a real low this afternoon....
>> 
>> In other words, is there the possibility of a tree of, say, exclusive
> locks with
>> (offset, length) like:
>> 
>> 	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)
>> 
>> and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving a
> process
>> waiting without there being an actual conflict.
>
> That implies that the order the locks were received in was:
>
> (0,4)
> (2,2)
> (1,2)
> (0,2)
>
> But couldn't (0,2) have been made only dependent on (0,4)?

Correct.  (0,2) would be a child if (0,4), but a sibling of (2,2).

>    Of course then
> (1,2) is dependent on BOTH (2,2) and (0,2). Does this tree logic handle that
> case?

No, there is no support for double dependencies.  It is still possible
for a lock request to be woken up which, which still cannot be
satisfied.
When one block lock is unlocked, a pending request might then be queued
against a different blocking lock.

>
> On the other hand, there might be a fairness reason to make (0,2) wait for
> (1,2) even though it could have been granted concurrently with (2,2) since
> this dependency tree also preserves some of the order of lock requests.

The locking API doesn't promise fairness, and I don't think we should
try too hard to achieve it.  Certainly we shouldn't actively fight it
(so no LIFO queuing) but if we try harder than that I suspect it would
just be needless complexity.

Thanks,
NeilBrown


>
> Frank
NeilBrown Aug. 8, 2018, 10:39 p.m. UTC | #8
On Wed, Aug 08 2018, J. Bruce Fields wrote:

> On Wed, Aug 08, 2018 at 04:09:12PM -0400, J. Bruce Fields wrote:
>> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
>> > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
>> > > If you have a many-core machine, and have many threads all wanting to
>> > > briefly lock a give file (udev is known to do this), you can get quite
>> > > poor performance.
>> > > 
>> > > When one thread releases a lock, it wakes up all other threads that
>> > > are waiting (classic thundering-herd) - one will get the lock and the
>> > > others go to sleep.
>> > > When you have few cores, this is not very noticeable: by the time the
>> > > 4th or 5th thread gets enough CPU time to try to claim the lock, the
>> > > earlier threads have claimed it, done what was needed, and released.
>> > > With 50+ cores, the contention can easily be measured.
>> > > 
>> > > This patchset creates a tree of pending lock request in which siblings
>> > > don't conflict and each lock request does conflict with its parent.
>> > > When a lock is released, only requests which don't conflict with each
>> > > other a woken.
>> > 
>> > Are you sure you aren't depending on the (incorrect) assumption that "X
>> > blocks Y" is a transitive relation?
>> > 
>> > OK I should be able to answer that question myself, my patience for
>> > code-reading is at a real low this afternoon....
>> 
>> In other words, is there the possibility of a tree of, say, exclusive
>> locks with (offset, length) like:
>> 
>> 	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)
>> 
>> and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving
>> a process waiting without there being an actual conflict.
>
> After batting it back and forth with Jeff on IRC....  So do I understand
> right that when we wake a waiter, we leave its own tree of waiters
> intact, and when it wakes if it finds a conflict it just adds it lock
> (with tree of waiters) in to the tree of the conflicting lock?
>
> If so then yes I think that depends on the transitivity
> assumption--you're assuming that finding a conflict between the root of
> the tree and a lock proves that all the other members of the tree also
> conflict.

Ahhh... I see what you are getting at.  When lock requests are added
individually, they will always be blocked by all ancestors in the tree.
But when they are added as a group, that might not be the case.
So we might need to re-add every request individually.
In the (common) case where a lock request is blocked across its whole
range, we can just attach the whole tree beneath the blocker.  In other
cases we need a finer grained approach.

I'll have a look and see how to make the code work for this case.

Thanks for the thorough review!

NeilBrown

>
> So maybe this example works.  (All locks are exclusive and written
> (offset, length), X->Y means X is waiting on Y.)
>
> 	process acquires (0,3)
> 	2nd process requests (1,2), is put to sleep.
> 	3rd process requests (0,2), is put to sleep.
>
> 	The tree of waiters now looks like (0,2)->(1,2)->(0,3)
>
> 	(0,3) is unlocked.
> 	A 4th process races in and locks (2,2).
> 	The 2nd process wakes up, sees this new conflict, and waits on
> 	(2,2).  Now the tree looks like (0,2)->(1,2)->(2,2), and (0,2)
> 	is waiting for no reason.
>
> ?
>
> --b.
Jeff Layton Aug. 8, 2018, 10:50 p.m. UTC | #9
On Wed, 2018-08-08 at 17:28 -0400, J. Bruce Fields wrote:
> On Wed, Aug 08, 2018 at 04:09:12PM -0400, J. Bruce Fields wrote:
> > On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
> > > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
> > > > If you have a many-core machine, and have many threads all wanting to
> > > > briefly lock a give file (udev is known to do this), you can get quite
> > > > poor performance.
> > > > 
> > > > When one thread releases a lock, it wakes up all other threads that
> > > > are waiting (classic thundering-herd) - one will get the lock and the
> > > > others go to sleep.
> > > > When you have few cores, this is not very noticeable: by the time the
> > > > 4th or 5th thread gets enough CPU time to try to claim the lock, the
> > > > earlier threads have claimed it, done what was needed, and released.
> > > > With 50+ cores, the contention can easily be measured.
> > > > 
> > > > This patchset creates a tree of pending lock request in which siblings
> > > > don't conflict and each lock request does conflict with its parent.
> > > > When a lock is released, only requests which don't conflict with each
> > > > other a woken.
> > > 
> > > Are you sure you aren't depending on the (incorrect) assumption that "X
> > > blocks Y" is a transitive relation?
> > > 
> > > OK I should be able to answer that question myself, my patience for
> > > code-reading is at a real low this afternoon....
> > 
> > In other words, is there the possibility of a tree of, say, exclusive
> > locks with (offset, length) like:
> > 
> > 	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)
> > 
> > and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving
> > a process waiting without there being an actual conflict.
> 
> After batting it back and forth with Jeff on IRC....  So do I understand
> right that when we wake a waiter, we leave its own tree of waiters
> intact, and when it wakes if it finds a conflict it just adds it lock
> (with tree of waiters) in to the tree of the conflicting lock?
> 
> If so then yes I think that depends on the transitivity
> assumption--you're assuming that finding a conflict between the root of
> the tree and a lock proves that all the other members of the tree also
> conflict.
> 
> So maybe this example works.  (All locks are exclusive and written
> (offset, length), X->Y means X is waiting on Y.)
> 
> 	process acquires (0,3)
> 	2nd process requests (1,2), is put to sleep.
> 	3rd process requests (0,2), is put to sleep.
> 
> 	The tree of waiters now looks like (0,2)->(1,2)->(0,3)
> 
> 	(0,3) is unlocked.
> 	A 4th process races in and locks (2,2).
> 	The 2nd process wakes up, sees this new conflict, and waits on
> 	(2,2).  Now the tree looks like (0,2)->(1,2)->(2,2), and (0,2)
> 	is waiting for no reason.
> 

That seems like a legit problem.

One possible fix might be to have the waiter on (1,2) walk down the
entire subtree and wake up any waiter that is waiting on a lock that
doesn't conflict with the lock on which it's waiting.

So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it
could walk down its entire fl_blocked subtree and wake up anything
waiting on a lock that doesn't conflict with (2,2).

That's potentially an expensive operation, but:

a) the task is going back to sleep anyway, so letting it do a little
extra work before that should be no big deal

b) it's probably still cheaper than waking up the whole herd
Frank Filz Aug. 8, 2018, 11:34 p.m. UTC | #10
> On Wed, 2018-08-08 at 17:28 -0400, J. Bruce Fields wrote:
> > On Wed, Aug 08, 2018 at 04:09:12PM -0400, J. Bruce Fields wrote:
> > > On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote:
> > > > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote:
> > > > > If you have a many-core machine, and have many threads all
> > > > > wanting to briefly lock a give file (udev is known to do this),
> > > > > you can get quite poor performance.
> > > > >
> > > > > When one thread releases a lock, it wakes up all other threads
> > > > > that are waiting (classic thundering-herd) - one will get the
> > > > > lock and the others go to sleep.
> > > > > When you have few cores, this is not very noticeable: by the
> > > > > time the 4th or 5th thread gets enough CPU time to try to claim
> > > > > the lock, the earlier threads have claimed it, done what was needed, and
> released.
> > > > > With 50+ cores, the contention can easily be measured.
> > > > >
> > > > > This patchset creates a tree of pending lock request in which
> > > > > siblings don't conflict and each lock request does conflict with its parent.
> > > > > When a lock is released, only requests which don't conflict with
> > > > > each other a woken.
> > > >
> > > > Are you sure you aren't depending on the (incorrect) assumption
> > > > that "X blocks Y" is a transitive relation?
> > > >
> > > > OK I should be able to answer that question myself, my patience
> > > > for code-reading is at a real low this afternoon....
> > >
> > > In other words, is there the possibility of a tree of, say,
> > > exclusive locks with (offset, length) like:
> > >
> > > 	(0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4)
> > >
> > > and when waking (0, 4) you could wake up (2, 2) but not (0, 2),
> > > leaving a process waiting without there being an actual conflict.
> >
> > After batting it back and forth with Jeff on IRC....  So do I
> > understand right that when we wake a waiter, we leave its own tree of
> > waiters intact, and when it wakes if it finds a conflict it just adds
> > it lock (with tree of waiters) in to the tree of the conflicting lock?
> >
> > If so then yes I think that depends on the transitivity
> > assumption--you're assuming that finding a conflict between the root
> > of the tree and a lock proves that all the other members of the tree
> > also conflict.
> >
> > So maybe this example works.  (All locks are exclusive and written
> > (offset, length), X->Y means X is waiting on Y.)
> >
> > 	process acquires (0,3)
> > 	2nd process requests (1,2), is put to sleep.
> > 	3rd process requests (0,2), is put to sleep.
> >
> > 	The tree of waiters now looks like (0,2)->(1,2)->(0,3)
> >
> > 	(0,3) is unlocked.
> > 	A 4th process races in and locks (2,2).
> > 	The 2nd process wakes up, sees this new conflict, and waits on
> > 	(2,2).  Now the tree looks like (0,2)->(1,2)->(2,2), and (0,2)
> > 	is waiting for no reason.
> >
> 
> That seems like a legit problem.
> 
> One possible fix might be to have the waiter on (1,2) walk down the entire
> subtree and wake up any waiter that is waiting on a lock that doesn't conflict
> with the lock on which it's waiting.
> 
> So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it could
> walk down its entire fl_blocked subtree and wake up anything waiting on a lock
> that doesn't conflict with (2,2).
> 
> That's potentially an expensive operation, but:
> 
> a) the task is going back to sleep anyway, so letting it do a little extra work
> before that should be no big deal
> 
> b) it's probably still cheaper than waking up the whole herd

Yea, I think so.

Now here's another question... How does this new logic play with Open File Description Locks? Should still be ok since there's a thread waiting on each of those.

Frank
NeilBrown Aug. 9, 2018, 12:58 a.m. UTC | #11
On Wed, Aug 08 2018, J. Bruce Fields wrote:

> On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote:
>> On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
>> > If you have a many-core machine, and have many threads all wanting to
>> > briefly lock a give file (udev is known to do this), you can get quite
>> > poor performance.
>> > 
>> > When one thread releases a lock, it wakes up all other threads that
>> > are waiting (classic thundering-herd) - one will get the lock and the
>> > others go to sleep.
>> > When you have few cores, this is not very noticeable: by the time the
>> > 4th or 5th thread gets enough CPU time to try to claim the lock, the
>> > earlier threads have claimed it, done what was needed, and released.
>> > With 50+ cores, the contention can easily be measured.
>> > 
>> > This patchset creates a tree of pending lock request in which siblings
>> > don't conflict and each lock request does conflict with its parent.
>> > When a lock is released, only requests which don't conflict with each
>> > other a woken.
>> > 
>> > Testing shows that lock-acquisitions-per-second is now fairly stable even
>> > as number of contending process goes to 1000.  Without this patch,
>> > locks-per-second drops off steeply after a few 10s of processes.
>> > 
>> > There is a small cost to this extra complexity.
>> > At 20 processes running a particular test on 72 cores, the lock
>> > acquisitions per second drops from 1.8 million to 1.4 million with
>> > this patch.  For 100 processes, this patch still provides 1.4 million
>> > while without this patch there are about 700,000.
>> > 
>> > NeilBrown
>> > 
>> > ---
>> > 
>> > NeilBrown (4):
>> >       fs/locks: rename some lists and pointers.
>> >       fs/locks: allow a lock request to block other requests.
>> >       fs/locks: change all *_conflict() functions to return bool.
>> >       fs/locks: create a tree of dependent requests.
>> > 
>> > 
>> >  fs/cifs/file.c                  |    2 -
>> >  fs/locks.c                      |  142 +++++++++++++++++++++++++--------------
>> >  include/linux/fs.h              |    5 +
>> >  include/trace/events/filelock.h |   16 ++--
>> >  4 files changed, 103 insertions(+), 62 deletions(-)
>> > 
>> 
>> Nice work! I looked over this and I think it looks good.
>> 
>> I made an attempt to fix this issue several years ago, but my method
>> sucked as it ended up penalizing the unlocking task too much. This is
>> much cleaner and should scale well overall, I think.
>
> I think I also took a crack at this at one point while I was at UM/CITI
> and never got anything I was happy with.  Looks like good work!
>
> I remember one main obstacle that I felt like I never had a good
> benchmark....
>
> How did you choose this workload and hardware?  Was it in fact udev
> (booting a large machine?), or was there some other motivation?

I'm hoping Martin will chime in here - her identified the problem and
did most of the testing...

NeilBrown


>
> Not that I'm likely to do it any time soon, but could you share
> sufficient details for someone else to reproduce your results?
>
> --b.
NeilBrown Aug. 9, 2018, 2:52 a.m. UTC | #12
On Wed, Aug 08 2018, Frank Filz wrote:

>
> Now here's another question... How does this new logic play with Open
> File Description Locks? Should still be ok since there's a thread
> waiting on each of those.

At this level Posix locks and OFD locks are almost identical - just a
different owner.
So this enhancement should work equally for posix, flock, ofd and even
leases.

Thanks,
NeilBrown

>
> Frank
J. Bruce Fields Aug. 9, 2018, 1 p.m. UTC | #13
On Wed, Aug 08, 2018 at 06:50:06PM -0400, Jeff Layton wrote:
> That seems like a legit problem.
> 
> One possible fix might be to have the waiter on (1,2) walk down the
> entire subtree and wake up any waiter that is waiting on a lock that
> doesn't conflict with the lock on which it's waiting.
> 
> So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it
> could walk down its entire fl_blocked subtree and wake up anything
> waiting on a lock that doesn't conflict with (2,2).
> 
> That's potentially an expensive operation, but:
> 
> a) the task is going back to sleep anyway, so letting it do a little
> extra work before that should be no big deal

I don't understand why cpu used by a process going to sleep is cheaper
than cpu used in any other situation.

> b) it's probably still cheaper than waking up the whole herd

Yeah, I'd like to understand this.

I feel like Neil's addressing two different performance costs:

	- the cost of waking up all the waiters
	- the cost of walking the list of waiters

Are they equally important?

If we only cared about the former, and only in simple cases, we could
walk the entire list and skip waking up only the locks that conflict
with the first one we wake.  We wouldn't need the tree.

--b.
Jeff Layton Aug. 9, 2018, 2:49 p.m. UTC | #14
On Thu, 2018-08-09 at 09:00 -0400, J. Bruce Fields wrote:
> On Wed, Aug 08, 2018 at 06:50:06PM -0400, Jeff Layton wrote:
> > That seems like a legit problem.
> > 
> > One possible fix might be to have the waiter on (1,2) walk down the
> > entire subtree and wake up any waiter that is waiting on a lock that
> > doesn't conflict with the lock on which it's waiting.
> > 
> > So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it
> > could walk down its entire fl_blocked subtree and wake up anything
> > waiting on a lock that doesn't conflict with (2,2).
> > 
> > That's potentially an expensive operation, but:
> > 
> > a) the task is going back to sleep anyway, so letting it do a little
> > extra work before that should be no big deal
> 
> I don't understand why cpu used by a process going to sleep is cheaper
> than cpu used in any other situation.
> 

It's not any cheaper in that sense. It's just that this task is not
slated to be doing any useful work anyway as it's going to sleep, so we
aren't delaying any "real work" by this task by having it do this 
before returning to userland. It's already scheduled and holds the
appropriate lock.

The alternative would be to do this in the context of a different task,
but that means extra context switching and spinlocking, etc.

> > b) it's probably still cheaper than waking up the whole herd
> 
> Yeah, I'd like to understand this.
> 
> I feel like Neil's addressing two different performance costs:
> 
> 	- the cost of waking up all the waiters
> 	- the cost of walking the list of waiters
> 
> Are they equally important?
> 
> If we only cared about the former, and only in simple cases, we could
> walk the entire list and skip waking up only the locks that conflict
> with the first one we wake.  We wouldn't need the tree.
NeilBrown Aug. 9, 2018, 11:56 p.m. UTC | #15
On Thu, Aug 09 2018, J. Bruce Fields wrote:

> On Wed, Aug 08, 2018 at 06:50:06PM -0400, Jeff Layton wrote:
>> That seems like a legit problem.
>> 
>> One possible fix might be to have the waiter on (1,2) walk down the
>> entire subtree and wake up any waiter that is waiting on a lock that
>> doesn't conflict with the lock on which it's waiting.
>> 
>> So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it
>> could walk down its entire fl_blocked subtree and wake up anything
>> waiting on a lock that doesn't conflict with (2,2).
>> 
>> That's potentially an expensive operation, but:
>> 
>> a) the task is going back to sleep anyway, so letting it do a little
>> extra work before that should be no big deal
>
> I don't understand why cpu used by a process going to sleep is cheaper
> than cpu used in any other situation.
>
>> b) it's probably still cheaper than waking up the whole herd
>
> Yeah, I'd like to understand this.
>
> I feel like Neil's addressing two different performance costs:
>
> 	- the cost of waking up all the waiters
> 	- the cost of walking the list of waiters
>
> Are they equally important?

Probably not.  Reducing wakeups is likely the most important.

>
> If we only cared about the former, and only in simple cases, we could
> walk the entire list and skip waking up only the locks that conflict
> with the first one we wake.  We wouldn't need the tree.

Yes, we could do that. It would probably make the code simpler.
It would mean doing the conflicts() tests when performing wake-up rather
than prior to going to sleep.  In general it would mean doing the tests
more often, as the tree effectively records the results of the previous
time conflicts() was run.
You might get a quadratic effect though ... wouldn't you want to
skip anything that conflicts with anything that has been woken?

If the tree-management code turns out to be too complex, it would
certainly be an option.  I think the tree approach should result in less
total CPU usage..

Thanks for the thought - taking this simple approach really hadn't
occurred to me. :-(

NeilBrown
J. Bruce Fields Aug. 10, 2018, 1:05 a.m. UTC | #16
On Fri, Aug 10, 2018 at 09:56:07AM +1000, NeilBrown wrote:
> On Thu, Aug 09 2018, J. Bruce Fields wrote:
> > If we only cared about the former, and only in simple cases, we could
> > walk the entire list and skip waking up only the locks that conflict
> > with the first one we wake.  We wouldn't need the tree.
> 
> Yes, we could do that. It would probably make the code simpler.
> It would mean doing the conflicts() tests when performing wake-up rather
> than prior to going to sleep.  In general it would mean doing the tests
> more often, as the tree effectively records the results of the previous
> time conflicts() was run.
> You might get a quadratic effect though ... wouldn't you want to
> skip anything that conflicts with anything that has been woken?

I was thinking it'd be simplest just to check for conflict with the
first thing that you decide to wake.  That might be all that's necessary
for typical cases.

If you wanted to keep a running list of the locks you've chosen to wake
so far and check each one against that list, I guess you could.

> If the tree-management code turns out to be too complex, it would
> certainly be an option.  I think the tree approach should result in less
> total CPU usage..

Have we ever looked into the CPU usage of deadlock detection?  Might
experiment with turning that off.  But I may be biased by my white-hot
hatred of it.

--b.
Martin Wilck Aug. 20, 2018, 11:02 a.m. UTC | #17
On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote:
> On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote:
> > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> > > If you have a many-core machine, and have many threads all
> > > wanting to
> > > briefly lock a give file (udev is known to do this), you can get
> > > quite
> > > poor performance.
> > > 
> > > When one thread releases a lock, it wakes up all other threads
> > > that
> > > are waiting (classic thundering-herd) - one will get the lock and
> > > the
> > > others go to sleep.
> > > When you have few cores, this is not very noticeable: by the time
> > > the
> > > 4th or 5th thread gets enough CPU time to try to claim the lock,
> > > the
> > > earlier threads have claimed it, done what was needed, and
> > > released.
> > > With 50+ cores, the contention can easily be measured.
> > > 
> > > This patchset creates a tree of pending lock request in which
> > > siblings
> > > don't conflict and each lock request does conflict with its
> > > parent.
> > > When a lock is released, only requests which don't conflict with
> > > each
> > > other a woken.
> > > 
> > > Testing shows that lock-acquisitions-per-second is now fairly
> > > stable even
> > > as number of contending process goes to 1000.  Without this
> > > patch,
> > > locks-per-second drops off steeply after a few 10s of processes.
> > > 
> > > There is a small cost to this extra complexity.
> > > At 20 processes running a particular test on 72 cores, the lock
> > > acquisitions per second drops from 1.8 million to 1.4 million
> > > with
> > > this patch.  For 100 processes, this patch still provides 1.4
> > > million
> > > while without this patch there are about 700,000.
> > > 
> > > NeilBrown
> > > 
> > > ---
> > > 
> > > NeilBrown (4):
> > >       fs/locks: rename some lists and pointers.
> > >       fs/locks: allow a lock request to block other requests.
> > >       fs/locks: change all *_conflict() functions to return bool.
> > >       fs/locks: create a tree of dependent requests.
> > > 
> > > 
> > >  fs/cifs/file.c                  |    2 -
> > >  fs/locks.c                      |  142
> > > +++++++++++++++++++++++++--------------
> > >  include/linux/fs.h              |    5 +
> > >  include/trace/events/filelock.h |   16 ++--
> > >  4 files changed, 103 insertions(+), 62 deletions(-)
> > > 
> > 
> > Nice work! I looked over this and I think it looks good.
> > 
> > I made an attempt to fix this issue several years ago, but my
> > method
> > sucked as it ended up penalizing the unlocking task too much. This
> > is
> > much cleaner and should scale well overall, I think.
> 
> I think I also took a crack at this at one point while I was at
> UM/CITI
> and never got anything I was happy with.  Looks like good work!
> 
> I remember one main obstacle that I felt like I never had a good
> benchmark....
> 
> How did you choose this workload and hardware?  Was it in fact udev
> (booting a large machine?), or was there some other motivation?

Some details can be found here:

https://github.com/systemd/systemd/pull/9551

https://github.com/systemd/systemd/pull/8667#issuecomment-385520335
and comments further down. 8667 has been superseded by 9551.

The original problem was that the symlink "/dev/disk/by-
partlabel/primary" may be claimed by _many_ devices on big systems
under certain distributions, which use older versions of parted for
partition creation on GPT disk labels. I've seen systems with literally
thousands of contenders for this symlink. 

We found that with current systemd, this can cause a boot-time race
where the wrong device is eventually assigned the "best" contender for
the symlink (e.g. a partition on multipath member rather than a
partition on the multipath map itself). I extended the udev test suite,
creating a test that makes this race easily reproducible, at least on
systems with many CPUs (the test host I used most had 72 cores).

I created an udev patch that would use systemd's built in fcntl-based
locking to avoid this race, but I found that it would massively slow
down the system, and found the contention to be in the spin locks in
posix_lock_common(). (I therefore added more the systemd patches to
make the locking scale better, but that's irrelevant for the kernel-
side discussion).

I further created an artificial test just for the scaling of
fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the
scaling problem easily, and do some quantitive experiments. My tests
didn't use any byte ranges, only "full" locking of 0-byte files.

> Not that I'm likely to do it any time soon, but could you share
> sufficient details for someone else to reproduce your results?
> 
> --b.

The udev test code can be found in the above links. It adds a new
script "test/sd-script.py" that would be run after "test/sys-script.py" 
using a numeric argument indicating the number of contenders for the
test link to be created, such as "python test/sd-script.py test 1000".
Next step would be running "test/udev-test.pl 152" e.g. under perf (152
is the test ID of the scaling test).

Of course I can also share my other test program if you desire so.

Regards,
Martin
J. Bruce Fields Aug. 20, 2018, 8:02 p.m. UTC | #18
On Mon, Aug 20, 2018 at 01:02:21PM +0200, Martin Wilck wrote:
> On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote:
> > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote:
> > > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> > > > If you have a many-core machine, and have many threads all
> > > > wanting to
> > > > briefly lock a give file (udev is known to do this), you can get
> > > > quite
> > > > poor performance.
> > > > 
> > > > When one thread releases a lock, it wakes up all other threads
> > > > that
> > > > are waiting (classic thundering-herd) - one will get the lock and
> > > > the
> > > > others go to sleep.
> > > > When you have few cores, this is not very noticeable: by the time
> > > > the
> > > > 4th or 5th thread gets enough CPU time to try to claim the lock,
> > > > the
> > > > earlier threads have claimed it, done what was needed, and
> > > > released.
> > > > With 50+ cores, the contention can easily be measured.
> > > > 
> > > > This patchset creates a tree of pending lock request in which
> > > > siblings
> > > > don't conflict and each lock request does conflict with its
> > > > parent.
> > > > When a lock is released, only requests which don't conflict with
> > > > each
> > > > other a woken.
> > > > 
> > > > Testing shows that lock-acquisitions-per-second is now fairly
> > > > stable even
> > > > as number of contending process goes to 1000.  Without this
> > > > patch,
> > > > locks-per-second drops off steeply after a few 10s of processes.
> > > > 
> > > > There is a small cost to this extra complexity.
> > > > At 20 processes running a particular test on 72 cores, the lock
> > > > acquisitions per second drops from 1.8 million to 1.4 million
> > > > with
> > > > this patch.  For 100 processes, this patch still provides 1.4
> > > > million
> > > > while without this patch there are about 700,000.
> > > > 
> > > > NeilBrown
> > > > 
> > > > ---
> > > > 
> > > > NeilBrown (4):
> > > >       fs/locks: rename some lists and pointers.
> > > >       fs/locks: allow a lock request to block other requests.
> > > >       fs/locks: change all *_conflict() functions to return bool.
> > > >       fs/locks: create a tree of dependent requests.
> > > > 
> > > > 
> > > >  fs/cifs/file.c                  |    2 -
> > > >  fs/locks.c                      |  142
> > > > +++++++++++++++++++++++++--------------
> > > >  include/linux/fs.h              |    5 +
> > > >  include/trace/events/filelock.h |   16 ++--
> > > >  4 files changed, 103 insertions(+), 62 deletions(-)
> > > > 
> > > 
> > > Nice work! I looked over this and I think it looks good.
> > > 
> > > I made an attempt to fix this issue several years ago, but my
> > > method
> > > sucked as it ended up penalizing the unlocking task too much. This
> > > is
> > > much cleaner and should scale well overall, I think.
> > 
> > I think I also took a crack at this at one point while I was at
> > UM/CITI
> > and never got anything I was happy with.  Looks like good work!
> > 
> > I remember one main obstacle that I felt like I never had a good
> > benchmark....
> > 
> > How did you choose this workload and hardware?  Was it in fact udev
> > (booting a large machine?), or was there some other motivation?
> 
> Some details can be found here:
> 
> https://github.com/systemd/systemd/pull/9551
> 
> https://github.com/systemd/systemd/pull/8667#issuecomment-385520335
> and comments further down. 8667 has been superseded by 9551.
> 
> The original problem was that the symlink "/dev/disk/by-
> partlabel/primary" may be claimed by _many_ devices on big systems
> under certain distributions, which use older versions of parted for
> partition creation on GPT disk labels. I've seen systems with literally
> thousands of contenders for this symlink. 
> 
> We found that with current systemd, this can cause a boot-time race
> where the wrong device is eventually assigned the "best" contender for
> the symlink (e.g. a partition on multipath member rather than a
> partition on the multipath map itself). I extended the udev test suite,
> creating a test that makes this race easily reproducible, at least on
> systems with many CPUs (the test host I used most had 72 cores).
> 
> I created an udev patch that would use systemd's built in fcntl-based
> locking to avoid this race, but I found that it would massively slow
> down the system, and found the contention to be in the spin locks in
> posix_lock_common(). (I therefore added more the systemd patches to
> make the locking scale better, but that's irrelevant for the kernel-
> side discussion).
> 
> I further created an artificial test just for the scaling of
> fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the
> scaling problem easily, and do some quantitive experiments. My tests
> didn't use any byte ranges, only "full" locking of 0-byte files.

Thanks for the explanation!

I wonder whether there's also anything we could do to keep every waiter
from having to take the same spinlock.

--b.

> 
> > Not that I'm likely to do it any time soon, but could you share
> > sufficient details for someone else to reproduce your results?
> > 
> > --b.
> 
> The udev test code can be found in the above links. It adds a new
> script "test/sd-script.py" that would be run after "test/sys-script.py" 
> using a numeric argument indicating the number of contenders for the
> test link to be created, such as "python test/sd-script.py test 1000".
> Next step would be running "test/udev-test.pl 152" e.g. under perf (152
> is the test ID of the scaling test).
> 
> Of course I can also share my other test program if you desire so.
Martin Wilck Aug. 20, 2018, 8:06 p.m. UTC | #19
On Mon, 2018-08-20 at 16:02 -0400, J. Bruce Fields wrote:
> On Mon, Aug 20, 2018 at 01:02:21PM +0200, Martin Wilck wrote:
> > On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote:
> > > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote:
> > > > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> > > > > If you have a many-core machine, and have many threads all
> > > > > wanting to
> > > > > briefly lock a give file (udev is known to do this), you can
> > > > > get
> > > > > quite
> > > > > poor performance.
> > > > > 
> > > > > When one thread releases a lock, it wakes up all other
> > > > > threads
> > > > > that
> > > > > are waiting (classic thundering-herd) - one will get the lock
> > > > > and
> > > > > the
> > > > > others go to sleep.
> > > > > When you have few cores, this is not very noticeable: by the
> > > > > time
> > > > > the
> > > > > 4th or 5th thread gets enough CPU time to try to claim the
> > > > > lock,
> > > > > the
> > > > > earlier threads have claimed it, done what was needed, and
> > > > > released.
> > > > > With 50+ cores, the contention can easily be measured.
> > > > > 
> > > > > This patchset creates a tree of pending lock request in which
> > > > > siblings
> > > > > don't conflict and each lock request does conflict with its
> > > > > parent.
> > > > > When a lock is released, only requests which don't conflict
> > > > > with
> > > > > each
> > > > > other a woken.
> > > > > 
> > > > > Testing shows that lock-acquisitions-per-second is now fairly
> > > > > stable even
> > > > > as number of contending process goes to 1000.  Without this
> > > > > patch,
> > > > > locks-per-second drops off steeply after a few 10s of
> > > > > processes.
> > > > > 
> > > > > There is a small cost to this extra complexity.
> > > > > At 20 processes running a particular test on 72 cores, the
> > > > > lock
> > > > > acquisitions per second drops from 1.8 million to 1.4 million
> > > > > with
> > > > > this patch.  For 100 processes, this patch still provides 1.4
> > > > > million
> > > > > while without this patch there are about 700,000.
> > > > > 
> > > > > NeilBrown
> > > > > 
> > > > > ---
> > > > > 
> > > > > NeilBrown (4):
> > > > >       fs/locks: rename some lists and pointers.
> > > > >       fs/locks: allow a lock request to block other requests.
> > > > >       fs/locks: change all *_conflict() functions to return
> > > > > bool.
> > > > >       fs/locks: create a tree of dependent requests.
> > > > > 
> > > > > 
> > > > >  fs/cifs/file.c                  |    2 -
> > > > >  fs/locks.c                      |  142
> > > > > +++++++++++++++++++++++++--------------
> > > > >  include/linux/fs.h              |    5 +
> > > > >  include/trace/events/filelock.h |   16 ++--
> > > > >  4 files changed, 103 insertions(+), 62 deletions(-)
> > > > > 
> > > > 
> > > > Nice work! I looked over this and I think it looks good.
> > > > 
> > > > I made an attempt to fix this issue several years ago, but my
> > > > method
> > > > sucked as it ended up penalizing the unlocking task too much.
> > > > This
> > > > is
> > > > much cleaner and should scale well overall, I think.
> > > 
> > > I think I also took a crack at this at one point while I was at
> > > UM/CITI
> > > and never got anything I was happy with.  Looks like good work!
> > > 
> > > I remember one main obstacle that I felt like I never had a good
> > > benchmark....
> > > 
> > > How did you choose this workload and hardware?  Was it in fact
> > > udev
> > > (booting a large machine?), or was there some other motivation?
> > 
> > Some details can be found here:
> > 
> > https://github.com/systemd/systemd/pull/9551
> > 
> > https://github.com/systemd/systemd/pull/8667#issuecomment-385520335
> > and comments further down. 8667 has been superseded by 9551.
> > 
> > The original problem was that the symlink "/dev/disk/by-
> > partlabel/primary" may be claimed by _many_ devices on big systems
> > under certain distributions, which use older versions of parted for
> > partition creation on GPT disk labels. I've seen systems with
> > literally
> > thousands of contenders for this symlink. 
> > 
> > We found that with current systemd, this can cause a boot-time race
> > where the wrong device is eventually assigned the "best" contender
> > for
> > the symlink (e.g. a partition on multipath member rather than a
> > partition on the multipath map itself). I extended the udev test
> > suite,
> > creating a test that makes this race easily reproducible, at least
> > on
> > systems with many CPUs (the test host I used most had 72 cores).
> > 
> > I created an udev patch that would use systemd's built in fcntl-
> > based
> > locking to avoid this race, but I found that it would massively
> > slow
> > down the system, and found the contention to be in the spin locks
> > in
> > posix_lock_common(). (I therefore added more the systemd patches to
> > make the locking scale better, but that's irrelevant for the
> > kernel-
> > side discussion).
> > 
> > I further created an artificial test just for the scaling of
> > fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the
> > scaling problem easily, and do some quantitive experiments. My
> > tests
> > didn't use any byte ranges, only "full" locking of 0-byte files.
> 
> Thanks for the explanation!
> 
> I wonder whether there's also anything we could do to keep every
> waiter
> from having to take the same spinlock.

That was my line of thinking initially, but Neil's patches that just
avoid the thundering herd wakeup made both tests scale quite nicely, so
at least for my use case, no further optimizations are required.

Martin