Message ID | 153369219467.12605.13472423449508444601.stgit@noble (mailing list archive) |
---|---|
Headers | show |
Series | locks: avoid thundering-herd wake-ups | expand |
On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote: > If you have a many-core machine, and have many threads all wanting to > briefly lock a give file (udev is known to do this), you can get quite > poor performance. > > When one thread releases a lock, it wakes up all other threads that > are waiting (classic thundering-herd) - one will get the lock and the > others go to sleep. > When you have few cores, this is not very noticeable: by the time the > 4th or 5th thread gets enough CPU time to try to claim the lock, the > earlier threads have claimed it, done what was needed, and released. > With 50+ cores, the contention can easily be measured. > > This patchset creates a tree of pending lock request in which siblings > don't conflict and each lock request does conflict with its parent. > When a lock is released, only requests which don't conflict with each > other a woken. > > Testing shows that lock-acquisitions-per-second is now fairly stable even > as number of contending process goes to 1000. Without this patch, > locks-per-second drops off steeply after a few 10s of processes. > > There is a small cost to this extra complexity. > At 20 processes running a particular test on 72 cores, the lock > acquisitions per second drops from 1.8 million to 1.4 million with > this patch. For 100 processes, this patch still provides 1.4 million > while without this patch there are about 700,000. > > NeilBrown > > --- > > NeilBrown (4): > fs/locks: rename some lists and pointers. > fs/locks: allow a lock request to block other requests. > fs/locks: change all *_conflict() functions to return bool. > fs/locks: create a tree of dependent requests. > > > fs/cifs/file.c | 2 - > fs/locks.c | 142 +++++++++++++++++++++++++-------------- > include/linux/fs.h | 5 + > include/trace/events/filelock.h | 16 ++-- > 4 files changed, 103 insertions(+), 62 deletions(-) > Nice work! I looked over this and I think it looks good. I made an attempt to fix this issue several years ago, but my method sucked as it ended up penalizing the unlocking task too much. This is much cleaner and should scale well overall, I think. I'll put this in -next soon and we can aim for merge in v4.20. Thanks,
On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote: > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote: > > If you have a many-core machine, and have many threads all wanting to > > briefly lock a give file (udev is known to do this), you can get quite > > poor performance. > > > > When one thread releases a lock, it wakes up all other threads that > > are waiting (classic thundering-herd) - one will get the lock and the > > others go to sleep. > > When you have few cores, this is not very noticeable: by the time the > > 4th or 5th thread gets enough CPU time to try to claim the lock, the > > earlier threads have claimed it, done what was needed, and released. > > With 50+ cores, the contention can easily be measured. > > > > This patchset creates a tree of pending lock request in which siblings > > don't conflict and each lock request does conflict with its parent. > > When a lock is released, only requests which don't conflict with each > > other a woken. > > > > Testing shows that lock-acquisitions-per-second is now fairly stable even > > as number of contending process goes to 1000. Without this patch, > > locks-per-second drops off steeply after a few 10s of processes. > > > > There is a small cost to this extra complexity. > > At 20 processes running a particular test on 72 cores, the lock > > acquisitions per second drops from 1.8 million to 1.4 million with > > this patch. For 100 processes, this patch still provides 1.4 million > > while without this patch there are about 700,000. > > > > NeilBrown > > > > --- > > > > NeilBrown (4): > > fs/locks: rename some lists and pointers. > > fs/locks: allow a lock request to block other requests. > > fs/locks: change all *_conflict() functions to return bool. > > fs/locks: create a tree of dependent requests. > > > > > > fs/cifs/file.c | 2 - > > fs/locks.c | 142 +++++++++++++++++++++++++-------------- > > include/linux/fs.h | 5 + > > include/trace/events/filelock.h | 16 ++-- > > 4 files changed, 103 insertions(+), 62 deletions(-) > > > > Nice work! I looked over this and I think it looks good. > > I made an attempt to fix this issue several years ago, but my method > sucked as it ended up penalizing the unlocking task too much. This is > much cleaner and should scale well overall, I think. I think I also took a crack at this at one point while I was at UM/CITI and never got anything I was happy with. Looks like good work! I remember one main obstacle that I felt like I never had a good benchmark.... How did you choose this workload and hardware? Was it in fact udev (booting a large machine?), or was there some other motivation? Not that I'm likely to do it any time soon, but could you share sufficient details for someone else to reproduce your results? --b.
On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote: > If you have a many-core machine, and have many threads all wanting to > briefly lock a give file (udev is known to do this), you can get quite > poor performance. > > When one thread releases a lock, it wakes up all other threads that > are waiting (classic thundering-herd) - one will get the lock and the > others go to sleep. > When you have few cores, this is not very noticeable: by the time the > 4th or 5th thread gets enough CPU time to try to claim the lock, the > earlier threads have claimed it, done what was needed, and released. > With 50+ cores, the contention can easily be measured. > > This patchset creates a tree of pending lock request in which siblings > don't conflict and each lock request does conflict with its parent. > When a lock is released, only requests which don't conflict with each > other a woken. Are you sure you aren't depending on the (incorrect) assumption that "X blocks Y" is a transitive relation? OK I should be able to answer that question myself, my patience for code-reading is at a real low this afternoon.... --b. > > Testing shows that lock-acquisitions-per-second is now fairly stable even > as number of contending process goes to 1000. Without this patch, > locks-per-second drops off steeply after a few 10s of processes. > > There is a small cost to this extra complexity. > At 20 processes running a particular test on 72 cores, the lock > acquisitions per second drops from 1.8 million to 1.4 million with > this patch. For 100 processes, this patch still provides 1.4 million > while without this patch there are about 700,000. > > NeilBrown > > --- > > NeilBrown (4): > fs/locks: rename some lists and pointers. > fs/locks: allow a lock request to block other requests. > fs/locks: change all *_conflict() functions to return bool. > fs/locks: create a tree of dependent requests. > > > fs/cifs/file.c | 2 - > fs/locks.c | 142 +++++++++++++++++++++++++-------------- > include/linux/fs.h | 5 + > include/trace/events/filelock.h | 16 ++-- > 4 files changed, 103 insertions(+), 62 deletions(-) > > -- > Signature
On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote: > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote: > > If you have a many-core machine, and have many threads all wanting to > > briefly lock a give file (udev is known to do this), you can get quite > > poor performance. > > > > When one thread releases a lock, it wakes up all other threads that > > are waiting (classic thundering-herd) - one will get the lock and the > > others go to sleep. > > When you have few cores, this is not very noticeable: by the time the > > 4th or 5th thread gets enough CPU time to try to claim the lock, the > > earlier threads have claimed it, done what was needed, and released. > > With 50+ cores, the contention can easily be measured. > > > > This patchset creates a tree of pending lock request in which siblings > > don't conflict and each lock request does conflict with its parent. > > When a lock is released, only requests which don't conflict with each > > other a woken. > > Are you sure you aren't depending on the (incorrect) assumption that "X > blocks Y" is a transitive relation? > > OK I should be able to answer that question myself, my patience for > code-reading is at a real low this afternoon.... In other words, is there the possibility of a tree of, say, exclusive locks with (offset, length) like: (0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4) and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving a process waiting without there being an actual conflict. --b.
> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote: > > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote: > > > If you have a many-core machine, and have many threads all wanting > > > to briefly lock a give file (udev is known to do this), you can get > > > quite poor performance. > > > > > > When one thread releases a lock, it wakes up all other threads that > > > are waiting (classic thundering-herd) - one will get the lock and > > > the others go to sleep. > > > When you have few cores, this is not very noticeable: by the time > > > the 4th or 5th thread gets enough CPU time to try to claim the lock, > > > the earlier threads have claimed it, done what was needed, and released. > > > With 50+ cores, the contention can easily be measured. > > > > > > This patchset creates a tree of pending lock request in which > > > siblings don't conflict and each lock request does conflict with its parent. > > > When a lock is released, only requests which don't conflict with > > > each other a woken. > > > > Are you sure you aren't depending on the (incorrect) assumption that > > "X blocks Y" is a transitive relation? > > > > OK I should be able to answer that question myself, my patience for > > code-reading is at a real low this afternoon.... > > In other words, is there the possibility of a tree of, say, exclusive locks with > (offset, length) like: > > (0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4) > > and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving a process > waiting without there being an actual conflict. That implies that the order the locks were received in was: (0,4) (2,2) (1,2) (0,2) But couldn't (0,2) have been made only dependent on (0,4)? Of course then (1,2) is dependent on BOTH (2,2) and (0,2). Does this tree logic handle that case? On the other hand, there might be a fairness reason to make (0,2) wait for (1,2) even though it could have been granted concurrently with (2,2) since this dependency tree also preserves some of the order of lock requests. Frank
On Wed, Aug 08, 2018 at 04:09:12PM -0400, J. Bruce Fields wrote: > On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote: > > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote: > > > If you have a many-core machine, and have many threads all wanting to > > > briefly lock a give file (udev is known to do this), you can get quite > > > poor performance. > > > > > > When one thread releases a lock, it wakes up all other threads that > > > are waiting (classic thundering-herd) - one will get the lock and the > > > others go to sleep. > > > When you have few cores, this is not very noticeable: by the time the > > > 4th or 5th thread gets enough CPU time to try to claim the lock, the > > > earlier threads have claimed it, done what was needed, and released. > > > With 50+ cores, the contention can easily be measured. > > > > > > This patchset creates a tree of pending lock request in which siblings > > > don't conflict and each lock request does conflict with its parent. > > > When a lock is released, only requests which don't conflict with each > > > other a woken. > > > > Are you sure you aren't depending on the (incorrect) assumption that "X > > blocks Y" is a transitive relation? > > > > OK I should be able to answer that question myself, my patience for > > code-reading is at a real low this afternoon.... > > In other words, is there the possibility of a tree of, say, exclusive > locks with (offset, length) like: > > (0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4) > > and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving > a process waiting without there being an actual conflict. After batting it back and forth with Jeff on IRC.... So do I understand right that when we wake a waiter, we leave its own tree of waiters intact, and when it wakes if it finds a conflict it just adds it lock (with tree of waiters) in to the tree of the conflicting lock? If so then yes I think that depends on the transitivity assumption--you're assuming that finding a conflict between the root of the tree and a lock proves that all the other members of the tree also conflict. So maybe this example works. (All locks are exclusive and written (offset, length), X->Y means X is waiting on Y.) process acquires (0,3) 2nd process requests (1,2), is put to sleep. 3rd process requests (0,2), is put to sleep. The tree of waiters now looks like (0,2)->(1,2)->(0,3) (0,3) is unlocked. A 4th process races in and locks (2,2). The 2nd process wakes up, sees this new conflict, and waits on (2,2). Now the tree looks like (0,2)->(1,2)->(2,2), and (0,2) is waiting for no reason. ? --b.
On Wed, Aug 08 2018, Frank Filz wrote: >> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote: >> > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote: >> > > If you have a many-core machine, and have many threads all wanting >> > > to briefly lock a give file (udev is known to do this), you can get >> > > quite poor performance. >> > > >> > > When one thread releases a lock, it wakes up all other threads that >> > > are waiting (classic thundering-herd) - one will get the lock and >> > > the others go to sleep. >> > > When you have few cores, this is not very noticeable: by the time >> > > the 4th or 5th thread gets enough CPU time to try to claim the lock, >> > > the earlier threads have claimed it, done what was needed, and > released. >> > > With 50+ cores, the contention can easily be measured. >> > > >> > > This patchset creates a tree of pending lock request in which >> > > siblings don't conflict and each lock request does conflict with its > parent. >> > > When a lock is released, only requests which don't conflict with >> > > each other a woken. >> > >> > Are you sure you aren't depending on the (incorrect) assumption that >> > "X blocks Y" is a transitive relation? >> > >> > OK I should be able to answer that question myself, my patience for >> > code-reading is at a real low this afternoon.... >> >> In other words, is there the possibility of a tree of, say, exclusive > locks with >> (offset, length) like: >> >> (0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4) >> >> and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving a > process >> waiting without there being an actual conflict. > > That implies that the order the locks were received in was: > > (0,4) > (2,2) > (1,2) > (0,2) > > But couldn't (0,2) have been made only dependent on (0,4)? Correct. (0,2) would be a child if (0,4), but a sibling of (2,2). > Of course then > (1,2) is dependent on BOTH (2,2) and (0,2). Does this tree logic handle that > case? No, there is no support for double dependencies. It is still possible for a lock request to be woken up which, which still cannot be satisfied. When one block lock is unlocked, a pending request might then be queued against a different blocking lock. > > On the other hand, there might be a fairness reason to make (0,2) wait for > (1,2) even though it could have been granted concurrently with (2,2) since > this dependency tree also preserves some of the order of lock requests. The locking API doesn't promise fairness, and I don't think we should try too hard to achieve it. Certainly we shouldn't actively fight it (so no LIFO queuing) but if we try harder than that I suspect it would just be needless complexity. Thanks, NeilBrown > > Frank
On Wed, Aug 08 2018, J. Bruce Fields wrote: > On Wed, Aug 08, 2018 at 04:09:12PM -0400, J. Bruce Fields wrote: >> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote: >> > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote: >> > > If you have a many-core machine, and have many threads all wanting to >> > > briefly lock a give file (udev is known to do this), you can get quite >> > > poor performance. >> > > >> > > When one thread releases a lock, it wakes up all other threads that >> > > are waiting (classic thundering-herd) - one will get the lock and the >> > > others go to sleep. >> > > When you have few cores, this is not very noticeable: by the time the >> > > 4th or 5th thread gets enough CPU time to try to claim the lock, the >> > > earlier threads have claimed it, done what was needed, and released. >> > > With 50+ cores, the contention can easily be measured. >> > > >> > > This patchset creates a tree of pending lock request in which siblings >> > > don't conflict and each lock request does conflict with its parent. >> > > When a lock is released, only requests which don't conflict with each >> > > other a woken. >> > >> > Are you sure you aren't depending on the (incorrect) assumption that "X >> > blocks Y" is a transitive relation? >> > >> > OK I should be able to answer that question myself, my patience for >> > code-reading is at a real low this afternoon.... >> >> In other words, is there the possibility of a tree of, say, exclusive >> locks with (offset, length) like: >> >> (0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4) >> >> and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving >> a process waiting without there being an actual conflict. > > After batting it back and forth with Jeff on IRC.... So do I understand > right that when we wake a waiter, we leave its own tree of waiters > intact, and when it wakes if it finds a conflict it just adds it lock > (with tree of waiters) in to the tree of the conflicting lock? > > If so then yes I think that depends on the transitivity > assumption--you're assuming that finding a conflict between the root of > the tree and a lock proves that all the other members of the tree also > conflict. Ahhh... I see what you are getting at. When lock requests are added individually, they will always be blocked by all ancestors in the tree. But when they are added as a group, that might not be the case. So we might need to re-add every request individually. In the (common) case where a lock request is blocked across its whole range, we can just attach the whole tree beneath the blocker. In other cases we need a finer grained approach. I'll have a look and see how to make the code work for this case. Thanks for the thorough review! NeilBrown > > So maybe this example works. (All locks are exclusive and written > (offset, length), X->Y means X is waiting on Y.) > > process acquires (0,3) > 2nd process requests (1,2), is put to sleep. > 3rd process requests (0,2), is put to sleep. > > The tree of waiters now looks like (0,2)->(1,2)->(0,3) > > (0,3) is unlocked. > A 4th process races in and locks (2,2). > The 2nd process wakes up, sees this new conflict, and waits on > (2,2). Now the tree looks like (0,2)->(1,2)->(2,2), and (0,2) > is waiting for no reason. > > ? > > --b.
On Wed, 2018-08-08 at 17:28 -0400, J. Bruce Fields wrote: > On Wed, Aug 08, 2018 at 04:09:12PM -0400, J. Bruce Fields wrote: > > On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote: > > > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote: > > > > If you have a many-core machine, and have many threads all wanting to > > > > briefly lock a give file (udev is known to do this), you can get quite > > > > poor performance. > > > > > > > > When one thread releases a lock, it wakes up all other threads that > > > > are waiting (classic thundering-herd) - one will get the lock and the > > > > others go to sleep. > > > > When you have few cores, this is not very noticeable: by the time the > > > > 4th or 5th thread gets enough CPU time to try to claim the lock, the > > > > earlier threads have claimed it, done what was needed, and released. > > > > With 50+ cores, the contention can easily be measured. > > > > > > > > This patchset creates a tree of pending lock request in which siblings > > > > don't conflict and each lock request does conflict with its parent. > > > > When a lock is released, only requests which don't conflict with each > > > > other a woken. > > > > > > Are you sure you aren't depending on the (incorrect) assumption that "X > > > blocks Y" is a transitive relation? > > > > > > OK I should be able to answer that question myself, my patience for > > > code-reading is at a real low this afternoon.... > > > > In other words, is there the possibility of a tree of, say, exclusive > > locks with (offset, length) like: > > > > (0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4) > > > > and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving > > a process waiting without there being an actual conflict. > > After batting it back and forth with Jeff on IRC.... So do I understand > right that when we wake a waiter, we leave its own tree of waiters > intact, and when it wakes if it finds a conflict it just adds it lock > (with tree of waiters) in to the tree of the conflicting lock? > > If so then yes I think that depends on the transitivity > assumption--you're assuming that finding a conflict between the root of > the tree and a lock proves that all the other members of the tree also > conflict. > > So maybe this example works. (All locks are exclusive and written > (offset, length), X->Y means X is waiting on Y.) > > process acquires (0,3) > 2nd process requests (1,2), is put to sleep. > 3rd process requests (0,2), is put to sleep. > > The tree of waiters now looks like (0,2)->(1,2)->(0,3) > > (0,3) is unlocked. > A 4th process races in and locks (2,2). > The 2nd process wakes up, sees this new conflict, and waits on > (2,2). Now the tree looks like (0,2)->(1,2)->(2,2), and (0,2) > is waiting for no reason. > That seems like a legit problem. One possible fix might be to have the waiter on (1,2) walk down the entire subtree and wake up any waiter that is waiting on a lock that doesn't conflict with the lock on which it's waiting. So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it could walk down its entire fl_blocked subtree and wake up anything waiting on a lock that doesn't conflict with (2,2). That's potentially an expensive operation, but: a) the task is going back to sleep anyway, so letting it do a little extra work before that should be no big deal b) it's probably still cheaper than waking up the whole herd
> On Wed, 2018-08-08 at 17:28 -0400, J. Bruce Fields wrote: > > On Wed, Aug 08, 2018 at 04:09:12PM -0400, J. Bruce Fields wrote: > > > On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote: > > > > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote: > > > > > If you have a many-core machine, and have many threads all > > > > > wanting to briefly lock a give file (udev is known to do this), > > > > > you can get quite poor performance. > > > > > > > > > > When one thread releases a lock, it wakes up all other threads > > > > > that are waiting (classic thundering-herd) - one will get the > > > > > lock and the others go to sleep. > > > > > When you have few cores, this is not very noticeable: by the > > > > > time the 4th or 5th thread gets enough CPU time to try to claim > > > > > the lock, the earlier threads have claimed it, done what was needed, and > released. > > > > > With 50+ cores, the contention can easily be measured. > > > > > > > > > > This patchset creates a tree of pending lock request in which > > > > > siblings don't conflict and each lock request does conflict with its parent. > > > > > When a lock is released, only requests which don't conflict with > > > > > each other a woken. > > > > > > > > Are you sure you aren't depending on the (incorrect) assumption > > > > that "X blocks Y" is a transitive relation? > > > > > > > > OK I should be able to answer that question myself, my patience > > > > for code-reading is at a real low this afternoon.... > > > > > > In other words, is there the possibility of a tree of, say, > > > exclusive locks with (offset, length) like: > > > > > > (0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4) > > > > > > and when waking (0, 4) you could wake up (2, 2) but not (0, 2), > > > leaving a process waiting without there being an actual conflict. > > > > After batting it back and forth with Jeff on IRC.... So do I > > understand right that when we wake a waiter, we leave its own tree of > > waiters intact, and when it wakes if it finds a conflict it just adds > > it lock (with tree of waiters) in to the tree of the conflicting lock? > > > > If so then yes I think that depends on the transitivity > > assumption--you're assuming that finding a conflict between the root > > of the tree and a lock proves that all the other members of the tree > > also conflict. > > > > So maybe this example works. (All locks are exclusive and written > > (offset, length), X->Y means X is waiting on Y.) > > > > process acquires (0,3) > > 2nd process requests (1,2), is put to sleep. > > 3rd process requests (0,2), is put to sleep. > > > > The tree of waiters now looks like (0,2)->(1,2)->(0,3) > > > > (0,3) is unlocked. > > A 4th process races in and locks (2,2). > > The 2nd process wakes up, sees this new conflict, and waits on > > (2,2). Now the tree looks like (0,2)->(1,2)->(2,2), and (0,2) > > is waiting for no reason. > > > > That seems like a legit problem. > > One possible fix might be to have the waiter on (1,2) walk down the entire > subtree and wake up any waiter that is waiting on a lock that doesn't conflict > with the lock on which it's waiting. > > So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it could > walk down its entire fl_blocked subtree and wake up anything waiting on a lock > that doesn't conflict with (2,2). > > That's potentially an expensive operation, but: > > a) the task is going back to sleep anyway, so letting it do a little extra work > before that should be no big deal > > b) it's probably still cheaper than waking up the whole herd Yea, I think so. Now here's another question... How does this new logic play with Open File Description Locks? Should still be ok since there's a thread waiting on each of those. Frank
On Wed, Aug 08 2018, J. Bruce Fields wrote: > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote: >> On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote: >> > If you have a many-core machine, and have many threads all wanting to >> > briefly lock a give file (udev is known to do this), you can get quite >> > poor performance. >> > >> > When one thread releases a lock, it wakes up all other threads that >> > are waiting (classic thundering-herd) - one will get the lock and the >> > others go to sleep. >> > When you have few cores, this is not very noticeable: by the time the >> > 4th or 5th thread gets enough CPU time to try to claim the lock, the >> > earlier threads have claimed it, done what was needed, and released. >> > With 50+ cores, the contention can easily be measured. >> > >> > This patchset creates a tree of pending lock request in which siblings >> > don't conflict and each lock request does conflict with its parent. >> > When a lock is released, only requests which don't conflict with each >> > other a woken. >> > >> > Testing shows that lock-acquisitions-per-second is now fairly stable even >> > as number of contending process goes to 1000. Without this patch, >> > locks-per-second drops off steeply after a few 10s of processes. >> > >> > There is a small cost to this extra complexity. >> > At 20 processes running a particular test on 72 cores, the lock >> > acquisitions per second drops from 1.8 million to 1.4 million with >> > this patch. For 100 processes, this patch still provides 1.4 million >> > while without this patch there are about 700,000. >> > >> > NeilBrown >> > >> > --- >> > >> > NeilBrown (4): >> > fs/locks: rename some lists and pointers. >> > fs/locks: allow a lock request to block other requests. >> > fs/locks: change all *_conflict() functions to return bool. >> > fs/locks: create a tree of dependent requests. >> > >> > >> > fs/cifs/file.c | 2 - >> > fs/locks.c | 142 +++++++++++++++++++++++++-------------- >> > include/linux/fs.h | 5 + >> > include/trace/events/filelock.h | 16 ++-- >> > 4 files changed, 103 insertions(+), 62 deletions(-) >> > >> >> Nice work! I looked over this and I think it looks good. >> >> I made an attempt to fix this issue several years ago, but my method >> sucked as it ended up penalizing the unlocking task too much. This is >> much cleaner and should scale well overall, I think. > > I think I also took a crack at this at one point while I was at UM/CITI > and never got anything I was happy with. Looks like good work! > > I remember one main obstacle that I felt like I never had a good > benchmark.... > > How did you choose this workload and hardware? Was it in fact udev > (booting a large machine?), or was there some other motivation? I'm hoping Martin will chime in here - her identified the problem and did most of the testing... NeilBrown > > Not that I'm likely to do it any time soon, but could you share > sufficient details for someone else to reproduce your results? > > --b.
On Wed, Aug 08 2018, Frank Filz wrote: > > Now here's another question... How does this new logic play with Open > File Description Locks? Should still be ok since there's a thread > waiting on each of those. At this level Posix locks and OFD locks are almost identical - just a different owner. So this enhancement should work equally for posix, flock, ofd and even leases. Thanks, NeilBrown > > Frank
On Wed, Aug 08, 2018 at 06:50:06PM -0400, Jeff Layton wrote: > That seems like a legit problem. > > One possible fix might be to have the waiter on (1,2) walk down the > entire subtree and wake up any waiter that is waiting on a lock that > doesn't conflict with the lock on which it's waiting. > > So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it > could walk down its entire fl_blocked subtree and wake up anything > waiting on a lock that doesn't conflict with (2,2). > > That's potentially an expensive operation, but: > > a) the task is going back to sleep anyway, so letting it do a little > extra work before that should be no big deal I don't understand why cpu used by a process going to sleep is cheaper than cpu used in any other situation. > b) it's probably still cheaper than waking up the whole herd Yeah, I'd like to understand this. I feel like Neil's addressing two different performance costs: - the cost of waking up all the waiters - the cost of walking the list of waiters Are they equally important? If we only cared about the former, and only in simple cases, we could walk the entire list and skip waking up only the locks that conflict with the first one we wake. We wouldn't need the tree. --b.
On Thu, 2018-08-09 at 09:00 -0400, J. Bruce Fields wrote: > On Wed, Aug 08, 2018 at 06:50:06PM -0400, Jeff Layton wrote: > > That seems like a legit problem. > > > > One possible fix might be to have the waiter on (1,2) walk down the > > entire subtree and wake up any waiter that is waiting on a lock that > > doesn't conflict with the lock on which it's waiting. > > > > So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it > > could walk down its entire fl_blocked subtree and wake up anything > > waiting on a lock that doesn't conflict with (2,2). > > > > That's potentially an expensive operation, but: > > > > a) the task is going back to sleep anyway, so letting it do a little > > extra work before that should be no big deal > > I don't understand why cpu used by a process going to sleep is cheaper > than cpu used in any other situation. > It's not any cheaper in that sense. It's just that this task is not slated to be doing any useful work anyway as it's going to sleep, so we aren't delaying any "real work" by this task by having it do this before returning to userland. It's already scheduled and holds the appropriate lock. The alternative would be to do this in the context of a different task, but that means extra context switching and spinlocking, etc. > > b) it's probably still cheaper than waking up the whole herd > > Yeah, I'd like to understand this. > > I feel like Neil's addressing two different performance costs: > > - the cost of waking up all the waiters > - the cost of walking the list of waiters > > Are they equally important? > > If we only cared about the former, and only in simple cases, we could > walk the entire list and skip waking up only the locks that conflict > with the first one we wake. We wouldn't need the tree.
On Thu, Aug 09 2018, J. Bruce Fields wrote: > On Wed, Aug 08, 2018 at 06:50:06PM -0400, Jeff Layton wrote: >> That seems like a legit problem. >> >> One possible fix might be to have the waiter on (1,2) walk down the >> entire subtree and wake up any waiter that is waiting on a lock that >> doesn't conflict with the lock on which it's waiting. >> >> So, before the task waiting on 1,2 goes back to sleep to wait on 2,2, it >> could walk down its entire fl_blocked subtree and wake up anything >> waiting on a lock that doesn't conflict with (2,2). >> >> That's potentially an expensive operation, but: >> >> a) the task is going back to sleep anyway, so letting it do a little >> extra work before that should be no big deal > > I don't understand why cpu used by a process going to sleep is cheaper > than cpu used in any other situation. > >> b) it's probably still cheaper than waking up the whole herd > > Yeah, I'd like to understand this. > > I feel like Neil's addressing two different performance costs: > > - the cost of waking up all the waiters > - the cost of walking the list of waiters > > Are they equally important? Probably not. Reducing wakeups is likely the most important. > > If we only cared about the former, and only in simple cases, we could > walk the entire list and skip waking up only the locks that conflict > with the first one we wake. We wouldn't need the tree. Yes, we could do that. It would probably make the code simpler. It would mean doing the conflicts() tests when performing wake-up rather than prior to going to sleep. In general it would mean doing the tests more often, as the tree effectively records the results of the previous time conflicts() was run. You might get a quadratic effect though ... wouldn't you want to skip anything that conflicts with anything that has been woken? If the tree-management code turns out to be too complex, it would certainly be an option. I think the tree approach should result in less total CPU usage.. Thanks for the thought - taking this simple approach really hadn't occurred to me. :-( NeilBrown
On Fri, Aug 10, 2018 at 09:56:07AM +1000, NeilBrown wrote: > On Thu, Aug 09 2018, J. Bruce Fields wrote: > > If we only cared about the former, and only in simple cases, we could > > walk the entire list and skip waking up only the locks that conflict > > with the first one we wake. We wouldn't need the tree. > > Yes, we could do that. It would probably make the code simpler. > It would mean doing the conflicts() tests when performing wake-up rather > than prior to going to sleep. In general it would mean doing the tests > more often, as the tree effectively records the results of the previous > time conflicts() was run. > You might get a quadratic effect though ... wouldn't you want to > skip anything that conflicts with anything that has been woken? I was thinking it'd be simplest just to check for conflict with the first thing that you decide to wake. That might be all that's necessary for typical cases. If you wanted to keep a running list of the locks you've chosen to wake so far and check each one against that list, I guess you could. > If the tree-management code turns out to be too complex, it would > certainly be an option. I think the tree approach should result in less > total CPU usage.. Have we ever looked into the CPU usage of deadlock detection? Might experiment with turning that off. But I may be biased by my white-hot hatred of it. --b.
On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote: > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote: > > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote: > > > If you have a many-core machine, and have many threads all > > > wanting to > > > briefly lock a give file (udev is known to do this), you can get > > > quite > > > poor performance. > > > > > > When one thread releases a lock, it wakes up all other threads > > > that > > > are waiting (classic thundering-herd) - one will get the lock and > > > the > > > others go to sleep. > > > When you have few cores, this is not very noticeable: by the time > > > the > > > 4th or 5th thread gets enough CPU time to try to claim the lock, > > > the > > > earlier threads have claimed it, done what was needed, and > > > released. > > > With 50+ cores, the contention can easily be measured. > > > > > > This patchset creates a tree of pending lock request in which > > > siblings > > > don't conflict and each lock request does conflict with its > > > parent. > > > When a lock is released, only requests which don't conflict with > > > each > > > other a woken. > > > > > > Testing shows that lock-acquisitions-per-second is now fairly > > > stable even > > > as number of contending process goes to 1000. Without this > > > patch, > > > locks-per-second drops off steeply after a few 10s of processes. > > > > > > There is a small cost to this extra complexity. > > > At 20 processes running a particular test on 72 cores, the lock > > > acquisitions per second drops from 1.8 million to 1.4 million > > > with > > > this patch. For 100 processes, this patch still provides 1.4 > > > million > > > while without this patch there are about 700,000. > > > > > > NeilBrown > > > > > > --- > > > > > > NeilBrown (4): > > > fs/locks: rename some lists and pointers. > > > fs/locks: allow a lock request to block other requests. > > > fs/locks: change all *_conflict() functions to return bool. > > > fs/locks: create a tree of dependent requests. > > > > > > > > > fs/cifs/file.c | 2 - > > > fs/locks.c | 142 > > > +++++++++++++++++++++++++-------------- > > > include/linux/fs.h | 5 + > > > include/trace/events/filelock.h | 16 ++-- > > > 4 files changed, 103 insertions(+), 62 deletions(-) > > > > > > > Nice work! I looked over this and I think it looks good. > > > > I made an attempt to fix this issue several years ago, but my > > method > > sucked as it ended up penalizing the unlocking task too much. This > > is > > much cleaner and should scale well overall, I think. > > I think I also took a crack at this at one point while I was at > UM/CITI > and never got anything I was happy with. Looks like good work! > > I remember one main obstacle that I felt like I never had a good > benchmark.... > > How did you choose this workload and hardware? Was it in fact udev > (booting a large machine?), or was there some other motivation? Some details can be found here: https://github.com/systemd/systemd/pull/9551 https://github.com/systemd/systemd/pull/8667#issuecomment-385520335 and comments further down. 8667 has been superseded by 9551. The original problem was that the symlink "/dev/disk/by- partlabel/primary" may be claimed by _many_ devices on big systems under certain distributions, which use older versions of parted for partition creation on GPT disk labels. I've seen systems with literally thousands of contenders for this symlink. We found that with current systemd, this can cause a boot-time race where the wrong device is eventually assigned the "best" contender for the symlink (e.g. a partition on multipath member rather than a partition on the multipath map itself). I extended the udev test suite, creating a test that makes this race easily reproducible, at least on systems with many CPUs (the test host I used most had 72 cores). I created an udev patch that would use systemd's built in fcntl-based locking to avoid this race, but I found that it would massively slow down the system, and found the contention to be in the spin locks in posix_lock_common(). (I therefore added more the systemd patches to make the locking scale better, but that's irrelevant for the kernel- side discussion). I further created an artificial test just for the scaling of fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the scaling problem easily, and do some quantitive experiments. My tests didn't use any byte ranges, only "full" locking of 0-byte files. > Not that I'm likely to do it any time soon, but could you share > sufficient details for someone else to reproduce your results? > > --b. The udev test code can be found in the above links. It adds a new script "test/sd-script.py" that would be run after "test/sys-script.py" using a numeric argument indicating the number of contenders for the test link to be created, such as "python test/sd-script.py test 1000". Next step would be running "test/udev-test.pl 152" e.g. under perf (152 is the test ID of the scaling test). Of course I can also share my other test program if you desire so. Regards, Martin
On Mon, Aug 20, 2018 at 01:02:21PM +0200, Martin Wilck wrote: > On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote: > > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote: > > > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote: > > > > If you have a many-core machine, and have many threads all > > > > wanting to > > > > briefly lock a give file (udev is known to do this), you can get > > > > quite > > > > poor performance. > > > > > > > > When one thread releases a lock, it wakes up all other threads > > > > that > > > > are waiting (classic thundering-herd) - one will get the lock and > > > > the > > > > others go to sleep. > > > > When you have few cores, this is not very noticeable: by the time > > > > the > > > > 4th or 5th thread gets enough CPU time to try to claim the lock, > > > > the > > > > earlier threads have claimed it, done what was needed, and > > > > released. > > > > With 50+ cores, the contention can easily be measured. > > > > > > > > This patchset creates a tree of pending lock request in which > > > > siblings > > > > don't conflict and each lock request does conflict with its > > > > parent. > > > > When a lock is released, only requests which don't conflict with > > > > each > > > > other a woken. > > > > > > > > Testing shows that lock-acquisitions-per-second is now fairly > > > > stable even > > > > as number of contending process goes to 1000. Without this > > > > patch, > > > > locks-per-second drops off steeply after a few 10s of processes. > > > > > > > > There is a small cost to this extra complexity. > > > > At 20 processes running a particular test on 72 cores, the lock > > > > acquisitions per second drops from 1.8 million to 1.4 million > > > > with > > > > this patch. For 100 processes, this patch still provides 1.4 > > > > million > > > > while without this patch there are about 700,000. > > > > > > > > NeilBrown > > > > > > > > --- > > > > > > > > NeilBrown (4): > > > > fs/locks: rename some lists and pointers. > > > > fs/locks: allow a lock request to block other requests. > > > > fs/locks: change all *_conflict() functions to return bool. > > > > fs/locks: create a tree of dependent requests. > > > > > > > > > > > > fs/cifs/file.c | 2 - > > > > fs/locks.c | 142 > > > > +++++++++++++++++++++++++-------------- > > > > include/linux/fs.h | 5 + > > > > include/trace/events/filelock.h | 16 ++-- > > > > 4 files changed, 103 insertions(+), 62 deletions(-) > > > > > > > > > > Nice work! I looked over this and I think it looks good. > > > > > > I made an attempt to fix this issue several years ago, but my > > > method > > > sucked as it ended up penalizing the unlocking task too much. This > > > is > > > much cleaner and should scale well overall, I think. > > > > I think I also took a crack at this at one point while I was at > > UM/CITI > > and never got anything I was happy with. Looks like good work! > > > > I remember one main obstacle that I felt like I never had a good > > benchmark.... > > > > How did you choose this workload and hardware? Was it in fact udev > > (booting a large machine?), or was there some other motivation? > > Some details can be found here: > > https://github.com/systemd/systemd/pull/9551 > > https://github.com/systemd/systemd/pull/8667#issuecomment-385520335 > and comments further down. 8667 has been superseded by 9551. > > The original problem was that the symlink "/dev/disk/by- > partlabel/primary" may be claimed by _many_ devices on big systems > under certain distributions, which use older versions of parted for > partition creation on GPT disk labels. I've seen systems with literally > thousands of contenders for this symlink. > > We found that with current systemd, this can cause a boot-time race > where the wrong device is eventually assigned the "best" contender for > the symlink (e.g. a partition on multipath member rather than a > partition on the multipath map itself). I extended the udev test suite, > creating a test that makes this race easily reproducible, at least on > systems with many CPUs (the test host I used most had 72 cores). > > I created an udev patch that would use systemd's built in fcntl-based > locking to avoid this race, but I found that it would massively slow > down the system, and found the contention to be in the spin locks in > posix_lock_common(). (I therefore added more the systemd patches to > make the locking scale better, but that's irrelevant for the kernel- > side discussion). > > I further created an artificial test just for the scaling of > fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the > scaling problem easily, and do some quantitive experiments. My tests > didn't use any byte ranges, only "full" locking of 0-byte files. Thanks for the explanation! I wonder whether there's also anything we could do to keep every waiter from having to take the same spinlock. --b. > > > Not that I'm likely to do it any time soon, but could you share > > sufficient details for someone else to reproduce your results? > > > > --b. > > The udev test code can be found in the above links. It adds a new > script "test/sd-script.py" that would be run after "test/sys-script.py" > using a numeric argument indicating the number of contenders for the > test link to be created, such as "python test/sd-script.py test 1000". > Next step would be running "test/udev-test.pl 152" e.g. under perf (152 > is the test ID of the scaling test). > > Of course I can also share my other test program if you desire so.
On Mon, 2018-08-20 at 16:02 -0400, J. Bruce Fields wrote: > On Mon, Aug 20, 2018 at 01:02:21PM +0200, Martin Wilck wrote: > > On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote: > > > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote: > > > > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote: > > > > > If you have a many-core machine, and have many threads all > > > > > wanting to > > > > > briefly lock a give file (udev is known to do this), you can > > > > > get > > > > > quite > > > > > poor performance. > > > > > > > > > > When one thread releases a lock, it wakes up all other > > > > > threads > > > > > that > > > > > are waiting (classic thundering-herd) - one will get the lock > > > > > and > > > > > the > > > > > others go to sleep. > > > > > When you have few cores, this is not very noticeable: by the > > > > > time > > > > > the > > > > > 4th or 5th thread gets enough CPU time to try to claim the > > > > > lock, > > > > > the > > > > > earlier threads have claimed it, done what was needed, and > > > > > released. > > > > > With 50+ cores, the contention can easily be measured. > > > > > > > > > > This patchset creates a tree of pending lock request in which > > > > > siblings > > > > > don't conflict and each lock request does conflict with its > > > > > parent. > > > > > When a lock is released, only requests which don't conflict > > > > > with > > > > > each > > > > > other a woken. > > > > > > > > > > Testing shows that lock-acquisitions-per-second is now fairly > > > > > stable even > > > > > as number of contending process goes to 1000. Without this > > > > > patch, > > > > > locks-per-second drops off steeply after a few 10s of > > > > > processes. > > > > > > > > > > There is a small cost to this extra complexity. > > > > > At 20 processes running a particular test on 72 cores, the > > > > > lock > > > > > acquisitions per second drops from 1.8 million to 1.4 million > > > > > with > > > > > this patch. For 100 processes, this patch still provides 1.4 > > > > > million > > > > > while without this patch there are about 700,000. > > > > > > > > > > NeilBrown > > > > > > > > > > --- > > > > > > > > > > NeilBrown (4): > > > > > fs/locks: rename some lists and pointers. > > > > > fs/locks: allow a lock request to block other requests. > > > > > fs/locks: change all *_conflict() functions to return > > > > > bool. > > > > > fs/locks: create a tree of dependent requests. > > > > > > > > > > > > > > > fs/cifs/file.c | 2 - > > > > > fs/locks.c | 142 > > > > > +++++++++++++++++++++++++-------------- > > > > > include/linux/fs.h | 5 + > > > > > include/trace/events/filelock.h | 16 ++-- > > > > > 4 files changed, 103 insertions(+), 62 deletions(-) > > > > > > > > > > > > > Nice work! I looked over this and I think it looks good. > > > > > > > > I made an attempt to fix this issue several years ago, but my > > > > method > > > > sucked as it ended up penalizing the unlocking task too much. > > > > This > > > > is > > > > much cleaner and should scale well overall, I think. > > > > > > I think I also took a crack at this at one point while I was at > > > UM/CITI > > > and never got anything I was happy with. Looks like good work! > > > > > > I remember one main obstacle that I felt like I never had a good > > > benchmark.... > > > > > > How did you choose this workload and hardware? Was it in fact > > > udev > > > (booting a large machine?), or was there some other motivation? > > > > Some details can be found here: > > > > https://github.com/systemd/systemd/pull/9551 > > > > https://github.com/systemd/systemd/pull/8667#issuecomment-385520335 > > and comments further down. 8667 has been superseded by 9551. > > > > The original problem was that the symlink "/dev/disk/by- > > partlabel/primary" may be claimed by _many_ devices on big systems > > under certain distributions, which use older versions of parted for > > partition creation on GPT disk labels. I've seen systems with > > literally > > thousands of contenders for this symlink. > > > > We found that with current systemd, this can cause a boot-time race > > where the wrong device is eventually assigned the "best" contender > > for > > the symlink (e.g. a partition on multipath member rather than a > > partition on the multipath map itself). I extended the udev test > > suite, > > creating a test that makes this race easily reproducible, at least > > on > > systems with many CPUs (the test host I used most had 72 cores). > > > > I created an udev patch that would use systemd's built in fcntl- > > based > > locking to avoid this race, but I found that it would massively > > slow > > down the system, and found the contention to be in the spin locks > > in > > posix_lock_common(). (I therefore added more the systemd patches to > > make the locking scale better, but that's irrelevant for the > > kernel- > > side discussion). > > > > I further created an artificial test just for the scaling of > > fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the > > scaling problem easily, and do some quantitive experiments. My > > tests > > didn't use any byte ranges, only "full" locking of 0-byte files. > > Thanks for the explanation! > > I wonder whether there's also anything we could do to keep every > waiter > from having to take the same spinlock. That was my line of thinking initially, but Neil's patches that just avoid the thundering herd wakeup made both tests scale quite nicely, so at least for my use case, no further optimizations are required. Martin