Message ID | 1541709268-3766-1-git-send-email-longman@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | locking/lockdep: Add a new class of terminal locks | expand |
* Waiman Long <longman@redhat.com> wrote: > The purpose of this patchset is to add a new class of locks called > terminal locks and converts some of the low level raw or regular > spinlocks to terminal locks. A terminal lock does not have forward > dependency and it won't allow a lock or unlock operation on another > lock. Two level nesting of terminal locks is allowed, though. > > Only spinlocks that are acquired with the _irq/_irqsave variants or > acquired in an IRQ disabled context should be classified as terminal > locks. > > Because of the restrictions on terminal locks, we can do simple checks on > them without using the lockdep lock validation machinery. The advantages > of making these changes are as follows: > > 1) The lockdep check will be faster for terminal locks without using > the lock validation code. > 2) It saves table entries used by the validation code and hence make > it harder to overflow those tables. > > In fact, it is possible to overflow some of the tables by running > a variety of different workloads on a debug kernel. I have seen bug > reports about exhausting MAX_LOCKDEP_KEYS, MAX_LOCKDEP_ENTRIES and > MAX_STACK_TRACE_ENTRIES. This patch will help to reduce the chance > of overflowing some of the tables. > > Performance wise, there was no statistically significant difference in > performanace when doing a parallel kernel build on a debug kernel. Could you please measure a locking intense workload instead, such as: $ perf stat --null --sync --repeat 10 perf bench sched messaging and profile which locks used there could be marked terminal, and measure the before/after performance impact? > Below were selected output lines from the lockdep_stats files of the > patched and unpatched kernels after bootup and running parallel kernel > builds. > > Item Unpatched kernel Patched kernel % Change > ---- ---------------- -------------- -------- > direct dependencies 9732 8994 -7.6% > dependency chains 18776 17033 -9.3% > dependency chain hlocks 76044 68419 -10.0% > stack-trace entries 110403 104341 -5.5% That's pretty impressive! > There were some reductions in the size of the lockdep tables. They were > not significant, but it is still a good start to rein in the number of > entries in those tables to make it harder to overflow them. Agreed. BTW., if you are interested in more radical approaches to optimize lockdep, we could also add a static checker via objtool driven call graph analysis, and mark those locks terminal that we can prove are terminal. This would require the unified call graph of the kernel image and of all modules to be examined in a final pass, but that's within the principal scope of objtool. (This 'final pass' could also be done during bootup, at least in initial versions.) Note that beyond marking it 'terminal' such a static analysis pass would also allow the detection of obvious locking bugs at the build (or boot) stage already - plus it would allow the disabling of lockdep for self-contained locks that don't interact with anything else. I.e. the static analysis pass would 'augment' lockdep and leave only those locks active for runtime lockdep tracking whose dependencies it cannot prove to be correct yet. Thanks, Ingo
On 11/09/2018 03:04 AM, Ingo Molnar wrote: > * Waiman Long <longman@redhat.com> wrote: > >> The purpose of this patchset is to add a new class of locks called >> terminal locks and converts some of the low level raw or regular >> spinlocks to terminal locks. A terminal lock does not have forward >> dependency and it won't allow a lock or unlock operation on another >> lock. Two level nesting of terminal locks is allowed, though. >> >> Only spinlocks that are acquired with the _irq/_irqsave variants or >> acquired in an IRQ disabled context should be classified as terminal >> locks. >> >> Because of the restrictions on terminal locks, we can do simple checks on >> them without using the lockdep lock validation machinery. The advantages >> of making these changes are as follows: >> >> 1) The lockdep check will be faster for terminal locks without using >> the lock validation code. >> 2) It saves table entries used by the validation code and hence make >> it harder to overflow those tables. >> >> In fact, it is possible to overflow some of the tables by running >> a variety of different workloads on a debug kernel. I have seen bug >> reports about exhausting MAX_LOCKDEP_KEYS, MAX_LOCKDEP_ENTRIES and >> MAX_STACK_TRACE_ENTRIES. This patch will help to reduce the chance >> of overflowing some of the tables. >> >> Performance wise, there was no statistically significant difference in >> performanace when doing a parallel kernel build on a debug kernel. > Could you please measure a locking intense workload instead, such as: > > $ perf stat --null --sync --repeat 10 perf bench sched messaging > > and profile which locks used there could be marked terminal, and measure > the before/after performance impact? I will run the test. It will probably be done after the LPC next week. >> Below were selected output lines from the lockdep_stats files of the >> patched and unpatched kernels after bootup and running parallel kernel >> builds. >> >> Item Unpatched kernel Patched kernel % Change >> ---- ---------------- -------------- -------- >> direct dependencies 9732 8994 -7.6% >> dependency chains 18776 17033 -9.3% >> dependency chain hlocks 76044 68419 -10.0% >> stack-trace entries 110403 104341 -5.5% > That's pretty impressive! > >> There were some reductions in the size of the lockdep tables. They were >> not significant, but it is still a good start to rein in the number of >> entries in those tables to make it harder to overflow them. > Agreed. > > BTW., if you are interested in more radical approaches to optimize > lockdep, we could also add a static checker via objtool driven call graph > analysis, and mark those locks terminal that we can prove are terminal. > > This would require the unified call graph of the kernel image and of all > modules to be examined in a final pass, but that's within the principal > scope of objtool. (This 'final pass' could also be done during bootup, at > least in initial versions.) > > Note that beyond marking it 'terminal' such a static analysis pass would > also allow the detection of obvious locking bugs at the build (or boot) > stage already - plus it would allow the disabling of lockdep for > self-contained locks that don't interact with anything else. > > I.e. the static analysis pass would 'augment' lockdep and leave only > those locks active for runtime lockdep tracking whose dependencies it > cannot prove to be correct yet. It is a pretty interesting idea to use objtool to scan for locks. The list of locks that I marked as terminal in this patch was found by looking at /proc/lockdep for those that only have backward dependencies, but no forward dependency. I focused on those with a large number of BDs and check the code to see if they could marked as terminal. This is a rather labor intensive process and is subject to error. It would be nice if it can be done by an automated tool. So I am going to look into that, but it won't be part of this initial patchset, though. I sent this patchset out to see if anyone has any objection to it. It seems you don't have any objection to that. So I am going to move ahead to do more testing and performance analysis. Thanks, Longman
On Fri, Nov 09, 2018 at 09:04:12AM +0100, Ingo Molnar wrote: > BTW., if you are interested in more radical approaches to optimize > lockdep, we could also add a static checker via objtool driven call graph > analysis, and mark those locks terminal that we can prove are terminal. > > This would require the unified call graph of the kernel image and of all > modules to be examined in a final pass, but that's within the principal > scope of objtool. (This 'final pass' could also be done during bootup, at > least in initial versions.) Something like this is needed for objtool LTO support as well. I just dread the build time 'regressions' this will introduce :/ The final link pass is already by far the most expensive part (as measured in wall-time) of building a kernel, adding more work there would really suck :/
On 11/10/2018 09:10 AM, Peter Zijlstra wrote: > On Fri, Nov 09, 2018 at 09:04:12AM +0100, Ingo Molnar wrote: >> BTW., if you are interested in more radical approaches to optimize >> lockdep, we could also add a static checker via objtool driven call graph >> analysis, and mark those locks terminal that we can prove are terminal. >> >> This would require the unified call graph of the kernel image and of all >> modules to be examined in a final pass, but that's within the principal >> scope of objtool. (This 'final pass' could also be done during bootup, at >> least in initial versions.) > Something like this is needed for objtool LTO support as well. I just > dread the build time 'regressions' this will introduce :/ > > The final link pass is already by far the most expensive part (as > measured in wall-time) of building a kernel, adding more work there > would really suck :/ I think the idea is to make objtool have the capability to do that. It doesn't mean we need to turn it on by default in every build. Cheers, Longman
* Waiman Long <longman@redhat.com> wrote: > On 11/10/2018 09:10 AM, Peter Zijlstra wrote: > > On Fri, Nov 09, 2018 at 09:04:12AM +0100, Ingo Molnar wrote: > >> BTW., if you are interested in more radical approaches to optimize > >> lockdep, we could also add a static checker via objtool driven call graph > >> analysis, and mark those locks terminal that we can prove are terminal. > >> > >> This would require the unified call graph of the kernel image and of all > >> modules to be examined in a final pass, but that's within the principal > >> scope of objtool. (This 'final pass' could also be done during bootup, at > >> least in initial versions.) > > > > Something like this is needed for objtool LTO support as well. I just > > dread the build time 'regressions' this will introduce :/ > > > > The final link pass is already by far the most expensive part (as > > measured in wall-time) of building a kernel, adding more work there > > would really suck :/ > > I think the idea is to make objtool have the capability to do that. It > doesn't mean we need to turn it on by default in every build. Yeah. Also note that much of the objtool legwork would be on a per file basis which is reasonably parallelized already. On x86 it's also already done for every ORC build i.e. every distro build and the incremental overhead from also extracting locking dependencies should be reasonably small. The final search of the global graph would be serialized but still reasonably fast as these are all 'class' level dependencies which are much less numerous than runtime dependencies. I.e. I think we are talking about tens of thousands of dependencies, not tens of millions. At least in theory. ;-) Thanks, Ingo
* Waiman Long <longman@redhat.com> wrote: > > Could you please measure a locking intense workload instead, such as: > > > > $ perf stat --null --sync --repeat 10 perf bench sched messaging > > > > and profile which locks used there could be marked terminal, and measure > > the before/after performance impact? > > I will run the test. It will probably be done after the LPC next week. Thanks! > >> Below were selected output lines from the lockdep_stats files of the > >> patched and unpatched kernels after bootup and running parallel kernel > >> builds. > >> > >> Item Unpatched kernel Patched kernel % Change > >> ---- ---------------- -------------- -------- > >> direct dependencies 9732 8994 -7.6% > >> dependency chains 18776 17033 -9.3% > >> dependency chain hlocks 76044 68419 -10.0% > >> stack-trace entries 110403 104341 -5.5% > > That's pretty impressive! > > > >> There were some reductions in the size of the lockdep tables. They were > >> not significant, but it is still a good start to rein in the number of > >> entries in those tables to make it harder to overflow them. > > Agreed. > > > > BTW., if you are interested in more radical approaches to optimize > > lockdep, we could also add a static checker via objtool driven call graph > > analysis, and mark those locks terminal that we can prove are terminal. > > > > This would require the unified call graph of the kernel image and of all > > modules to be examined in a final pass, but that's within the principal > > scope of objtool. (This 'final pass' could also be done during bootup, at > > least in initial versions.) > > > > Note that beyond marking it 'terminal' such a static analysis pass would > > also allow the detection of obvious locking bugs at the build (or boot) > > stage already - plus it would allow the disabling of lockdep for > > self-contained locks that don't interact with anything else. > > > > I.e. the static analysis pass would 'augment' lockdep and leave only > > those locks active for runtime lockdep tracking whose dependencies it > > cannot prove to be correct yet. > > It is a pretty interesting idea to use objtool to scan for locks. The > list of locks that I marked as terminal in this patch was found by > looking at /proc/lockdep for those that only have backward dependencies, > but no forward dependency. I focused on those with a large number of BDs > and check the code to see if they could marked as terminal. This is a > rather labor intensive process and is subject to error. Yeah. > [...] It would be nice if it can be done by an automated tool. So I am > going to look into that, but it won't be part of this initial patchset, > though. Of course! > I sent this patchset out to see if anyone has any objection to it. It > seems you don't have any objection to that. So I am going to move ahead > to do more testing and performance analysis. The one worry I have is that this interim solution removes the benefit of a proper static analysis method. But if you promise to make a serious effort on the static analysis tooling as well (which should have awesome performance results and automate the manual markup), then I have no fundamental objections to the interim approach either. If static analysis works as well as I expect it to then in principle we might even be able to have lockdep enabled in production kernels: it would only add overhead to locks that are overly complex - which would create incentives to improve those dependencies. Thanks, Ingo
On Mon, Nov 12, 2018 at 06:10:33AM +0100, Ingo Molnar wrote: > > * Waiman Long <longman@redhat.com> wrote: > > > On 11/10/2018 09:10 AM, Peter Zijlstra wrote: > > > On Fri, Nov 09, 2018 at 09:04:12AM +0100, Ingo Molnar wrote: > > >> BTW., if you are interested in more radical approaches to optimize > > >> lockdep, we could also add a static checker via objtool driven call graph > > >> analysis, and mark those locks terminal that we can prove are terminal. > > >> > > >> This would require the unified call graph of the kernel image and of all > > >> modules to be examined in a final pass, but that's within the principal > > >> scope of objtool. (This 'final pass' could also be done during bootup, at > > >> least in initial versions.) > > > > > > Something like this is needed for objtool LTO support as well. I just > > > dread the build time 'regressions' this will introduce :/ > > > > > > The final link pass is already by far the most expensive part (as > > > measured in wall-time) of building a kernel, adding more work there > > > would really suck :/ > > > > I think the idea is to make objtool have the capability to do that. It > > doesn't mean we need to turn it on by default in every build. > > Yeah. > > Also note that much of the objtool legwork would be on a per file basis > which is reasonably parallelized already. On x86 it's also already done > for every ORC build i.e. every distro build and the incremental overhead > from also extracting locking dependencies should be reasonably small. > > The final search of the global graph would be serialized but still > reasonably fast as these are all 'class' level dependencies which are > much less numerous than runtime dependencies. > > I.e. I think we are talking about tens of thousands of dependencies, not > tens of millions. > > At least in theory. ;-) Generating a unified call graph sounds very expensive (and very far beyond what objtool can do today). Also, what about function pointers? BTW there's another kernel static analysis tool which attempts to create such a call graph already: smatch.
* Josh Poimboeuf <jpoimboe@redhat.com> wrote: > On Mon, Nov 12, 2018 at 06:10:33AM +0100, Ingo Molnar wrote: > > > > * Waiman Long <longman@redhat.com> wrote: > > > > > On 11/10/2018 09:10 AM, Peter Zijlstra wrote: > > > > On Fri, Nov 09, 2018 at 09:04:12AM +0100, Ingo Molnar wrote: > > > >> BTW., if you are interested in more radical approaches to optimize > > > >> lockdep, we could also add a static checker via objtool driven call graph > > > >> analysis, and mark those locks terminal that we can prove are terminal. > > > >> > > > >> This would require the unified call graph of the kernel image and of all > > > >> modules to be examined in a final pass, but that's within the principal > > > >> scope of objtool. (This 'final pass' could also be done during bootup, at > > > >> least in initial versions.) > > > > > > > > Something like this is needed for objtool LTO support as well. I just > > > > dread the build time 'regressions' this will introduce :/ > > > > > > > > The final link pass is already by far the most expensive part (as > > > > measured in wall-time) of building a kernel, adding more work there > > > > would really suck :/ > > > > > > I think the idea is to make objtool have the capability to do that. It > > > doesn't mean we need to turn it on by default in every build. > > > > Yeah. > > > > Also note that much of the objtool legwork would be on a per file basis > > which is reasonably parallelized already. On x86 it's also already done > > for every ORC build i.e. every distro build and the incremental overhead > > from also extracting locking dependencies should be reasonably small. > > > > The final search of the global graph would be serialized but still > > reasonably fast as these are all 'class' level dependencies which are > > much less numerous than runtime dependencies. > > > > I.e. I think we are talking about tens of thousands of dependencies, not > > tens of millions. > > > > At least in theory. ;-) > > Generating a unified call graph sounds very expensive (and very far > beyond what objtool can do today). Well, objtool already goes through the instruction stream and recognizes function calls - so it can in effect generate a stream of "function x called by function y" data, correct? > Also, what about function pointers? So maybe it's possible to enumerate all potential values for function pointers with a reasonably simple compiler plugin and work from there? One complication would be function pointers encoded as opaque data types... > BTW there's another kernel static analysis tool which attempts to > create such a call graph already: smatch. It's not included in the kernel tree though and I'd expect tight coupling (or at least lock-step improvements) between tooling and lockdep here. Thanks, Ingo
On Mon, Nov 12, 2018 at 07:30:50AM +0100, Ingo Molnar wrote: > > * Josh Poimboeuf <jpoimboe@redhat.com> wrote: > > > On Mon, Nov 12, 2018 at 06:10:33AM +0100, Ingo Molnar wrote: > > > > > > * Waiman Long <longman@redhat.com> wrote: > > > > > > > On 11/10/2018 09:10 AM, Peter Zijlstra wrote: > > > > > On Fri, Nov 09, 2018 at 09:04:12AM +0100, Ingo Molnar wrote: > > > > >> BTW., if you are interested in more radical approaches to optimize > > > > >> lockdep, we could also add a static checker via objtool driven call graph > > > > >> analysis, and mark those locks terminal that we can prove are terminal. > > > > >> > > > > >> This would require the unified call graph of the kernel image and of all > > > > >> modules to be examined in a final pass, but that's within the principal > > > > >> scope of objtool. (This 'final pass' could also be done during bootup, at > > > > >> least in initial versions.) > > > > > > > > > > Something like this is needed for objtool LTO support as well. I just > > > > > dread the build time 'regressions' this will introduce :/ > > > > > > > > > > The final link pass is already by far the most expensive part (as > > > > > measured in wall-time) of building a kernel, adding more work there > > > > > would really suck :/ > > > > > > > > I think the idea is to make objtool have the capability to do that. It > > > > doesn't mean we need to turn it on by default in every build. > > > > > > Yeah. > > > > > > Also note that much of the objtool legwork would be on a per file basis > > > which is reasonably parallelized already. On x86 it's also already done > > > for every ORC build i.e. every distro build and the incremental overhead > > > from also extracting locking dependencies should be reasonably small. > > > > > > The final search of the global graph would be serialized but still > > > reasonably fast as these are all 'class' level dependencies which are > > > much less numerous than runtime dependencies. > > > > > > I.e. I think we are talking about tens of thousands of dependencies, not > > > tens of millions. > > > > > > At least in theory. ;-) > > > > Generating a unified call graph sounds very expensive (and very far > > beyond what objtool can do today). > > Well, objtool already goes through the instruction stream and recognizes > function calls - so it can in effect generate a stream of "function x > called by function y" data, correct? Yeah, though it would be quite simple to get the same data with a simple awk script at link time. > > Also, what about function pointers? > > So maybe it's possible to enumerate all potential values for function > pointers with a reasonably simple compiler plugin and work from there? I think this would be somewhere between very difficult and impossible to do properly. I can't even imagine how this would be implemented in a compiler plugin. But I'd love to be proven wrong on that. > One complication would be function pointers encoded as opaque data > types... > > > BTW there's another kernel static analysis tool which attempts to > > create such a call graph already: smatch. > > It's not included in the kernel tree though and I'd expect tight coupling > (or at least lock-step improvements) between tooling and lockdep here. Fair enough. Smatch's call tree isn't perfect anyway, but I don't think perfect is attainable.
On 11/12/2018 02:22 PM, Josh Poimboeuf wrote: > On Mon, Nov 12, 2018 at 07:30:50AM +0100, Ingo Molnar wrote: >> * Josh Poimboeuf <jpoimboe@redhat.com> wrote: >> >>> On Mon, Nov 12, 2018 at 06:10:33AM +0100, Ingo Molnar wrote: >>>> * Waiman Long <longman@redhat.com> wrote: >>>> >>>>> On 11/10/2018 09:10 AM, Peter Zijlstra wrote: >>>>>> On Fri, Nov 09, 2018 at 09:04:12AM +0100, Ingo Molnar wrote: >>>>>>> BTW., if you are interested in more radical approaches to optimize >>>>>>> lockdep, we could also add a static checker via objtool driven call graph >>>>>>> analysis, and mark those locks terminal that we can prove are terminal. >>>>>>> >>>>>>> This would require the unified call graph of the kernel image and of all >>>>>>> modules to be examined in a final pass, but that's within the principal >>>>>>> scope of objtool. (This 'final pass' could also be done during bootup, at >>>>>>> least in initial versions.) >>>>>> Something like this is needed for objtool LTO support as well. I just >>>>>> dread the build time 'regressions' this will introduce :/ >>>>>> >>>>>> The final link pass is already by far the most expensive part (as >>>>>> measured in wall-time) of building a kernel, adding more work there >>>>>> would really suck :/ >>>>> I think the idea is to make objtool have the capability to do that. It >>>>> doesn't mean we need to turn it on by default in every build. >>>> Yeah. >>>> >>>> Also note that much of the objtool legwork would be on a per file basis >>>> which is reasonably parallelized already. On x86 it's also already done >>>> for every ORC build i.e. every distro build and the incremental overhead >>>> from also extracting locking dependencies should be reasonably small. >>>> >>>> The final search of the global graph would be serialized but still >>>> reasonably fast as these are all 'class' level dependencies which are >>>> much less numerous than runtime dependencies. >>>> >>>> I.e. I think we are talking about tens of thousands of dependencies, not >>>> tens of millions. >>>> >>>> At least in theory. ;-) >>> Generating a unified call graph sounds very expensive (and very far >>> beyond what objtool can do today). >> Well, objtool already goes through the instruction stream and recognizes >> function calls - so it can in effect generate a stream of "function x >> called by function y" data, correct? > Yeah, though it would be quite simple to get the same data with a simple > awk script at link time. > >>> Also, what about function pointers? >> So maybe it's possible to enumerate all potential values for function >> pointers with a reasonably simple compiler plugin and work from there? > I think this would be somewhere between very difficult and impossible to > do properly. I can't even imagine how this would be implemented in a > compiler plugin. But I'd love to be proven wrong on that. I would say we have to assume for the worst when a function pointer is being called while holding a lock unless we are able to find out all its possible targets. Cheers, Longman