mbox series

[0/2] locking: Add new lock contention tracepoints (v3)

Message ID 20220316224548.500123-1-namhyung@kernel.org (mailing list archive)
Headers show
Series locking: Add new lock contention tracepoints (v3) | expand

Message

Namhyung Kim March 16, 2022, 10:45 p.m. UTC
Hello,

There have been some requests for low-overhead kernel lock contention
monitoring.  The kernel has CONFIG_LOCK_STAT to provide such an infra
either via /proc/lock_stat or tracepoints directly.

However it's not light-weight and hard to be used in production.  So
I'm trying to add new tracepoints for lock contention and using them
as a base to build a new monitoring system.

* Changes in v3
 - move the tracepoints deeper in the slow path
 - remove the caller ip
 - don't use task state in the flags

* Changes in v2
 - do not use lockdep infrastructure
 - add flags argument to lock:contention_begin tracepoint

I added a flags argument in the contention_begin to classify locks in
question.  It can tell whether it's a spinlock, reader-writer lock or
a mutex.  With stacktrace, users can identify which lock is contended.

The patch 01 added the tracepoints and move the definition to the
mutex.c file so that it can see the tracepoints without lockdep.

The patch 02 actually installs the tracepoints in the locking code.
To minimize the overhead, they were added in the slow path of the code
separately.  As spinlocks are defined in the arch headers, I couldn't
handle them all.  I've just added it to generic queued spinlock and
rwlocks only.  Each arch can add the tracepoints later.

This series base on the current tip/locking/core and you get it from
'locking/tracepoint-v3' branch in my tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git


Thanks,
Namhyung


Namhyung Kim (2):
  locking: Add lock contention tracepoints
  locking: Apply contention tracepoints in the slow path

 include/trace/events/lock.h   | 54 +++++++++++++++++++++++++++++++++--
 kernel/locking/lockdep.c      |  1 -
 kernel/locking/mutex.c        |  6 ++++
 kernel/locking/percpu-rwsem.c |  3 ++
 kernel/locking/qrwlock.c      |  9 ++++++
 kernel/locking/qspinlock.c    |  5 ++++
 kernel/locking/rtmutex.c      | 11 +++++++
 kernel/locking/rwbase_rt.c    |  3 ++
 kernel/locking/rwsem.c        |  9 ++++++
 kernel/locking/semaphore.c    | 14 ++++++++-
 10 files changed, 110 insertions(+), 5 deletions(-)


base-commit: cd27ccfc727e99352321c0c75012ab9c5a90321e

Comments

Hyeonggon Yoo March 17, 2022, 5:32 p.m. UTC | #1
On Wed, Mar 16, 2022 at 03:45:46PM -0700, Namhyung Kim wrote:
> Hello,
> 
> There have been some requests for low-overhead kernel lock contention
> monitoring.  The kernel has CONFIG_LOCK_STAT to provide such an infra
> either via /proc/lock_stat or tracepoints directly.
> 
> However it's not light-weight and hard to be used in production.  So
> I'm trying to add new tracepoints for lock contention and using them
> as a base to build a new monitoring system.

Hello Namhyung,
I like this series so much.
Lock contentions became much more visible without serious overhead.

For the series:
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

How would you use these tracepoints, is there a script you use?
For testing, I just wrote a simple bpftrace script:

$ sudo bpftrace -e 'BEGIN
{
	printf("Collecting lockstats... Hit Ctrl-C to end.\n");
}

tracepoint:lock:contention_begin
{
	@start_us[tid] = nsecs / 1000;
}

tracepoint:lock:contention_end
{
	if (args->ret == 0) {
		@stats[kstack] = hist(nsecs / 1000 - @start_us[tid]);
	}
}

END
{
	clear(@start_us);
}'

And it shows its distribution of slowpath wait time like below. Great.

@stats[
    __traceiter_contention_end+76
    __traceiter_contention_end+76
    queued_spin_lock_slowpath+556
    _raw_spin_lock+108
    rmqueue_bulk+80
    rmqueue+1060
    get_page_from_freelist+372
    __alloc_pages+208
    alloc_pages_vma+160
    alloc_zeroed_user_highpage_movable+72
    do_anonymous_page+92
    handle_pte_fault+320
    __handle_mm_fault+252
    handle_mm_fault+244
    do_page_fault+340
    do_translation_fault+100
    do_mem_abort+76
    el0_da+60
    el0t_64_sync_handler+232
    el0t_64_sync+420
]:
[2, 4)                 1 |@                                                   |
[4, 8)                30 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |
[8, 16)               25 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
[16, 32)              33 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32, 64)              29 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[64, 128)             13 |@@@@@@@@@@@@@@@@@@@@                                |
[128, 256)             2 |@@@                                                 |


@stats[
    __traceiter_contention_end+76
    __traceiter_contention_end+76
    rwsem_down_write_slowpath+1216
    down_write_killable+100
    do_mprotect_pkey.constprop.0+176
    __arm64_sys_mprotect+40
    invoke_syscall+80
    el0_svc_common.constprop.0+76
    do_el0_svc+52
    el0_svc+48
    el0t_64_sync_handler+164
    el0t_64_sync+420
]: 
[1]                   13 |@@@@@@@@@@@                                         |
[2, 4)                42 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
[4, 8)                 5 |@@@@                                                |
[8, 16)               10 |@@@@@@@@                                            |
[16, 32)              60 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32, 64)              41 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 |
[64, 128)             40 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[128, 256)            14 |@@@@@@@@@@@@                                        |
[256, 512)             7 |@@@@@@                                              |
[512, 1K)              6 |@@@@@                                               |
[1K, 2K)               2 |@                                                   |
[2K, 4K)               1 |                                                    |

@stats[
    __traceiter_contention_end+76
    __traceiter_contention_end+76
    queued_spin_lock_slowpath+556
    _raw_spin_lock+108
    futex_wake+168
    do_futex+200
    __arm64_sys_futex+112
    invoke_syscall+80
    el0_svc_common.constprop.0+76
    do_el0_svc+52
    el0_svc+48
    el0t_64_sync_handler+164
    el0t_64_sync+420
]:
[0]                    3 |                                                    |
[1]                 2515 |@                                                   |
[2, 4)              8747 |@@@@@                                               |
[4, 8)             17052 |@@@@@@@@@@                                          |
[8, 16)            46706 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[16, 32)           82105 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32, 64)           46918 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[64, 128)             99 |                                                    |
[128, 256)             1 |                                                    |
[256, 512)             8 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)               0 |                                                    |
[2K, 4K)               0 |                                                    |
[4K, 8K)               0 |                                                    |
[8K, 16K)              0 |                                                    |
[16K, 32K)             5 |                                                    |
[32K, 64K)            12 |                                                    |
[64K, 128K)           34 |                                                    |
[128K, 256K)          68 |                                                    |
[256K, 512K)           7 |                                                    |


> * Changes in v3
>  - move the tracepoints deeper in the slow path
>  - remove the caller ip
>  - don't use task state in the flags
> 
> * Changes in v2
>  - do not use lockdep infrastructure
>  - add flags argument to lock:contention_begin tracepoint
> 
> I added a flags argument in the contention_begin to classify locks in
> question.  It can tell whether it's a spinlock, reader-writer lock or
> a mutex.  With stacktrace, users can identify which lock is contended.
>
> The patch 01 added the tracepoints and move the definition to the
> mutex.c file so that it can see the tracepoints without lockdep.
> 
> The patch 02 actually installs the tracepoints in the locking code.
> To minimize the overhead, they were added in the slow path of the code
> separately.  As spinlocks are defined in the arch headers, I couldn't
> handle them all.  I've just added it to generic queued spinlock and
> rwlocks only.  Each arch can add the tracepoints later.
>
> This series base on the current tip/locking/core and you get it from
> 'locking/tracepoint-v3' branch in my tree at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> 
> 
> Thanks,
> Namhyung
> 
> 
> Namhyung Kim (2):
>   locking: Add lock contention tracepoints
>   locking: Apply contention tracepoints in the slow path
> 
>  include/trace/events/lock.h   | 54 +++++++++++++++++++++++++++++++++--
>  kernel/locking/lockdep.c      |  1 -
>  kernel/locking/mutex.c        |  6 ++++
>  kernel/locking/percpu-rwsem.c |  3 ++
>  kernel/locking/qrwlock.c      |  9 ++++++
>  kernel/locking/qspinlock.c    |  5 ++++
>  kernel/locking/rtmutex.c      | 11 +++++++
>  kernel/locking/rwbase_rt.c    |  3 ++
>  kernel/locking/rwsem.c        |  9 ++++++
>  kernel/locking/semaphore.c    | 14 ++++++++-
>  10 files changed, 110 insertions(+), 5 deletions(-)
> 
> 
> base-commit: cd27ccfc727e99352321c0c75012ab9c5a90321e
> -- 
> 2.35.1.894.gb6a874cedc-goog
>
Namhyung Kim March 18, 2022, 9:12 p.m. UTC | #2
Hello Hyeonggon,

On Thu, Mar 17, 2022 at 10:32 AM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote:
>
> On Wed, Mar 16, 2022 at 03:45:46PM -0700, Namhyung Kim wrote:
> > Hello,
> >
> > There have been some requests for low-overhead kernel lock contention
> > monitoring.  The kernel has CONFIG_LOCK_STAT to provide such an infra
> > either via /proc/lock_stat or tracepoints directly.
> >
> > However it's not light-weight and hard to be used in production.  So
> > I'm trying to add new tracepoints for lock contention and using them
> > as a base to build a new monitoring system.
>
> Hello Namhyung,
> I like this series so much.
> Lock contentions became much more visible without serious overhead.
>
> For the series:
> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Thanks!

>
> How would you use these tracepoints, is there a script you use?

Not yet.  But I'm thinking something similar to your script.
Probably I'll extend 'perf lock' command to have this kind of output
but it doesn't have lock name and use avg/min/max time instead of
histogram.

Thanks,
Namhyung


> For testing, I just wrote a simple bpftrace script:
>
> $ sudo bpftrace -e 'BEGIN
> {
>         printf("Collecting lockstats... Hit Ctrl-C to end.\n");
> }
>
> tracepoint:lock:contention_begin
> {
>         @start_us[tid] = nsecs / 1000;
> }
>
> tracepoint:lock:contention_end
> {
>         if (args->ret == 0) {
>                 @stats[kstack] = hist(nsecs / 1000 - @start_us[tid]);
>         }
> }
>
> END
> {
>         clear(@start_us);
> }'
>
> And it shows its distribution of slowpath wait time like below. Great.
>
> @stats[
>     __traceiter_contention_end+76
>     __traceiter_contention_end+76
>     queued_spin_lock_slowpath+556
>     _raw_spin_lock+108
>     rmqueue_bulk+80
>     rmqueue+1060
>     get_page_from_freelist+372
>     __alloc_pages+208
>     alloc_pages_vma+160
>     alloc_zeroed_user_highpage_movable+72
>     do_anonymous_page+92
>     handle_pte_fault+320
>     __handle_mm_fault+252
>     handle_mm_fault+244
>     do_page_fault+340
>     do_translation_fault+100
>     do_mem_abort+76
>     el0_da+60
>     el0t_64_sync_handler+232
>     el0t_64_sync+420
> ]:
> [2, 4)                 1 |@                                                   |
> [4, 8)                30 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |
> [8, 16)               25 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
> [16, 32)              33 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32, 64)              29 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
> [64, 128)             13 |@@@@@@@@@@@@@@@@@@@@                                |
> [128, 256)             2 |@@@                                                 |
>
>
> @stats[
>     __traceiter_contention_end+76
>     __traceiter_contention_end+76
>     rwsem_down_write_slowpath+1216
>     down_write_killable+100
>     do_mprotect_pkey.constprop.0+176
>     __arm64_sys_mprotect+40
>     invoke_syscall+80
>     el0_svc_common.constprop.0+76
>     do_el0_svc+52
>     el0_svc+48
>     el0t_64_sync_handler+164
>     el0t_64_sync+420
> ]:
> [1]                   13 |@@@@@@@@@@@                                         |
> [2, 4)                42 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
> [4, 8)                 5 |@@@@                                                |
> [8, 16)               10 |@@@@@@@@                                            |
> [16, 32)              60 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32, 64)              41 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 |
> [64, 128)             40 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
> [128, 256)            14 |@@@@@@@@@@@@                                        |
> [256, 512)             7 |@@@@@@                                              |
> [512, 1K)              6 |@@@@@                                               |
> [1K, 2K)               2 |@                                                   |
> [2K, 4K)               1 |                                                    |
>
> @stats[
>     __traceiter_contention_end+76
>     __traceiter_contention_end+76
>     queued_spin_lock_slowpath+556
>     _raw_spin_lock+108
>     futex_wake+168
>     do_futex+200
>     __arm64_sys_futex+112
>     invoke_syscall+80
>     el0_svc_common.constprop.0+76
>     do_el0_svc+52
>     el0_svc+48
>     el0t_64_sync_handler+164
>     el0t_64_sync+420
> ]:
> [0]                    3 |                                                    |
> [1]                 2515 |@                                                   |
> [2, 4)              8747 |@@@@@                                               |
> [4, 8)             17052 |@@@@@@@@@@                                          |
> [8, 16)            46706 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
> [16, 32)           82105 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32, 64)           46918 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
> [64, 128)             99 |                                                    |
> [128, 256)             1 |                                                    |
> [256, 512)             8 |                                                    |
> [512, 1K)              0 |                                                    |
> [1K, 2K)               0 |                                                    |
> [2K, 4K)               0 |                                                    |
> [4K, 8K)               0 |                                                    |
> [8K, 16K)              0 |                                                    |
> [16K, 32K)             5 |                                                    |
> [32K, 64K)            12 |                                                    |
> [64K, 128K)           34 |                                                    |
> [128K, 256K)          68 |                                                    |
> [256K, 512K)           7 |                                                    |
>