[v2,0/7] Add histogram measuring mmap_lock contention latency
mbox series

Message ID 20200528235238.74233-1-axelrasmussen@google.com
Headers show
Series
  • Add histogram measuring mmap_lock contention latency
Related show

Message

Axel Rasmussen May 28, 2020, 11:52 p.m. UTC
The overall goal of this patchset is to add a latency histogram which measures
`mmap_lock` acquisition time. This is useful to measure the impact of ongoing
work like maple trees and range locks (https://lwn.net/Articles/787629/), and
it is also useful to debug userspace processes which experience long waits due
to lock contention.

This patchset is built upon walken@google.com's new `mmap_lock` API
(https://lkml.org/lkml/2020/4/21/1307). In its current form, it should apply
cleanly to a 5.7-rc7 tree to which Michel's patchset has already been applied.

To summarize the changes being made at a high level:

- Add a histogram library: a `struct histogram` is effectively an array of
  thresholds (i.e., buckets), and an array of per-cpu `u64` counts of the
  number of samples in each bucket.

- Modify Michel's mmap_lock API to record samples in a histogram, owned by the
 `mm_struct`, on each lock acquisition. For contended lock acquisitions, we
 compute the amount of time spent waiting, which determines the bucket.

- For uncontended cases, we still record a sample, but with "0" latency. The
  reasoning for this is, a) we don't want to incur the overhead of actually
  measuring the time, but b) we still want to end up with an accurate count of
  acquisition attempts, as this lets us compute latency percentiles (e.g., "x%
  of lock acquisitions completed in <= y ns").

Changes since v1 (sent to a few folks within Google for initial review):

- Added a tracepoint to the contended case.
- Modified `mmap_write_lock_nested` to split the {un,}contended cases.
- Removed support for having more than one histogram in `mm_struct`.
- Removed any histogram code not explicitly used in this patchset.
- Whitespace cleanups.

Axel Rasmussen (7):
  histogram: add struct histogram
  histogram: add helper function to expose histograms to userspace
  mmap_lock: add a histogram structure to struct mm_struct
  mmap_lock: allocate histogram (if enabled) in mm_init
  mmap_lock: add /proc/<pid>/mmap_lock_contention interface
  mmap_lock: increment histogram whenever mmap_lock is acquired
  mmap_lock: add a tracepoint to contended acquisitions

 fs/proc/base.c                   |  25 +++
 include/linux/histogram.h        | 293 +++++++++++++++++++++++++++++++
 include/linux/mm_types.h         |  11 ++
 include/linux/mmap_lock.h        |  92 +++++++++-
 include/trace/events/mmap_lock.h |  34 ++++
 kernel/fork.c                    |  55 ++++++
 kernel/locking/rwsem.c           |   4 +-
 lib/Kconfig                      |   3 +
 lib/Makefile                     |   2 +
 lib/histogram.c                  | 212 ++++++++++++++++++++++
 mm/Kconfig                       |  13 ++
 mm/Makefile                      |   1 +
 mm/mmap_lock.c                   |  46 +++++
 13 files changed, 782 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/histogram.h
 create mode 100644 include/trace/events/mmap_lock.h
 create mode 100644 lib/histogram.c
 create mode 100644 mm/mmap_lock.c

--
2.27.0.rc0.183.gde8f92d652-goog

Comments

Steven Rostedt May 29, 2020, 12:24 a.m. UTC | #1
On Thu, 28 May 2020 16:52:38 -0700
Axel Rasmussen <axelrasmussen@google.com> wrote:

Hi Axel,

First, your patch threading is messed up. All the patches should be a
reply to this cover page, and not individual emails which get lost
among other patches.

Next, we already have histogram logic with trace events. Why not build
off of that. Or perhaps update lockdep which can record contention with
all locks. Why create yet another histogram infrastructure that is used
for just a specific purpose?

-- Steve


> The overall goal of this patchset is to add a latency histogram which measures
> `mmap_lock` acquisition time. This is useful to measure the impact of ongoing
> work like maple trees and range locks (https://lwn.net/Articles/787629/), and
> it is also useful to debug userspace processes which experience long waits due
> to lock contention.
> 
> This patchset is built upon walken@google.com's new `mmap_lock` API
> (https://lkml.org/lkml/2020/4/21/1307). In its current form, it should apply
> cleanly to a 5.7-rc7 tree to which Michel's patchset has already been applied.
> 
> To summarize the changes being made at a high level:
> 
> - Add a histogram library: a `struct histogram` is effectively an array of
>   thresholds (i.e., buckets), and an array of per-cpu `u64` counts of the
>   number of samples in each bucket.
> 
> - Modify Michel's mmap_lock API to record samples in a histogram, owned by the
>  `mm_struct`, on each lock acquisition. For contended lock acquisitions, we
>  compute the amount of time spent waiting, which determines the bucket.
> 
> - For uncontended cases, we still record a sample, but with "0" latency. The
>   reasoning for this is, a) we don't want to incur the overhead of actually
>   measuring the time, but b) we still want to end up with an accurate count of
>   acquisition attempts, as this lets us compute latency percentiles (e.g., "x%
>   of lock acquisitions completed in <= y ns").
> 
> Changes since v1 (sent to a few folks within Google for initial review):
> 
> - Added a tracepoint to the contended case.
> - Modified `mmap_write_lock_nested` to split the {un,}contended cases.
> - Removed support for having more than one histogram in `mm_struct`.
> - Removed any histogram code not explicitly used in this patchset.
> - Whitespace cleanups.
> 
> Axel Rasmussen (7):
>   histogram: add struct histogram
>   histogram: add helper function to expose histograms to userspace
>   mmap_lock: add a histogram structure to struct mm_struct
>   mmap_lock: allocate histogram (if enabled) in mm_init
>   mmap_lock: add /proc/<pid>/mmap_lock_contention interface
>   mmap_lock: increment histogram whenever mmap_lock is acquired
>   mmap_lock: add a tracepoint to contended acquisitions
> 
>  fs/proc/base.c                   |  25 +++
>  include/linux/histogram.h        | 293 +++++++++++++++++++++++++++++++
>  include/linux/mm_types.h         |  11 ++
>  include/linux/mmap_lock.h        |  92 +++++++++-
>  include/trace/events/mmap_lock.h |  34 ++++
>  kernel/fork.c                    |  55 ++++++
>  kernel/locking/rwsem.c           |   4 +-
>  lib/Kconfig                      |   3 +
>  lib/Makefile                     |   2 +
>  lib/histogram.c                  | 212 ++++++++++++++++++++++
>  mm/Kconfig                       |  13 ++
>  mm/Makefile                      |   1 +
>  mm/mmap_lock.c                   |  46 +++++
>  13 files changed, 782 insertions(+), 9 deletions(-)
>  create mode 100644 include/linux/histogram.h
>  create mode 100644 include/trace/events/mmap_lock.h
>  create mode 100644 lib/histogram.c
>  create mode 100644 mm/mmap_lock.c
> 
> --
> 2.27.0.rc0.183.gde8f92d652-goog
Axel Rasmussen May 29, 2020, 1:39 a.m. UTC | #2
On Thu, May 28, 2020 at 5:24 PM Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 28 May 2020 16:52:38 -0700
> Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> Hi Axel,
>
> First, your patch threading is messed up. All the patches should be a
> reply to this cover page, and not individual emails which get lost
> among other patches.
>

Sorry - I'll fix this in the future. (I mistakenly ran git send-email on
each patch individually, rather than globbing them into a single command.)


>
> Next, we already have histogram logic with trace events. Why not build
> off of that. Or perhaps update lockdep which can record contention with
> all locks. Why create yet another histogram infrastructure that is used
> for just a specific purpose?
>

The use case we have in mind for this is to enable this instrumentation
widely in Google's production fleet. Internally, we have a userspace thing
which scrapes these metrics and publishes them such that we can look at
aggregate metrics across our fleet. The thinking is that mechanisms like
lockdep or getting histograms with e.g. BPF attached to the tracepoint
introduces too much overhead for this to be viable. (Although, granted, I
don't have benchmarks to prove this - if there's skepticism, I can produce
such a thing - or prove myself wrong and rethink my approach. :) )


>
> -- Steve
>
>
> > The overall goal of this patchset is to add a latency histogram which
> measures
> > `mmap_lock` acquisition time. This is useful to measure the impact of
> ongoing
> > work like maple trees and range locks (https://lwn.net/Articles/787629/),
> and
> > it is also useful to debug userspace processes which experience long
> waits due
> > to lock contention.
> >
> > This patchset is built upon walken@google.com's new `mmap_lock` API
> > (https://lkml.org/lkml/2020/4/21/1307). In its current form, it should
> apply
> > cleanly to a 5.7-rc7 tree to which Michel's patchset has already been
> applied.
> >
> > To summarize the changes being made at a high level:
> >
> > - Add a histogram library: a `struct histogram` is effectively an array
> of
> >   thresholds (i.e., buckets), and an array of per-cpu `u64` counts of the
> >   number of samples in each bucket.
> >
> > - Modify Michel's mmap_lock API to record samples in a histogram, owned
> by the
> >  `mm_struct`, on each lock acquisition. For contended lock acquisitions,
> we
> >  compute the amount of time spent waiting, which determines the bucket.
> >
> > - For uncontended cases, we still record a sample, but with "0" latency.
> The
> >   reasoning for this is, a) we don't want to incur the overhead of
> actually
> >   measuring the time, but b) we still want to end up with an accurate
> count of
> >   acquisition attempts, as this lets us compute latency percentiles
> (e.g., "x%
> >   of lock acquisitions completed in <= y ns").
> >
> > Changes since v1 (sent to a few folks within Google for initial review):
> >
> > - Added a tracepoint to the contended case.
> > - Modified `mmap_write_lock_nested` to split the {un,}contended cases.
> > - Removed support for having more than one histogram in `mm_struct`.
> > - Removed any histogram code not explicitly used in this patchset.
> > - Whitespace cleanups.
> >
> > Axel Rasmussen (7):
> >   histogram: add struct histogram
> >   histogram: add helper function to expose histograms to userspace
> >   mmap_lock: add a histogram structure to struct mm_struct
> >   mmap_lock: allocate histogram (if enabled) in mm_init
> >   mmap_lock: add /proc/<pid>/mmap_lock_contention interface
> >   mmap_lock: increment histogram whenever mmap_lock is acquired
> >   mmap_lock: add a tracepoint to contended acquisitions
> >
> >  fs/proc/base.c                   |  25 +++
> >  include/linux/histogram.h        | 293 +++++++++++++++++++++++++++++++
> >  include/linux/mm_types.h         |  11 ++
> >  include/linux/mmap_lock.h        |  92 +++++++++-
> >  include/trace/events/mmap_lock.h |  34 ++++
> >  kernel/fork.c                    |  55 ++++++
> >  kernel/locking/rwsem.c           |   4 +-
> >  lib/Kconfig                      |   3 +
> >  lib/Makefile                     |   2 +
> >  lib/histogram.c                  | 212 ++++++++++++++++++++++
> >  mm/Kconfig                       |  13 ++
> >  mm/Makefile                      |   1 +
> >  mm/mmap_lock.c                   |  46 +++++
> >  13 files changed, 782 insertions(+), 9 deletions(-)
> >  create mode 100644 include/linux/histogram.h
> >  create mode 100644 include/trace/events/mmap_lock.h
> >  create mode 100644 lib/histogram.c
> >  create mode 100644 mm/mmap_lock.c
> >
> > --
> > 2.27.0.rc0.183.gde8f92d652-goog
>
>
Thomas Gleixner May 29, 2020, 5:32 a.m. UTC | #3
Axel,

Axel Rasmussen <axelrasmussen@google.com> writes:
> The overall goal of this patchset is to add a latency histogram which measures
> `mmap_lock` acquisition time. This is useful to measure the impact of ongoing
> work like maple trees and range locks (https://lwn.net/Articles/787629/), and
> it is also useful to debug userspace processes which experience long waits due
> to lock contention.
>
> This patchset is built upon walken@google.com's new `mmap_lock` API
> (https://lkml.org/lkml/2020/4/21/1307). In its current form, it should apply
> cleanly to a 5.7-rc7 tree to which Michel's patchset has already been applied.
>
> To summarize the changes being made at a high level:
>
> - Add a histogram library: a `struct histogram` is effectively an array of
>   thresholds (i.e., buckets), and an array of per-cpu `u64` counts of the
>   number of samples in each bucket.

this is maybe a redundant question as I did not follow the V1
submission. 

Why do we need yet another histogram mechanism for instrumentation
purposes?

ftrace has histogram support already. Why can't this be reused?

Thanks,

        tglx
Peter Zijlstra May 29, 2020, 8:09 a.m. UTC | #4
On Thu, May 28, 2020 at 06:39:08PM -0700, Axel Rasmussen wrote:

> The use case we have in mind for this is to enable this instrumentation
> widely in Google's production fleet. Internally, we have a userspace thing
> which scrapes these metrics and publishes them such that we can look at
> aggregate metrics across our fleet. The thinking is that mechanisms like
> lockdep or getting histograms with e.g. BPF attached to the tracepoint
> introduces too much overhead for this to be viable. (Although, granted, I
> don't have benchmarks to prove this - if there's skepticism, I can produce
> such a thing - or prove myself wrong and rethink my approach. :) )

Whichever way around; I don't believe in special instrumentation like
this. We'll grow a thousand separate pieces of crap if we go this route.

Next on, someone will come and instrument yet another lock, with yet
another 1000 lines of gunk.

Why can't you kprobe the mmap_lock things and use ftrace histograms?
Masami Hiramatsu May 29, 2020, 3:03 p.m. UTC | #5
On Fri, 29 May 2020 10:09:57 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, May 28, 2020 at 06:39:08PM -0700, Axel Rasmussen wrote:
> 
> > The use case we have in mind for this is to enable this instrumentation
> > widely in Google's production fleet. Internally, we have a userspace thing
> > which scrapes these metrics and publishes them such that we can look at
> > aggregate metrics across our fleet. The thinking is that mechanisms like
> > lockdep or getting histograms with e.g. BPF attached to the tracepoint
> > introduces too much overhead for this to be viable. (Although, granted, I
> > don't have benchmarks to prove this - if there's skepticism, I can produce
> > such a thing - or prove myself wrong and rethink my approach. :) )
> 
> Whichever way around; I don't believe in special instrumentation like
> this. We'll grow a thousand separate pieces of crap if we go this route.
> 
> Next on, someone will come and instrument yet another lock, with yet
> another 1000 lines of gunk.
> 
> Why can't you kprobe the mmap_lock things and use ftrace histograms?

+1.
As far as I can see the series, if you want to make a histogram
of the duration of acquiring locks, you might only need 7/7 (but this
is a minimum subset.) I recommend you to introduce a set of tracepoints
 -- start-locking, locked, and released so that we can investigate
which process is waiting for which one. Then you can use either bpf
or ftrace to make a histogram easily.

Thank you,
Axel Rasmussen May 29, 2020, 10:38 p.m. UTC | #6
On Fri, May 29, 2020 at 8:04 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Fri, 29 May 2020 10:09:57 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Thu, May 28, 2020 at 06:39:08PM -0700, Axel Rasmussen wrote:
> >
> > > The use case we have in mind for this is to enable this instrumentation
> > > widely in Google's production fleet. Internally, we have a userspace thing
> > > which scrapes these metrics and publishes them such that we can look at
> > > aggregate metrics across our fleet. The thinking is that mechanisms like
> > > lockdep or getting histograms with e.g. BPF attached to the tracepoint
> > > introduces too much overhead for this to be viable. (Although, granted, I
> > > don't have benchmarks to prove this - if there's skepticism, I can produce
> > > such a thing - or prove myself wrong and rethink my approach. :) )
> >
> > Whichever way around; I don't believe in special instrumentation like
> > this. We'll grow a thousand separate pieces of crap if we go this route.
> >
> > Next on, someone will come and instrument yet another lock, with yet
> > another 1000 lines of gunk.
> >
> > Why can't you kprobe the mmap_lock things and use ftrace histograms?
>
> +1.
> As far as I can see the series, if you want to make a histogram
> of the duration of acquiring locks, you might only need 7/7 (but this
> is a minimum subset.) I recommend you to introduce a set of tracepoints
>  -- start-locking, locked, and released so that we can investigate
> which process is waiting for which one. Then you can use either bpf
> or ftrace to make a histogram easily.
>
> Thank you,
>
> --
> Masami Hiramatsu <mhiramat@kernel.org>

The reasoning against using BPF or ftrace basically comes down to
overhead. My intuition is that BPF/ftrace are great for testing /
debugging on a small number of machines, but are less suitable for
leaving them enabled in production across many servers. This may not
be generally true, but due to how "hot" this lock is, I think this may
be sort of a pathological case.

Consider maple trees and range locks: if we're running Linux on many
servers, with many different workloads, it's useful to see the impact
of these changes in production, and in aggregate, over a "long" period
of time, instead of just under test conditions on a small number of
machines.

I'll circle back next week with some benchmarks to confirm/deny my
intuition on this. If I can confirm the overhead of BPF / ftrace is
low enough, I'll pursue that route instead.



The point about special instrumentation is well taken. I completely
agree we don't want a file in /proc for each lock in the kernel. :) I
think there's some argument to be made that mmap_lock in particular is
"special", considering the amount of investment going into optimizing
it vs. other locks.