[v31,01/13] mm: Introduce Data Access MONitor (DAMON)

Message ID	20210621083108.17589-2-sj38.park@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=YCYJ=LP=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0C23861164 From: SeongJae Park <sj38.park@gmail.com> To: akpm@linux-foundation.org Cc: SeongJae Park <sjpark@amazon.de>, Jonathan.Cameron@Huawei.com, acme@kernel.org, alexander.shishkin@linux.intel.com, amit@kernel.org, benh@kernel.crashing.org, brendanhiggins@google.com, corbet@lwn.net, david@redhat.com, dwmw@amazon.com, elver@google.com, fan.du@intel.com, foersleo@amazon.de, greg@kroah.com, gthelen@google.com, guoju.fgj@alibaba-inc.com, jgowans@amazon.com, mgorman@suse.de, mheyne@amazon.de, minchan@kernel.org, mingo@redhat.com, namhyung@kernel.org, peterz@infradead.org, riel@surriel.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, shakeelb@google.com, shuah@kernel.org, sieberf@amazon.com, sj38.park@gmail.com, snu@zelle79.org, vbabka@suse.cz, vdavydov.dev@gmail.com, zgf574564920@gmail.com, linux-damon@amazon.com, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v31 01/13] mm: Introduce Data Access MONitor (DAMON) Date: Mon, 21 Jun 2021 08:30:56 +0000 Message-Id: <20210621083108.17589-2-sj38.park@gmail.com> In-Reply-To: <20210621083108.17589-1-sj38.park@gmail.com> References: <20210621083108.17589-1-sj38.park@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Introduce Data Access MONitor (DAMON) \| expand [v31,00/13] Introduce Data Access MONitor (DAMON) [v31,01/13] mm: Introduce Data Access MONitor (DAMON) [v31,02/13] mm/damon/core: Implement region-based sampling [v31,03/13] mm/damon: Adaptively adjust regions [v31,04/13] mm/idle_page_tracking: Make PG_idle reusable [v31,05/13] mm/damon: Implement primitives for the virtual memory address spaces [v31,06/13] mm/damon: Add a tracepoint [v31,07/13] mm/damon: Implement a debugfs-based user space interface [v31,08/13] mm/damon/dbgfs: Export kdamond pid to the user space [v31,09/13] mm/damon/dbgfs: Support multiple contexts [v31,10/13] Documentation: Add documents for DAMON [v31,11/13] mm/damon: Add kunit tests [v31,12/13] mm/damon: Add user space selftests [v31,13/13] MAINTAINERS: Update for DAMON

SeongJae Park June 21, 2021, 8:30 a.m. UTC

From: SeongJae Park <sjpark@amazon.de>

DAMON is a data access monitoring framework for the Linux kernel.  The
core mechanisms of DAMON make it

 - accurate (the monitoring output is useful enough for DRAM level
   performance-centric memory management; It might be inappropriate for
   CPU cache levels, though),
 - light-weight (the monitoring overhead is normally low enough to be
   applied online), and
 - scalable (the upper-bound of the overhead is in constant range
   regardless of the size of target workloads).

Using this framework, hence, we can easily write efficient kernel space
data access monitoring applications.  For example, the kernel's memory
management mechanisms can make advanced decisions using this.
Experimental data access aware optimization works that incurring high
access monitoring overhead could again be implemented on top of this.

Due to its simple and flexible interface, providing user space interface
would be also easy.  Then, user space users who have some special
workloads can write personalized applications for better understanding
and optimizations of their workloads and systems.

===

Nevertheless, this commit is defining and implementing only basic access
check part without the overhead-accuracy handling core logic.  The basic
access check is as below.

The output of DAMON says what memory regions are how frequently accessed
for a given duration.  The resolution of the access frequency is
controlled by setting ``sampling interval`` and ``aggregation
interval``.  In detail, DAMON checks access to each page per ``sampling
interval`` and aggregates the results.  In other words, counts the
number of the accesses to each region.  After each ``aggregation
interval`` passes, DAMON calls callback functions that previously
registered by users so that users can read the aggregated results and
then clears the results.  This can be described in below simple
pseudo-code::

    init()
    while monitoring_on:
        for page in monitoring_target:
            if accessed(page):
                nr_accesses[page] += 1
        if time() % aggregation_interval == 0:
            for callback in user_registered_callbacks:
                callback(monitoring_target, nr_accesses)
            for page in monitoring_target:
                nr_accesses[page] = 0
        if time() % update_interval == 0:
            update()
        sleep(sampling interval)

The target regions constructed at the beginning of the monitoring and
updated after each ``regions_update_interval``, because the target
regions could be dynamically changed (e.g., mmap() or memory hotplug).
The monitoring overhead of this mechanism will arbitrarily increase as
the size of the target workload grows.

The basic monitoring primitives for actual access check and dynamic
target regions construction aren't in the core part of DAMON.  Instead,
it allows users to implement their own primitives that are optimized for
their use case and configure DAMON to use those.  In other words, users
cannot use current version of DAMON without some additional works.

Following commits will implement the core mechanisms for the
overhead-accuracy control and default primitives implementations.

Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
---
 include/linux/damon.h | 167 ++++++++++++++++++++++
 mm/Kconfig            |   2 +
 mm/Makefile           |   1 +
 mm/damon/Kconfig      |  15 ++
 mm/damon/Makefile     |   3 +
 mm/damon/core.c       | 318 ++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 506 insertions(+)
 create mode 100644 include/linux/damon.h
 create mode 100644 mm/damon/Kconfig
 create mode 100644 mm/damon/Makefile
 create mode 100644 mm/damon/core.c

Shakeel Butt June 22, 2021, 2:59 p.m. UTC | #1

On Mon, Jun 21, 2021 at 1:31 AM SeongJae Park <sj38.park@gmail.com> wrote:
>
> From: SeongJae Park <sjpark@amazon.de>
>
> DAMON is a data access monitoring framework for the Linux kernel.  The
> core mechanisms of DAMON make it
>
>  - accurate (the monitoring output is useful enough for DRAM level
>    performance-centric memory management; It might be inappropriate for
>    CPU cache levels, though),
>  - light-weight (the monitoring overhead is normally low enough to be
>    applied online), and
>  - scalable (the upper-bound of the overhead is in constant range
>    regardless of the size of target workloads).
>
> Using this framework, hence, we can easily write efficient kernel space
> data access monitoring applications.  For example, the kernel's memory
> management mechanisms can make advanced decisions using this.
> Experimental data access aware optimization works that incurring high
> access monitoring overhead could again be implemented on top of this.
>
> Due to its simple and flexible interface, providing user space interface
> would be also easy.  Then, user space users who have some special
> workloads can write personalized applications for better understanding
> and optimizations of their workloads and systems.
>
> ===
>
> Nevertheless, this commit is defining and implementing only basic access
> check part without the overhead-accuracy handling core logic.  The basic
> access check is as below.
>
> The output of DAMON says what memory regions are how frequently accessed
> for a given duration.  The resolution of the access frequency is
> controlled by setting ``sampling interval`` and ``aggregation
> interval``.  In detail, DAMON checks access to each page per ``sampling
> interval`` and aggregates the results.  In other words, counts the
> number of the accesses to each region.  After each ``aggregation
> interval`` passes, DAMON calls callback functions that previously
> registered by users so that users can read the aggregated results and
> then clears the results.  This can be described in below simple
> pseudo-code::
>
>     init()
>     while monitoring_on:
>         for page in monitoring_target:
>             if accessed(page):
>                 nr_accesses[page] += 1
>         if time() % aggregation_interval == 0:
>             for callback in user_registered_callbacks:
>                 callback(monitoring_target, nr_accesses)
>             for page in monitoring_target:
>                 nr_accesses[page] = 0
>         if time() % update_interval == 0:

regions_update_interval?

>             update()
>         sleep(sampling interval)
>
> The target regions constructed at the beginning of the monitoring and
> updated after each ``regions_update_interval``, because the target
> regions could be dynamically changed (e.g., mmap() or memory hotplug).
> The monitoring overhead of this mechanism will arbitrarily increase as
> the size of the target workload grows.
>
> The basic monitoring primitives for actual access check and dynamic
> target regions construction aren't in the core part of DAMON.  Instead,
> it allows users to implement their own primitives that are optimized for
> their use case and configure DAMON to use those.  In other words, users
> cannot use current version of DAMON without some additional works.
>
> Following commits will implement the core mechanisms for the
> overhead-accuracy control and default primitives implementations.
>
> Signed-off-by: SeongJae Park <sjpark@amazon.de>
> Reviewed-by: Leonard Foerster <foersleo@amazon.de>
> Reviewed-by: Fernand Sieber <sieberf@amazon.com>

Few nits below otherwise look good to me. You can add:

Acked-by: Shakeel Butt <shakeelb@google.com>

[...]
> +/*
> + * __damon_start() - Starts monitoring with given context.
> + * @ctx:       monitoring context
> + *
> + * This function should be called while damon_lock is hold.
> + *
> + * Return: 0 on success, negative error code otherwise.
> + */
> +static int __damon_start(struct damon_ctx *ctx)
> +{
> +       int err = -EBUSY;
> +
> +       mutex_lock(&ctx->kdamond_lock);
> +       if (!ctx->kdamond) {
> +               err = 0;
> +               ctx->kdamond_stop = false;
> +               ctx->kdamond = kthread_create(kdamond_fn, ctx, "kdamond.%d",
> +                               nr_running_ctxs);
> +               if (IS_ERR(ctx->kdamond))
> +                       err = PTR_ERR(ctx->kdamond);
> +               else
> +                       wake_up_process(ctx->kdamond);

Nit: You can use kthread_run() here.

> +       }
> +       mutex_unlock(&ctx->kdamond_lock);
> +
> +       return err;
> +}
> +
[...]
> +static int __damon_stop(struct damon_ctx *ctx)
> +{
> +       mutex_lock(&ctx->kdamond_lock);
> +       if (ctx->kdamond) {
> +               ctx->kdamond_stop = true;
> +               mutex_unlock(&ctx->kdamond_lock);
> +               while (damon_kdamond_running(ctx))
> +                       usleep_range(ctx->sample_interval,
> +                                       ctx->sample_interval * 2);

Any reason to not use kthread_stop() here?

SeongJae Park June 24, 2021, 10:26 a.m. UTC | #2

From: SeongJae Park <sjpark@amazon.de>

On Tue, 22 Jun 2021 07:59:11 -0700 Shakeel Butt <shakeelb@google.com> wrote:

> On Mon, Jun 21, 2021 at 1:31 AM SeongJae Park <sj38.park@gmail.com> wrote:
> >
> > From: SeongJae Park <sjpark@amazon.de>
> >
> > DAMON is a data access monitoring framework for the Linux kernel.  The
> > core mechanisms of DAMON make it
> >
> >  - accurate (the monitoring output is useful enough for DRAM level
> >    performance-centric memory management; It might be inappropriate for
> >    CPU cache levels, though),
> >  - light-weight (the monitoring overhead is normally low enough to be
> >    applied online), and
> >  - scalable (the upper-bound of the overhead is in constant range
> >    regardless of the size of target workloads).
> >
> > Using this framework, hence, we can easily write efficient kernel space
> > data access monitoring applications.  For example, the kernel's memory
> > management mechanisms can make advanced decisions using this.
> > Experimental data access aware optimization works that incurring high
> > access monitoring overhead could again be implemented on top of this.
> >
> > Due to its simple and flexible interface, providing user space interface
> > would be also easy.  Then, user space users who have some special
> > workloads can write personalized applications for better understanding
> > and optimizations of their workloads and systems.
> >
> > ===
> >
> > Nevertheless, this commit is defining and implementing only basic access
> > check part without the overhead-accuracy handling core logic.  The basic
> > access check is as below.
> >
> > The output of DAMON says what memory regions are how frequently accessed
> > for a given duration.  The resolution of the access frequency is
> > controlled by setting ``sampling interval`` and ``aggregation
> > interval``.  In detail, DAMON checks access to each page per ``sampling
> > interval`` and aggregates the results.  In other words, counts the
> > number of the accesses to each region.  After each ``aggregation
> > interval`` passes, DAMON calls callback functions that previously
> > registered by users so that users can read the aggregated results and
> > then clears the results.  This can be described in below simple
> > pseudo-code::
> >
> >     init()
> >     while monitoring_on:
> >         for page in monitoring_target:
> >             if accessed(page):
> >                 nr_accesses[page] += 1
> >         if time() % aggregation_interval == 0:
> >             for callback in user_registered_callbacks:
> >                 callback(monitoring_target, nr_accesses)
> >             for page in monitoring_target:
> >                 nr_accesses[page] = 0
> >         if time() % update_interval == 0:
> 
> regions_update_interval?

It used the name before.  But, I changed the name in this way to use it as a
general periodic updates of monitoring primitives.  Of course we can use the
specific name only in this specific example, but also want to make this as
similar to the actual code as possible.

If you strongly want to rename this, please feel free to let me know.

> 
> >             update()
> >         sleep(sampling interval)
> >
> > The target regions constructed at the beginning of the monitoring and
> > updated after each ``regions_update_interval``, because the target
> > regions could be dynamically changed (e.g., mmap() or memory hotplug).
> > The monitoring overhead of this mechanism will arbitrarily increase as
> > the size of the target workload grows.
> >
> > The basic monitoring primitives for actual access check and dynamic
> > target regions construction aren't in the core part of DAMON.  Instead,
> > it allows users to implement their own primitives that are optimized for
> > their use case and configure DAMON to use those.  In other words, users
> > cannot use current version of DAMON without some additional works.
> >
> > Following commits will implement the core mechanisms for the
> > overhead-accuracy control and default primitives implementations.
> >
> > Signed-off-by: SeongJae Park <sjpark@amazon.de>
> > Reviewed-by: Leonard Foerster <foersleo@amazon.de>
> > Reviewed-by: Fernand Sieber <sieberf@amazon.com>
> 
> Few nits below otherwise look good to me. You can add:
> 
> Acked-by: Shakeel Butt <shakeelb@google.com>

Thank you!

> 
> [...]
> > +/*
> > + * __damon_start() - Starts monitoring with given context.
> > + * @ctx:       monitoring context
> > + *
> > + * This function should be called while damon_lock is hold.
> > + *
> > + * Return: 0 on success, negative error code otherwise.
> > + */
> > +static int __damon_start(struct damon_ctx *ctx)
> > +{
> > +       int err = -EBUSY;
> > +
> > +       mutex_lock(&ctx->kdamond_lock);
> > +       if (!ctx->kdamond) {
> > +               err = 0;
> > +               ctx->kdamond_stop = false;
> > +               ctx->kdamond = kthread_create(kdamond_fn, ctx, "kdamond.%d",
> > +                               nr_running_ctxs);
> > +               if (IS_ERR(ctx->kdamond))
> > +                       err = PTR_ERR(ctx->kdamond);
> > +               else
> > +                       wake_up_process(ctx->kdamond);
> 
> Nit: You can use kthread_run() here.

Ok, I will use that from the next spin.

> 
> > +       }
> > +       mutex_unlock(&ctx->kdamond_lock);
> > +
> > +       return err;
> > +}
> > +
> [...]
> > +static int __damon_stop(struct damon_ctx *ctx)
> > +{
> > +       mutex_lock(&ctx->kdamond_lock);
> > +       if (ctx->kdamond) {
> > +               ctx->kdamond_stop = true;
> > +               mutex_unlock(&ctx->kdamond_lock);
> > +               while (damon_kdamond_running(ctx))
> > +                       usleep_range(ctx->sample_interval,
> > +                                       ctx->sample_interval * 2);
> 
> Any reason to not use kthread_stop() here?

Using 'kthread_stop()' here will make the code much simpler.  But, 'kdamond'
also stops itself when all monitoring targets became invalid (e.g., all
monitoring target processes terminated).  However, 'kthread_stop()' is not easy
to be used for the use case (self stopping).  It's of course possible, but it
would make the code longer.  That's why I use 'kdamond_stop' flag here.  So,
I'd like leave this as is.  If you think 'kthread_stop()' should be used,
please feel free to let me know.


Thanks,
SeongJae Park

SeongJae Park June 24, 2021, 10:26 a.m. UTC | #3

From: SeongJae Park <sjpark@amazon.de>

On Tue, 22 Jun 2021 07:59:35 -0700 Shakeel Butt <shakeelb@google.com> wrote:

> On Mon, Jun 21, 2021 at 1:31 AM SeongJae Park <sj38.park@gmail.com> wrote:
> >
> > From: SeongJae Park <sjpark@amazon.de>
> >
> > Even somehow the initial monitoring target regions are well constructed
> > to fulfill the assumption (pages in same region have similar access
> > frequencies), the data access pattern can be dynamically changed.  This
> > will result in low monitoring quality.  To keep the assumption as much
> > as possible, DAMON adaptively merges and splits each region based on
> > their access frequency.
> >
> > For each ``aggregation interval``, it compares the access frequencies of
> > adjacent regions and merges those if the frequency difference is small.
> > Then, after it reports and clears the aggregated access frequency of
> > each region, it splits each region into two or three regions if the
> > total number of regions will not exceed the user-specified maximum
> > number of regions after the split.
> >
> > In this way, DAMON provides its best-effort quality and minimal overhead
> > while keeping the upper-bound overhead that users set.
> >
> > Signed-off-by: SeongJae Park <sjpark@amazon.de>
> > Reviewed-by: Leonard Foerster <foersleo@amazon.de>
> > Reviewed-by: Fernand Sieber <sieberf@amazon.com>
> [...]
> >
> > +unsigned int damon_nr_regions(struct damon_target *t)
> > +{
> > +       struct damon_region *r;
> > +       unsigned int nr_regions = 0;
> > +
> > +       damon_for_each_region(r, t)
> > +               nr_regions++;
> 
> This bugs me everytime. Please just have nr_regions field in the
> damon_target instead of traversing the list to count the number of
> regions.

Ok, I will make the change in next spin.

> 
> Other than that, it looks good to me.

Thanks,
SeongJae Park

SeongJae Park June 24, 2021, 10:26 a.m. UTC | #4

From: SeongJae Park <sjpark@amazon.de>

On Tue, 22 Jun 2021 08:00:58 -0700 Shakeel Butt <shakeelb@google.com> wrote:

> On Mon, Jun 21, 2021 at 1:31 AM SeongJae Park <sj38.park@gmail.com> wrote:
> >
> > From: SeongJae Park <sjpark@amazon.de>
> >
> > This commit introduces a reference implementation of the address space
> > specific low level primitives for the virtual address space, so that
> > users of DAMON can easily monitor the data accesses on virtual address
> > spaces of specific processes by simply configuring the implementation to
> > be used by DAMON.
> >
> > The low level primitives for the fundamental access monitoring are
> > defined in two parts:
> >
> > 1. Identification of the monitoring target address range for the address
> >    space.
> > 2. Access check of specific address range in the target space.
> >
> > The reference implementation for the virtual address space does the
> > works as below.
> >
> > PTE Accessed-bit Based Access Check
> > -----------------------------------
> >
> > The implementation uses PTE Accessed-bit for basic access checks.  That
> > is, it clears the bit for the next sampling target page and checks
> > whether it is set again after one sampling period.  This could disturb
> > the reclaim logic.  DAMON uses ``PG_idle`` and ``PG_young`` page flags
> > to solve the conflict, as Idle page tracking does.
> >
> > VMA-based Target Address Range Construction
> > -------------------------------------------
> >
> > Only small parts in the super-huge virtual address space of the
> > processes are mapped to physical memory and accessed.  Thus, tracking
> > the unmapped address regions is just wasteful.  However, because DAMON
> > can deal with some level of noise using the adaptive regions adjustment
> > mechanism, tracking every mapping is not strictly required but could
> > even incur a high overhead in some cases.  That said, too huge unmapped
> > areas inside the monitoring target should be removed to not take the
> > time for the adaptive mechanism.
> >
> > For the reason, this implementation converts the complex mappings to
> > three distinct regions that cover every mapped area of the address
> > space.  Also, the two gaps between the three regions are the two biggest
> > unmapped areas in the given address space.  The two biggest unmapped
> > areas would be the gap between the heap and the uppermost mmap()-ed
> > region, and the gap between the lowermost mmap()-ed region and the stack
> > in most of the cases.  Because these gaps are exceptionally huge in
> > usual address spaces, excluding these will be sufficient to make a
> > reasonable trade-off.  Below shows this in detail::
> >
> >     <heap>
> >     <BIG UNMAPPED REGION 1>
> >     <uppermost mmap()-ed region>
> >     (small mmap()-ed regions and munmap()-ed regions)
> >     <lowermost mmap()-ed region>
> >     <BIG UNMAPPED REGION 2>
> >     <stack>
> >
> > Signed-off-by: SeongJae Park <sjpark@amazon.de>
> > Reviewed-by: Leonard Foerster <foersleo@amazon.de>
> > Reviewed-by: Fernand Sieber <sieberf@amazon.com>
> 
> Couple of nits below and one concern on the default value of
> primitive_update_interval of virtual address space primitive.
> Otherwise looks good to me.

Thank you!

> 
> [...]
> 
> > +
> > +/*
> > + * Size-evenly split a region into 'nr_pieces' small regions
> > + *
> > + * Returns 0 on success, or negative error code otherwise.
> > + */
> > +static int damon_va_evenly_split_region(struct damon_ctx *ctx,
> 
> I don't see ctx being used in this function.

Good point, will remove that from the next spin.

> 
> > +               struct damon_region *r, unsigned int nr_pieces)
> > +{
> > +       unsigned long sz_orig, sz_piece, orig_end;
> > +       struct damon_region *n = NULL, *next;
> > +       unsigned long start;
> > +
> > +       if (!r || !nr_pieces)
> > +               return -EINVAL;
> > +
> > +       orig_end = r->ar.end;
> > +       sz_orig = r->ar.end - r->ar.start;
> > +       sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, DAMON_MIN_REGION);
> > +
> > +       if (!sz_piece)
> > +               return -EINVAL;
> > +
> > +       r->ar.end = r->ar.start + sz_piece;
> > +       next = damon_next_region(r);
> > +       for (start = r->ar.end; start + sz_piece <= orig_end;
> > +                       start += sz_piece) {
> > +               n = damon_new_region(start, start + sz_piece);
> > +               if (!n)
> > +                       return -ENOMEM;
> > +               damon_insert_region(n, r, next);
> > +               r = n;
> > +       }
> > +       /* complement last region for possible rounding error */
> > +       if (n)
> > +               n->ar.end = orig_end;
> > +
> > +       return 0;
> > +}
> 
> [...]
> 
> > +/*
> > + * Get the three regions in the given target (task)
> > + *
> > + * Returns 0 on success, negative error code otherwise.
> > + */
> > +static int damon_va_three_regions(struct damon_target *t,
> > +                               struct damon_addr_range regions[3])
> > +{
> > +       struct mm_struct *mm;
> > +       int rc;
> > +
> > +       mm = damon_get_mm(t);
> > +       if (!mm)
> > +               return -EINVAL;
> > +
> > +       mmap_read_lock(mm);
> > +       rc = __damon_va_three_regions(mm->mmap, regions);
> > +       mmap_read_unlock(mm);
> 
> This is being called for each target every second by default. Seems
> too aggressive. Applications don't change their address space every
> second. I would recommend to default ctx->primitive_update_interval to
> a higher default value.

Good point.  If there are many targets and each target has a huge number of
VMAs, the overhead could be high.  Nevertheless, I couldn't find the overhead
in my test setup.  Also, it seems someone are already started exploring DAMON
patchset with the default value. and usages from others.  Silently changing the
default value could distract such people.  So, if you think it's ok, I'd like
to change the default value only after someone finds the overhead from their
usages and asks a change.

If you disagree or you found the overhead from your usage, please feel free to
let me know.

> 
> > +
> > +       mmput(mm);
> > +       return rc;
> > +}
> > +
> 
> [...]
> 
> > +static void __damon_va_init_regions(struct damon_ctx *c,
> 
> Keep the convention of naming damon_ctx ctx.

Ok, I will do so from the next spin.

> 
> > +                                    struct damon_target *t)
> > +{
> > +       struct damon_region *r;
> > +       struct damon_addr_range regions[3];
> > +       unsigned long sz = 0, nr_pieces;
> > +       int i;
> > +
> > +       if (damon_va_three_regions(t, regions)) {
> > +               pr_err("Failed to get three regions of target %lu\n", t->id);
> > +               return;
> > +       }
> > +
> > +       for (i = 0; i < 3; i++)
> > +               sz += regions[i].end - regions[i].start;
> > +       if (c->min_nr_regions)
> > +               sz /= c->min_nr_regions;
> > +       if (sz < DAMON_MIN_REGION)
> > +               sz = DAMON_MIN_REGION;
> > +
> > +       /* Set the initial three regions of the target */
> > +       for (i = 0; i < 3; i++) {
> > +               r = damon_new_region(regions[i].start, regions[i].end);
> > +               if (!r) {
> > +                       pr_err("%d'th init region creation failed\n", i);
> > +                       return;
> > +               }
> > +               damon_add_region(r, t);
> > +
> > +               nr_pieces = (regions[i].end - regions[i].start) / sz;
> > +               damon_va_evenly_split_region(c, r, nr_pieces);
> > +       }
> > +}
> 
> [...]
> 
> > +/*
> > + * Update damon regions for the three big regions of the given target
> > + *
> > + * t           the given target
> > + * bregions    the three big regions of the target
> > + */
> > +static void damon_va_apply_three_regions(struct damon_ctx *ctx,
> 
> ctx not used in this function.

Good eye, will remove that from the next version.

> 
> 
> > +               struct damon_target *t, struct damon_addr_range bregions[3])
> > +{
> > +       struct damon_region *r, *next;
> > +       unsigned int i = 0;
> > +
> > +       /* Remove regions which are not in the three big regions now */
> > +       damon_for_each_region_safe(r, next, t) {
> > +               for (i = 0; i < 3; i++) {
> > +                       if (damon_intersect(r, &bregions[i]))
> > +                               break;
> > +               }
> > +               if (i == 3)
> > +                       damon_destroy_region(r);
> > +       }
> > +
> > +       /* Adjust intersecting regions to fit with the three big regions */
> > +       for (i = 0; i < 3; i++) {
> > +               struct damon_region *first = NULL, *last;
> > +               struct damon_region *newr;
> > +               struct damon_addr_range *br;
> > +
> > +               br = &bregions[i];
> > +               /* Get the first and last regions which intersects with br */
> > +               damon_for_each_region(r, t) {
> > +                       if (damon_intersect(r, br)) {
> > +                               if (!first)
> > +                                       first = r;
> > +                               last = r;
> > +                       }
> > +                       if (r->ar.start >= br->end)
> > +                               break;
> > +               }
> > +               if (!first) {
> > +                       /* no damon_region intersects with this big region */
> > +                       newr = damon_new_region(
> > +                                       ALIGN_DOWN(br->start,
> > +                                               DAMON_MIN_REGION),
> > +                                       ALIGN(br->end, DAMON_MIN_REGION));
> > +                       if (!newr)
> > +                               continue;
> > +                       damon_insert_region(newr, damon_prev_region(r), r);
> > +               } else {
> > +                       first->ar.start = ALIGN_DOWN(br->start,
> > +                                       DAMON_MIN_REGION);
> > +                       last->ar.end = ALIGN(br->end, DAMON_MIN_REGION);
> > +               }
> > +       }
> > +}


Thanks,
SeongJae Park

SeongJae Park June 24, 2021, 10:26 a.m. UTC | #5

From: SeongJae Park <sjpark@amazon.de>

On Tue, 22 Jun 2021 11:12:36 -0700 Shakeel Butt <shakeelb@google.com> wrote:

> On Mon, Jun 21, 2021 at 1:31 AM SeongJae Park <sj38.park@gmail.com> wrote:
> >
> > From: SeongJae Park <sjpark@amazon.de>
> >
> > DAMON is designed to be used by kernel space code such as the memory
> > management subsystems, and therefore it provides only kernel space API.
> > That said, letting the user space control DAMON could provide some
> > benefits to them.  For example, it will allow user space to analyze
> > their specific workloads and make their own special optimizations.
> >
> > For such cases, this commit implements a simple DAMON application kernel
> > module, namely 'damon-dbgfs', which merely wraps the DAMON api and
> > exports those to the user space via the debugfs.
> >
> > 'damon-dbgfs' exports three files, ``attrs``, ``target_ids``, and
> > ``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``.
> >
> > Attributes
> > ----------
> >
> > Users can read and write the ``sampling interval``, ``aggregation
> > interval``, ``regions update interval``, and min/max number of
> > monitoring target regions by reading from and writing to the ``attrs``
> > file.  For example, below commands set those values to 5 ms, 100 ms,
> > 1,000 ms, 10, 1000 and check it again::
> >
> >     # cd <debugfs>/damon
> >     # echo 5000 100000 1000000 10 1000 > attrs
> >     # cat attrs
> >     5000 100000 1000000 10 1000
> >
> > Target IDs
> > ----------
> >
> > Some types of address spaces supports multiple monitoring target.  For
> > example, the virtual memory address spaces monitoring can have multiple
> > processes as the monitoring targets.  Users can set the targets by
> > writing relevant id values of the targets to, and get the ids of the
> > current targets by reading from the ``target_ids`` file.  In case of the
> > virtual address spaces monitoring, the values should be pids of the
> > monitoring target processes.  For example, below commands set processes
> > having pids 42 and 4242 as the monitoring targets and check it again::
> >
> >     # cd <debugfs>/damon
> >     # echo 42 4242 > target_ids
> >     # cat target_ids
> >     42 4242
> >
> > Note that setting the target ids doesn't start the monitoring.
> >
> > Turning On/Off
> > --------------
> >
> > Setting the files as described above doesn't incur effect unless you
> > explicitly start the monitoring.  You can start, stop, and check the
> > current status of the monitoring by writing to and reading from the
> > ``monitor_on`` file.  Writing ``on`` to the file starts the monitoring
> > of the targets with the attributes.  Writing ``off`` to the file stops
> > those.  DAMON also stops if every targets are invalidated (in case of
> > the virtual memory monitoring, target processes are invalidated when
> > terminated).  Below example commands turn on, off, and check the status
> > of DAMON::
> >
> >     # cd <debugfs>/damon
> >     # echo on > monitor_on
> >     # echo off > monitor_on
> >     # cat monitor_on
> >     off
> >
> > Please note that you cannot write to the above-mentioned debugfs files
> > while the monitoring is turned on.  If you write to the files while
> > DAMON is running, an error code such as ``-EBUSY`` will be returned.
> >
> > Signed-off-by: SeongJae Park <sjpark@amazon.de>
> > Reviewed-by: Leonard Foerster <foersleo@amazon.de>
> > Reviewed-by: Fernand Sieber <sieberf@amazon.com>
> 
> 
> The high level comment I have for this patch is the layering of pid
> reference counting. The dbgfs should treat the targets as abstract
> objects and vaddr should handle the reference counting of pids. More
> specifically move find_get_pid from dbgfs to vaddr and to add an
> interface to the primitive for set_targets.
> 
> At the moment, the pid reference is taken in dbgfs and put in vaddr.
> This will be the source of bugs in future.

Good point, and agreed on the problem.  But, I'd like to move 'put_pid()' to
dbgfs, because I think that would let extending the dbgfs user interface to
pidfd a little bit simpler.  Also, I think that would be easier to use for
in-kernel programming interface usages.  If you disagree, please feel free to
let me know.


Thanks,
SeongJae Park

SeongJae Park June 24, 2021, 10:26 a.m. UTC | #6

From: SeongJae Park <sjpark@amazon.de>

On Tue, 22 Jun 2021 11:23:12 -0700 Shakeel Butt <shakeelb@google.com> wrote:

> On Mon, Jun 21, 2021 at 1:31 AM SeongJae Park <sj38.park@gmail.com> wrote:
> >
> > From: SeongJae Park <sjpark@amazon.de>
> >
> > For CPU usage accounting, knowing pid of the monitoring thread could be
> > helpful.  For example, users could use cpuaccount cgroups with the pid.
> >
> > This commit therefore exports the pid of currently running monitoring
> > thread to the user space via 'kdamond_pid' file in the debugfs
> > directory.
> >
> > Signed-off-by: SeongJae Park <sjpark@amazon.de>
> > Reviewed-by: Fernand Sieber <sieberf@amazon.com>
> > ---
> 
> [...]
> 
> >
> > +static const struct file_operations kdamond_pid_fops = {
> > +       .owner = THIS_MODULE,
> 
> I don't think you need to set the owner (and for other fops) as these
> files are built into modules. Otherwise it looks good.

Good point.  Will remove those from the next spin.


Thanks,
SeongJae Park

Shakeel Butt June 24, 2021, 2:34 p.m. UTC | #7

On Thu, Jun 24, 2021 at 3:26 AM SeongJae Park <sj38.park@gmail.com> wrote:
>
[...]
> > >         if time() % update_interval == 0:
> >
> > regions_update_interval?
>
> It used the name before.  But, I changed the name in this way to use it as a
> general periodic updates of monitoring primitives.  Of course we can use the
> specific name only in this specific example, but also want to make this as
> similar to the actual code as possible.
>
> If you strongly want to rename this, please feel free to let me know.
>

Nah, it is ok.

[...]

> >
> > Any reason to not use kthread_stop() here?
>
> Using 'kthread_stop()' here will make the code much simpler.  But, 'kdamond'
> also stops itself when all monitoring targets became invalid (e.g., all
> monitoring target processes terminated).  However, 'kthread_stop()' is not easy
> to be used for the use case (self stopping).  It's of course possible, but it
> would make the code longer.  That's why I use 'kdamond_stop' flag here.  So,
> I'd like leave this as is.  If you think 'kthread_stop()' should be used,
> please feel free to let me know.
>

Fine as it is.

Shakeel Butt June 24, 2021, 2:42 p.m. UTC | #8

On Thu, Jun 24, 2021 at 3:26 AM SeongJae Park <sj38.park@gmail.com> wrote:
>
[...]
> > > +/*
> > > + * Get the three regions in the given target (task)
> > > + *
> > > + * Returns 0 on success, negative error code otherwise.
> > > + */
> > > +static int damon_va_three_regions(struct damon_target *t,
> > > +                               struct damon_addr_range regions[3])
> > > +{
> > > +       struct mm_struct *mm;
> > > +       int rc;
> > > +
> > > +       mm = damon_get_mm(t);
> > > +       if (!mm)
> > > +               return -EINVAL;
> > > +
> > > +       mmap_read_lock(mm);
> > > +       rc = __damon_va_three_regions(mm->mmap, regions);
> > > +       mmap_read_unlock(mm);
> >
> > This is being called for each target every second by default. Seems
> > too aggressive. Applications don't change their address space every
> > second. I would recommend to default ctx->primitive_update_interval to
> > a higher default value.
>
> Good point.  If there are many targets and each target has a huge number of
> VMAs, the overhead could be high.  Nevertheless, I couldn't find the overhead
> in my test setup.  Also, it seems someone are already started exploring DAMON
> patchset with the default value. and usages from others.  Silently changing the
> default value could distract such people.  So, if you think it's ok, I'd like
> to change the default value only after someone finds the overhead from their
> usages and asks a change.
>
> If you disagree or you found the overhead from your usage, please feel free to
> let me know.
>

mmap lock is a source contention in the real world workloads. We do
observe in our fleet and many others (like Facebook) do complain on
this issue. This is the whole motivation behind SFP, maple tree and
many other mmap lock scalability work. I would be really careful to
add another source of contention on mmap lock. Yes, the user can
change this interval themselves but we should not burden them with
this internal knowledge like "oh if you observe high mmap contention
you may want to increase this specific interval". We should set a good
default value to avoid such situations (most of the time).

Shakeel Butt June 24, 2021, 2:52 p.m. UTC | #9

On Thu, Jun 24, 2021 at 3:26 AM SeongJae Park <sj38.park@gmail.com> wrote:
>
[...]
> >
> > The high level comment I have for this patch is the layering of pid
> > reference counting. The dbgfs should treat the targets as abstract
> > objects and vaddr should handle the reference counting of pids. More
> > specifically move find_get_pid from dbgfs to vaddr and to add an
> > interface to the primitive for set_targets.
> >
> > At the moment, the pid reference is taken in dbgfs and put in vaddr.
> > This will be the source of bugs in future.
>
> Good point, and agreed on the problem.  But, I'd like to move 'put_pid()' to
> dbgfs, because I think that would let extending the dbgfs user interface to
> pidfd a little bit simpler.  Also, I think that would be easier to use for
> in-kernel programming interface usages.  If you disagree, please feel free to
> let me know.
>

I was thinking of removing targetid_is_pid() checks. Anyways this is
not something we can not change later, so I will let you decide which
direction you want to take.

SeongJae Park June 24, 2021, 3:21 p.m. UTC | #10

From: SeongJae Park <sjpark@amazon.de>

On Thu, 24 Jun 2021 07:42:44 -0700 Shakeel Butt <shakeelb@google.com> wrote:

> On Thu, Jun 24, 2021 at 3:26 AM SeongJae Park <sj38.park@gmail.com> wrote:
> >
> [...]
> > > > +/*
> > > > + * Get the three regions in the given target (task)
> > > > + *
> > > > + * Returns 0 on success, negative error code otherwise.
> > > > + */
> > > > +static int damon_va_three_regions(struct damon_target *t,
> > > > +                               struct damon_addr_range regions[3])
> > > > +{
> > > > +       struct mm_struct *mm;
> > > > +       int rc;
> > > > +
> > > > +       mm = damon_get_mm(t);
> > > > +       if (!mm)
> > > > +               return -EINVAL;
> > > > +
> > > > +       mmap_read_lock(mm);
> > > > +       rc = __damon_va_three_regions(mm->mmap, regions);
> > > > +       mmap_read_unlock(mm);
> > >
> > > This is being called for each target every second by default. Seems
> > > too aggressive. Applications don't change their address space every
> > > second. I would recommend to default ctx->primitive_update_interval to
> > > a higher default value.
> >
> > Good point.  If there are many targets and each target has a huge number of
> > VMAs, the overhead could be high.  Nevertheless, I couldn't find the overhead
> > in my test setup.  Also, it seems someone are already started exploring DAMON
> > patchset with the default value. and usages from others.  Silently changing the
> > default value could distract such people.  So, if you think it's ok, I'd like
> > to change the default value only after someone finds the overhead from their
> > usages and asks a change.
> >
> > If you disagree or you found the overhead from your usage, please feel free to
> > let me know.
> >
> 
> mmap lock is a source contention in the real world workloads. We do
> observe in our fleet and many others (like Facebook) do complain on
> this issue. This is the whole motivation behind SFP, maple tree and
> many other mmap lock scalability work. I would be really careful to
> add another source of contention on mmap lock. Yes, the user can
> change this interval themselves but we should not burden them with
> this internal knowledge like "oh if you observe high mmap contention
> you may want to increase this specific interval". We should set a good
> default value to avoid such situations (most of the time).

Thank you for this nice clarification.  I can understand your concern because I
also worked for an HTM-based solution of the scalability issue before.

However, I have neither strong preference nor confidence for the new default
value at the moment.  Could you please recommend one if you have?


Thanks,
SeongJae Park

Shakeel Butt June 24, 2021, 4:33 p.m. UTC | #11

On Thu, Jun 24, 2021 at 8:21 AM SeongJae Park <sj38.park@gmail.com> wrote:
>
> From: SeongJae Park <sjpark@amazon.de>
>
> On Thu, 24 Jun 2021 07:42:44 -0700 Shakeel Butt <shakeelb@google.com> wrote:
>
> > On Thu, Jun 24, 2021 at 3:26 AM SeongJae Park <sj38.park@gmail.com> wrote:
> > >
> > [...]
> > > > > +/*
> > > > > + * Get the three regions in the given target (task)
> > > > > + *
> > > > > + * Returns 0 on success, negative error code otherwise.
> > > > > + */
> > > > > +static int damon_va_three_regions(struct damon_target *t,
> > > > > +                               struct damon_addr_range regions[3])
> > > > > +{
> > > > > +       struct mm_struct *mm;
> > > > > +       int rc;
> > > > > +
> > > > > +       mm = damon_get_mm(t);
> > > > > +       if (!mm)
> > > > > +               return -EINVAL;
> > > > > +
> > > > > +       mmap_read_lock(mm);
> > > > > +       rc = __damon_va_three_regions(mm->mmap, regions);
> > > > > +       mmap_read_unlock(mm);
> > > >
> > > > This is being called for each target every second by default. Seems
> > > > too aggressive. Applications don't change their address space every
> > > > second. I would recommend to default ctx->primitive_update_interval to
> > > > a higher default value.
> > >
> > > Good point.  If there are many targets and each target has a huge number of
> > > VMAs, the overhead could be high.  Nevertheless, I couldn't find the overhead
> > > in my test setup.  Also, it seems someone are already started exploring DAMON
> > > patchset with the default value. and usages from others.  Silently changing the
> > > default value could distract such people.  So, if you think it's ok, I'd like
> > > to change the default value only after someone finds the overhead from their
> > > usages and asks a change.
> > >
> > > If you disagree or you found the overhead from your usage, please feel free to
> > > let me know.
> > >
> >
> > mmap lock is a source contention in the real world workloads. We do
> > observe in our fleet and many others (like Facebook) do complain on
> > this issue. This is the whole motivation behind SFP, maple tree and
> > many other mmap lock scalability work. I would be really careful to
> > add another source of contention on mmap lock. Yes, the user can
> > change this interval themselves but we should not burden them with
> > this internal knowledge like "oh if you observe high mmap contention
> > you may want to increase this specific interval". We should set a good
> > default value to avoid such situations (most of the time).
>
> Thank you for this nice clarification.  I can understand your concern because I
> also worked for an HTM-based solution of the scalability issue before.
>
> However, I have neither strong preference nor confidence for the new default
> value at the moment.  Could you please recommend one if you have?
>

I would say go with a conservative value like 60 seconds. Though there
is no scientific reason behind this specific number, I think it would
be a good compromise. Applications usually don't change their address
space layout that often.

SeongJae Park June 24, 2021, 5:38 p.m. UTC | #12

From: SeongJae Park <sjpark@amazon.de>

On Thu, 24 Jun 2021 09:33:07 -0700 Shakeel Butt <shakeelb@google.com> wrote:

> On Thu, Jun 24, 2021 at 8:21 AM SeongJae Park <sj38.park@gmail.com> wrote:
> >
> > From: SeongJae Park <sjpark@amazon.de>
> >
> > On Thu, 24 Jun 2021 07:42:44 -0700 Shakeel Butt <shakeelb@google.com> wrote:
> >
> > > On Thu, Jun 24, 2021 at 3:26 AM SeongJae Park <sj38.park@gmail.com> wrote:
> > > >
> > > [...]
> > > > > > +/*
> > > > > > + * Get the three regions in the given target (task)
> > > > > > + *
> > > > > > + * Returns 0 on success, negative error code otherwise.
> > > > > > + */
> > > > > > +static int damon_va_three_regions(struct damon_target *t,
> > > > > > +                               struct damon_addr_range regions[3])
> > > > > > +{
> > > > > > +       struct mm_struct *mm;
> > > > > > +       int rc;
> > > > > > +
> > > > > > +       mm = damon_get_mm(t);
> > > > > > +       if (!mm)
> > > > > > +               return -EINVAL;
> > > > > > +
> > > > > > +       mmap_read_lock(mm);
> > > > > > +       rc = __damon_va_three_regions(mm->mmap, regions);
> > > > > > +       mmap_read_unlock(mm);
> > > > >
> > > > > This is being called for each target every second by default. Seems
> > > > > too aggressive. Applications don't change their address space every
> > > > > second. I would recommend to default ctx->primitive_update_interval to
> > > > > a higher default value.
> > > >
> > > > Good point.  If there are many targets and each target has a huge number of
> > > > VMAs, the overhead could be high.  Nevertheless, I couldn't find the overhead
> > > > in my test setup.  Also, it seems someone are already started exploring DAMON
> > > > patchset with the default value. and usages from others.  Silently changing the
> > > > default value could distract such people.  So, if you think it's ok, I'd like
> > > > to change the default value only after someone finds the overhead from their
> > > > usages and asks a change.
> > > >
> > > > If you disagree or you found the overhead from your usage, please feel free to
> > > > let me know.
> > > >
> > >
> > > mmap lock is a source contention in the real world workloads. We do
> > > observe in our fleet and many others (like Facebook) do complain on
> > > this issue. This is the whole motivation behind SFP, maple tree and
> > > many other mmap lock scalability work. I would be really careful to
> > > add another source of contention on mmap lock. Yes, the user can
> > > change this interval themselves but we should not burden them with
> > > this internal knowledge like "oh if you observe high mmap contention
> > > you may want to increase this specific interval". We should set a good
> > > default value to avoid such situations (most of the time).
> >
> > Thank you for this nice clarification.  I can understand your concern because I
> > also worked for an HTM-based solution of the scalability issue before.
> >
> > However, I have neither strong preference nor confidence for the new default
> > value at the moment.  Could you please recommend one if you have?
> >
> 
> I would say go with a conservative value like 60 seconds. Though there
> is no scientific reason behind this specific number, I think it would
> be a good compromise. Applications usually don't change their address
> space layout that often.

Ok, I will use that from the next spin.  Thank you for this nice suggestion.


Thanks,
SeongJae Park

[v31,01/13] mm: Introduce Data Access MONitor (DAMON)

Commit Message

Comments

Patch