diff mbox series

[v23,01/15] mm: Introduce Data Access MONitor (DAMON)

Message ID 20201215115448.25633-2-sjpark@amazon.com (mailing list archive)
State New, archived
Headers show
Series Introduce Data Access MONitor (DAMON) | expand

Commit Message

SeongJae Park Dec. 15, 2020, 11:54 a.m. UTC
From: SeongJae Park <sjpark@amazon.de>

DAMON is a data access monitoring framework for the Linux kernel.  The
core mechanisms of DAMON make it

 - accurate (the monitoring output is useful enough for DRAM level
   performance-centric memory management; It might be inappropriate for
   CPU cache levels, though),
 - light-weight (the monitoring overhead is normally low enough to be
   applied online), and
 - scalable (the upper-bound of the overhead is in constant range
   regardless of the size of target workloads).

Using this framework, hence, we can easily write efficient kernel space
data access monitoring applications.  For example, the kernel's memory
management mechanisms can make advanced decisions using this.
Experimental data access aware optimization works that incurring high
access monitoring overhead could implemented again on top of this.

Due to its simple and flexible interface, providing user space interface
would be also easy.  Then, user space users who have some special
workloads can write personalized applications for better understanding
and optimizations of their workloads and systems.

---

Nevertheless, this commit is defining and implementing only basic access
check part without the overhead-accuracy handling core logic.  The basic
access check is as below.

The output of DAMON says what memory regions are how frequently accessed
for a given duration.  The resolution of the access frequency is
controlled by setting ``sampling interval`` and ``aggregation
interval``.  In detail, DAMON checks access to each page per ``sampling
interval`` and aggregates the results.  In other words, counts the
number of the accesses to each region.  After each ``aggregation
interval`` passes, DAMON calls callback functions that previously
registered by users so that users can read the aggregated results and
then clears the results.  This can be described in below simple
pseudo-code::

    while monitoring_on:
        for page in monitoring_target:
            if accessed(page):
                nr_accesses[page] += 1
        if time() % aggregation_interval == 0:
            for callback in user_registered_callbacks:
                callback(monitoring_target, nr_accesses)
            for page in monitoring_target:
                nr_accesses[page] = 0
        sleep(sampling interval)

The target regions constructed at the beginning of the monitoring and
updated after each ``regions_update_interval``, because the target
regions could be dynamically changed (e.g., mmap() or memory hotplug).
The monitoring overhead of this mechanism will arbitrarily increase as
the size of the target workload grows.

The basic monitoring primitives for actual access check and dynamic
target regions construction aren't in the core part of DAMON.  Instead,
it allows users to implement their own primitives that optimized for
their use case and configure DAMON to use those.  In other words, users
cannot use current version of DAMON without some additional works.

Following commits will implement the core mechanisms for the
overhead-accuracy control and default primitives implementations.

Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
---
 include/linux/damon.h | 167 ++++++++++++++++++++++
 mm/Kconfig            |   2 +
 mm/Makefile           |   1 +
 mm/damon/Kconfig      |  15 ++
 mm/damon/Makefile     |   3 +
 mm/damon/core.c       | 316 ++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 504 insertions(+)
 create mode 100644 include/linux/damon.h
 create mode 100644 mm/damon/Kconfig
 create mode 100644 mm/damon/Makefile
 create mode 100644 mm/damon/core.c

Comments

Shakeel Butt Dec. 23, 2020, 3:11 p.m. UTC | #1
First I would like you to prune your To/CC list.

On Tue, Dec 15, 2020 at 3:56 AM SeongJae Park <sjpark@amazon.com> wrote:
>
> From: SeongJae Park <sjpark@amazon.de>
>
> DAMON is a data access monitoring framework for the Linux kernel.  The
> core mechanisms of DAMON make it
>
>  - accurate (the monitoring output is useful enough for DRAM level
>    performance-centric memory management; It might be inappropriate for
>    CPU cache levels, though),
>  - light-weight (the monitoring overhead is normally low enough to be
>    applied online), and
>  - scalable (the upper-bound of the overhead is in constant range
>    regardless of the size of target workloads).
>
> Using this framework, hence, we can easily write efficient kernel space
> data access monitoring applications.  For example, the kernel's memory
> management mechanisms can make advanced decisions using this.
> Experimental data access aware optimization works that incurring high
> access monitoring overhead could implemented again on top of this.

*could again be implemented on*

>
> Due to its simple and flexible interface, providing user space interface
> would be also easy.  Then, user space users who have some special
> workloads can write personalized applications for better understanding
> and optimizations of their workloads and systems.
>
> ---
>
> Nevertheless, this commit is defining and implementing only basic access
> check part without the overhead-accuracy handling core logic.  The basic
> access check is as below.
>
> The output of DAMON says what memory regions are how frequently accessed
> for a given duration.  The resolution of the access frequency is
> controlled by setting ``sampling interval`` and ``aggregation
> interval``.  In detail, DAMON checks access to each page per ``sampling
> interval`` and aggregates the results.  In other words, counts the
> number of the accesses to each region.  After each ``aggregation
> interval`` passes, DAMON calls callback functions that previously
> registered by users so that users can read the aggregated results and
> then clears the results.  This can be described in below simple
> pseudo-code::
>
>     while monitoring_on:
>         for page in monitoring_target:
>             if accessed(page):
>                 nr_accesses[page] += 1
>         if time() % aggregation_interval == 0:
>             for callback in user_registered_callbacks:
>                 callback(monitoring_target, nr_accesses)
>             for page in monitoring_target:
>                 nr_accesses[page] = 0
>         sleep(sampling interval)
>

The above is a good example and I was hoping to see the almost same actual code.

> The target regions constructed at the beginning of the monitoring and
> updated after each ``regions_update_interval``, because the target
> regions could be dynamically changed (e.g., mmap() or memory hotplug).
> The monitoring overhead of this mechanism will arbitrarily increase as
> the size of the target workload grows.
>
> The basic monitoring primitives for actual access check and dynamic
> target regions construction aren't in the core part of DAMON.  Instead,
> it allows users to implement their own primitives that optimized for

'that are optimized for'

> their use case and configure DAMON to use those.  In other words, users
> cannot use current version of DAMON without some additional works.
>
> Following commits will implement the core mechanisms for the
> overhead-accuracy control and default primitives implementations.
>
> Signed-off-by: SeongJae Park <sjpark@amazon.de>
> Reviewed-by: Leonard Foerster <foersleo@amazon.de>
> ---
>  include/linux/damon.h | 167 ++++++++++++++++++++++
>  mm/Kconfig            |   2 +
>  mm/Makefile           |   1 +
>  mm/damon/Kconfig      |  15 ++
>  mm/damon/Makefile     |   3 +
>  mm/damon/core.c       | 316 ++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 504 insertions(+)
>  create mode 100644 include/linux/damon.h
>  create mode 100644 mm/damon/Kconfig
>  create mode 100644 mm/damon/Makefile
>  create mode 100644 mm/damon/core.c
>
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> new file mode 100644
> index 000000000000..387fa4399fc8
> --- /dev/null
> +++ b/include/linux/damon.h
> @@ -0,0 +1,167 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * DAMON api
> + *
> + * Author: SeongJae Park <sjpark@amazon.de>
> + */
> +
> +#ifndef _DAMON_H_
> +#define _DAMON_H_
> +
> +#include <linux/mutex.h>
> +#include <linux/time64.h>
> +#include <linux/types.h>
> +
> +struct damon_ctx;
> +
> +/**
> + * struct damon_primitive      Monitoring primitives for given use cases.
> + *
> + * @init_target_regions:       Constructs initial monitoring target regions.
> + * @update_target_regions:     Updates monitoring target regions.

The 'region' here still does not feel right. I prefer to rename the
above two to just init() and update().

> + * @prepare_access_checks:     Prepares next access check of target regions.
> + * @check_accesses:            Checks the access of target regions.
> + * @reset_aggregated:          Resets aggregated accesses monitoring results.
> + * @target_valid:              Determine if the target is valid.
> + * @cleanup:                   Cleans up the context.
> + *
> + * DAMON can be extended for various address spaces and usages.  For this,
> + * users should register the low level primitives for their target address
> + * space and usecase via the &damon_ctx.primitive.  Then, the monitoring thread
> + * calls @init_target_regions and @prepare_access_checks before starting the
> + * monitoring, @update_target_regions after each
> + * &damon_ctx.regions_update_interval, and @check_accesses, @target_valid and
> + * @prepare_access_checks after each &damon_ctx.sample_interval.  Finally,
> + * @reset_aggregated is called after each &damon_ctx.aggr_interval.
> + *
> + * @init_target_regions should construct proper monitoring target regions and
> + * link those to the DAMON context struct.  The regions should be defined by
> + * user and saved in @damon_ctx.target.
> + * @update_target_regions should update the monitoring target regions for
> + * current status.
> + * @prepare_access_checks should manipulate the monitoring regions to be
> + * prepared for the next access check.
> + * @check_accesses should check the accesses to each region that made after the
> + * last preparation and update the number of observed accesses of each region.
> + * @reset_aggregated should reset the access monitoring results that aggregated
> + * by @check_accesses.
> + * @target_valid should check whether the target is still valid for the
> + * monitoring.
> + * @cleanup is called from @kdamond just before its termination.  After this
> + * call, only @kdamond_lock and @kdamond will be touched.
> + */
> +struct damon_primitive {
> +       void (*init_target_regions)(struct damon_ctx *context);
> +       void (*update_target_regions)(struct damon_ctx *context);
> +       void (*prepare_access_checks)(struct damon_ctx *context);
> +       void (*check_accesses)(struct damon_ctx *context);
> +       void (*reset_aggregated)(struct damon_ctx *context);
> +       bool (*target_valid)(void *target);
> +       void (*cleanup)(struct damon_ctx *context);
> +};
> +
> +/*
> + * struct damon_callback       Monitoring events notification callbacks.
> + *
> + * @before_start:      Called before starting the monitoring.
> + * @after_sampling:    Called after each sampling.
> + * @after_aggregation: Called after each aggregation.
> + * @before_terminate:  Called before terminating the monitoring.
> + * @private:           User private data.
> + *
> + * The monitoring thread (&damon_ctx->kdamond) calls @before_start and
> + * @before_terminate just before starting and finishing the monitoring,
> + * respectively.  Therefore, those are good places for installing and cleaning
> + * @private.
> + *
> + * The monitoring thread calls @after_sampling and @after_aggregation for each
> + * of the sampling intervals and aggregation intervals, respectively.
> + * Therefore, users can safely access the monitoring results without additional
> + * protection.  For the reason, users are recommended to use these callback for
> + * the accesses to the results.
> + *
> + * If any callback returns non-zero, monitoring stops.

I am not sure if this is the right patch to add this struct. Either we
need to add text why this is needed or add this patch when the real
user i.e. debugfs interface is added.

> + */
> +struct damon_callback {
> +       void *private;
> +
> +       int (*before_start)(struct damon_ctx *context);
> +       int (*after_sampling)(struct damon_ctx *context);
> +       int (*after_aggregation)(struct damon_ctx *context);
> +       int (*before_terminate)(struct damon_ctx *context);
> +};
> +
> +/**
> + * struct damon_ctx - Represents a context for each monitoring.  This is the
> + * main interface that allows users to set the attributes and get the results
> + * of the monitoring.
> + *
> + * @sample_interval:           The time between access samplings.
> + * @aggr_interval:             The time between monitor results aggregations.
> + * @regions_update_interval:   The time between monitor regions updates.

regions_update_internal should be part of the primitive abstraction
and the update() callback internally can check this field.

> + *
> + * For each @sample_interval, DAMON checks whether each region is accessed or
> + * not.  It aggregates and keeps the access information (number of accesses to
> + * each region) for @aggr_interval time.  DAMON also checks whether the target
> + * memory regions need update (e.g., by ``mmap()`` calls from the application,
> + * in case of virtual memory monitoring) and applies the changes for each
> + * @regions_update_interval.  All time intervals are in micro-seconds.  Please
> + * refer to &struct damon_primitive and &struct damon_callback for more detail.
> + *
> + * @kdamond:           Kernel thread who does the monitoring.
> + * @kdamond_stop:      Notifies whether kdamond should stop.
> + * @kdamond_lock:      Mutex for the synchronizations with @kdamond.
> + *
> + * For each monitoring context, one kernel thread for the monitoring is
> + * created.  The pointer to the thread is stored in @kdamond.
> + *
> + * Once started, the monitoring thread runs until explicitly required to be
> + * terminated or every monitoring target is invalid.  The validity of the
> + * targets is checked via the &damon_primitive.target_valid of @primitive.  The
> + * termination can also be explicitly requested by writing non-zero to
> + * @kdamond_stop.  The thread sets @kdamond to NULL when it terminates.
> + * Therefore, users can know whether the monitoring is ongoing or terminated by
> + * reading @kdamond.  Reads and writes to @kdamond and @kdamond_stop from
> + * outside of the monitoring thread must be protected by @kdamond_lock.
> + *
> + * Note that the monitoring thread protects only @kdamond and @kdamond_stop via
> + * @kdamond_lock.  Accesses to other fields must be protected by themselves.
> + *
> + * @primitive: Set of monitoring primitives for given use cases.
> + * @callback:  Set of callbacks for monitoring events notifications.
> + *
> + * @target:    Pointer to the user-defined monitoring target.
> + */
> +struct damon_ctx {
> +       unsigned long sample_interval;
> +       unsigned long aggr_interval;
> +       unsigned long regions_update_interval;
> +
> +/* private */
> +       struct timespec64 last_aggregation;
> +       struct timespec64 last_regions_update;
> +
> +/* public */
> +       struct task_struct *kdamond;
> +       bool kdamond_stop;
> +       struct mutex kdamond_lock;
> +
> +       struct damon_primitive primitive;
> +       struct damon_callback callback;
> +
> +       void *target;
> +};
> +
> +#ifdef CONFIG_DAMON
> +
> +struct damon_ctx *damon_new_ctx(void);
> +void damon_destroy_ctx(struct damon_ctx *ctx);
> +int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
> +               unsigned long aggr_int, unsigned long regions_update_int);
> +
> +int damon_start(struct damon_ctx **ctxs, int nr_ctxs);
> +int damon_stop(struct damon_ctx **ctxs, int nr_ctxs);
> +
> +#endif /* CONFIG_DAMON */
> +
> +#endif /* _DAMON_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 390165ffbb0f..b97f2e8ab83f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -859,4 +859,6 @@ config ARCH_HAS_HUGEPD
>  config MAPPING_DIRTY_HELPERS
>          bool
>
> +source "mm/damon/Kconfig"
> +
>  endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index d73aed0fc99c..8022b8f04096 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
>  obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
>  obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
>  obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
> +obj-$(CONFIG_DAMON) += damon/
> diff --git a/mm/damon/Kconfig b/mm/damon/Kconfig
> new file mode 100644
> index 000000000000..d00e99ac1a15
> --- /dev/null
> +++ b/mm/damon/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +
> +menu "Data Access Monitoring"
> +
> +config DAMON
> +       bool "DAMON: Data Access Monitoring Framework"
> +       help
> +         This builds a framework that allows kernel subsystems to monitor
> +         access frequency of each memory region. The information can be useful
> +         for performance-centric DRAM level memory management.
> +
> +         See https://damonitor.github.io/doc/html/latest-damon/index.html for
> +         more information.
> +
> +endmenu
> diff --git a/mm/damon/Makefile b/mm/damon/Makefile
> new file mode 100644
> index 000000000000..4fd2edb4becf
> --- /dev/null
> +++ b/mm/damon/Makefile
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +obj-$(CONFIG_DAMON)            := core.o
> diff --git a/mm/damon/core.c b/mm/damon/core.c
> new file mode 100644
> index 000000000000..8963804efdf9
> --- /dev/null
> +++ b/mm/damon/core.c
> @@ -0,0 +1,316 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Data Access Monitor
> + *
> + * Author: SeongJae Park <sjpark@amazon.de>
> + */
> +
> +#define pr_fmt(fmt) "damon: " fmt
> +
> +#include <linux/damon.h>
> +#include <linux/delay.h>
> +#include <linux/kthread.h>
> +#include <linux/slab.h>
> +
> +static DEFINE_MUTEX(damon_lock);
> +static int nr_running_ctxs;
> +
> +struct damon_ctx *damon_new_ctx(void)
> +{
> +       struct damon_ctx *ctx;
> +
> +       ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> +       if (!ctx)
> +               return NULL;
> +
> +       ctx->sample_interval = 5 * 1000;
> +       ctx->aggr_interval = 100 * 1000;
> +       ctx->regions_update_interval = 1000 * 1000;
> +
> +       ktime_get_coarse_ts64(&ctx->last_aggregation);
> +       ctx->last_regions_update = ctx->last_aggregation;
> +
> +       mutex_init(&ctx->kdamond_lock);
> +
> +       ctx->target = NULL;
> +
> +       return ctx;
> +}
> +
> +void damon_destroy_ctx(struct damon_ctx *ctx)
> +{
> +       kfree(ctx);
> +}
> +
> +/**
> + * damon_set_attrs() - Set attributes for the monitoring.
> + * @ctx:               monitoring context
> + * @sample_int:                time interval between samplings
> + * @aggr_int:          time interval between aggregations
> + * @regions_update_int:        time interval between target regions update
> + *
> + * This function should not be called while the kdamond is running.
> + * Every time interval is in micro-seconds.
> + *
> + * Return: 0 on success, negative error code otherwise.
> + */
> +int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
> +                   unsigned long aggr_int, unsigned long regions_update_int)
> +{
> +       ctx->sample_interval = sample_int;
> +       ctx->aggr_interval = aggr_int;
> +       ctx->regions_update_interval = regions_update_int;
> +
> +       return 0;
> +}
> +
> +static bool damon_kdamond_running(struct damon_ctx *ctx)
> +{
> +       bool running;
> +
> +       mutex_lock(&ctx->kdamond_lock);
> +       running = ctx->kdamond != NULL;
> +       mutex_unlock(&ctx->kdamond_lock);
> +
> +       return running;
> +}
> +
> +static int kdamond_fn(void *data);
> +
> +/*
> + * __damon_start() - Starts monitoring with given context.
> + * @ctx:       monitoring context
> + *
> + * This function should be called while damon_lock is hold.
> + *
> + * Return: 0 on success, negative error code otherwise.
> + */
> +static int __damon_start(struct damon_ctx *ctx)
> +{
> +       int err = -EBUSY;
> +
> +       mutex_lock(&ctx->kdamond_lock);
> +       if (!ctx->kdamond) {
> +               err = 0;
> +               ctx->kdamond_stop = false;
> +               ctx->kdamond = kthread_create(kdamond_fn, ctx, "kdamond.%d",
> +                               nr_running_ctxs);
> +               if (IS_ERR(ctx->kdamond))
> +                       err = PTR_ERR(ctx->kdamond);
> +               else
> +                       wake_up_process(ctx->kdamond);
> +       }
> +       mutex_unlock(&ctx->kdamond_lock);
> +
> +       return err;
> +}
> +
> +/**
> + * damon_start() - Starts the monitorings for a given group of contexts.
> + * @ctxs:      an array of the pointers for contexts to start monitoring
> + * @nr_ctxs:   size of @ctxs
> + *
> + * This function starts a group of monitoring threads for a group of monitoring
> + * contexts.  One thread per each context is created and run in parallel.  The
> + * caller should handle synchronization between the threads by itself.  If a
> + * group of threads that created by other 'damon_start()' call is currently
> + * running, this function does nothing but returns -EBUSY.
> + *
> + * Return: 0 on success, negative error code otherwise.
> + */
> +int damon_start(struct damon_ctx **ctxs, int nr_ctxs)
> +{
> +       int i;
> +       int err = 0;
> +
> +       mutex_lock(&damon_lock);
> +       if (nr_running_ctxs) {
> +               mutex_unlock(&damon_lock);
> +               return -EBUSY;
> +       }
> +
> +       for (i = 0; i < nr_ctxs; i++) {
> +               err = __damon_start(ctxs[i]);
> +               if (err)
> +                       break;
> +               nr_running_ctxs++;
> +       }
> +       mutex_unlock(&damon_lock);
> +
> +       return err;
> +}
> +
> +/*
> + * __damon_stop() - Stops monitoring of given context.
> + * @ctx:       monitoring context
> + *
> + * Return: 0 on success, negative error code otherwise.
> + */
> +static int __damon_stop(struct damon_ctx *ctx)
> +{
> +       mutex_lock(&ctx->kdamond_lock);
> +       if (ctx->kdamond) {
> +               ctx->kdamond_stop = true;
> +               mutex_unlock(&ctx->kdamond_lock);
> +               while (damon_kdamond_running(ctx))
> +                       usleep_range(ctx->sample_interval,
> +                                       ctx->sample_interval * 2);
> +               return 0;
> +       }
> +       mutex_unlock(&ctx->kdamond_lock);
> +
> +       return -EPERM;
> +}
> +
> +/**
> + * damon_stop() - Stops the monitorings for a given group of contexts.
> + * @ctxs:      an array of the pointers for contexts to stop monitoring
> + * @nr_ctxs:   size of @ctxs
> + *
> + * Return: 0 on success, negative error code otherwise.
> + */
> +int damon_stop(struct damon_ctx **ctxs, int nr_ctxs)
> +{
> +       int i, err = 0;
> +
> +       for (i = 0; i < nr_ctxs; i++) {
> +               /* nr_running_ctxs is decremented in kdamond_fn */
> +               err = __damon_stop(ctxs[i]);
> +               if (err)
> +                       return err;
> +       }
> +
> +       return err;
> +}
> +
> +/*
> + * damon_check_reset_time_interval() - Check if a time interval is elapsed.
> + * @baseline:  the time to check whether the interval has elapsed since
> + * @interval:  the time interval (microseconds)
> + *
> + * See whether the given time interval has passed since the given baseline
> + * time.  If so, it also updates the baseline to current time for next check.
> + *
> + * Return:     true if the time interval has passed, or false otherwise.
> + */
> +static bool damon_check_reset_time_interval(struct timespec64 *baseline,
> +               unsigned long interval)
> +{
> +       struct timespec64 now;
> +
> +       ktime_get_coarse_ts64(&now);
> +       if ((timespec64_to_ns(&now) - timespec64_to_ns(baseline)) <
> +                       interval * 1000)
> +               return false;
> +       *baseline = now;
> +       return true;
> +}
> +
> +/*
> + * Check whether it is time to flush the aggregated information
> + */
> +static bool kdamond_aggregate_interval_passed(struct damon_ctx *ctx)
> +{
> +       return damon_check_reset_time_interval(&ctx->last_aggregation,
> +                       ctx->aggr_interval);
> +}
> +
> +/*
> + * Check whether it is time to check and apply the target monitoring regions
> + *
> + * Returns true if it is.
> + */
> +static bool kdamond_need_update_regions(struct damon_ctx *ctx)
> +{
> +       return damon_check_reset_time_interval(&ctx->last_regions_update,
> +                       ctx->regions_update_interval);
> +}
> +
> +/*
> + * Check whether current monitoring should be stopped
> + *
> + * The monitoring is stopped when either the user requested to stop, or all
> + * monitoring targets are invalid.
> + *
> + * Returns true if need to stop current monitoring.
> + */
> +static bool kdamond_need_stop(struct damon_ctx *ctx)
> +{
> +       bool stop;
> +
> +       mutex_lock(&ctx->kdamond_lock);
> +       stop = ctx->kdamond_stop;
> +       mutex_unlock(&ctx->kdamond_lock);
> +       if (stop)
> +               return true;
> +
> +       if (!ctx->primitive.target_valid)
> +               return false;
> +
> +       return !ctx->primitive.target_valid(ctx->target);
> +}
> +
> +static void set_kdamond_stop(struct damon_ctx *ctx)
> +{
> +       mutex_lock(&ctx->kdamond_lock);
> +       ctx->kdamond_stop = true;
> +       mutex_unlock(&ctx->kdamond_lock);
> +}
> +
> +/*
> + * The monitoring daemon that runs as a kernel thread
> + */
> +static int kdamond_fn(void *data)
> +{
> +       struct damon_ctx *ctx = (struct damon_ctx *)data;
> +
> +       pr_info("kdamond (%d) starts\n", ctx->kdamond->pid);
> +
> +       if (ctx->primitive.init_target_regions)
> +               ctx->primitive.init_target_regions(ctx);
> +       if (ctx->callback.before_start && ctx->callback.before_start(ctx))
> +               set_kdamond_stop(ctx);
> +
> +       while (!kdamond_need_stop(ctx)) {
> +               if (ctx->primitive.prepare_access_checks)
> +                       ctx->primitive.prepare_access_checks(ctx);
> +               if (ctx->callback.after_sampling &&
> +                               ctx->callback.after_sampling(ctx))
> +                       set_kdamond_stop(ctx);

Is 'break' needed here? Or do we want to complete this iteration (same
for the set_kdamond_stop() below)?

> +
> +               usleep_range(ctx->sample_interval, ctx->sample_interval + 1);
> +
> +               if (ctx->primitive.check_accesses)
> +                       ctx->primitive.check_accesses(ctx);
> +
> +               if (kdamond_aggregate_interval_passed(ctx)) {
> +                       if (ctx->callback.after_aggregation &&
> +                                       ctx->callback.after_aggregation(ctx))
> +                               set_kdamond_stop(ctx);
> +                       if (ctx->primitive.reset_aggregated)
> +                               ctx->primitive.reset_aggregated(ctx);
> +               }
> +

The following 'if' can be done in ctx->primitive.update() callback.

> +               if (kdamond_need_update_regions(ctx)) {
> +                       if (ctx->primitive.update_target_regions)
> +                               ctx->primitive.update_target_regions(ctx);
> +               }
> +       }
> +
> +       if (ctx->callback.before_terminate &&
> +                       ctx->callback.before_terminate(ctx))
> +               set_kdamond_stop(ctx);
> +       if (ctx->primitive.cleanup)
> +               ctx->primitive.cleanup(ctx);
> +
> +       pr_debug("kdamond (%d) finishes\n", ctx->kdamond->pid);
> +       mutex_lock(&ctx->kdamond_lock);
> +       ctx->kdamond = NULL;
> +       mutex_unlock(&ctx->kdamond_lock);
> +
> +       mutex_lock(&damon_lock);
> +       nr_running_ctxs--;
> +       mutex_unlock(&damon_lock);
> +
> +       do_exit(0);
> +}
> --
> 2.17.1
>

Overall the patch looks good to me. Two concerns I have are if we
should damon_callback here or with the real user and the regions part
of primitive abstraction. For the first one, I don't have any strong
opinion but for the second one I do.

More specifically the question is if sampling and adaptive region
adjustment are general enough to be part of core monitoring context?
Can you give an example of a different primitive/use-case where these
would be beneficial.
SeongJae Park Dec. 23, 2020, 4:33 p.m. UTC | #2
Thanks for the valuable comments, Shakeel!

On Wed, 23 Dec 2020 07:11:12 -0800 Shakeel Butt <shakeelb@google.com> wrote:

> First I would like you to prune your To/CC list.

I will remove people not directly related with this work and didn't comment to
this series yet.

> 
> On Tue, Dec 15, 2020 at 3:56 AM SeongJae Park <sjpark@amazon.com> wrote:
> >
> > From: SeongJae Park <sjpark@amazon.de>
> >
> > DAMON is a data access monitoring framework for the Linux kernel.  The
> > core mechanisms of DAMON make it
> >
> >  - accurate (the monitoring output is useful enough for DRAM level
> >    performance-centric memory management; It might be inappropriate for
> >    CPU cache levels, though),
> >  - light-weight (the monitoring overhead is normally low enough to be
> >    applied online), and
> >  - scalable (the upper-bound of the overhead is in constant range
> >    regardless of the size of target workloads).
> >
> > Using this framework, hence, we can easily write efficient kernel space
> > data access monitoring applications.  For example, the kernel's memory
> > management mechanisms can make advanced decisions using this.
> > Experimental data access aware optimization works that incurring high
> > access monitoring overhead could implemented again on top of this.
> 
> *could again be implemented on*

Good eye!  I will fix this in the next version.

> 
> >
> > Due to its simple and flexible interface, providing user space interface
> > would be also easy.  Then, user space users who have some special
> > workloads can write personalized applications for better understanding
> > and optimizations of their workloads and systems.
> >
> > ---
> >
> > Nevertheless, this commit is defining and implementing only basic access
> > check part without the overhead-accuracy handling core logic.  The basic
> > access check is as below.
> >
> > The output of DAMON says what memory regions are how frequently accessed
> > for a given duration.  The resolution of the access frequency is
> > controlled by setting ``sampling interval`` and ``aggregation
> > interval``.  In detail, DAMON checks access to each page per ``sampling
> > interval`` and aggregates the results.  In other words, counts the
> > number of the accesses to each region.  After each ``aggregation
> > interval`` passes, DAMON calls callback functions that previously
> > registered by users so that users can read the aggregated results and
> > then clears the results.  This can be described in below simple
> > pseudo-code::
> >
> >     while monitoring_on:
> >         for page in monitoring_target:
> >             if accessed(page):
> >                 nr_accesses[page] += 1
> >         if time() % aggregation_interval == 0:
> >             for callback in user_registered_callbacks:
> >                 callback(monitoring_target, nr_accesses)
> >             for page in monitoring_target:
> >                 nr_accesses[page] = 0
> >         sleep(sampling interval)
> >
> 
> The above is a good example and I was hoping to see the almost same actual code.

Oh, this doesn't explain about the 'update'.  I will update this in the next
version.

> 
> > The target regions constructed at the beginning of the monitoring and
> > updated after each ``regions_update_interval``, because the target
> > regions could be dynamically changed (e.g., mmap() or memory hotplug).
> > The monitoring overhead of this mechanism will arbitrarily increase as
> > the size of the target workload grows.
> >
> > The basic monitoring primitives for actual access check and dynamic
> > target regions construction aren't in the core part of DAMON.  Instead,
> > it allows users to implement their own primitives that optimized for
> 
> 'that are optimized for'

Good catch!  I will fix this in the next version.

> 
> > their use case and configure DAMON to use those.  In other words, users
> > cannot use current version of DAMON without some additional works.
> >
> > Following commits will implement the core mechanisms for the
> > overhead-accuracy control and default primitives implementations.
> >
> > Signed-off-by: SeongJae Park <sjpark@amazon.de>
> > Reviewed-by: Leonard Foerster <foersleo@amazon.de>
> > ---
> >  include/linux/damon.h | 167 ++++++++++++++++++++++
> >  mm/Kconfig            |   2 +
> >  mm/Makefile           |   1 +
> >  mm/damon/Kconfig      |  15 ++
> >  mm/damon/Makefile     |   3 +
> >  mm/damon/core.c       | 316 ++++++++++++++++++++++++++++++++++++++++++
> >  6 files changed, 504 insertions(+)
> >  create mode 100644 include/linux/damon.h
> >  create mode 100644 mm/damon/Kconfig
> >  create mode 100644 mm/damon/Makefile
> >  create mode 100644 mm/damon/core.c
> >
> > diff --git a/include/linux/damon.h b/include/linux/damon.h
> > new file mode 100644
> > index 000000000000..387fa4399fc8
> > --- /dev/null
> > +++ b/include/linux/damon.h
> > @@ -0,0 +1,167 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * DAMON api
> > + *
> > + * Author: SeongJae Park <sjpark@amazon.de>
> > + */
> > +
> > +#ifndef _DAMON_H_
> > +#define _DAMON_H_
> > +
> > +#include <linux/mutex.h>
> > +#include <linux/time64.h>
> > +#include <linux/types.h>
> > +
> > +struct damon_ctx;
> > +
> > +/**
> > + * struct damon_primitive      Monitoring primitives for given use cases.
> > + *
> > + * @init_target_regions:       Constructs initial monitoring target regions.
> > + * @update_target_regions:     Updates monitoring target regions.
> 
> The 'region' here still does not feel right. I prefer to rename the
> above two to just init() and update().

Agreed.  I will rename so in the next version.

> 
> > + * @prepare_access_checks:     Prepares next access check of target regions.
> > + * @check_accesses:            Checks the access of target regions.
> > + * @reset_aggregated:          Resets aggregated accesses monitoring results.
> > + * @target_valid:              Determine if the target is valid.
> > + * @cleanup:                   Cleans up the context.
> > + *
> > + * DAMON can be extended for various address spaces and usages.  For this,
> > + * users should register the low level primitives for their target address
> > + * space and usecase via the &damon_ctx.primitive.  Then, the monitoring thread
> > + * calls @init_target_regions and @prepare_access_checks before starting the
> > + * monitoring, @update_target_regions after each
> > + * &damon_ctx.regions_update_interval, and @check_accesses, @target_valid and
> > + * @prepare_access_checks after each &damon_ctx.sample_interval.  Finally,
> > + * @reset_aggregated is called after each &damon_ctx.aggr_interval.
> > + *
> > + * @init_target_regions should construct proper monitoring target regions and
> > + * link those to the DAMON context struct.  The regions should be defined by
> > + * user and saved in @damon_ctx.target.
> > + * @update_target_regions should update the monitoring target regions for
> > + * current status.
> > + * @prepare_access_checks should manipulate the monitoring regions to be
> > + * prepared for the next access check.
> > + * @check_accesses should check the accesses to each region that made after the
> > + * last preparation and update the number of observed accesses of each region.
> > + * @reset_aggregated should reset the access monitoring results that aggregated
> > + * by @check_accesses.
> > + * @target_valid should check whether the target is still valid for the
> > + * monitoring.
> > + * @cleanup is called from @kdamond just before its termination.  After this
> > + * call, only @kdamond_lock and @kdamond will be touched.
> > + */
> > +struct damon_primitive {
> > +       void (*init_target_regions)(struct damon_ctx *context);
> > +       void (*update_target_regions)(struct damon_ctx *context);
> > +       void (*prepare_access_checks)(struct damon_ctx *context);
> > +       void (*check_accesses)(struct damon_ctx *context);
> > +       void (*reset_aggregated)(struct damon_ctx *context);
> > +       bool (*target_valid)(void *target);
> > +       void (*cleanup)(struct damon_ctx *context);
> > +};
> > +
> > +/*
> > + * struct damon_callback       Monitoring events notification callbacks.
> > + *
> > + * @before_start:      Called before starting the monitoring.
> > + * @after_sampling:    Called after each sampling.
> > + * @after_aggregation: Called after each aggregation.
> > + * @before_terminate:  Called before terminating the monitoring.
> > + * @private:           User private data.
> > + *
> > + * The monitoring thread (&damon_ctx->kdamond) calls @before_start and
> > + * @before_terminate just before starting and finishing the monitoring,
> > + * respectively.  Therefore, those are good places for installing and cleaning
> > + * @private.
> > + *
> > + * The monitoring thread calls @after_sampling and @after_aggregation for each
> > + * of the sampling intervals and aggregation intervals, respectively.
> > + * Therefore, users can safely access the monitoring results without additional
> > + * protection.  For the reason, users are recommended to use these callback for
> > + * the accesses to the results.
> > + *
> > + * If any callback returns non-zero, monitoring stops.
> 
> I am not sure if this is the right patch to add this struct. Either we
> need to add text why this is needed or add this patch when the real
> user i.e. debugfs interface is added.

I think this is better to be here to let people know how DAMON API users can
access to the monitoring result.

> 
> > + */
> > +struct damon_callback {
> > +       void *private;
> > +
> > +       int (*before_start)(struct damon_ctx *context);
> > +       int (*after_sampling)(struct damon_ctx *context);
> > +       int (*after_aggregation)(struct damon_ctx *context);
> > +       int (*before_terminate)(struct damon_ctx *context);
> > +};
> > +
> > +/**
> > + * struct damon_ctx - Represents a context for each monitoring.  This is the
> > + * main interface that allows users to set the attributes and get the results
> > + * of the monitoring.
> > + *
> > + * @sample_interval:           The time between access samplings.
> > + * @aggr_interval:             The time between monitor results aggregations.
> > + * @regions_update_interval:   The time between monitor regions updates.
> 
> regions_update_internal should be part of the primitive abstraction
> and the update() callback internally can check this field.

I think the field should be renamed to be independent from the 'region'
concept.

However, if we agree that the 'update()' is general concept (I assume
so because you didn't object to adding 'update()' callback), I believe this
should be here, because the interval is the part of the monitoring request.
Also, I unsure how the monitoring thread could know when to call 'update()'
callback if this field is not here.

[...]
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 390165ffbb0f..b97f2e8ab83f 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -859,4 +859,6 @@ config ARCH_HAS_HUGEPD
> >  config MAPPING_DIRTY_HELPERS
> >          bool
> >
> > +source "mm/damon/Kconfig"
> > +
> >  endmenu
> > diff --git a/mm/Makefile b/mm/Makefile
> > index d73aed0fc99c..8022b8f04096 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
> >  obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
> >  obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
> >  obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
> > +obj-$(CONFIG_DAMON) += damon/
> > diff --git a/mm/damon/Kconfig b/mm/damon/Kconfig
> > new file mode 100644
> > index 000000000000..d00e99ac1a15
> > --- /dev/null
> > +++ b/mm/damon/Kconfig
> > @@ -0,0 +1,15 @@
> > +# SPDX-License-Identifier: GPL-2.0-only
> > +
> > +menu "Data Access Monitoring"
> > +
> > +config DAMON
> > +       bool "DAMON: Data Access Monitoring Framework"
> > +       help
> > +         This builds a framework that allows kernel subsystems to monitor
> > +         access frequency of each memory region. The information can be useful
> > +         for performance-centric DRAM level memory management.
> > +
> > +         See https://damonitor.github.io/doc/html/latest-damon/index.html for
> > +         more information.
> > +
> > +endmenu
> > diff --git a/mm/damon/Makefile b/mm/damon/Makefile
> > new file mode 100644
> > index 000000000000..4fd2edb4becf
> > --- /dev/null
> > +++ b/mm/damon/Makefile
> > @@ -0,0 +1,3 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +obj-$(CONFIG_DAMON)            := core.o
> > diff --git a/mm/damon/core.c b/mm/damon/core.c
> > new file mode 100644
> > index 000000000000..8963804efdf9
> > --- /dev/null
> > +++ b/mm/damon/core.c
[...]
> > +/*
> > + * The monitoring daemon that runs as a kernel thread
> > + */
> > +static int kdamond_fn(void *data)
> > +{
> > +       struct damon_ctx *ctx = (struct damon_ctx *)data;
> > +
> > +       pr_info("kdamond (%d) starts\n", ctx->kdamond->pid);
> > +
> > +       if (ctx->primitive.init_target_regions)
> > +               ctx->primitive.init_target_regions(ctx);
> > +       if (ctx->callback.before_start && ctx->callback.before_start(ctx))
> > +               set_kdamond_stop(ctx);
> > +
> > +       while (!kdamond_need_stop(ctx)) {
> > +               if (ctx->primitive.prepare_access_checks)
> > +                       ctx->primitive.prepare_access_checks(ctx);
> > +               if (ctx->callback.after_sampling &&
> > +                               ctx->callback.after_sampling(ctx))
> > +                       set_kdamond_stop(ctx);
> 
> Is 'break' needed here? Or do we want to complete this iteration (same
> for the set_kdamond_stop() below)?

My intention was to complete this iteration.  I think it could give slightly
simpler view to primitives and callbacks implementing people, because they can
assume callbacks in each iteration will be called.

> 
> > +
> > +               usleep_range(ctx->sample_interval, ctx->sample_interval + 1);
> > +
> > +               if (ctx->primitive.check_accesses)
> > +                       ctx->primitive.check_accesses(ctx);
> > +
> > +               if (kdamond_aggregate_interval_passed(ctx)) {
> > +                       if (ctx->callback.after_aggregation &&
> > +                                       ctx->callback.after_aggregation(ctx))
> > +                               set_kdamond_stop(ctx);
> > +                       if (ctx->primitive.reset_aggregated)
> > +                               ctx->primitive.reset_aggregated(ctx);
> > +               }
> > +
> 
> The following 'if' can be done in ctx->primitive.update() callback.

I think this could be common to multiple primitives, so I'd like to let it be
here to minimize code duplication.

> 
> > +               if (kdamond_need_update_regions(ctx)) {
> > +                       if (ctx->primitive.update_target_regions)
> > +                               ctx->primitive.update_target_regions(ctx);
> > +               }
> > +       }
> > +
> > +       if (ctx->callback.before_terminate &&
> > +                       ctx->callback.before_terminate(ctx))
> > +               set_kdamond_stop(ctx);
> > +       if (ctx->primitive.cleanup)
> > +               ctx->primitive.cleanup(ctx);
> > +
> > +       pr_debug("kdamond (%d) finishes\n", ctx->kdamond->pid);
> > +       mutex_lock(&ctx->kdamond_lock);
> > +       ctx->kdamond = NULL;
> > +       mutex_unlock(&ctx->kdamond_lock);
> > +
> > +       mutex_lock(&damon_lock);
> > +       nr_running_ctxs--;
> > +       mutex_unlock(&damon_lock);
> > +
> > +       do_exit(0);
> > +}
> > --
> > 2.17.1
> >
> 
> Overall the patch looks good to me. Two concerns I have are if we
> should damon_callback here or with the real user and the regions part
> of primitive abstraction. For the first one, I don't have any strong
> opinion but for the second one I do.

I'd like to keep 'damon_callback' part here, to let API users know how the
monitoring result will be available to them.

For the 'regions' part, I will rename relevant things as below in the next
version, to reduce any confusion.

init_target_regions() -> init()
update_target_regions() -> update()
regions_update_interval -> update_interval
last_regions_update -> last_update

> 
> More specifically the question is if sampling and adaptive region
> adjustment are general enough to be part of core monitoring context?
> Can you give an example of a different primitive/use-case where these
> would be beneficial.

I think all adress spaces having some spatial locality and monitoring requests
that need to have upper-bound overhead and best-effort accuracy could get
benefit from it.  The primitives targetting 'virtual address spaces' and the
'physical address space' clearly showed the benefit.
Shakeel Butt Dec. 23, 2020, 10:49 p.m. UTC | #3
On Wed, Dec 23, 2020 at 8:34 AM SeongJae Park <sjpark@amazon.com> wrote:
[snip]
> > Overall the patch looks good to me. Two concerns I have are if we
> > should damon_callback here or with the real user and the regions part
> > of primitive abstraction. For the first one, I don't have any strong
> > opinion but for the second one I do.
>
> I'd like to keep 'damon_callback' part here, to let API users know how the
> monitoring result will be available to them.
>
> For the 'regions' part, I will rename relevant things as below in the next
> version, to reduce any confusion.
>
> init_target_regions() -> init()
> update_target_regions() -> update()
> regions_update_interval -> update_interval
> last_regions_update -> last_update
>
> >
> > More specifically the question is if sampling and adaptive region
> > adjustment are general enough to be part of core monitoring context?
> > Can you give an example of a different primitive/use-case where these
> > would be beneficial.
>
> I think all adress spaces having some spatial locality and monitoring requests
> that need to have upper-bound overhead and best-effort accuracy could get
> benefit from it.  The primitives targetting 'virtual address spaces' and the
> 'physical address space' clearly showed the benefit.

I am still not much convinced on the 'physical address space' use-case
or the way you are presenting it. Anyways I think we start with what
you have and if in future there is a use-case where regions adjustment
does not make sense, we can change it then.
SeongJae Park Dec. 24, 2020, 7:02 a.m. UTC | #4
On Wed, 23 Dec 2020 14:49:57 -0800 Shakeel Butt <shakeelb@google.com> wrote:

> On Wed, Dec 23, 2020 at 8:34 AM SeongJae Park <sjpark@amazon.com> wrote:
> [snip]
> > > Overall the patch looks good to me. Two concerns I have are if we
> > > should damon_callback here or with the real user and the regions part
> > > of primitive abstraction. For the first one, I don't have any strong
> > > opinion but for the second one I do.
> >
> > I'd like to keep 'damon_callback' part here, to let API users know how the
> > monitoring result will be available to them.
> >
> > For the 'regions' part, I will rename relevant things as below in the next
> > version, to reduce any confusion.
> >
> > init_target_regions() -> init()
> > update_target_regions() -> update()
> > regions_update_interval -> update_interval
> > last_regions_update -> last_update
> >
> > >
> > > More specifically the question is if sampling and adaptive region
> > > adjustment are general enough to be part of core monitoring context?
> > > Can you give an example of a different primitive/use-case where these
> > > would be beneficial.
> >
> > I think all adress spaces having some spatial locality and monitoring requests
> > that need to have upper-bound overhead and best-effort accuracy could get
> > benefit from it.  The primitives targetting 'virtual address spaces' and the
> > 'physical address space' clearly showed the benefit.
> 
> I am still not much convinced on the 'physical address space' use-case
> or the way you are presenting it.

I understand the concern.  I also once thought the mechanism might not work
well for the physical address space because we cannot expect much spatial
locality in the space.  However, it turned out that there is some (temporal)
spatial locality that enough to make DAMON work reasonably well.  The word,
'reasonably well' might be controversial.  With the mechanism, DAMON provides
only 'best-effort' accuracy, rather than 100% accuracy.  Our goal is to make
the information accurate enough only for DRAM-centric optimizations.  I'd like
to also note that there are knobs that you can use to make minimum quality
higher (nr_min_regions) while setting the upperbound of the monitoring overhead
(nr_max_regions).   What I can say for now is that we ran DAMON for physical
address space of our production systems (shared detail in the 'Real-workd User
Story' section of coverletter[1]) and the result was reasonable enough to
convince the owner of the systems.

[1] https://lore.kernel.org/linux-mm/20201215115448.25633-1-sjpark@amazon.com/

> Anyways I think we start with what you have and if in future there is a
> use-case where regions adjustment does not make sense, we can change it then.

100% agreed, and thank you for understanding my argument.


Thanks,
SeongJae Park
diff mbox series

Patch

diff --git a/include/linux/damon.h b/include/linux/damon.h
new file mode 100644
index 000000000000..387fa4399fc8
--- /dev/null
+++ b/include/linux/damon.h
@@ -0,0 +1,167 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * DAMON api
+ *
+ * Author: SeongJae Park <sjpark@amazon.de>
+ */
+
+#ifndef _DAMON_H_
+#define _DAMON_H_
+
+#include <linux/mutex.h>
+#include <linux/time64.h>
+#include <linux/types.h>
+
+struct damon_ctx;
+
+/**
+ * struct damon_primitive	Monitoring primitives for given use cases.
+ *
+ * @init_target_regions:	Constructs initial monitoring target regions.
+ * @update_target_regions:	Updates monitoring target regions.
+ * @prepare_access_checks:	Prepares next access check of target regions.
+ * @check_accesses:		Checks the access of target regions.
+ * @reset_aggregated:		Resets aggregated accesses monitoring results.
+ * @target_valid:		Determine if the target is valid.
+ * @cleanup:			Cleans up the context.
+ *
+ * DAMON can be extended for various address spaces and usages.  For this,
+ * users should register the low level primitives for their target address
+ * space and usecase via the &damon_ctx.primitive.  Then, the monitoring thread
+ * calls @init_target_regions and @prepare_access_checks before starting the
+ * monitoring, @update_target_regions after each
+ * &damon_ctx.regions_update_interval, and @check_accesses, @target_valid and
+ * @prepare_access_checks after each &damon_ctx.sample_interval.  Finally,
+ * @reset_aggregated is called after each &damon_ctx.aggr_interval.
+ *
+ * @init_target_regions should construct proper monitoring target regions and
+ * link those to the DAMON context struct.  The regions should be defined by
+ * user and saved in @damon_ctx.target.
+ * @update_target_regions should update the monitoring target regions for
+ * current status.
+ * @prepare_access_checks should manipulate the monitoring regions to be
+ * prepared for the next access check.
+ * @check_accesses should check the accesses to each region that made after the
+ * last preparation and update the number of observed accesses of each region.
+ * @reset_aggregated should reset the access monitoring results that aggregated
+ * by @check_accesses.
+ * @target_valid should check whether the target is still valid for the
+ * monitoring.
+ * @cleanup is called from @kdamond just before its termination.  After this
+ * call, only @kdamond_lock and @kdamond will be touched.
+ */
+struct damon_primitive {
+	void (*init_target_regions)(struct damon_ctx *context);
+	void (*update_target_regions)(struct damon_ctx *context);
+	void (*prepare_access_checks)(struct damon_ctx *context);
+	void (*check_accesses)(struct damon_ctx *context);
+	void (*reset_aggregated)(struct damon_ctx *context);
+	bool (*target_valid)(void *target);
+	void (*cleanup)(struct damon_ctx *context);
+};
+
+/*
+ * struct damon_callback	Monitoring events notification callbacks.
+ *
+ * @before_start:	Called before starting the monitoring.
+ * @after_sampling:	Called after each sampling.
+ * @after_aggregation:	Called after each aggregation.
+ * @before_terminate:	Called before terminating the monitoring.
+ * @private:		User private data.
+ *
+ * The monitoring thread (&damon_ctx->kdamond) calls @before_start and
+ * @before_terminate just before starting and finishing the monitoring,
+ * respectively.  Therefore, those are good places for installing and cleaning
+ * @private.
+ *
+ * The monitoring thread calls @after_sampling and @after_aggregation for each
+ * of the sampling intervals and aggregation intervals, respectively.
+ * Therefore, users can safely access the monitoring results without additional
+ * protection.  For the reason, users are recommended to use these callback for
+ * the accesses to the results.
+ *
+ * If any callback returns non-zero, monitoring stops.
+ */
+struct damon_callback {
+	void *private;
+
+	int (*before_start)(struct damon_ctx *context);
+	int (*after_sampling)(struct damon_ctx *context);
+	int (*after_aggregation)(struct damon_ctx *context);
+	int (*before_terminate)(struct damon_ctx *context);
+};
+
+/**
+ * struct damon_ctx - Represents a context for each monitoring.  This is the
+ * main interface that allows users to set the attributes and get the results
+ * of the monitoring.
+ *
+ * @sample_interval:		The time between access samplings.
+ * @aggr_interval:		The time between monitor results aggregations.
+ * @regions_update_interval:	The time between monitor regions updates.
+ *
+ * For each @sample_interval, DAMON checks whether each region is accessed or
+ * not.  It aggregates and keeps the access information (number of accesses to
+ * each region) for @aggr_interval time.  DAMON also checks whether the target
+ * memory regions need update (e.g., by ``mmap()`` calls from the application,
+ * in case of virtual memory monitoring) and applies the changes for each
+ * @regions_update_interval.  All time intervals are in micro-seconds.  Please
+ * refer to &struct damon_primitive and &struct damon_callback for more detail.
+ *
+ * @kdamond:		Kernel thread who does the monitoring.
+ * @kdamond_stop:	Notifies whether kdamond should stop.
+ * @kdamond_lock:	Mutex for the synchronizations with @kdamond.
+ *
+ * For each monitoring context, one kernel thread for the monitoring is
+ * created.  The pointer to the thread is stored in @kdamond.
+ *
+ * Once started, the monitoring thread runs until explicitly required to be
+ * terminated or every monitoring target is invalid.  The validity of the
+ * targets is checked via the &damon_primitive.target_valid of @primitive.  The
+ * termination can also be explicitly requested by writing non-zero to
+ * @kdamond_stop.  The thread sets @kdamond to NULL when it terminates.
+ * Therefore, users can know whether the monitoring is ongoing or terminated by
+ * reading @kdamond.  Reads and writes to @kdamond and @kdamond_stop from
+ * outside of the monitoring thread must be protected by @kdamond_lock.
+ *
+ * Note that the monitoring thread protects only @kdamond and @kdamond_stop via
+ * @kdamond_lock.  Accesses to other fields must be protected by themselves.
+ *
+ * @primitive:	Set of monitoring primitives for given use cases.
+ * @callback:	Set of callbacks for monitoring events notifications.
+ *
+ * @target:	Pointer to the user-defined monitoring target.
+ */
+struct damon_ctx {
+	unsigned long sample_interval;
+	unsigned long aggr_interval;
+	unsigned long regions_update_interval;
+
+/* private */
+	struct timespec64 last_aggregation;
+	struct timespec64 last_regions_update;
+
+/* public */
+	struct task_struct *kdamond;
+	bool kdamond_stop;
+	struct mutex kdamond_lock;
+
+	struct damon_primitive primitive;
+	struct damon_callback callback;
+
+	void *target;
+};
+
+#ifdef CONFIG_DAMON
+
+struct damon_ctx *damon_new_ctx(void);
+void damon_destroy_ctx(struct damon_ctx *ctx);
+int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
+		unsigned long aggr_int, unsigned long regions_update_int);
+
+int damon_start(struct damon_ctx **ctxs, int nr_ctxs);
+int damon_stop(struct damon_ctx **ctxs, int nr_ctxs);
+
+#endif	/* CONFIG_DAMON */
+
+#endif	/* _DAMON_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 390165ffbb0f..b97f2e8ab83f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -859,4 +859,6 @@  config ARCH_HAS_HUGEPD
 config MAPPING_DIRTY_HELPERS
         bool
 
+source "mm/damon/Kconfig"
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d73aed0fc99c..8022b8f04096 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,3 +120,4 @@  obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_DAMON) += damon/
diff --git a/mm/damon/Kconfig b/mm/damon/Kconfig
new file mode 100644
index 000000000000..d00e99ac1a15
--- /dev/null
+++ b/mm/damon/Kconfig
@@ -0,0 +1,15 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+
+menu "Data Access Monitoring"
+
+config DAMON
+	bool "DAMON: Data Access Monitoring Framework"
+	help
+	  This builds a framework that allows kernel subsystems to monitor
+	  access frequency of each memory region. The information can be useful
+	  for performance-centric DRAM level memory management.
+
+	  See https://damonitor.github.io/doc/html/latest-damon/index.html for
+	  more information.
+
+endmenu
diff --git a/mm/damon/Makefile b/mm/damon/Makefile
new file mode 100644
index 000000000000..4fd2edb4becf
--- /dev/null
+++ b/mm/damon/Makefile
@@ -0,0 +1,3 @@ 
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_DAMON)		:= core.o
diff --git a/mm/damon/core.c b/mm/damon/core.c
new file mode 100644
index 000000000000..8963804efdf9
--- /dev/null
+++ b/mm/damon/core.c
@@ -0,0 +1,316 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Data Access Monitor
+ *
+ * Author: SeongJae Park <sjpark@amazon.de>
+ */
+
+#define pr_fmt(fmt) "damon: " fmt
+
+#include <linux/damon.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/slab.h>
+
+static DEFINE_MUTEX(damon_lock);
+static int nr_running_ctxs;
+
+struct damon_ctx *damon_new_ctx(void)
+{
+	struct damon_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	ctx->sample_interval = 5 * 1000;
+	ctx->aggr_interval = 100 * 1000;
+	ctx->regions_update_interval = 1000 * 1000;
+
+	ktime_get_coarse_ts64(&ctx->last_aggregation);
+	ctx->last_regions_update = ctx->last_aggregation;
+
+	mutex_init(&ctx->kdamond_lock);
+
+	ctx->target = NULL;
+
+	return ctx;
+}
+
+void damon_destroy_ctx(struct damon_ctx *ctx)
+{
+	kfree(ctx);
+}
+
+/**
+ * damon_set_attrs() - Set attributes for the monitoring.
+ * @ctx:		monitoring context
+ * @sample_int:		time interval between samplings
+ * @aggr_int:		time interval between aggregations
+ * @regions_update_int:	time interval between target regions update
+ *
+ * This function should not be called while the kdamond is running.
+ * Every time interval is in micro-seconds.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
+		    unsigned long aggr_int, unsigned long regions_update_int)
+{
+	ctx->sample_interval = sample_int;
+	ctx->aggr_interval = aggr_int;
+	ctx->regions_update_interval = regions_update_int;
+
+	return 0;
+}
+
+static bool damon_kdamond_running(struct damon_ctx *ctx)
+{
+	bool running;
+
+	mutex_lock(&ctx->kdamond_lock);
+	running = ctx->kdamond != NULL;
+	mutex_unlock(&ctx->kdamond_lock);
+
+	return running;
+}
+
+static int kdamond_fn(void *data);
+
+/*
+ * __damon_start() - Starts monitoring with given context.
+ * @ctx:	monitoring context
+ *
+ * This function should be called while damon_lock is hold.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+static int __damon_start(struct damon_ctx *ctx)
+{
+	int err = -EBUSY;
+
+	mutex_lock(&ctx->kdamond_lock);
+	if (!ctx->kdamond) {
+		err = 0;
+		ctx->kdamond_stop = false;
+		ctx->kdamond = kthread_create(kdamond_fn, ctx, "kdamond.%d",
+				nr_running_ctxs);
+		if (IS_ERR(ctx->kdamond))
+			err = PTR_ERR(ctx->kdamond);
+		else
+			wake_up_process(ctx->kdamond);
+	}
+	mutex_unlock(&ctx->kdamond_lock);
+
+	return err;
+}
+
+/**
+ * damon_start() - Starts the monitorings for a given group of contexts.
+ * @ctxs:	an array of the pointers for contexts to start monitoring
+ * @nr_ctxs:	size of @ctxs
+ *
+ * This function starts a group of monitoring threads for a group of monitoring
+ * contexts.  One thread per each context is created and run in parallel.  The
+ * caller should handle synchronization between the threads by itself.  If a
+ * group of threads that created by other 'damon_start()' call is currently
+ * running, this function does nothing but returns -EBUSY.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_start(struct damon_ctx **ctxs, int nr_ctxs)
+{
+	int i;
+	int err = 0;
+
+	mutex_lock(&damon_lock);
+	if (nr_running_ctxs) {
+		mutex_unlock(&damon_lock);
+		return -EBUSY;
+	}
+
+	for (i = 0; i < nr_ctxs; i++) {
+		err = __damon_start(ctxs[i]);
+		if (err)
+			break;
+		nr_running_ctxs++;
+	}
+	mutex_unlock(&damon_lock);
+
+	return err;
+}
+
+/*
+ * __damon_stop() - Stops monitoring of given context.
+ * @ctx:	monitoring context
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+static int __damon_stop(struct damon_ctx *ctx)
+{
+	mutex_lock(&ctx->kdamond_lock);
+	if (ctx->kdamond) {
+		ctx->kdamond_stop = true;
+		mutex_unlock(&ctx->kdamond_lock);
+		while (damon_kdamond_running(ctx))
+			usleep_range(ctx->sample_interval,
+					ctx->sample_interval * 2);
+		return 0;
+	}
+	mutex_unlock(&ctx->kdamond_lock);
+
+	return -EPERM;
+}
+
+/**
+ * damon_stop() - Stops the monitorings for a given group of contexts.
+ * @ctxs:	an array of the pointers for contexts to stop monitoring
+ * @nr_ctxs:	size of @ctxs
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_stop(struct damon_ctx **ctxs, int nr_ctxs)
+{
+	int i, err = 0;
+
+	for (i = 0; i < nr_ctxs; i++) {
+		/* nr_running_ctxs is decremented in kdamond_fn */
+		err = __damon_stop(ctxs[i]);
+		if (err)
+			return err;
+	}
+
+	return err;
+}
+
+/*
+ * damon_check_reset_time_interval() - Check if a time interval is elapsed.
+ * @baseline:	the time to check whether the interval has elapsed since
+ * @interval:	the time interval (microseconds)
+ *
+ * See whether the given time interval has passed since the given baseline
+ * time.  If so, it also updates the baseline to current time for next check.
+ *
+ * Return:	true if the time interval has passed, or false otherwise.
+ */
+static bool damon_check_reset_time_interval(struct timespec64 *baseline,
+		unsigned long interval)
+{
+	struct timespec64 now;
+
+	ktime_get_coarse_ts64(&now);
+	if ((timespec64_to_ns(&now) - timespec64_to_ns(baseline)) <
+			interval * 1000)
+		return false;
+	*baseline = now;
+	return true;
+}
+
+/*
+ * Check whether it is time to flush the aggregated information
+ */
+static bool kdamond_aggregate_interval_passed(struct damon_ctx *ctx)
+{
+	return damon_check_reset_time_interval(&ctx->last_aggregation,
+			ctx->aggr_interval);
+}
+
+/*
+ * Check whether it is time to check and apply the target monitoring regions
+ *
+ * Returns true if it is.
+ */
+static bool kdamond_need_update_regions(struct damon_ctx *ctx)
+{
+	return damon_check_reset_time_interval(&ctx->last_regions_update,
+			ctx->regions_update_interval);
+}
+
+/*
+ * Check whether current monitoring should be stopped
+ *
+ * The monitoring is stopped when either the user requested to stop, or all
+ * monitoring targets are invalid.
+ *
+ * Returns true if need to stop current monitoring.
+ */
+static bool kdamond_need_stop(struct damon_ctx *ctx)
+{
+	bool stop;
+
+	mutex_lock(&ctx->kdamond_lock);
+	stop = ctx->kdamond_stop;
+	mutex_unlock(&ctx->kdamond_lock);
+	if (stop)
+		return true;
+
+	if (!ctx->primitive.target_valid)
+		return false;
+
+	return !ctx->primitive.target_valid(ctx->target);
+}
+
+static void set_kdamond_stop(struct damon_ctx *ctx)
+{
+	mutex_lock(&ctx->kdamond_lock);
+	ctx->kdamond_stop = true;
+	mutex_unlock(&ctx->kdamond_lock);
+}
+
+/*
+ * The monitoring daemon that runs as a kernel thread
+ */
+static int kdamond_fn(void *data)
+{
+	struct damon_ctx *ctx = (struct damon_ctx *)data;
+
+	pr_info("kdamond (%d) starts\n", ctx->kdamond->pid);
+
+	if (ctx->primitive.init_target_regions)
+		ctx->primitive.init_target_regions(ctx);
+	if (ctx->callback.before_start && ctx->callback.before_start(ctx))
+		set_kdamond_stop(ctx);
+
+	while (!kdamond_need_stop(ctx)) {
+		if (ctx->primitive.prepare_access_checks)
+			ctx->primitive.prepare_access_checks(ctx);
+		if (ctx->callback.after_sampling &&
+				ctx->callback.after_sampling(ctx))
+			set_kdamond_stop(ctx);
+
+		usleep_range(ctx->sample_interval, ctx->sample_interval + 1);
+
+		if (ctx->primitive.check_accesses)
+			ctx->primitive.check_accesses(ctx);
+
+		if (kdamond_aggregate_interval_passed(ctx)) {
+			if (ctx->callback.after_aggregation &&
+					ctx->callback.after_aggregation(ctx))
+				set_kdamond_stop(ctx);
+			if (ctx->primitive.reset_aggregated)
+				ctx->primitive.reset_aggregated(ctx);
+		}
+
+		if (kdamond_need_update_regions(ctx)) {
+			if (ctx->primitive.update_target_regions)
+				ctx->primitive.update_target_regions(ctx);
+		}
+	}
+
+	if (ctx->callback.before_terminate &&
+			ctx->callback.before_terminate(ctx))
+		set_kdamond_stop(ctx);
+	if (ctx->primitive.cleanup)
+		ctx->primitive.cleanup(ctx);
+
+	pr_debug("kdamond (%d) finishes\n", ctx->kdamond->pid);
+	mutex_lock(&ctx->kdamond_lock);
+	ctx->kdamond = NULL;
+	mutex_unlock(&ctx->kdamond_lock);
+
+	mutex_lock(&damon_lock);
+	nr_running_ctxs--;
+	mutex_unlock(&damon_lock);
+
+	do_exit(0);
+}