Message ID | 20210621083108.17589-2-sj38.park@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Introduce Data Access MONitor (DAMON) | expand |
On Mon, Jun 21, 2021 at 1:31 AM SeongJae Park <sj38.park@gmail.com> wrote: > > From: SeongJae Park <sjpark@amazon.de> > > DAMON is a data access monitoring framework for the Linux kernel. The > core mechanisms of DAMON make it > > - accurate (the monitoring output is useful enough for DRAM level > performance-centric memory management; It might be inappropriate for > CPU cache levels, though), > - light-weight (the monitoring overhead is normally low enough to be > applied online), and > - scalable (the upper-bound of the overhead is in constant range > regardless of the size of target workloads). > > Using this framework, hence, we can easily write efficient kernel space > data access monitoring applications. For example, the kernel's memory > management mechanisms can make advanced decisions using this. > Experimental data access aware optimization works that incurring high > access monitoring overhead could again be implemented on top of this. > > Due to its simple and flexible interface, providing user space interface > would be also easy. Then, user space users who have some special > workloads can write personalized applications for better understanding > and optimizations of their workloads and systems. > > === > > Nevertheless, this commit is defining and implementing only basic access > check part without the overhead-accuracy handling core logic. The basic > access check is as below. > > The output of DAMON says what memory regions are how frequently accessed > for a given duration. The resolution of the access frequency is > controlled by setting ``sampling interval`` and ``aggregation > interval``. In detail, DAMON checks access to each page per ``sampling > interval`` and aggregates the results. In other words, counts the > number of the accesses to each region. After each ``aggregation > interval`` passes, DAMON calls callback functions that previously > registered by users so that users can read the aggregated results and > then clears the results. This can be described in below simple > pseudo-code:: > > init() > while monitoring_on: > for page in monitoring_target: > if accessed(page): > nr_accesses[page] += 1 > if time() % aggregation_interval == 0: > for callback in user_registered_callbacks: > callback(monitoring_target, nr_accesses) > for page in monitoring_target: > nr_accesses[page] = 0 > if time() % update_interval == 0: regions_update_interval? > update() > sleep(sampling interval) > > The target regions constructed at the beginning of the monitoring and > updated after each ``regions_update_interval``, because the target > regions could be dynamically changed (e.g., mmap() or memory hotplug). > The monitoring overhead of this mechanism will arbitrarily increase as > the size of the target workload grows. > > The basic monitoring primitives for actual access check and dynamic > target regions construction aren't in the core part of DAMON. Instead, > it allows users to implement their own primitives that are optimized for > their use case and configure DAMON to use those. In other words, users > cannot use current version of DAMON without some additional works. > > Following commits will implement the core mechanisms for the > overhead-accuracy control and default primitives implementations. > > Signed-off-by: SeongJae Park <sjpark@amazon.de> > Reviewed-by: Leonard Foerster <foersleo@amazon.de> > Reviewed-by: Fernand Sieber <sieberf@amazon.com> Few nits below otherwise look good to me. You can add: Acked-by: Shakeel Butt <shakeelb@google.com> [...] > +/* > + * __damon_start() - Starts monitoring with given context. > + * @ctx: monitoring context > + * > + * This function should be called while damon_lock is hold. > + * > + * Return: 0 on success, negative error code otherwise. > + */ > +static int __damon_start(struct damon_ctx *ctx) > +{ > + int err = -EBUSY; > + > + mutex_lock(&ctx->kdamond_lock); > + if (!ctx->kdamond) { > + err = 0; > + ctx->kdamond_stop = false; > + ctx->kdamond = kthread_create(kdamond_fn, ctx, "kdamond.%d", > + nr_running_ctxs); > + if (IS_ERR(ctx->kdamond)) > + err = PTR_ERR(ctx->kdamond); > + else > + wake_up_process(ctx->kdamond); Nit: You can use kthread_run() here. > + } > + mutex_unlock(&ctx->kdamond_lock); > + > + return err; > +} > + [...] > +static int __damon_stop(struct damon_ctx *ctx) > +{ > + mutex_lock(&ctx->kdamond_lock); > + if (ctx->kdamond) { > + ctx->kdamond_stop = true; > + mutex_unlock(&ctx->kdamond_lock); > + while (damon_kdamond_running(ctx)) > + usleep_range(ctx->sample_interval, > + ctx->sample_interval * 2); Any reason to not use kthread_stop() here?
From: SeongJae Park <sjpark@amazon.de> On Tue, 22 Jun 2021 07:59:11 -0700 Shakeel Butt <shakeelb@google.com> wrote: > On Mon, Jun 21, 2021 at 1:31 AM SeongJae Park <sj38.park@gmail.com> wrote: > > > > From: SeongJae Park <sjpark@amazon.de> > > > > DAMON is a data access monitoring framework for the Linux kernel. The > > core mechanisms of DAMON make it > > > > - accurate (the monitoring output is useful enough for DRAM level > > performance-centric memory management; It might be inappropriate for > > CPU cache levels, though), > > - light-weight (the monitoring overhead is normally low enough to be > > applied online), and > > - scalable (the upper-bound of the overhead is in constant range > > regardless of the size of target workloads). > > > > Using this framework, hence, we can easily write efficient kernel space > > data access monitoring applications. For example, the kernel's memory > > management mechanisms can make advanced decisions using this. > > Experimental data access aware optimization works that incurring high > > access monitoring overhead could again be implemented on top of this. > > > > Due to its simple and flexible interface, providing user space interface > > would be also easy. Then, user space users who have some special > > workloads can write personalized applications for better understanding > > and optimizations of their workloads and systems. > > > > === > > > > Nevertheless, this commit is defining and implementing only basic access > > check part without the overhead-accuracy handling core logic. The basic > > access check is as below. > > > > The output of DAMON says what memory regions are how frequently accessed > > for a given duration. The resolution of the access frequency is > > controlled by setting ``sampling interval`` and ``aggregation > > interval``. In detail, DAMON checks access to each page per ``sampling > > interval`` and aggregates the results. In other words, counts the > > number of the accesses to each region. After each ``aggregation > > interval`` passes, DAMON calls callback functions that previously > > registered by users so that users can read the aggregated results and > > then clears the results. This can be described in below simple > > pseudo-code:: > > > > init() > > while monitoring_on: > > for page in monitoring_target: > > if accessed(page): > > nr_accesses[page] += 1 > > if time() % aggregation_interval == 0: > > for callback in user_registered_callbacks: > > callback(monitoring_target, nr_accesses) > > for page in monitoring_target: > > nr_accesses[page] = 0 > > if time() % update_interval == 0: > > regions_update_interval? It used the name before. But, I changed the name in this way to use it as a general periodic updates of monitoring primitives. Of course we can use the specific name only in this specific example, but also want to make this as similar to the actual code as possible. If you strongly want to rename this, please feel free to let me know. > > > update() > > sleep(sampling interval) > > > > The target regions constructed at the beginning of the monitoring and > > updated after each ``regions_update_interval``, because the target > > regions could be dynamically changed (e.g., mmap() or memory hotplug). > > The monitoring overhead of this mechanism will arbitrarily increase as > > the size of the target workload grows. > > > > The basic monitoring primitives for actual access check and dynamic > > target regions construction aren't in the core part of DAMON. Instead, > > it allows users to implement their own primitives that are optimized for > > their use case and configure DAMON to use those. In other words, users > > cannot use current version of DAMON without some additional works. > > > > Following commits will implement the core mechanisms for the > > overhead-accuracy control and default primitives implementations. > > > > Signed-off-by: SeongJae Park <sjpark@amazon.de> > > Reviewed-by: Leonard Foerster <foersleo@amazon.de> > > Reviewed-by: Fernand Sieber <sieberf@amazon.com> > > Few nits below otherwise look good to me. You can add: > > Acked-by: Shakeel Butt <shakeelb@google.com> Thank you! > > [...] > > +/* > > + * __damon_start() - Starts monitoring with given context. > > + * @ctx: monitoring context > > + * > > + * This function should be called while damon_lock is hold. > > + * > > + * Return: 0 on success, negative error code otherwise. > > + */ > > +static int __damon_start(struct damon_ctx *ctx) > > +{ > > + int err = -EBUSY; > > + > > + mutex_lock(&ctx->kdamond_lock); > > + if (!ctx->kdamond) { > > + err = 0; > > + ctx->kdamond_stop = false; > > + ctx->kdamond = kthread_create(kdamond_fn, ctx, "kdamond.%d", > > + nr_running_ctxs); > > + if (IS_ERR(ctx->kdamond)) > > + err = PTR_ERR(ctx->kdamond); > > + else > > + wake_up_process(ctx->kdamond); > > Nit: You can use kthread_run() here. Ok, I will use that from the next spin. > > > + } > > + mutex_unlock(&ctx->kdamond_lock); > > + > > + return err; > > +} > > + > [...] > > +static int __damon_stop(struct damon_ctx *ctx) > > +{ > > + mutex_lock(&ctx->kdamond_lock); > > + if (ctx->kdamond) { > > + ctx->kdamond_stop = true; > > + mutex_unlock(&ctx->kdamond_lock); > > + while (damon_kdamond_running(ctx)) > > + usleep_range(ctx->sample_interval, > > + ctx->sample_interval * 2); > > Any reason to not use kthread_stop() here? Using 'kthread_stop()' here will make the code much simpler. But, 'kdamond' also stops itself when all monitoring targets became invalid (e.g., all monitoring target processes terminated). However, 'kthread_stop()' is not easy to be used for the use case (self stopping). It's of course possible, but it would make the code longer. That's why I use 'kdamond_stop' flag here. So, I'd like leave this as is. If you think 'kthread_stop()' should be used, please feel free to let me know. Thanks, SeongJae Park
From: SeongJae Park <sjpark@amazon.de> On Tue, 22 Jun 2021 07:59:35 -0700 Shakeel Butt <shakeelb@google.com> wrote: > On Mon, Jun 21, 2021 at 1:31 AM SeongJae Park <sj38.park@gmail.com> wrote: > > > > From: SeongJae Park <sjpark@amazon.de> > > > > Even somehow the initial monitoring target regions are well constructed > > to fulfill the assumption (pages in same region have similar access > > frequencies), the data access pattern can be dynamically changed. This > > will result in low monitoring quality. To keep the assumption as much > > as possible, DAMON adaptively merges and splits each region based on > > their access frequency. > > > > For each ``aggregation interval``, it compares the access frequencies of > > adjacent regions and merges those if the frequency difference is small. > > Then, after it reports and clears the aggregated access frequency of > > each region, it splits each region into two or three regions if the > > total number of regions will not exceed the user-specified maximum > > number of regions after the split. > > > > In this way, DAMON provides its best-effort quality and minimal overhead > > while keeping the upper-bound overhead that users set. > > > > Signed-off-by: SeongJae Park <sjpark@amazon.de> > > Reviewed-by: Leonard Foerster <foersleo@amazon.de> > > Reviewed-by: Fernand Sieber <sieberf@amazon.com> > [...] > > > > +unsigned int damon_nr_regions(struct damon_target *t) > > +{ > > + struct damon_region *r; > > + unsigned int nr_regions = 0; > > + > > + damon_for_each_region(r, t) > > + nr_regions++; > > This bugs me everytime. Please just have nr_regions field in the > damon_target instead of traversing the list to count the number of > regions. Ok, I will make the change in next spin. > > Other than that, it looks good to me. Thanks, SeongJae Park
From: SeongJae Park <sjpark@amazon.de> On Tue, 22 Jun 2021 08:00:58 -0700 Shakeel Butt <shakeelb@google.com> wrote: > On Mon, Jun 21, 2021 at 1:31 AM SeongJae Park <sj38.park@gmail.com> wrote: > > > > From: SeongJae Park <sjpark@amazon.de> > > > > This commit introduces a reference implementation of the address space > > specific low level primitives for the virtual address space, so that > > users of DAMON can easily monitor the data accesses on virtual address > > spaces of specific processes by simply configuring the implementation to > > be used by DAMON. > > > > The low level primitives for the fundamental access monitoring are > > defined in two parts: > > > > 1. Identification of the monitoring target address range for the address > > space. > > 2. Access check of specific address range in the target space. > > > > The reference implementation for the virtual address space does the > > works as below. > > > > PTE Accessed-bit Based Access Check > > ----------------------------------- > > > > The implementation uses PTE Accessed-bit for basic access checks. That > > is, it clears the bit for the next sampling target page and checks > > whether it is set again after one sampling period. This could disturb > > the reclaim logic. DAMON uses ``PG_idle`` and ``PG_young`` page flags > > to solve the conflict, as Idle page tracking does. > > > > VMA-based Target Address Range Construction > > ------------------------------------------- > > > > Only small parts in the super-huge virtual address space of the > > processes are mapped to physical memory and accessed. Thus, tracking > > the unmapped address regions is just wasteful. However, because DAMON > > can deal with some level of noise using the adaptive regions adjustment > > mechanism, tracking every mapping is not strictly required but could > > even incur a high overhead in some cases. That said, too huge unmapped > > areas inside the monitoring target should be removed to not take the > > time for the adaptive mechanism. > > > > For the reason, this implementation converts the complex mappings to > > three distinct regions that cover every mapped area of the address > > space. Also, the two gaps between the three regions are the two biggest > > unmapped areas in the given address space. The two biggest unmapped > > areas would be the gap between the heap and the uppermost mmap()-ed > > region, and the gap between the lowermost mmap()-ed region and the stack > > in most of the cases. Because these gaps are exceptionally huge in > > usual address spaces, excluding these will be sufficient to make a > > reasonable trade-off. Below shows this in detail:: > > > > <heap> > > <BIG UNMAPPED REGION 1> > > <uppermost mmap()-ed region> > > (small mmap()-ed regions and munmap()-ed regions) > > <lowermost mmap()-ed region> > > <BIG UNMAPPED REGION 2> > > <stack> > > > > Signed-off-by: SeongJae Park <sjpark@amazon.de> > > Reviewed-by: Leonard Foerster <foersleo@amazon.de> > > Reviewed-by: Fernand Sieber <sieberf@amazon.com> > > Couple of nits below and one concern on the default value of > primitive_update_interval of virtual address space primitive. > Otherwise looks good to me. Thank you! > > [...] > > > + > > +/* > > + * Size-evenly split a region into 'nr_pieces' small regions > > + * > > + * Returns 0 on success, or negative error code otherwise. > > + */ > > +static int damon_va_evenly_split_region(struct damon_ctx *ctx, > > I don't see ctx being used in this function. Good point, will remove that from the next spin. > > > + struct damon_region *r, unsigned int nr_pieces) > > +{ > > + unsigned long sz_orig, sz_piece, orig_end; > > + struct damon_region *n = NULL, *next; > > + unsigned long start; > > + > > + if (!r || !nr_pieces) > > + return -EINVAL; > > + > > + orig_end = r->ar.end; > > + sz_orig = r->ar.end - r->ar.start; > > + sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, DAMON_MIN_REGION); > > + > > + if (!sz_piece) > > + return -EINVAL; > > + > > + r->ar.end = r->ar.start + sz_piece; > > + next = damon_next_region(r); > > + for (start = r->ar.end; start + sz_piece <= orig_end; > > + start += sz_piece) { > > + n = damon_new_region(start, start + sz_piece); > > + if (!n) > > + return -ENOMEM; > > + damon_insert_region(n, r, next); > > + r = n; > > + } > > + /* complement last region for possible rounding error */ > > + if (n) > > + n->ar.end = orig_end; > > + > > + return 0; > > +} > > [...] > > > +/* > > + * Get the three regions in the given target (task) > > + * > > + * Returns 0 on success, negative error code otherwise. > > + */ > > +static int damon_va_three_regions(struct damon_target *t, > > + struct damon_addr_range regions[3]) > > +{ > > + struct mm_struct *mm; > > + int rc; > > + > > + mm = damon_get_mm(t); > > + if (!mm) > > + return -EINVAL; > > + > > + mmap_read_lock(mm); > > + rc = __damon_va_three_regions(mm->mmap, regions); > > + mmap_read_unlock(mm); > > This is being called for each target every second by default. Seems > too aggressive. Applications don't change their address space every > second. I would recommend to default ctx->primitive_update_interval to > a higher default value. Good point. If there are many targets and each target has a huge number of VMAs, the overhead could be high. Nevertheless, I couldn't find the overhead in my test setup. Also, it seems someone are already started exploring DAMON patchset with the default value. and usages from others. Silently changing the default value could distract such people. So, if you think it's ok, I'd like to change the default value only after someone finds the overhead from their usages and asks a change. If you disagree or you found the overhead from your usage, please feel free to let me know. > > > + > > + mmput(mm); > > + return rc; > > +} > > + > > [...] > > > +static void __damon_va_init_regions(struct damon_ctx *c, > > Keep the convention of naming damon_ctx ctx. Ok, I will do so from the next spin. > > > + struct damon_target *t) > > +{ > > + struct damon_region *r; > > + struct damon_addr_range regions[3]; > > + unsigned long sz = 0, nr_pieces; > > + int i; > > + > > + if (damon_va_three_regions(t, regions)) { > > + pr_err("Failed to get three regions of target %lu\n", t->id); > > + return; > > + } > > + > > + for (i = 0; i < 3; i++) > > + sz += regions[i].end - regions[i].start; > > + if (c->min_nr_regions) > > + sz /= c->min_nr_regions; > > + if (sz < DAMON_MIN_REGION) > > + sz = DAMON_MIN_REGION; > > + > > + /* Set the initial three regions of the target */ > > + for (i = 0; i < 3; i++) { > > + r = damon_new_region(regions[i].start, regions[i].end); > > + if (!r) { > > + pr_err("%d'th init region creation failed\n", i); > > + return; > > + } > > + damon_add_region(r, t); > > + > > + nr_pieces = (regions[i].end - regions[i].start) / sz; > > + damon_va_evenly_split_region(c, r, nr_pieces); > > + } > > +} > > [...] > > > +/* > > + * Update damon regions for the three big regions of the given target > > + * > > + * t the given target > > + * bregions the three big regions of the target > > + */ > > +static void damon_va_apply_three_regions(struct damon_ctx *ctx, > > ctx not used in this function. Good eye, will remove that from the next version. > > > > + struct damon_target *t, struct damon_addr_range bregions[3]) > > +{ > > + struct damon_region *r, *next; > > + unsigned int i = 0; > > + > > + /* Remove regions which are not in the three big regions now */ > > + damon_for_each_region_safe(r, next, t) { > > + for (i = 0; i < 3; i++) { > > + if (damon_intersect(r, &bregions[i])) > > + break; > > + } > > + if (i == 3) > > + damon_destroy_region(r); > > + } > > + > > + /* Adjust intersecting regions to fit with the three big regions */ > > + for (i = 0; i < 3; i++) { > > + struct damon_region *first = NULL, *last; > > + struct damon_region *newr; > > + struct damon_addr_range *br; > > + > > + br = &bregions[i]; > > + /* Get the first and last regions which intersects with br */ > > + damon_for_each_region(r, t) { > > + if (damon_intersect(r, br)) { > > + if (!first) > > + first = r; > > + last = r; > > + } > > + if (r->ar.start >= br->end) > > + break; > > + } > > + if (!first) { > > + /* no damon_region intersects with this big region */ > > + newr = damon_new_region( > > + ALIGN_DOWN(br->start, > > + DAMON_MIN_REGION), > > + ALIGN(br->end, DAMON_MIN_REGION)); > > + if (!newr) > > + continue; > > + damon_insert_region(newr, damon_prev_region(r), r); > > + } else { > > + first->ar.start = ALIGN_DOWN(br->start, > > + DAMON_MIN_REGION); > > + last->ar.end = ALIGN(br->end, DAMON_MIN_REGION); > > + } > > + } > > +} Thanks, SeongJae Park
From: SeongJae Park <sjpark@amazon.de> On Tue, 22 Jun 2021 11:12:36 -0700 Shakeel Butt <shakeelb@google.com> wrote: > On Mon, Jun 21, 2021 at 1:31 AM SeongJae Park <sj38.park@gmail.com> wrote: > > > > From: SeongJae Park <sjpark@amazon.de> > > > > DAMON is designed to be used by kernel space code such as the memory > > management subsystems, and therefore it provides only kernel space API. > > That said, letting the user space control DAMON could provide some > > benefits to them. For example, it will allow user space to analyze > > their specific workloads and make their own special optimizations. > > > > For such cases, this commit implements a simple DAMON application kernel > > module, namely 'damon-dbgfs', which merely wraps the DAMON api and > > exports those to the user space via the debugfs. > > > > 'damon-dbgfs' exports three files, ``attrs``, ``target_ids``, and > > ``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``. > > > > Attributes > > ---------- > > > > Users can read and write the ``sampling interval``, ``aggregation > > interval``, ``regions update interval``, and min/max number of > > monitoring target regions by reading from and writing to the ``attrs`` > > file. For example, below commands set those values to 5 ms, 100 ms, > > 1,000 ms, 10, 1000 and check it again:: > > > > # cd <debugfs>/damon > > # echo 5000 100000 1000000 10 1000 > attrs > > # cat attrs > > 5000 100000 1000000 10 1000 > > > > Target IDs > > ---------- > > > > Some types of address spaces supports multiple monitoring target. For > > example, the virtual memory address spaces monitoring can have multiple > > processes as the monitoring targets. Users can set the targets by > > writing relevant id values of the targets to, and get the ids of the > > current targets by reading from the ``target_ids`` file. In case of the > > virtual address spaces monitoring, the values should be pids of the > > monitoring target processes. For example, below commands set processes > > having pids 42 and 4242 as the monitoring targets and check it again:: > > > > # cd <debugfs>/damon > > # echo 42 4242 > target_ids > > # cat target_ids > > 42 4242 > > > > Note that setting the target ids doesn't start the monitoring. > > > > Turning On/Off > > -------------- > > > > Setting the files as described above doesn't incur effect unless you > > explicitly start the monitoring. You can start, stop, and check the > > current status of the monitoring by writing to and reading from the > > ``monitor_on`` file. Writing ``on`` to the file starts the monitoring > > of the targets with the attributes. Writing ``off`` to the file stops > > those. DAMON also stops if every targets are invalidated (in case of > > the virtual memory monitoring, target processes are invalidated when > > terminated). Below example commands turn on, off, and check the status > > of DAMON:: > > > > # cd <debugfs>/damon > > # echo on > monitor_on > > # echo off > monitor_on > > # cat monitor_on > > off > > > > Please note that you cannot write to the above-mentioned debugfs files > > while the monitoring is turned on. If you write to the files while > > DAMON is running, an error code such as ``-EBUSY`` will be returned. > > > > Signed-off-by: SeongJae Park <sjpark@amazon.de> > > Reviewed-by: Leonard Foerster <foersleo@amazon.de> > > Reviewed-by: Fernand Sieber <sieberf@amazon.com> > > > The high level comment I have for this patch is the layering of pid > reference counting. The dbgfs should treat the targets as abstract > objects and vaddr should handle the reference counting of pids. More > specifically move find_get_pid from dbgfs to vaddr and to add an > interface to the primitive for set_targets. > > At the moment, the pid reference is taken in dbgfs and put in vaddr. > This will be the source of bugs in future. Good point, and agreed on the problem. But, I'd like to move 'put_pid()' to dbgfs, because I think that would let extending the dbgfs user interface to pidfd a little bit simpler. Also, I think that would be easier to use for in-kernel programming interface usages. If you disagree, please feel free to let me know. Thanks, SeongJae Park
From: SeongJae Park <sjpark@amazon.de> On Tue, 22 Jun 2021 11:23:12 -0700 Shakeel Butt <shakeelb@google.com> wrote: > On Mon, Jun 21, 2021 at 1:31 AM SeongJae Park <sj38.park@gmail.com> wrote: > > > > From: SeongJae Park <sjpark@amazon.de> > > > > For CPU usage accounting, knowing pid of the monitoring thread could be > > helpful. For example, users could use cpuaccount cgroups with the pid. > > > > This commit therefore exports the pid of currently running monitoring > > thread to the user space via 'kdamond_pid' file in the debugfs > > directory. > > > > Signed-off-by: SeongJae Park <sjpark@amazon.de> > > Reviewed-by: Fernand Sieber <sieberf@amazon.com> > > --- > > [...] > > > > > +static const struct file_operations kdamond_pid_fops = { > > + .owner = THIS_MODULE, > > I don't think you need to set the owner (and for other fops) as these > files are built into modules. Otherwise it looks good. Good point. Will remove those from the next spin. Thanks, SeongJae Park
On Thu, Jun 24, 2021 at 3:26 AM SeongJae Park <sj38.park@gmail.com> wrote: > [...] > > > if time() % update_interval == 0: > > > > regions_update_interval? > > It used the name before. But, I changed the name in this way to use it as a > general periodic updates of monitoring primitives. Of course we can use the > specific name only in this specific example, but also want to make this as > similar to the actual code as possible. > > If you strongly want to rename this, please feel free to let me know. > Nah, it is ok. [...] > > > > Any reason to not use kthread_stop() here? > > Using 'kthread_stop()' here will make the code much simpler. But, 'kdamond' > also stops itself when all monitoring targets became invalid (e.g., all > monitoring target processes terminated). However, 'kthread_stop()' is not easy > to be used for the use case (self stopping). It's of course possible, but it > would make the code longer. That's why I use 'kdamond_stop' flag here. So, > I'd like leave this as is. If you think 'kthread_stop()' should be used, > please feel free to let me know. > Fine as it is.
On Thu, Jun 24, 2021 at 3:26 AM SeongJae Park <sj38.park@gmail.com> wrote: > [...] > > > +/* > > > + * Get the three regions in the given target (task) > > > + * > > > + * Returns 0 on success, negative error code otherwise. > > > + */ > > > +static int damon_va_three_regions(struct damon_target *t, > > > + struct damon_addr_range regions[3]) > > > +{ > > > + struct mm_struct *mm; > > > + int rc; > > > + > > > + mm = damon_get_mm(t); > > > + if (!mm) > > > + return -EINVAL; > > > + > > > + mmap_read_lock(mm); > > > + rc = __damon_va_three_regions(mm->mmap, regions); > > > + mmap_read_unlock(mm); > > > > This is being called for each target every second by default. Seems > > too aggressive. Applications don't change their address space every > > second. I would recommend to default ctx->primitive_update_interval to > > a higher default value. > > Good point. If there are many targets and each target has a huge number of > VMAs, the overhead could be high. Nevertheless, I couldn't find the overhead > in my test setup. Also, it seems someone are already started exploring DAMON > patchset with the default value. and usages from others. Silently changing the > default value could distract such people. So, if you think it's ok, I'd like > to change the default value only after someone finds the overhead from their > usages and asks a change. > > If you disagree or you found the overhead from your usage, please feel free to > let me know. > mmap lock is a source contention in the real world workloads. We do observe in our fleet and many others (like Facebook) do complain on this issue. This is the whole motivation behind SFP, maple tree and many other mmap lock scalability work. I would be really careful to add another source of contention on mmap lock. Yes, the user can change this interval themselves but we should not burden them with this internal knowledge like "oh if you observe high mmap contention you may want to increase this specific interval". We should set a good default value to avoid such situations (most of the time).
On Thu, Jun 24, 2021 at 3:26 AM SeongJae Park <sj38.park@gmail.com> wrote: > [...] > > > > The high level comment I have for this patch is the layering of pid > > reference counting. The dbgfs should treat the targets as abstract > > objects and vaddr should handle the reference counting of pids. More > > specifically move find_get_pid from dbgfs to vaddr and to add an > > interface to the primitive for set_targets. > > > > At the moment, the pid reference is taken in dbgfs and put in vaddr. > > This will be the source of bugs in future. > > Good point, and agreed on the problem. But, I'd like to move 'put_pid()' to > dbgfs, because I think that would let extending the dbgfs user interface to > pidfd a little bit simpler. Also, I think that would be easier to use for > in-kernel programming interface usages. If you disagree, please feel free to > let me know. > I was thinking of removing targetid_is_pid() checks. Anyways this is not something we can not change later, so I will let you decide which direction you want to take.
From: SeongJae Park <sjpark@amazon.de> On Thu, 24 Jun 2021 07:42:44 -0700 Shakeel Butt <shakeelb@google.com> wrote: > On Thu, Jun 24, 2021 at 3:26 AM SeongJae Park <sj38.park@gmail.com> wrote: > > > [...] > > > > +/* > > > > + * Get the three regions in the given target (task) > > > > + * > > > > + * Returns 0 on success, negative error code otherwise. > > > > + */ > > > > +static int damon_va_three_regions(struct damon_target *t, > > > > + struct damon_addr_range regions[3]) > > > > +{ > > > > + struct mm_struct *mm; > > > > + int rc; > > > > + > > > > + mm = damon_get_mm(t); > > > > + if (!mm) > > > > + return -EINVAL; > > > > + > > > > + mmap_read_lock(mm); > > > > + rc = __damon_va_three_regions(mm->mmap, regions); > > > > + mmap_read_unlock(mm); > > > > > > This is being called for each target every second by default. Seems > > > too aggressive. Applications don't change their address space every > > > second. I would recommend to default ctx->primitive_update_interval to > > > a higher default value. > > > > Good point. If there are many targets and each target has a huge number of > > VMAs, the overhead could be high. Nevertheless, I couldn't find the overhead > > in my test setup. Also, it seems someone are already started exploring DAMON > > patchset with the default value. and usages from others. Silently changing the > > default value could distract such people. So, if you think it's ok, I'd like > > to change the default value only after someone finds the overhead from their > > usages and asks a change. > > > > If you disagree or you found the overhead from your usage, please feel free to > > let me know. > > > > mmap lock is a source contention in the real world workloads. We do > observe in our fleet and many others (like Facebook) do complain on > this issue. This is the whole motivation behind SFP, maple tree and > many other mmap lock scalability work. I would be really careful to > add another source of contention on mmap lock. Yes, the user can > change this interval themselves but we should not burden them with > this internal knowledge like "oh if you observe high mmap contention > you may want to increase this specific interval". We should set a good > default value to avoid such situations (most of the time). Thank you for this nice clarification. I can understand your concern because I also worked for an HTM-based solution of the scalability issue before. However, I have neither strong preference nor confidence for the new default value at the moment. Could you please recommend one if you have? Thanks, SeongJae Park
On Thu, Jun 24, 2021 at 8:21 AM SeongJae Park <sj38.park@gmail.com> wrote: > > From: SeongJae Park <sjpark@amazon.de> > > On Thu, 24 Jun 2021 07:42:44 -0700 Shakeel Butt <shakeelb@google.com> wrote: > > > On Thu, Jun 24, 2021 at 3:26 AM SeongJae Park <sj38.park@gmail.com> wrote: > > > > > [...] > > > > > +/* > > > > > + * Get the three regions in the given target (task) > > > > > + * > > > > > + * Returns 0 on success, negative error code otherwise. > > > > > + */ > > > > > +static int damon_va_three_regions(struct damon_target *t, > > > > > + struct damon_addr_range regions[3]) > > > > > +{ > > > > > + struct mm_struct *mm; > > > > > + int rc; > > > > > + > > > > > + mm = damon_get_mm(t); > > > > > + if (!mm) > > > > > + return -EINVAL; > > > > > + > > > > > + mmap_read_lock(mm); > > > > > + rc = __damon_va_three_regions(mm->mmap, regions); > > > > > + mmap_read_unlock(mm); > > > > > > > > This is being called for each target every second by default. Seems > > > > too aggressive. Applications don't change their address space every > > > > second. I would recommend to default ctx->primitive_update_interval to > > > > a higher default value. > > > > > > Good point. If there are many targets and each target has a huge number of > > > VMAs, the overhead could be high. Nevertheless, I couldn't find the overhead > > > in my test setup. Also, it seems someone are already started exploring DAMON > > > patchset with the default value. and usages from others. Silently changing the > > > default value could distract such people. So, if you think it's ok, I'd like > > > to change the default value only after someone finds the overhead from their > > > usages and asks a change. > > > > > > If you disagree or you found the overhead from your usage, please feel free to > > > let me know. > > > > > > > mmap lock is a source contention in the real world workloads. We do > > observe in our fleet and many others (like Facebook) do complain on > > this issue. This is the whole motivation behind SFP, maple tree and > > many other mmap lock scalability work. I would be really careful to > > add another source of contention on mmap lock. Yes, the user can > > change this interval themselves but we should not burden them with > > this internal knowledge like "oh if you observe high mmap contention > > you may want to increase this specific interval". We should set a good > > default value to avoid such situations (most of the time). > > Thank you for this nice clarification. I can understand your concern because I > also worked for an HTM-based solution of the scalability issue before. > > However, I have neither strong preference nor confidence for the new default > value at the moment. Could you please recommend one if you have? > I would say go with a conservative value like 60 seconds. Though there is no scientific reason behind this specific number, I think it would be a good compromise. Applications usually don't change their address space layout that often.
From: SeongJae Park <sjpark@amazon.de> On Thu, 24 Jun 2021 09:33:07 -0700 Shakeel Butt <shakeelb@google.com> wrote: > On Thu, Jun 24, 2021 at 8:21 AM SeongJae Park <sj38.park@gmail.com> wrote: > > > > From: SeongJae Park <sjpark@amazon.de> > > > > On Thu, 24 Jun 2021 07:42:44 -0700 Shakeel Butt <shakeelb@google.com> wrote: > > > > > On Thu, Jun 24, 2021 at 3:26 AM SeongJae Park <sj38.park@gmail.com> wrote: > > > > > > > [...] > > > > > > +/* > > > > > > + * Get the three regions in the given target (task) > > > > > > + * > > > > > > + * Returns 0 on success, negative error code otherwise. > > > > > > + */ > > > > > > +static int damon_va_three_regions(struct damon_target *t, > > > > > > + struct damon_addr_range regions[3]) > > > > > > +{ > > > > > > + struct mm_struct *mm; > > > > > > + int rc; > > > > > > + > > > > > > + mm = damon_get_mm(t); > > > > > > + if (!mm) > > > > > > + return -EINVAL; > > > > > > + > > > > > > + mmap_read_lock(mm); > > > > > > + rc = __damon_va_three_regions(mm->mmap, regions); > > > > > > + mmap_read_unlock(mm); > > > > > > > > > > This is being called for each target every second by default. Seems > > > > > too aggressive. Applications don't change their address space every > > > > > second. I would recommend to default ctx->primitive_update_interval to > > > > > a higher default value. > > > > > > > > Good point. If there are many targets and each target has a huge number of > > > > VMAs, the overhead could be high. Nevertheless, I couldn't find the overhead > > > > in my test setup. Also, it seems someone are already started exploring DAMON > > > > patchset with the default value. and usages from others. Silently changing the > > > > default value could distract such people. So, if you think it's ok, I'd like > > > > to change the default value only after someone finds the overhead from their > > > > usages and asks a change. > > > > > > > > If you disagree or you found the overhead from your usage, please feel free to > > > > let me know. > > > > > > > > > > mmap lock is a source contention in the real world workloads. We do > > > observe in our fleet and many others (like Facebook) do complain on > > > this issue. This is the whole motivation behind SFP, maple tree and > > > many other mmap lock scalability work. I would be really careful to > > > add another source of contention on mmap lock. Yes, the user can > > > change this interval themselves but we should not burden them with > > > this internal knowledge like "oh if you observe high mmap contention > > > you may want to increase this specific interval". We should set a good > > > default value to avoid such situations (most of the time). > > > > Thank you for this nice clarification. I can understand your concern because I > > also worked for an HTM-based solution of the scalability issue before. > > > > However, I have neither strong preference nor confidence for the new default > > value at the moment. Could you please recommend one if you have? > > > > I would say go with a conservative value like 60 seconds. Though there > is no scientific reason behind this specific number, I think it would > be a good compromise. Applications usually don't change their address > space layout that often. Ok, I will use that from the next spin. Thank you for this nice suggestion. Thanks, SeongJae Park
diff --git a/include/linux/damon.h b/include/linux/damon.h new file mode 100644 index 000000000000..2f652602b1ea --- /dev/null +++ b/include/linux/damon.h @@ -0,0 +1,167 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * DAMON api + * + * Author: SeongJae Park <sjpark@amazon.de> + */ + +#ifndef _DAMON_H_ +#define _DAMON_H_ + +#include <linux/mutex.h> +#include <linux/time64.h> +#include <linux/types.h> + +struct damon_ctx; + +/** + * struct damon_primitive Monitoring primitives for given use cases. + * + * @init: Initialize primitive-internal data structures. + * @update: Update primitive-internal data structures. + * @prepare_access_checks: Prepare next access check of target regions. + * @check_accesses: Check the accesses to target regions. + * @reset_aggregated: Reset aggregated accesses monitoring results. + * @target_valid: Determine if the target is valid. + * @cleanup: Clean up the context. + * + * DAMON can be extended for various address spaces and usages. For this, + * users should register the low level primitives for their target address + * space and usecase via the &damon_ctx.primitive. Then, the monitoring thread + * (&damon_ctx.kdamond) calls @init and @prepare_access_checks before starting + * the monitoring, @update after each &damon_ctx.primitive_update_interval, and + * @check_accesses, @target_valid and @prepare_access_checks after each + * &damon_ctx.sample_interval. Finally, @reset_aggregated is called after each + * &damon_ctx.aggr_interval. + * + * @init should initialize primitive-internal data structures. For example, + * this could be used to construct proper monitoring target regions and link + * those to @damon_ctx.target. + * @update should update the primitive-internal data structures. For example, + * this could be used to update monitoring target regions for current status. + * @prepare_access_checks should manipulate the monitoring regions to be + * prepared for the next access check. + * @check_accesses should check the accesses to each region that made after the + * last preparation and update the number of observed accesses of each region. + * @reset_aggregated should reset the access monitoring results that aggregated + * by @check_accesses. + * @target_valid should check whether the target is still valid for the + * monitoring. + * @cleanup is called from @kdamond just before its termination. + */ +struct damon_primitive { + void (*init)(struct damon_ctx *context); + void (*update)(struct damon_ctx *context); + void (*prepare_access_checks)(struct damon_ctx *context); + void (*check_accesses)(struct damon_ctx *context); + void (*reset_aggregated)(struct damon_ctx *context); + bool (*target_valid)(void *target); + void (*cleanup)(struct damon_ctx *context); +}; + +/* + * struct damon_callback Monitoring events notification callbacks. + * + * @before_start: Called before starting the monitoring. + * @after_sampling: Called after each sampling. + * @after_aggregation: Called after each aggregation. + * @before_terminate: Called before terminating the monitoring. + * @private: User private data. + * + * The monitoring thread (&damon_ctx.kdamond) calls @before_start and + * @before_terminate just before starting and finishing the monitoring, + * respectively. Therefore, those are good places for installing and cleaning + * @private. + * + * The monitoring thread calls @after_sampling and @after_aggregation for each + * of the sampling intervals and aggregation intervals, respectively. + * Therefore, users can safely access the monitoring results without additional + * protection. For the reason, users are recommended to use these callback for + * the accesses to the results. + * + * If any callback returns non-zero, monitoring stops. + */ +struct damon_callback { + void *private; + + int (*before_start)(struct damon_ctx *context); + int (*after_sampling)(struct damon_ctx *context); + int (*after_aggregation)(struct damon_ctx *context); + int (*before_terminate)(struct damon_ctx *context); +}; + +/** + * struct damon_ctx - Represents a context for each monitoring. This is the + * main interface that allows users to set the attributes and get the results + * of the monitoring. + * + * @sample_interval: The time between access samplings. + * @aggr_interval: The time between monitor results aggregations. + * @primitive_update_interval: The time between monitoring primitive updates. + * + * For each @sample_interval, DAMON checks whether each region is accessed or + * not. It aggregates and keeps the access information (number of accesses to + * each region) for @aggr_interval time. DAMON also checks whether the target + * memory regions need update (e.g., by ``mmap()`` calls from the application, + * in case of virtual memory monitoring) and applies the changes for each + * @primitive_update_interval. All time intervals are in micro-seconds. + * Please refer to &struct damon_primitive and &struct damon_callback for more + * detail. + * + * @kdamond: Kernel thread who does the monitoring. + * @kdamond_stop: Notifies whether kdamond should stop. + * @kdamond_lock: Mutex for the synchronizations with @kdamond. + * + * For each monitoring context, one kernel thread for the monitoring is + * created. The pointer to the thread is stored in @kdamond. + * + * Once started, the monitoring thread runs until explicitly required to be + * terminated or every monitoring target is invalid. The validity of the + * targets is checked via the &damon_primitive.target_valid of @primitive. The + * termination can also be explicitly requested by writing non-zero to + * @kdamond_stop. The thread sets @kdamond to NULL when it terminates. + * Therefore, users can know whether the monitoring is ongoing or terminated by + * reading @kdamond. Reads and writes to @kdamond and @kdamond_stop from + * outside of the monitoring thread must be protected by @kdamond_lock. + * + * Note that the monitoring thread protects only @kdamond and @kdamond_stop via + * @kdamond_lock. Accesses to other fields must be protected by themselves. + * + * @primitive: Set of monitoring primitives for given use cases. + * @callback: Set of callbacks for monitoring events notifications. + * + * @target: Pointer to the user-defined monitoring target. + */ +struct damon_ctx { + unsigned long sample_interval; + unsigned long aggr_interval; + unsigned long primitive_update_interval; + +/* private: internal use only */ + struct timespec64 last_aggregation; + struct timespec64 last_primitive_update; + +/* public: */ + struct task_struct *kdamond; + bool kdamond_stop; + struct mutex kdamond_lock; + + struct damon_primitive primitive; + struct damon_callback callback; + + void *target; +}; + +#ifdef CONFIG_DAMON + +struct damon_ctx *damon_new_ctx(void); +void damon_destroy_ctx(struct damon_ctx *ctx); +int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int, + unsigned long aggr_int, unsigned long primitive_upd_int); + +int damon_start(struct damon_ctx **ctxs, int nr_ctxs); +int damon_stop(struct damon_ctx **ctxs, int nr_ctxs); + +#endif /* CONFIG_DAMON */ + +#endif /* _DAMON_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 8f748010f7ea..44776162ae0d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -888,4 +888,6 @@ config IO_MAPPING config SECRETMEM def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED +source "mm/damon/Kconfig" + endmenu diff --git a/mm/Makefile b/mm/Makefile index e3436741d539..709674b13497 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -128,3 +128,4 @@ obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o +obj-$(CONFIG_DAMON) += damon/ diff --git a/mm/damon/Kconfig b/mm/damon/Kconfig new file mode 100644 index 000000000000..d00e99ac1a15 --- /dev/null +++ b/mm/damon/Kconfig @@ -0,0 +1,15 @@ +# SPDX-License-Identifier: GPL-2.0-only + +menu "Data Access Monitoring" + +config DAMON + bool "DAMON: Data Access Monitoring Framework" + help + This builds a framework that allows kernel subsystems to monitor + access frequency of each memory region. The information can be useful + for performance-centric DRAM level memory management. + + See https://damonitor.github.io/doc/html/latest-damon/index.html for + more information. + +endmenu diff --git a/mm/damon/Makefile b/mm/damon/Makefile new file mode 100644 index 000000000000..4fd2edb4becf --- /dev/null +++ b/mm/damon/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-$(CONFIG_DAMON) := core.o diff --git a/mm/damon/core.c b/mm/damon/core.c new file mode 100644 index 000000000000..693e51ebc05a --- /dev/null +++ b/mm/damon/core.c @@ -0,0 +1,318 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Data Access Monitor + * + * Author: SeongJae Park <sjpark@amazon.de> + */ + +#define pr_fmt(fmt) "damon: " fmt + +#include <linux/damon.h> +#include <linux/delay.h> +#include <linux/kthread.h> +#include <linux/slab.h> + +static DEFINE_MUTEX(damon_lock); +static int nr_running_ctxs; + +struct damon_ctx *damon_new_ctx(void) +{ + struct damon_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + ctx->sample_interval = 5 * 1000; + ctx->aggr_interval = 100 * 1000; + ctx->primitive_update_interval = 1000 * 1000; + + ktime_get_coarse_ts64(&ctx->last_aggregation); + ctx->last_primitive_update = ctx->last_aggregation; + + mutex_init(&ctx->kdamond_lock); + + ctx->target = NULL; + + return ctx; +} + +void damon_destroy_ctx(struct damon_ctx *ctx) +{ + if (ctx->primitive.cleanup) + ctx->primitive.cleanup(ctx); + kfree(ctx); +} + +/** + * damon_set_attrs() - Set attributes for the monitoring. + * @ctx: monitoring context + * @sample_int: time interval between samplings + * @aggr_int: time interval between aggregations + * @primitive_upd_int: time interval between monitoring primitive updates + * + * This function should not be called while the kdamond is running. + * Every time interval is in micro-seconds. + * + * Return: 0 on success, negative error code otherwise. + */ +int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int, + unsigned long aggr_int, unsigned long primitive_upd_int) +{ + ctx->sample_interval = sample_int; + ctx->aggr_interval = aggr_int; + ctx->primitive_update_interval = primitive_upd_int; + + return 0; +} + +static bool damon_kdamond_running(struct damon_ctx *ctx) +{ + bool running; + + mutex_lock(&ctx->kdamond_lock); + running = ctx->kdamond != NULL; + mutex_unlock(&ctx->kdamond_lock); + + return running; +} + +static int kdamond_fn(void *data); + +/* + * __damon_start() - Starts monitoring with given context. + * @ctx: monitoring context + * + * This function should be called while damon_lock is hold. + * + * Return: 0 on success, negative error code otherwise. + */ +static int __damon_start(struct damon_ctx *ctx) +{ + int err = -EBUSY; + + mutex_lock(&ctx->kdamond_lock); + if (!ctx->kdamond) { + err = 0; + ctx->kdamond_stop = false; + ctx->kdamond = kthread_create(kdamond_fn, ctx, "kdamond.%d", + nr_running_ctxs); + if (IS_ERR(ctx->kdamond)) + err = PTR_ERR(ctx->kdamond); + else + wake_up_process(ctx->kdamond); + } + mutex_unlock(&ctx->kdamond_lock); + + return err; +} + +/** + * damon_start() - Starts the monitorings for a given group of contexts. + * @ctxs: an array of the pointers for contexts to start monitoring + * @nr_ctxs: size of @ctxs + * + * This function starts a group of monitoring threads for a group of monitoring + * contexts. One thread per each context is created and run in parallel. The + * caller should handle synchronization between the threads by itself. If a + * group of threads that created by other 'damon_start()' call is currently + * running, this function does nothing but returns -EBUSY. + * + * Return: 0 on success, negative error code otherwise. + */ +int damon_start(struct damon_ctx **ctxs, int nr_ctxs) +{ + int i; + int err = 0; + + mutex_lock(&damon_lock); + if (nr_running_ctxs) { + mutex_unlock(&damon_lock); + return -EBUSY; + } + + for (i = 0; i < nr_ctxs; i++) { + err = __damon_start(ctxs[i]); + if (err) + break; + nr_running_ctxs++; + } + mutex_unlock(&damon_lock); + + return err; +} + +/* + * __damon_stop() - Stops monitoring of given context. + * @ctx: monitoring context + * + * Return: 0 on success, negative error code otherwise. + */ +static int __damon_stop(struct damon_ctx *ctx) +{ + mutex_lock(&ctx->kdamond_lock); + if (ctx->kdamond) { + ctx->kdamond_stop = true; + mutex_unlock(&ctx->kdamond_lock); + while (damon_kdamond_running(ctx)) + usleep_range(ctx->sample_interval, + ctx->sample_interval * 2); + return 0; + } + mutex_unlock(&ctx->kdamond_lock); + + return -EPERM; +} + +/** + * damon_stop() - Stops the monitorings for a given group of contexts. + * @ctxs: an array of the pointers for contexts to stop monitoring + * @nr_ctxs: size of @ctxs + * + * Return: 0 on success, negative error code otherwise. + */ +int damon_stop(struct damon_ctx **ctxs, int nr_ctxs) +{ + int i, err = 0; + + for (i = 0; i < nr_ctxs; i++) { + /* nr_running_ctxs is decremented in kdamond_fn */ + err = __damon_stop(ctxs[i]); + if (err) + return err; + } + + return err; +} + +/* + * damon_check_reset_time_interval() - Check if a time interval is elapsed. + * @baseline: the time to check whether the interval has elapsed since + * @interval: the time interval (microseconds) + * + * See whether the given time interval has passed since the given baseline + * time. If so, it also updates the baseline to current time for next check. + * + * Return: true if the time interval has passed, or false otherwise. + */ +static bool damon_check_reset_time_interval(struct timespec64 *baseline, + unsigned long interval) +{ + struct timespec64 now; + + ktime_get_coarse_ts64(&now); + if ((timespec64_to_ns(&now) - timespec64_to_ns(baseline)) < + interval * 1000) + return false; + *baseline = now; + return true; +} + +/* + * Check whether it is time to flush the aggregated information + */ +static bool kdamond_aggregate_interval_passed(struct damon_ctx *ctx) +{ + return damon_check_reset_time_interval(&ctx->last_aggregation, + ctx->aggr_interval); +} + +/* + * Check whether it is time to check and apply the target monitoring regions + * + * Returns true if it is. + */ +static bool kdamond_need_update_primitive(struct damon_ctx *ctx) +{ + return damon_check_reset_time_interval(&ctx->last_primitive_update, + ctx->primitive_update_interval); +} + +/* + * Check whether current monitoring should be stopped + * + * The monitoring is stopped when either the user requested to stop, or all + * monitoring targets are invalid. + * + * Returns true if need to stop current monitoring. + */ +static bool kdamond_need_stop(struct damon_ctx *ctx) +{ + bool stop; + + mutex_lock(&ctx->kdamond_lock); + stop = ctx->kdamond_stop; + mutex_unlock(&ctx->kdamond_lock); + if (stop) + return true; + + if (!ctx->primitive.target_valid) + return false; + + return !ctx->primitive.target_valid(ctx->target); +} + +static void set_kdamond_stop(struct damon_ctx *ctx) +{ + mutex_lock(&ctx->kdamond_lock); + ctx->kdamond_stop = true; + mutex_unlock(&ctx->kdamond_lock); +} + +/* + * The monitoring daemon that runs as a kernel thread + */ +static int kdamond_fn(void *data) +{ + struct damon_ctx *ctx = (struct damon_ctx *)data; + + pr_info("kdamond (%d) starts\n", ctx->kdamond->pid); + + if (ctx->primitive.init) + ctx->primitive.init(ctx); + if (ctx->callback.before_start && ctx->callback.before_start(ctx)) + set_kdamond_stop(ctx); + + while (!kdamond_need_stop(ctx)) { + if (ctx->primitive.prepare_access_checks) + ctx->primitive.prepare_access_checks(ctx); + if (ctx->callback.after_sampling && + ctx->callback.after_sampling(ctx)) + set_kdamond_stop(ctx); + + usleep_range(ctx->sample_interval, ctx->sample_interval + 1); + + if (ctx->primitive.check_accesses) + ctx->primitive.check_accesses(ctx); + + if (kdamond_aggregate_interval_passed(ctx)) { + if (ctx->callback.after_aggregation && + ctx->callback.after_aggregation(ctx)) + set_kdamond_stop(ctx); + if (ctx->primitive.reset_aggregated) + ctx->primitive.reset_aggregated(ctx); + } + + if (kdamond_need_update_primitive(ctx)) { + if (ctx->primitive.update) + ctx->primitive.update(ctx); + } + } + + if (ctx->callback.before_terminate && + ctx->callback.before_terminate(ctx)) + set_kdamond_stop(ctx); + if (ctx->primitive.cleanup) + ctx->primitive.cleanup(ctx); + + pr_debug("kdamond (%d) finishes\n", ctx->kdamond->pid); + mutex_lock(&ctx->kdamond_lock); + ctx->kdamond = NULL; + mutex_unlock(&ctx->kdamond_lock); + + mutex_lock(&damon_lock); + nr_running_ctxs--; + mutex_unlock(&damon_lock); + + do_exit(0); +}