[v2,0/3] mm/damon: Profiling enhancements for DAMON

Message ID	20240318132848.82686-1-aravinda.prasad@intel.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Aravinda Prasad <aravinda.prasad@intel.com> To: damon@lists.linux.dev, linux-mm@kvack.org, sj@kernel.org, linux-kernel@vger.kernel.org Cc: aravinda.prasad@intel.com, s2322819@ed.ac.uk, sandeep4.kumar@intel.com, ying.huang@intel.com, dave.hansen@intel.com, dan.j.williams@intel.com, sreenivas.subramoney@intel.com, antti.kervinen@intel.com, alexander.kanevskiy@intel.com Subject: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON Date: Mon, 18 Mar 2024 18:58:45 +0530 Message-Id: <20240318132848.82686-1-aravinda.prasad@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/damon: Profiling enhancements for DAMON \| expand [v2,0/3] mm/damon: Profiling enhancements for DAMON [v2,1/3] mm/damon: mm infrastructure support [v2,2/3] mm/damon: profiling enhancement [v2,3/3] mm/damon: documentation updates

Prasad, Aravinda March 18, 2024, 1:28 p.m. UTC

DAMON randomly samples one or more pages in every region and tracks
accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
When the region size is large (e.g., several GBs), which is common
for large footprint applications, detecting whether the region is
accessed or not completely depends on whether the pages that are
actively accessed in the region are picked during random sampling.
If such pages are not picked for sampling, DAMON fails to identify
the region as accessed. However, increasing the sampling rate or
increasing the number of regions increases CPU overheads of kdamond.

This patch proposes profiling different levels of the application’s
page table tree to detect whether a region is accessed or not. This
patch set is based on the observation that, when the accessed bit for a
page is set, the accessed bits at the higher levels of the page table
tree (PMD/PUD/PGD) corresponding to the path of the page table walk
are also set. Hence, it is efficient to check the accessed bits at
the higher levels of the page table tree to detect whether a region
is accessed or not. For example, if the access bit for a PUD entry
is set, then one or more pages in the 1GB PUD subtree is accessed as
each PUD entry covers 1GB mapping. Hence, instead of sampling
thousands of 4K/2M pages to detect accesses in a large region, 
sampling at the higher level of page table tree is faster and efficient.

This patch set is based on 6.8-rc5 kernel (commit: f48159f8, mm-unstable
tree)

Changes since v1 [1]
====================

 - Added support for 5-level page table tree
 - Split the patch to mm infrastructure changes and DAMON enhancements
 - Code changes as per comments on v1
 - Added kerneldoc comments

[1] https://lkml.org/lkml/2023/12/15/272
 
Evaluation:

- MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
  and 5TB with 10GB hot data.
- DAMON: 5ms sampling, 200ms aggregation interval. Rest all
  parameters set to default value.
- DAMON+PTP: Page table profiling applied to DAMON with the above
  parameters.

Profiling efficiency in detecting hot data:

Footprint	1GB	10GB	100GB	5TB
---------------------------------------------
DAMON		>90%	<50%	 ~0%	  0%
DAMON+PTP	>90%	>90%	>90%	>90%

CPU overheads (in billion cycles) for kdamond:

Footprint	1GB	10GB	100GB	5TB
---------------------------------------------
DAMON		1.15	19.53	3.52	9.55
DAMON+PTP	0.83	 3.20	1.27	2.55

A detailed explanation and evaluation can be found in the arXiv paper:
https://arxiv.org/pdf/2311.10275.pdf


Aravinda Prasad (3):
  mm/damon: mm infrastructure support
  mm/damon: profiling enhancement
  mm/damon: documentation updates

 Documentation/mm/damon/design.rst |  42 ++++++
 arch/x86/include/asm/pgtable.h    |  20 +++
 arch/x86/mm/pgtable.c             |  28 +++-
 include/linux/mmu_notifier.h      |  36 +++++
 include/linux/pgtable.h           |  79 ++++++++++
 mm/damon/vaddr.c                  | 233 ++++++++++++++++++++++++++++--
 6 files changed, 424 insertions(+), 14 deletions(-)

Yu Zhao March 19, 2024, 12:51 a.m. UTC | #1

On Mon, Mar 18, 2024 at 9:24 AM Aravinda Prasad
<aravinda.prasad@intel.com> wrote:
>
> DAMON randomly samples one or more pages in every region and tracks
> accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> When the region size is large (e.g., several GBs), which is common
> for large footprint applications, detecting whether the region is
> accessed or not completely depends on whether the pages that are
> actively accessed in the region are picked during random sampling.
> If such pages are not picked for sampling, DAMON fails to identify
> the region as accessed. However, increasing the sampling rate or
> increasing the number of regions increases CPU overheads of kdamond.
>
> This patch proposes profiling different levels of the application’s
> page table tree to detect whether a region is accessed or not. This
> patch set is based on the observation that, when the accessed bit for a
> page is set, the accessed bits at the higher levels of the page table
> tree (PMD/PUD/PGD) corresponding to the path of the page table walk
> are also set. Hence, it is efficient to check the accessed bits at
> the higher levels of the page table tree to detect whether a region
> is accessed or not. For example, if the access bit for a PUD entry
> is set, then one or more pages in the 1GB PUD subtree is accessed as
> each PUD entry covers 1GB mapping. Hence, instead of sampling
> thousands of 4K/2M pages to detect accesses in a large region,
> sampling at the higher level of page table tree is faster and efficient.
>
> This patch set is based on 6.8-rc5 kernel (commit: f48159f8, mm-unstable
> tree)
>
> Changes since v1 [1]
> ====================
>
>  - Added support for 5-level page table tree
>  - Split the patch to mm infrastructure changes and DAMON enhancements
>  - Code changes as per comments on v1
>  - Added kerneldoc comments
>
> [1] https://lkml.org/lkml/2023/12/15/272
>
> Evaluation:
>
> - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
>   and 5TB with 10GB hot data.
> - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
>   parameters set to default value.
> - DAMON+PTP: Page table profiling applied to DAMON with the above
>   parameters.
>
> Profiling efficiency in detecting hot data:
>
> Footprint       1GB     10GB    100GB   5TB
> ---------------------------------------------
> DAMON           >90%    <50%     ~0%      0%
> DAMON+PTP       >90%    >90%    >90%    >90%
>
> CPU overheads (in billion cycles) for kdamond:
>
> Footprint       1GB     10GB    100GB   5TB
> ---------------------------------------------
> DAMON           1.15    19.53   3.52    9.55
> DAMON+PTP       0.83     3.20   1.27    2.55
>
> A detailed explanation and evaluation can be found in the arXiv paper:
> https://arxiv.org/pdf/2311.10275.pdf

NAK, on the ground of citing the nonfactual source above and
misrespenting the existing idea as your own invention [1].

The existing idea was purposely not patented so that all CPU vendors
are free to use it. Not sure what kind of peer review that source had,
but it's not getting around the reviewers here easily. Please do feel
free to ask any 3rd party that has no conflict of interest to override
my NAK though.

[1] https://lore.kernel.org/CAOUHufbDzy5dMcLR9ex25VdB_QBmSrW_We-2+KftZVYKNn4s9g@mail.gmail.com/
[2] https://lore.kernel.org/YE6yrQC1Ps195wPw@google.com/

SeongJae Park March 19, 2024, 5:20 a.m. UTC | #2

Hi Aravinda,

Thank you for posting this new revision!

I remember I told you that I don't see a high level significant problems on on
the reply to the previous revision of this patch[1], but I show a concern now.
Sorry for not raising this earlier, but let me explain my humble concerns
before being even more late.

On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad <aravinda.prasad@intel.com> wrote:

> DAMON randomly samples one or more pages in every region and tracks
> accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> When the region size is large (e.g., several GBs), which is common
> for large footprint applications, detecting whether the region is
> accessed or not completely depends on whether the pages that are
> actively accessed in the region are picked during random sampling.
> If such pages are not picked for sampling, DAMON fails to identify
> the region as accessed. However, increasing the sampling rate or
> increasing the number of regions increases CPU overheads of kdamond.

DAMON uses sampling because it considers a region as accessed if a portion of
the region that big enough to be detected via sampling is all accessed.  If a
region is having some pages that really accessed but the proportion is too
small to be found via sampling, I think DAMON could say the overall access to
the region is only modest and could even be ignored.  In my humble opinion,
this fits with the definition of DAMON region: A memory address range that
constructed with pages having similar access frequency.

> 
> This patch proposes profiling different levels of the application\u2019s
> page table tree to detect whether a region is accessed or not. This
> patch set is based on the observation that, when the accessed bit for a
> page is set, the accessed bits at the higher levels of the page table
> tree (PMD/PUD/PGD) corresponding to the path of the page table walk
> are also set. Hence, it is efficient to check the accessed bits at
> the higher levels of the page table tree to detect whether a region
> is accessed or not. For example, if the access bit for a PUD entry
> is set, then one or more pages in the 1GB PUD subtree is accessed as
> each PUD entry covers 1GB mapping. Hence, instead of sampling
> thousands of 4K/2M pages to detect accesses in a large region, 
> sampling at the higher level of page table tree is faster and efficient.

Due to the above reason, I concern this could result in making DAMON monitoring
results be inaccurately biased to report more than real accesses.

> 
> This patch set is based on 6.8-rc5 kernel (commit: f48159f8, mm-unstable
> tree)
> 
> Changes since v1 [1]
> ====================
> 
>  - Added support for 5-level page table tree
>  - Split the patch to mm infrastructure changes and DAMON enhancements
>  - Code changes as per comments on v1
>  - Added kerneldoc comments
> 
> [1] https://lkml.org/lkml/2023/12/15/272
>  
> Evaluation:
> 
> - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
>   and 5TB with 10GB hot data.
> - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
>   parameters set to default value.
> - DAMON+PTP: Page table profiling applied to DAMON with the above
>   parameters.
> 
> Profiling efficiency in detecting hot data:
> 
> Footprint	1GB	10GB	100GB	5TB
> ---------------------------------------------
> DAMON		>90%	<50%	 ~0%	  0%
> DAMON+PTP	>90%	>90%	>90%	>90%

Sampling interval is the time interval that assumed to be large enough for the
workload to make meaningful amount of accesses within the interval.  Hence,
meaningful amount of sampling interval depends on the workload's characteristic
and system's memory bandwidth.

Here, the size of the hot memory region is about 100MB, 1GB, 10GB, and 10GB for
the four cases, respectively.  And you set the sampling interval as 5ms.  Let's
assume the system can access, say, 50 GB per second, and hence it could be able
to access only up to 250 MB per 5ms.  So, in case of 1GB and footprint, all hot
memory region would be accessed while DAMON is waiting for next sampling
interval.  Hence, DAMON would be able to see most accesses via sampling.  But
for 100GB footprint case, only 250MB / 10GB = about 2.5% of the hot memory
region would be accessed between the sampling interval.  DAMON cannot see whole
accesses, and hence the precision could be low.

I don't know exact memory bandwith of the system, but to detect the 10 GB hot
region with 5ms sampling interval, the system should be able to access 2GB
memory per millisecond, or about 2TB memory per second.  I think systems of
such memory bandwidth is not that common.

I show you also explored a configuration setting the aggregation interval
higher.  But because each sampling checks only access between the sampling
interval, that might not help in this setup.  I'm wondering if you also
explored increasing sampling interval.

Sorry again for finding this concern not early enough.  But I think we may need
to discuss about this first.

[1] https://lkml.kernel.org/r/20231215201159.73845-1-sj@kernel.org

Thanks,
SJ

> 
> CPU overheads (in billion cycles) for kdamond:
> 
> Footprint	1GB	10GB	100GB	5TB
> ---------------------------------------------
> DAMON		1.15	19.53	3.52	9.55
> DAMON+PTP	0.83	 3.20	1.27	2.55
> 
> A detailed explanation and evaluation can be found in the arXiv paper:
> https://arxiv.org/pdf/2311.10275.pdf
> 
> 
> Aravinda Prasad (3):
>   mm/damon: mm infrastructure support
>   mm/damon: profiling enhancement
>   mm/damon: documentation updates
> 
>  Documentation/mm/damon/design.rst |  42 ++++++
>  arch/x86/include/asm/pgtable.h    |  20 +++
>  arch/x86/mm/pgtable.c             |  28 +++-
>  include/linux/mmu_notifier.h      |  36 +++++
>  include/linux/pgtable.h           |  79 ++++++++++
>  mm/damon/vaddr.c                  | 233 ++++++++++++++++++++++++++++--
>  6 files changed, 424 insertions(+), 14 deletions(-)
> 
> -- 
> 2.21.3

Prasad, Aravinda March 19, 2024, 10:56 a.m. UTC | #3

> -----Original Message-----
> From: SeongJae Park <sj@kernel.org>
> Sent: Tuesday, March 19, 2024 10:51 AM
> To: Prasad, Aravinda <aravinda.prasad@intel.com>
> Cc: damon@lists.linux.dev; linux-mm@kvack.org; sj@kernel.org; linux-
> kernel@vger.kernel.org; s2322819@ed.ac.uk; Kumar, Sandeep4
> <sandeep4.kumar@intel.com>; Huang, Ying <ying.huang@intel.com>;
> Hansen, Dave <dave.hansen@intel.com>; Williams, Dan J
> <dan.j.williams@intel.com>; Subramoney, Sreenivas
> <sreenivas.subramoney@intel.com>; Kervinen, Antti
> <antti.kervinen@intel.com>; Kanevskiy, Alexander
> <alexander.kanevskiy@intel.com>
> Subject: Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
> 
> Hi Aravinda,
> 
> 
> Thank you for posting this new revision!
> 
> I remember I told you that I don't see a high level significant problems on on
> the reply to the previous revision of this patch[1], but I show a concern now.
> Sorry for not raising this earlier, but let me explain my humble concerns before
> being even more late.

Sure, no problem. We can discuss. I will get back to you with a detailed note.

Regards,
Aravinda

> 
> On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad
> <aravinda.prasad@intel.com> wrote:
> 
> > DAMON randomly samples one or more pages in every region and tracks
> > accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> > When the region size is large (e.g., several GBs), which is common for
> > large footprint applications, detecting whether the region is accessed
> > or not completely depends on whether the pages that are actively
> > accessed in the region are picked during random sampling.
> > If such pages are not picked for sampling, DAMON fails to identify the
> > region as accessed. However, increasing the sampling rate or
> > increasing the number of regions increases CPU overheads of kdamond.
> 
> DAMON uses sampling because it considers a region as accessed if a portion of
> the region that big enough to be detected via sampling is all accessed.  If a
> region is having some pages that really accessed but the proportion is too
> small to be found via sampling, I think DAMON could say the overall access to
> the region is only modest and could even be ignored.  In my humble opinion,
> this fits with the definition of DAMON region: A memory address range that
> constructed with pages having similar access frequency.


> 
> >
> > This patch proposes profiling different levels of the
> > application\u2019s page table tree to detect whether a region is
> > accessed or not. This patch set is based on the observation that, when
> > the accessed bit for a page is set, the accessed bits at the higher
> > levels of the page table tree (PMD/PUD/PGD) corresponding to the path
> > of the page table walk are also set. Hence, it is efficient to check
> > the accessed bits at the higher levels of the page table tree to
> > detect whether a region is accessed or not. For example, if the access
> > bit for a PUD entry is set, then one or more pages in the 1GB PUD
> > subtree is accessed as each PUD entry covers 1GB mapping. Hence,
> > instead of sampling thousands of 4K/2M pages to detect accesses in a
> > large region, sampling at the higher level of page table tree is faster and
> efficient.
> 
> Due to the above reason, I concern this could result in making DAMON
> monitoring results be inaccurately biased to report more than real accesses.
> 
> >
> > This patch set is based on 6.8-rc5 kernel (commit: f48159f8,
> > mm-unstable
> > tree)
> >
> > Changes since v1 [1]
> > ====================
> >
> >  - Added support for 5-level page table tree
> >  - Split the patch to mm infrastructure changes and DAMON enhancements
> >  - Code changes as per comments on v1
> >  - Added kerneldoc comments
> >
> > [1] https://lkml.org/lkml/2023/12/15/272
> >
> > Evaluation:
> >
> > - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> >   and 5TB with 10GB hot data.
> > - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> >   parameters set to default value.
> > - DAMON+PTP: Page table profiling applied to DAMON with the above
> >   parameters.
> >
> > Profiling efficiency in detecting hot data:
> >
> > Footprint	1GB	10GB	100GB	5TB
> > ---------------------------------------------
> > DAMON		>90%	<50%	 ~0%	  0%
> > DAMON+PTP	>90%	>90%	>90%	>90%
> 
> Sampling interval is the time interval that assumed to be large enough for the
> workload to make meaningful amount of accesses within the interval.  Hence,
> meaningful amount of sampling interval depends on the workload's
> characteristic and system's memory bandwidth.
> 
> Here, the size of the hot memory region is about 100MB, 1GB, 10GB, and
> 10GB for the four cases, respectively.  And you set the sampling interval as
> 5ms.  Let's assume the system can access, say, 50 GB per second, and hence it
> could be able to access only up to 250 MB per 5ms.  So, in case of 1GB and
> footprint, all hot memory region would be accessed while DAMON is waiting
> for next sampling interval.  Hence, DAMON would be able to see most
> accesses via sampling.  But for 100GB footprint case, only 250MB / 10GB =
> about 2.5% of the hot memory region would be accessed between the
> sampling interval.  DAMON cannot see whole accesses, and hence the
> precision could be low.
> 
> I don't know exact memory bandwith of the system, but to detect the 10 GB
> hot region with 5ms sampling interval, the system should be able to access
> 2GB memory per millisecond, or about 2TB memory per second.  I think
> systems of such memory bandwidth is not that common.
> 
> I show you also explored a configuration setting the aggregation interval
> higher.  But because each sampling checks only access between the sampling
> interval, that might not help in this setup.  I'm wondering if you also explored
> increasing sampling interval.
> 
> Sorry again for finding this concern not early enough.  But I think we may need
> to discuss about this first.
> 
> [1] https://lkml.kernel.org/r/20231215201159.73845-1-sj@kernel.org
> 
> 
> Thanks,
> SJ
> 
> 
> >
> > CPU overheads (in billion cycles) for kdamond:
> >
> > Footprint	1GB	10GB	100GB	5TB
> > ---------------------------------------------
> > DAMON		1.15	19.53	3.52	9.55
> > DAMON+PTP	0.83	 3.20	1.27	2.55
> >
> > A detailed explanation and evaluation can be found in the arXiv paper:
> > https://arxiv.org/pdf/2311.10275.pdf
> >
> >
> > Aravinda Prasad (3):
> >   mm/damon: mm infrastructure support
> >   mm/damon: profiling enhancement
> >   mm/damon: documentation updates
> >
> >  Documentation/mm/damon/design.rst |  42 ++++++
> >  arch/x86/include/asm/pgtable.h    |  20 +++
> >  arch/x86/mm/pgtable.c             |  28 +++-
> >  include/linux/mmu_notifier.h      |  36 +++++
> >  include/linux/pgtable.h           |  79 ++++++++++
> >  mm/damon/vaddr.c                  | 233 ++++++++++++++++++++++++++++--
> >  6 files changed, 424 insertions(+), 14 deletions(-)
> >
> > --
> > 2.21.3

Prasad, Aravinda March 20, 2024, 12:31 p.m. UTC | #4

> -----Original Message-----
> From: SeongJae Park <sj@kernel.org>
> Sent: Tuesday, March 19, 2024 10:51 AM
> To: Prasad, Aravinda <aravinda.prasad@intel.com>
> Cc: damon@lists.linux.dev; linux-mm@kvack.org; sj@kernel.org; linux-
> kernel@vger.kernel.org; s2322819@ed.ac.uk; Kumar, Sandeep4
> <sandeep4.kumar@intel.com>; Huang, Ying <ying.huang@intel.com>; Hansen,
> Dave <dave.hansen@intel.com>; Williams, Dan J <dan.j.williams@intel.com>;
> Subramoney, Sreenivas <sreenivas.subramoney@intel.com>; Kervinen, Antti
> <antti.kervinen@intel.com>; Kanevskiy, Alexander
> <alexander.kanevskiy@intel.com>
> Subject: Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
> 
> Hi Aravinda,
> 
> 
> Thank you for posting this new revision!
> 
> I remember I told you that I don't see a high level significant problems on on the
> reply to the previous revision of this patch[1], but I show a concern now.
> Sorry for not raising this earlier, but let me explain my humble concerns before
> being even more late.

Please find my comments below:

> 
> On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad
> <aravinda.prasad@intel.com> wrote:
> 
> > DAMON randomly samples one or more pages in every region and tracks
> > accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> > When the region size is large (e.g., several GBs), which is common for
> > large footprint applications, detecting whether the region is accessed
> > or not completely depends on whether the pages that are actively
> > accessed in the region are picked during random sampling.
> > If such pages are not picked for sampling, DAMON fails to identify the
> > region as accessed. However, increasing the sampling rate or
> > increasing the number of regions increases CPU overheads of kdamond.
> 
> DAMON uses sampling because it considers a region as accessed if a portion of
> the region that big enough to be detected via sampling is all accessed.  If a region
> is having some pages that really accessed but the proportion is too small to be
> found via sampling, I think DAMON could say the overall access to the region is
> only modest and could even be ignored.  In my humble opinion, this fits with the
> definition of DAMON region: A memory address range that constructed with
> pages having similar access frequency.

Agree that DAMON considers a region as accessed if a good portion of the region
is accessed. But few points I would like to discuss:

For large regions (say 10GB, that has 2,621,440 4K pages), sampling at PTE level
will not cover a good portion of the region. For example, default 5ms sampling
and 100ms aggregation samples only 20 4K pages in an aggregation interval. 
Increasing sampling to 1 ms and aggregation to 1 second can only cover 
1000 4K pages, but results in higher CPU overheads due to frequent sampling. 
Even increasing the aggregation interval to 60 seconds but sampling at 5ms can
only cover 12000 samples, but region splitting and merging happens once
in 60 seconds.

In addition, this worsens when region sizes are say 100GB+. We observe that
sampling at PTE level does not help for large regions as more samples are
are required. So, decreasing/increasing the sampling or aggressions intervals
proportional to the region size is not practical as all regions are of not equal
size, we can have 100GB regions as well as many small regions (e.g., 16KB
to 1MB). So tuning sampling rate and aggregation interval did not help
for large regions.

It can also be observed that large regions cannot be avoided. Large regions
are created by merging adjacent smaller regions or at the beginning of
profiling (depending on min regions parameter which defaults to 10). 
Increasing min region reduces the size of regions but increases kdamond
overheads, hence, not preferable.

So, sampling at PTE level cannot precisely detect accesses to large regions
resulting in inaccuracies, even though it works for small regions. 
From our experiments, we found that with 10% hot data in a large region
(80+GB regions in a 1TB+ footprint application), DAMON was not able to
detect a single access to that region in 95+% cases with different sample
and aggregation interval combinations. But DAMON works good for
applications with footprint <50GB where regions are typically small.

Now consider the scenario with the proposed enhancement. With a
100GB region, if we sample a PUD entry that covers 1GB address
space, then the default 5ms sampling and 100ms aggregation samples
20 PUD entries that is 20 GB portion of the region. This gives a good
estimate of the portion of the region that is accessed. But the downside
is that as PUD accessed bit is set even if a small set of pages are accessed
under its subtree this can report more access as real as you noted.

But such large regions are split into two in the next aggregation interval. 
As the splitting of the regions continues, in next few aggregation intervals
many smaller regions are formed. Once smaller regions are formed,
the proposed enhancement cannot pick higher levels of the page table
tree and behaves as good as default DAMON. So, with the proposed
enhancement, larger regions are quickly split into smaller regions if they
have only small set of pages accessed.

To avoid misinterpreting region access count, I feel that the "age" of the
region is of real help and should be considered by both DAMON and the
proposed enhancement. If the age of a region is small (<10) then that
region should not be considered stable and hence should not be
considered for any memory tiering decisions. For regions with age, 
say >10, can be considered as stable as they reflect the actual access
frequency.

> 
> >
> > This patch proposes profiling different levels of the
> > application\u2019s page table tree to detect whether a region is
> > accessed or not. This patch set is based on the observation that, when
> > the accessed bit for a page is set, the accessed bits at the higher
> > levels of the page table tree (PMD/PUD/PGD) corresponding to the path
> > of the page table walk are also set. Hence, it is efficient to check
> > the accessed bits at the higher levels of the page table tree to
> > detect whether a region is accessed or not. For example, if the access
> > bit for a PUD entry is set, then one or more pages in the 1GB PUD
> > subtree is accessed as each PUD entry covers 1GB mapping. Hence,
> > instead of sampling thousands of 4K/2M pages to detect accesses in a
> > large region, sampling at the higher level of page table tree is faster and
> efficient.
> 
> Due to the above reason, I concern this could result in making DAMON monitoring
> results be inaccurately biased to report more than real accesses.

DAMON, even without the proposed enhancement, can result in inaccuracies
for large regions, (see examples above).

> 
> >
> > This patch set is based on 6.8-rc5 kernel (commit: f48159f8,
> > mm-unstable
> > tree)
> >
> > Changes since v1 [1]
> > ====================
> >
> >  - Added support for 5-level page table tree
> >  - Split the patch to mm infrastructure changes and DAMON enhancements
> >  - Code changes as per comments on v1
> >  - Added kerneldoc comments
> >
> > [1] https://lkml.org/lkml/2023/12/15/272
> >
> > Evaluation:
> >
> > - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> >   and 5TB with 10GB hot data.
> > - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> >   parameters set to default value.
> > - DAMON+PTP: Page table profiling applied to DAMON with the above
> >   parameters.
> >
> > Profiling efficiency in detecting hot data:
> >
> > Footprint	1GB	10GB	100GB	5TB
> > ---------------------------------------------
> > DAMON		>90%	<50%	 ~0%	  0%
> > DAMON+PTP	>90%	>90%	>90%	>90%
> 
> Sampling interval is the time interval that assumed to be large enough for the
> workload to make meaningful amount of accesses within the interval.  Hence,
> meaningful amount of sampling interval depends on the workload's characteristic
> and system's memory bandwidth.
> 
> Here, the size of the hot memory region is about 100MB, 1GB, 10GB, and 10GB
> for the four cases, respectively.  And you set the sampling interval as 5ms.  Let's
> assume the system can access, say, 50 GB per second, and hence it could be able
> to access only up to 250 MB per 5ms.  So, in case of 1GB and footprint, all hot
> memory region would be accessed while DAMON is waiting for next sampling
> interval.  Hence, DAMON would be able to see most accesses via sampling.  But
> for 100GB footprint case, only 250MB / 10GB = about 2.5% of the hot memory
> region would be accessed between the sampling interval.  DAMON cannot see
> whole accesses, and hence the precision could be low.
> 
> I don't know exact memory bandwith of the system, but to detect the 10 GB hot
> region with 5ms sampling interval, the system should be able to access 2GB
> memory per millisecond, or about 2TB memory per second.  I think systems of
> such memory bandwidth is not that common.
> 
> I show you also explored a configuration setting the aggregation interval higher.
> But because each sampling checks only access between the sampling interval,
> that might not help in this setup.  I'm wondering if you also explored increasing
> sampling interval.
>

What we have observed that many real-world benchmarks we experimented
with do not saturate the memory bandwidth. We also experimented with
masim microbenchmark to understand the impact on memory access rate
(we inserted delay between memory access operations in do_rnd_ro() and
other functions). We see decrease in the precision as access intensity is
reduced. We have experimented with different sampling and aggregation
intervals, but that did not help much in improving precision. 

So, what I think is it that most of the cases the precision depends on the page
(hot or cold) that is randomly picked for sampling than the sampling rate. Most
of the time only cold 4K pages are picked in a large region as they typically
account for 90% of the pages in the region and hence DAMON does not
detect any accesses at all. By profiling higher levels of the page table tree
this can be improved.

> Sorry again for finding this concern not early enough.  But I think we may need to
> discuss about this first.

Absolutely no problem. Please let me know your thoughts.

Regards,
Aravinda

> 
> [1] https://lkml.kernel.org/r/20231215201159.73845-1-sj@kernel.org
> 
> 
> Thanks,
> SJ
> 
> 
> >
> > CPU overheads (in billion cycles) for kdamond:
> >
> > Footprint	1GB	10GB	100GB	5TB
> > ---------------------------------------------
> > DAMON		1.15	19.53	3.52	9.55
> > DAMON+PTP	0.83	 3.20	1.27	2.55
> >
> > A detailed explanation and evaluation can be found in the arXiv paper:
> > https://arxiv.org/pdf/2311.10275.pdf
> >
> >
> > Aravinda Prasad (3):
> >   mm/damon: mm infrastructure support
> >   mm/damon: profiling enhancement
> >   mm/damon: documentation updates
> >
> >  Documentation/mm/damon/design.rst |  42 ++++++
> >  arch/x86/include/asm/pgtable.h    |  20 +++
> >  arch/x86/mm/pgtable.c             |  28 +++-
> >  include/linux/mmu_notifier.h      |  36 +++++
> >  include/linux/pgtable.h           |  79 ++++++++++
> >  mm/damon/vaddr.c                  | 233 ++++++++++++++++++++++++++++--
> >  6 files changed, 424 insertions(+), 14 deletions(-)
> >
> > --
> > 2.21.3

SeongJae Park March 21, 2024, 11:10 p.m. UTC | #5

On Wed, 20 Mar 2024 12:31:17 +0000 "Prasad, Aravinda" <aravinda.prasad@intel.com> wrote:

> 
> 
> > -----Original Message-----
> > From: SeongJae Park <sj@kernel.org>
> > Sent: Tuesday, March 19, 2024 10:51 AM
> > To: Prasad, Aravinda <aravinda.prasad@intel.com>
> > Cc: damon@lists.linux.dev; linux-mm@kvack.org; sj@kernel.org; linux-
> > kernel@vger.kernel.org; s2322819@ed.ac.uk; Kumar, Sandeep4
> > <sandeep4.kumar@intel.com>; Huang, Ying <ying.huang@intel.com>; Hansen,
> > Dave <dave.hansen@intel.com>; Williams, Dan J <dan.j.williams@intel.com>;
> > Subramoney, Sreenivas <sreenivas.subramoney@intel.com>; Kervinen, Antti
> > <antti.kervinen@intel.com>; Kanevskiy, Alexander
> > <alexander.kanevskiy@intel.com>
> > Subject: Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
> > 
> > Hi Aravinda,
> > 
> > 
> > Thank you for posting this new revision!
> > 
> > I remember I told you that I don't see a high level significant problems on on the
> > reply to the previous revision of this patch[1], but I show a concern now.
> > Sorry for not raising this earlier, but let me explain my humble concerns before
> > being even more late.
> 
> Please find my comments below:
> 
> > 
> > On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad
> > <aravinda.prasad@intel.com> wrote:
> > 
> > > DAMON randomly samples one or more pages in every region and tracks
> > > accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> > > When the region size is large (e.g., several GBs), which is common for
> > > large footprint applications, detecting whether the region is accessed
> > > or not completely depends on whether the pages that are actively
> > > accessed in the region are picked during random sampling.
> > > If such pages are not picked for sampling, DAMON fails to identify the
> > > region as accessed. However, increasing the sampling rate or
> > > increasing the number of regions increases CPU overheads of kdamond.
> > 
> > DAMON uses sampling because it considers a region as accessed if a portion of
> > the region that big enough to be detected via sampling is all accessed.  If a region
> > is having some pages that really accessed but the proportion is too small to be
> > found via sampling, I think DAMON could say the overall access to the region is
> > only modest and could even be ignored.  In my humble opinion, this fits with the
> > definition of DAMON region: A memory address range that constructed with
> > pages having similar access frequency.
> 
> Agree that DAMON considers a region as accessed if a good portion of the region
> is accessed. But few points I would like to discuss:
> 
> For large regions (say 10GB, that has 2,621,440 4K pages), sampling at PTE level
> will not cover a good portion of the region. For example, default 5ms sampling
> and 100ms aggregation samples only 20 4K pages in an aggregation interval. 

If the 20 attempts all failed at finding any single accessed 4K page, I think
it roughly means less than 5% of the region is accessed within the
user-specified time (aggregation interval).  I would translate that as only
tiny portion of the region is accessed within the user-specified time, and
hence DAMON is ok to say the region is nearly not accessed.

> Increasing sampling to 1 ms and aggregation to 1 second can only cover 
> 1000 4K pages, but results in higher CPU overheads due to frequent sampling. 
> Even increasing the aggregation interval to 60 seconds but sampling at 5ms can
> only cover 12000 samples, but region splitting and merging happens once
> in 60 seconds.

At the beginning of each sampling interval, DAMON randomly picks one page per
region, clear their accessed bits, wait until the sampling interval is
finished, and check the accessed bits again.  In other words, DAMON shows only
accesses that made in last sampling interval.

Increasing number of samples per aggregation interval can help DAMON knows the
access frequency of regions in finer granularity, but doesn't allow DAMON see
more accesses.  Rather than that, if the aggregation interval is fixed
(reducing sampling interval), DAMON can show even less amount of accesses.

What we need here is giving the workload longer sampling time so that the
workload can make access to a size of memory regions that large enough to be
found by DAMON.

> 
> In addition, this worsens when region sizes are say 100GB+. We observe that
> sampling at PTE level does not help for large regions as more samples are
> are required. So, decreasing/increasing the sampling or aggressions intervals
> proportional to the region size is not practical as all regions are of not equal
> size, we can have 100GB regions as well as many small regions (e.g., 16KB
> to 1MB).

IMO, it becomes worse because the minimum size of accessed memory regions that
can be found by DAMON via sampling has increased together, while you didn't
give more sampling time (a.k.a the time to let the workload make accesses that
DAMON can show).

> So tuning sampling rate and aggregation interval did not help for large
> regions.

Due to the mechanism of the DAMON's sampling I mentioned above, I think this is
what expected.  We need to increase sampling interval.

> 
> It can also be observed that large regions cannot be avoided. Large regions
> are created by merging adjacent smaller regions or at the beginning of
> profiling (depending on min regions parameter which defaults to 10). 
> Increasing min region reduces the size of regions but increases kdamond
> overheads, hence, not preferable.
> 
> So, sampling at PTE level cannot precisely detect accesses to large regions
> resulting in inaccuracies, even though it works for small regions. 
> From our experiments, we found that with 10% hot data in a large region
> (80+GB regions in a 1TB+ footprint application), DAMON was not able to
> detect a single access to that region in 95+% cases with different sample
> and aggregation interval combinations. But DAMON works good for
> applications with footprint <50GB where regions are typically small.
> 
> Now consider the scenario with the proposed enhancement. With a
> 100GB region, if we sample a PUD entry that covers 1GB address
> space, then the default 5ms sampling and 100ms aggregation samples
> 20 PUD entries that is 20 GB portion of the region. This gives a good
> estimate of the portion of the region that is accessed. But the downside
> is that as PUD accessed bit is set even if a small set of pages are accessed
> under its subtree this can report more access as real as you noted.
> 
> But such large regions are split into two in the next aggregation interval. 
> As the splitting of the regions continues, in next few aggregation intervals
> many smaller regions are formed. Once smaller regions are formed,
> the proposed enhancement cannot pick higher levels of the page table
> tree and behaves as good as default DAMON. So, with the proposed
> enhancement, larger regions are quickly split into smaller regions if they
> have only small set of pages accessed.

I fully agree.  This is what could be a real and important benefits.

> 
> To avoid misinterpreting region access count, I feel that the "age" of the
> region is of real help and should be considered by both DAMON and the
> proposed enhancement. If the age of a region is small (<10) then that
> region should not be considered stable and hence should not be
> considered for any memory tiering decisions. For regions with age, 
> say >10, can be considered as stable as they reflect the actual access
> frequency.

I think this is a good approach, but difficult to be used by default.  I think
we might be able to get the benefit without making problem at the
over-reporting accesses by using the high level accessed bit check results as a
hint for better quality of region split?

Also, if we can allow large enough age, the random region split will eventually
find the small hot regions even without high level accessed bit hint.  Of
course the hint could help finding it earlier.  I think that was one of my
comment on the first version of this patch.

> 
> > 
> > >
> > > This patch proposes profiling different levels of the
> > > application\u2019s page table tree to detect whether a region is
> > > accessed or not. This patch set is based on the observation that, when
> > > the accessed bit for a page is set, the accessed bits at the higher
> > > levels of the page table tree (PMD/PUD/PGD) corresponding to the path
> > > of the page table walk are also set. Hence, it is efficient to check
> > > the accessed bits at the higher levels of the page table tree to
> > > detect whether a region is accessed or not. For example, if the access
> > > bit for a PUD entry is set, then one or more pages in the 1GB PUD
> > > subtree is accessed as each PUD entry covers 1GB mapping. Hence,
> > > instead of sampling thousands of 4K/2M pages to detect accesses in a
> > > large region, sampling at the higher level of page table tree is faster and
> > efficient.
> > 
> > Due to the above reason, I concern this could result in making DAMON monitoring
> > results be inaccurately biased to report more than real accesses.
> 
> DAMON, even without the proposed enhancement, can result in inaccuracies
> for large regions, (see examples above).

I think temporarily missing such tiny portion of accesses is not a critical
problem.  If this is a problem, the user should increase the sampling interval
in my opinion.  That said, as mentioned above, DAMON would better to improve
its regions split mechanism.

> 
> > 
> > >
> > > This patch set is based on 6.8-rc5 kernel (commit: f48159f8,
> > > mm-unstable
> > > tree)
> > >
> > > Changes since v1 [1]
> > > ====================
> > >
> > >  - Added support for 5-level page table tree
> > >  - Split the patch to mm infrastructure changes and DAMON enhancements
> > >  - Code changes as per comments on v1
> > >  - Added kerneldoc comments
> > >
> > > [1] https://lkml.org/lkml/2023/12/15/272
> > >
> > > Evaluation:
> > >
> > > - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> > >   and 5TB with 10GB hot data.
> > > - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> > >   parameters set to default value.
> > > - DAMON+PTP: Page table profiling applied to DAMON with the above
> > >   parameters.
> > >
> > > Profiling efficiency in detecting hot data:
> > >
> > > Footprint	1GB	10GB	100GB	5TB
> > > ---------------------------------------------
> > > DAMON		>90%	<50%	 ~0%	  0%
> > > DAMON+PTP	>90%	>90%	>90%	>90%
> > 
> > Sampling interval is the time interval that assumed to be large enough for the
> > workload to make meaningful amount of accesses within the interval.  Hence,
> > meaningful amount of sampling interval depends on the workload's characteristic
> > and system's memory bandwidth.
> > 
> > Here, the size of the hot memory region is about 100MB, 1GB, 10GB, and 10GB
> > for the four cases, respectively.  And you set the sampling interval as 5ms.  Let's
> > assume the system can access, say, 50 GB per second, and hence it could be able
> > to access only up to 250 MB per 5ms.  So, in case of 1GB and footprint, all hot
> > memory region would be accessed while DAMON is waiting for next sampling
> > interval.  Hence, DAMON would be able to see most accesses via sampling.  But
> > for 100GB footprint case, only 250MB / 10GB = about 2.5% of the hot memory
> > region would be accessed between the sampling interval.  DAMON cannot see
> > whole accesses, and hence the precision could be low.
> > 
> > I don't know exact memory bandwith of the system, but to detect the 10 GB hot
> > region with 5ms sampling interval, the system should be able to access 2GB
> > memory per millisecond, or about 2TB memory per second.  I think systems of
> > such memory bandwidth is not that common.
> > 
> > I show you also explored a configuration setting the aggregation interval higher.
> > But because each sampling checks only access between the sampling interval,
> > that might not help in this setup.  I'm wondering if you also explored increasing
> > sampling interval.
> >
> 
> What we have observed that many real-world benchmarks we experimented
> with do not saturate the memory bandwidth. We also experimented with
> masim microbenchmark to understand the impact on memory access rate
> (we inserted delay between memory access operations in do_rnd_ro() and
> other functions). We see decrease in the precision as access intensity is
> reduced. We have experimented with different sampling and aggregation
> intervals, but that did not help much in improving precision. 

Again, please note that DAMON can show only accesses made between each sampling
interval at a time.  The important factor for expectation of DAMON's accuracy
is, the balance between the memory access intensity of the workload, and the
length of the sampling interval.  The workload should be access intensive
enough to make sufficient amount of accesses between sampling interval.  The
sampling interval should be long enough to allow the workload makes sufficient
amount of accesses within the time interval.

The fact that the workloads were not saturating the memory bandwidth is not
enough to know if that means the workload was memory intensive enough, and the
sampling interval was long enough.

I was mentioning the memory bandwidth as only the maximum memory intensity of
the system that could be achieved.

> 
> So, what I think is it that most of the cases the precision depends on the page
> (hot or cold) that is randomly picked for sampling than the sampling rate. Most
> of the time only cold 4K pages are picked in a large region as they typically
> account for 90% of the pages in the region and hence DAMON does not
> detect any accesses at all. By profiling higher levels of the page table tree
> this can be improved.

Again, agreed.  This is an important and grateful finding.  Thank you.  And
again as mentioned above, I don't think we can merge this patch as is, but we
could think about using the high level access bit check results as a hint to
better split the regions.

Indeed, DAMON's monitoring mechanism has many rooms for improvements.  I also
have some ideas but my time was more spent on more capabilities of DAMON/DAMOS
so far.  It was a bit intentional proiority setting since I got no real DAMON
accuracy problem report from the production usage, and improving the accuracy
will deliver the benefit to all DAMON/DAMOS features.

Since an important milestone of DAMOS, namely auto-tuning, has merged into the
mainline, I think I may better to spend more time on monitoring accuracy
improvement.  I have some immature ideas in my head.  I will try to summarize
and share the ideas in near future.

>  
> > Sorry again for finding this concern not early enough.  But I think we may need to
> > discuss about this first.
> 
> Absolutely no problem. Please let me know your thoughts.

Thank you for patiently walking with me :)


Thanks,
SJ

> 
> Regards,
> Aravinda
> 
> > 
> > [1] https://lkml.kernel.org/r/20231215201159.73845-1-sj@kernel.org
> > 
> > 
> > Thanks,
> > SJ
> > 
> > 
> > >
> > > CPU overheads (in billion cycles) for kdamond:
> > >
> > > Footprint	1GB	10GB	100GB	5TB
> > > ---------------------------------------------
> > > DAMON		1.15	19.53	3.52	9.55
> > > DAMON+PTP	0.83	 3.20	1.27	2.55
> > >
> > > A detailed explanation and evaluation can be found in the arXiv paper:
> > > https://arxiv.org/pdf/2311.10275.pdf
> > >
> > >
> > > Aravinda Prasad (3):
> > >   mm/damon: mm infrastructure support
> > >   mm/damon: profiling enhancement
> > >   mm/damon: documentation updates
> > >
> > >  Documentation/mm/damon/design.rst |  42 ++++++
> > >  arch/x86/include/asm/pgtable.h    |  20 +++
> > >  arch/x86/mm/pgtable.c             |  28 +++-
> > >  include/linux/mmu_notifier.h      |  36 +++++
> > >  include/linux/pgtable.h           |  79 ++++++++++
> > >  mm/damon/vaddr.c                  | 233 ++++++++++++++++++++++++++++--
> > >  6 files changed, 424 insertions(+), 14 deletions(-)
> > >
> > > --
> > > 2.21.3

Prasad, Aravinda March 22, 2024, 12:12 p.m. UTC | #6

> -----Original Message-----
> From: SeongJae Park <sj@kernel.org>
> Sent: Friday, March 22, 2024 4:40 AM
> To: Prasad, Aravinda <aravinda.prasad@intel.com>
> Cc: SeongJae Park <sj@kernel.org>; damon@lists.linux.dev; linux-
> mm@kvack.org; linux-kernel@vger.kernel.org; s2322819@ed.ac.uk; Kumar,
> Sandeep4 <sandeep4.kumar@intel.com>; Huang, Ying <ying.huang@intel.com>;
> Hansen, Dave <dave.hansen@intel.com>; Williams, Dan J
> <dan.j.williams@intel.com>; Subramoney, Sreenivas
> <sreenivas.subramoney@intel.com>; Kervinen, Antti <antti.kervinen@intel.com>;
> Kanevskiy, Alexander <alexander.kanevskiy@intel.com>
> Subject: RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
> 
> On Wed, 20 Mar 2024 12:31:17 +0000 "Prasad, Aravinda"
> <aravinda.prasad@intel.com> wrote:
> 
> >
> >
> > > -----Original Message-----
> > > From: SeongJae Park <sj@kernel.org>
> > > Sent: Tuesday, March 19, 2024 10:51 AM
> > > To: Prasad, Aravinda <aravinda.prasad@intel.com>
> > > Cc: damon@lists.linux.dev; linux-mm@kvack.org; sj@kernel.org; linux-
> > > kernel@vger.kernel.org; s2322819@ed.ac.uk; Kumar, Sandeep4
> > > <sandeep4.kumar@intel.com>; Huang, Ying <ying.huang@intel.com>;
> > > Hansen, Dave <dave.hansen@intel.com>; Williams, Dan J
> > > <dan.j.williams@intel.com>; Subramoney, Sreenivas
> > > <sreenivas.subramoney@intel.com>; Kervinen, Antti
> > > <antti.kervinen@intel.com>; Kanevskiy, Alexander
> > > <alexander.kanevskiy@intel.com>
> > > Subject: Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for
> > > DAMON
> > >
> > > Hi Aravinda,
> > >
> > >
> > > Thank you for posting this new revision!
> > >
> > > I remember I told you that I don't see a high level significant
> > > problems on on the reply to the previous revision of this patch[1], but I show a
> concern now.
> > > Sorry for not raising this earlier, but let me explain my humble
> > > concerns before being even more late.
> >
> > Please find my comments below:
> >
> > >
> > > On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad
> > > <aravinda.prasad@intel.com> wrote:
> > >
> > > > DAMON randomly samples one or more pages in every region and
> > > > tracks accesses to them using the ACCESSED bit in PTE (or PMD for 2MB
> pages).
> > > > When the region size is large (e.g., several GBs), which is common
> > > > for large footprint applications, detecting whether the region is
> > > > accessed or not completely depends on whether the pages that are
> > > > actively accessed in the region are picked during random sampling.
> > > > If such pages are not picked for sampling, DAMON fails to identify
> > > > the region as accessed. However, increasing the sampling rate or
> > > > increasing the number of regions increases CPU overheads of kdamond.
> > >
> > > DAMON uses sampling because it considers a region as accessed if a
> > > portion of the region that big enough to be detected via sampling is
> > > all accessed.  If a region is having some pages that really accessed
> > > but the proportion is too small to be found via sampling, I think
> > > DAMON could say the overall access to the region is only modest and
> > > could even be ignored.  In my humble opinion, this fits with the
> > > definition of DAMON region: A memory address range that constructed with
> pages having similar access frequency.
> >
> > Agree that DAMON considers a region as accessed if a good portion of
> > the region is accessed. But few points I would like to discuss:
> >
> > For large regions (say 10GB, that has 2,621,440 4K pages), sampling at
> > PTE level will not cover a good portion of the region. For example,
> > default 5ms sampling and 100ms aggregation samples only 20 4K pages in an
> aggregation interval.
> 
> If the 20 attempts all failed at finding any single accessed 4K page, I think it
> roughly means less than 5% of the region is accessed within the user-specified
> time (aggregation interval).  I would translate that as only tiny portion of the
> region is accessed within the user-specified time, and hence DAMON is ok to say
> the region is nearly not accessed.

I am looking at it from the other way:

To detect if a region is hot or cold at least 1% of the pages in the region should
be sampled. For a 10GB region (with 2,621,440 4K pages) this requires sampling
at least 26,214 pages. For a 100GB region this will require sampling at least
262,144 pages.

If we sample at 5ms, this takes 131.072 seconds to cover 1% of 10GB and 1310.72
seconds to cover 100GB. 

DAMON shows that the selected page as accessed if that page was accessed
during the 5ms sampling window. Now if we increase the sampling to 20ms to
improve access detection, then covering 1% of the region takes even longer.

> 
> > Increasing sampling to 1 ms and aggregation to 1 second can only cover
> > 1000 4K pages, but results in higher CPU overheads due to frequent sampling.
> > Even increasing the aggregation interval to 60 seconds but sampling at
> > 5ms can only cover 12000 samples, but region splitting and merging
> > happens once in 60 seconds.
> 
> At the beginning of each sampling interval, DAMON randomly picks one page per
> region, clear their accessed bits, wait until the sampling interval is finished, and
> check the accessed bits again.  In other words, DAMON shows only accesses that
> made in last sampling interval.

Yes, I see this in the code:

while(time < aggregation_interval)
{
  clear_access_bit
  sleep(sampling_time)
  check_access_bit
}

I would suggest this logic instead.

while(time < aggregation_interval)
{
  Number_of_samples = aggregation_interval / sampling_time;

  for (i = 0, I < number_of_samples; i++) 
  {
    clear_access_bit
  } 

  sleep(aggregation_time)

  for (i = 0, I < number_of_samples; i++) 
  {
    check_access_bit
  }
}

This can help in better access detection. I am sure you would
have already explored it.   

> 
> Increasing number of samples per aggregation interval can help DAMON knows
> the access frequency of regions in finer granularity, but doesn't allow DAMON see
> more accesses.  Rather than that, if the aggregation interval is fixed (reducing
> sampling interval), DAMON can show even less amount of accesses.
> 
> What we need here is giving the workload longer sampling time so that the
> workload can make access to a size of memory regions that large enough to be
> found by DAMON.

But even with longer sampling time, we may miss the access. For example, 
consider all the pages in the region are accessed sequentially. Now if DAMON samples
a different page other than the page that is being accessed it will miss. Now even if we
have longer sampling time it is possible that none of the accesses are detected.

> 
> >
> > In addition, this worsens when region sizes are say 100GB+. We observe
> > that sampling at PTE level does not help for large regions as more
> > samples are are required. So, decreasing/increasing the sampling or
> > aggressions intervals proportional to the region size is not practical
> > as all regions are of not equal size, we can have 100GB regions as
> > well as many small regions (e.g., 16KB to 1MB).
> 
> IMO, it becomes worse because the minimum size of accessed memory regions
> that can be found by DAMON via sampling has increased together, while you
> didn't give more sampling time (a.k.a the time to let the workload make accesses
> that DAMON can show).
> 
> > So tuning sampling rate and aggregation interval did not help for
> > large regions.
> 
> Due to the mechanism of the DAMON's sampling I mentioned above, I think this
> is what expected.  We need to increase sampling interval.
> 
> >
> > It can also be observed that large regions cannot be avoided. Large
> > regions are created by merging adjacent smaller regions or at the
> > beginning of profiling (depending on min regions parameter which defaults to
> 10).
> > Increasing min region reduces the size of regions but increases
> > kdamond overheads, hence, not preferable.
> >
> > So, sampling at PTE level cannot precisely detect accesses to large
> > regions resulting in inaccuracies, even though it works for small regions.
> > From our experiments, we found that with 10% hot data in a large
> > region (80+GB regions in a 1TB+ footprint application), DAMON was not
> > able to detect a single access to that region in 95+% cases with
> > different sample and aggregation interval combinations. But DAMON
> > works good for applications with footprint <50GB where regions are typically
> small.
> >
> > Now consider the scenario with the proposed enhancement. With a 100GB
> > region, if we sample a PUD entry that covers 1GB address space, then
> > the default 5ms sampling and 100ms aggregation samples
> > 20 PUD entries that is 20 GB portion of the region. This gives a good
> > estimate of the portion of the region that is accessed. But the
> > downside is that as PUD accessed bit is set even if a small set of
> > pages are accessed under its subtree this can report more access as real as you
> noted.
> >
> > But such large regions are split into two in the next aggregation interval.
> > As the splitting of the regions continues, in next few aggregation
> > intervals many smaller regions are formed. Once smaller regions are
> > formed, the proposed enhancement cannot pick higher levels of the page
> > table tree and behaves as good as default DAMON. So, with the proposed
> > enhancement, larger regions are quickly split into smaller regions if
> > they have only small set of pages accessed.
> 
> I fully agree.  This is what could be a real and important benefits.
> 
> >
> > To avoid misinterpreting region access count, I feel that the "age" of
> > the region is of real help and should be considered by both DAMON and
> > the proposed enhancement. If the age of a region is small (<10) then
> > that region should not be considered stable and hence should not be
> > considered for any memory tiering decisions. For regions with age, say
> > >10, can be considered as stable as they reflect the actual access
> > frequency.
> 
> I think this is a good approach, but difficult to be used by default.  I think we
> might be able to get the benefit without making problem at the over-reporting
> accesses by using the high level accessed bit check results as a hint for better
> quality of region split?

I agree, high level page table profiling can give hints to split the region instead of
using it to detect accesses to the region.

> 
> Also, if we can allow large enough age, the random region split will eventually find
> the small hot regions even without high level accessed bit hint.  Of course the hint
> could help finding it earlier.  I think that was one of my comment on the first
> version of this patch.

The problem is that a large region that is split is immediately merged as the split
regions have access count zero.

We observe that large regions are never getting split at all due to this.

Regards,
Aravinda

> 
> >
> > >
> > > >
> > > > This patch proposes profiling different levels of the
> > > > application\u2019s page table tree to detect whether a region is
> > > > accessed or not. This patch set is based on the observation that,
> > > > when the accessed bit for a page is set, the accessed bits at the
> > > > higher levels of the page table tree (PMD/PUD/PGD) corresponding
> > > > to the path of the page table walk are also set. Hence, it is
> > > > efficient to check the accessed bits at the higher levels of the
> > > > page table tree to detect whether a region is accessed or not. For
> > > > example, if the access bit for a PUD entry is set, then one or
> > > > more pages in the 1GB PUD subtree is accessed as each PUD entry
> > > > covers 1GB mapping. Hence, instead of sampling thousands of 4K/2M
> > > > pages to detect accesses in a large region, sampling at the higher
> > > > level of page table tree is faster and
> > > efficient.
> > >
> > > Due to the above reason, I concern this could result in making DAMON
> > > monitoring results be inaccurately biased to report more than real accesses.
> >
> > DAMON, even without the proposed enhancement, can result in
> > inaccuracies for large regions, (see examples above).
> 
> I think temporarily missing such tiny portion of accesses is not a critical problem.
> If this is a problem, the user should increase the sampling interval in my opinion.
> That said, as mentioned above, DAMON would better to improve its regions split
> mechanism.
> 
> >
> > >
> > > >
> > > > This patch set is based on 6.8-rc5 kernel (commit: f48159f8,
> > > > mm-unstable
> > > > tree)
> > > >
> > > > Changes since v1 [1]
> > > > ====================
> > > >
> > > >  - Added support for 5-level page table tree
> > > >  - Split the patch to mm infrastructure changes and DAMON
> > > > enhancements
> > > >  - Code changes as per comments on v1
> > > >  - Added kerneldoc comments
> > > >
> > > > [1] https://lkml.org/lkml/2023/12/15/272
> > > >
> > > > Evaluation:
> > > >
> > > > - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> > > >   and 5TB with 10GB hot data.
> > > > - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> > > >   parameters set to default value.
> > > > - DAMON+PTP: Page table profiling applied to DAMON with the above
> > > >   parameters.
> > > >
> > > > Profiling efficiency in detecting hot data:
> > > >
> > > > Footprint	1GB	10GB	100GB	5TB
> > > > ---------------------------------------------
> > > > DAMON		>90%	<50%	 ~0%	  0%
> > > > DAMON+PTP	>90%	>90%	>90%	>90%
> > >
> > > Sampling interval is the time interval that assumed to be large
> > > enough for the workload to make meaningful amount of accesses within
> > > the interval.  Hence, meaningful amount of sampling interval depends
> > > on the workload's characteristic and system's memory bandwidth.
> > >
> > > Here, the size of the hot memory region is about 100MB, 1GB, 10GB,
> > > and 10GB for the four cases, respectively.  And you set the sampling
> > > interval as 5ms.  Let's assume the system can access, say, 50 GB per
> > > second, and hence it could be able to access only up to 250 MB per
> > > 5ms.  So, in case of 1GB and footprint, all hot memory region would
> > > be accessed while DAMON is waiting for next sampling interval.
> > > Hence, DAMON would be able to see most accesses via sampling.  But
> > > for 100GB footprint case, only 250MB / 10GB = about 2.5% of the hot
> > > memory region would be accessed between the sampling interval.  DAMON
> cannot see whole accesses, and hence the precision could be low.
> > >
> > > I don't know exact memory bandwith of the system, but to detect the
> > > 10 GB hot region with 5ms sampling interval, the system should be
> > > able to access 2GB memory per millisecond, or about 2TB memory per
> > > second.  I think systems of such memory bandwidth is not that common.
> > >
> > > I show you also explored a configuration setting the aggregation interval
> higher.
> > > But because each sampling checks only access between the sampling
> > > interval, that might not help in this setup.  I'm wondering if you
> > > also explored increasing sampling interval.
> > >
> >
> > What we have observed that many real-world benchmarks we experimented
> > with do not saturate the memory bandwidth. We also experimented with
> > masim microbenchmark to understand the impact on memory access rate
> > (we inserted delay between memory access operations in do_rnd_ro() and
> > other functions). We see decrease in the precision as access intensity
> > is reduced. We have experimented with different sampling and
> > aggregation intervals, but that did not help much in improving precision.
> 
> Again, please note that DAMON can show only accesses made between each
> sampling interval at a time.  The important factor for expectation of DAMON's
> accuracy is, the balance between the memory access intensity of the workload,
> and the length of the sampling interval.  The workload should be access intensive
> enough to make sufficient amount of accesses between sampling interval.  The
> sampling interval should be long enough to allow the workload makes sufficient
> amount of accesses within the time interval.
> 
> The fact that the workloads were not saturating the memory bandwidth is not
> enough to know if that means the workload was memory intensive enough, and
> the sampling interval was long enough.
> 
> I was mentioning the memory bandwidth as only the maximum memory intensity
> of the system that could be achieved.
> 
> >
> > So, what I think is it that most of the cases the precision depends on
> > the page (hot or cold) that is randomly picked for sampling than the
> > sampling rate. Most of the time only cold 4K pages are picked in a
> > large region as they typically account for 90% of the pages in the
> > region and hence DAMON does not detect any accesses at all. By
> > profiling higher levels of the page table tree this can be improved.
> 
> Again, agreed.  This is an important and grateful finding.  Thank you.  And again as
> mentioned above, I don't think we can merge this patch as is, but we could think
> about using the high level access bit check results as a hint to better split the
> regions.
> 
> Indeed, DAMON's monitoring mechanism has many rooms for improvements.  I
> also have some ideas but my time was more spent on more capabilities of
> DAMON/DAMOS so far.  It was a bit intentional proiority setting since I got no real
> DAMON accuracy problem report from the production usage, and improving the
> accuracy will deliver the benefit to all DAMON/DAMOS features.
> 
> Since an important milestone of DAMOS, namely auto-tuning, has merged into
> the mainline, I think I may better to spend more time on monitoring accuracy
> improvement.  I have some immature ideas in my head.  I will try to summarize
> and share the ideas in near future.
> 
> >
> > > Sorry again for finding this concern not early enough.  But I think
> > > we may need to discuss about this first.
> >
> > Absolutely no problem. Please let me know your thoughts.
> 
> Thank you for patiently walking with me :)
> 
> 
> Thanks,
> SJ
> 
> >
> > Regards,
> > Aravinda
> >
> > >
> > > [1] https://lkml.kernel.org/r/20231215201159.73845-1-sj@kernel.org
> > >
> > >
> > > Thanks,
> > > SJ
> > >
> > >
> > > >
> > > > CPU overheads (in billion cycles) for kdamond:
> > > >
> > > > Footprint	1GB	10GB	100GB	5TB
> > > > ---------------------------------------------
> > > > DAMON		1.15	19.53	3.52	9.55
> > > > DAMON+PTP	0.83	 3.20	1.27	2.55
> > > >
> > > > A detailed explanation and evaluation can be found in the arXiv paper:
> > > > https://arxiv.org/pdf/2311.10275.pdf
> > > >
> > > >
> > > > Aravinda Prasad (3):
> > > >   mm/damon: mm infrastructure support
> > > >   mm/damon: profiling enhancement
> > > >   mm/damon: documentation updates
> > > >
> > > >  Documentation/mm/damon/design.rst |  42 ++++++
> > > >  arch/x86/include/asm/pgtable.h    |  20 +++
> > > >  arch/x86/mm/pgtable.c             |  28 +++-
> > > >  include/linux/mmu_notifier.h      |  36 +++++
> > > >  include/linux/pgtable.h           |  79 ++++++++++
> > > >  mm/damon/vaddr.c                  | 233 ++++++++++++++++++++++++++++--
> > > >  6 files changed, 424 insertions(+), 14 deletions(-)
> > > >
> > > > --
> > > > 2.21.3

SeongJae Park March 22, 2024, 6:32 p.m. UTC | #7

On Fri, 22 Mar 2024 12:12:09 +0000 "Prasad, Aravinda" <aravinda.prasad@intel.com> wrote:

[...] 
> > > For large regions (say 10GB, that has 2,621,440 4K pages), sampling at
> > > PTE level will not cover a good portion of the region. For example,
> > > default 5ms sampling and 100ms aggregation samples only 20 4K pages in an
> > > aggregation interval.
> > 
> > If the 20 attempts all failed at finding any single accessed 4K page, I think it
> > roughly means less than 5% of the region is accessed within the user-specified
> > time (aggregation interval).  I would translate that as only tiny portion of the

I now find the above sentence is not correct.  Sorry, my bad.  Let me re-write.

I think it roughly means the workload is not accessing the region in a
frequency that high enough for DAMON to observe within the user-specified time
(sampling interval).

> > region is accessed within the user-specified time, and hence DAMON is ok to say
> > the region is nearly not accessed.
> 
> I am looking at it from the other way:
> 
> To detect if a region is hot or cold at least 1% of the pages in the region should
> be sampled. For a 10GB region (with 2,621,440 4K pages) this requires sampling
> at least 26,214 pages. For a 100GB region this will require sampling at least
> 262,144 pages.

Why do you think 1% of the pages should be sampled?

DAMON defines the region as an address range that contains pages having similar
access frequency.  Hence if we see a page was accessed within a given time
interval, we can assume all pages in the page is also accessed within the
interval, and vice versa.  That's why we sample only single page per region,
and how DAMON's monitoring overhead can be controlled regardless of the size of
the monitoring target memory.

To detect if the region is hot or cold, DAMON continues sampling multiple times
and use number of sampling intervals that seen the access to the region
(nr_accesses) as the relative hotness of the region.  Hence, how many sampling
is required depends on what precision of the relative hotness the user wants.
The size of the region doesn't matter here.

Am I missing something?

> 
> If we sample at 5ms, this takes 131.072 seconds to cover 1% of 10GB and 1310.72
> seconds to cover 100GB. 
> 
> DAMON shows that the selected page as accessed if that page was accessed
> during the 5ms sampling window. Now if we increase the sampling to 20ms to
> improve access detection, then covering 1% of the region takes even longer.
> 
> > 
> > > Increasing sampling to 1 ms and aggregation to 1 second can only cover
> > > 1000 4K pages, but results in higher CPU overheads due to frequent sampling.
> > > Even increasing the aggregation interval to 60 seconds but sampling at
> > > 5ms can only cover 12000 samples, but region splitting and merging
> > > happens once in 60 seconds.
> > 
> > At the beginning of each sampling interval, DAMON randomly picks one page per
> > region, clear their accessed bits, wait until the sampling interval is finished, and
> > check the accessed bits again.  In other words, DAMON shows only accesses that
> > made in last sampling interval.
> 
> Yes, I see this in the code:
> 
> while(time < aggregation_interval)
> {
>   clear_access_bit
>   sleep(sampling_time)
>   check_access_bit
> }
> 
> I would suggest this logic instead.
> 
> while(time < aggregation_interval)
> {
>   Number_of_samples = aggregation_interval / sampling_time;
> 
>   for (i = 0, I < number_of_samples; i++) 
>   {
>     clear_access_bit
>   } 
> 
>   sleep(aggregation_time)
> 
>   for (i = 0, I < number_of_samples; i++) 
>   {
>     check_access_bit
>   }
> }
> 
> This can help in better access detection. I am sure you would
> have already explored it.   

The way to detect the access in the region is implemented by each monitoring
operations set (vaddr, fvaddr, and paddr).  We could implement yet another
monitoring operations set with a new access detection method.  Nonetheless, I
think changing existing monitoring operations sets to use this suggestion while
keeping their concepts would be not easy.

> 
> > 
> > Increasing number of samples per aggregation interval can help DAMON knows
> > the access frequency of regions in finer granularity, but doesn't allow DAMON see
> > more accesses.  Rather than that, if the aggregation interval is fixed (reducing
> > sampling interval), DAMON can show even less amount of accesses.
> > 
> > What we need here is giving the workload longer sampling time so that the
> > workload can make access to a size of memory regions that large enough to be
> > found by DAMON.
> 
> But even with longer sampling time, we may miss the access. For example, 
> consider all the pages in the region are accessed sequentially. Now if DAMON samples
> a different page other than the page that is being accessed it will miss. Now even if we
> have longer sampling time it is possible that none of the accesses are detected.

If there was accesses to some pages of the region but unaccessed page has
picked as the sampling target, someone could say only a tiny portion of the
region is accessed, so we can treat the region as not accessed at all.  That's
at least what the monitoring operations set you use here ('vaddr') thinks.

[...]
> > Also, if we can allow large enough age, the random region split will eventually find
> > the small hot regions even without high level accessed bit hint.  Of course the hint
> > could help finding it earlier.  I think that was one of my comment on the first
> > version of this patch.
> 
> The problem is that a large region that is split is immediately merged as the split
> regions have access count zero.
> 
> We observe that large regions are never getting split at all due to this.

I understand this is a valid concern.  Especially because we currently split
each region into two sub-regions, finding small hot memory region in the middle
of a huge region could be challenging.  This concern has raised before DAMON
has merged into the mainline by Jonathan Cameron.  There was also a research
from my previous colleague saying incresing the sub-regions for each split
improves the accuracy.  Nonetheless, it increases overall number of regions and
hence increased the overhead.  And we didn't get real issue due to this from
the production so far, so we still keeping the old behavior.  I'm thinking
about a way to make this better.

That said, the real system would have more than the single region, and the
access pattern will be more dynamic.  It will cause the region to be merged and
split in more random and chaotic way.  Hence I think there is still a chance to
find the small hot portion eventually.  Also, the sampling regions are picked
randomly.  A page of the small hot portion will eventually picked as sampling
target, even multiple times, and at least reset the 'age' of the region.

I sometimes turn on DAMON to monitor entire physical address space (about 128
GiB) of my machine and run no active workload but just a few background
deamons.  So the system would have only small amount of accesses.  At the
beginning, the monitoring output shows all regions as not accessed (nr_accesses
0) and having same 'age'.  But as time goes by, the regions are still showing
no access (nr_accesses 0), but different ages and sizes.

Again, I'm not saying existing monitoring mechanism is perfect and optimum.  We
should continue optimizing it.  Nonetheless, the current accuracy is not
perfectly proved to be too awful to be used in real world.  I know at least a
few unnamed production usages of DAMON, and they didn't complained about
DAMON's accuracy so far.

Thanks,
SJ

> 
> Regards,
> Aravinda
[...]

Prasad, Aravinda March 25, 2024, 7:50 a.m. UTC | #8

> -----Original Message-----
> From: SeongJae Park <sj@kernel.org>
> Sent: Saturday, March 23, 2024 12:03 AM
> To: Prasad, Aravinda <aravinda.prasad@intel.com>
> Cc: SeongJae Park <sj@kernel.org>; damon@lists.linux.dev; linux-
> mm@kvack.org; linux-kernel@vger.kernel.org; s2322819@ed.ac.uk; Kumar,
> Sandeep4 <sandeep4.kumar@intel.com>; Huang, Ying <ying.huang@intel.com>;
> Hansen, Dave <dave.hansen@intel.com>; Williams, Dan J
> <dan.j.williams@intel.com>; Subramoney, Sreenivas
> <sreenivas.subramoney@intel.com>; Kervinen, Antti <antti.kervinen@intel.com>;
> Kanevskiy, Alexander <alexander.kanevskiy@intel.com>
> Subject: RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
> 
> On Fri, 22 Mar 2024 12:12:09 +0000 "Prasad, Aravinda"
> <aravinda.prasad@intel.com> wrote:
> 
> [...]
> > > > For large regions (say 10GB, that has 2,621,440 4K pages),
> > > > sampling at PTE level will not cover a good portion of the region.
> > > > For example, default 5ms sampling and 100ms aggregation samples
> > > > only 20 4K pages in an aggregation interval.
> > >
> > > If the 20 attempts all failed at finding any single accessed 4K
> > > page, I think it roughly means less than 5% of the region is
> > > accessed within the user-specified time (aggregation interval).  I
> > > would translate that as only tiny portion of the
> 
> I now find the above sentence is not correct.  Sorry, my bad.  Let me re-write.
> 
> I think it roughly means the workload is not accessing the region in a frequency
> that high enough for DAMON to observe within the user-specified time (sampling
> interval).
> 
> > > region is accessed within the user-specified time, and hence DAMON
> > > is ok to say the region is nearly not accessed.
> >
> > I am looking at it from the other way:
> >
> > To detect if a region is hot or cold at least 1% of the pages in the
> > region should be sampled. For a 10GB region (with 2,621,440 4K pages)
> > this requires sampling at least 26,214 pages. For a 100GB region this
> > will require sampling at least
> > 262,144 pages.
> 
> Why do you think 1% of the pages should be sampled?

1% is just an example. 

> 
> DAMON defines the region as an address range that contains pages having similar
> access frequency.  Hence if we see a page was accessed within a given time
> interval, we can assume all pages in the page is also accessed within the interval,
> and vice versa.  That's why we sample only single page per region, and how
> DAMON's monitoring overhead can be controlled regardless of the size of the
> monitoring target memory.

Initially when DAMON creates "min" regions, it does not consider access frequency. 
They are created by diving the address space. So, at the beginning, these regions
need not have pages with similar access frequency. But eventually, as regions are
split and merged then regions are formed that have similar access frequency.

We observe that hot sets are spread across the address space of the application
and many times, only a portion of the DAMON created regions contain a hot data
as per the application's access pattern. In such cases a single sample per
region is not enough. 

For small memory footprint applications with small region size, I agree there are
absolutely no issues (also confirmed by our experiments). But for large footprint
applications (1TB+) that can have large regions (50GB+) we see these issues.

> 
> To detect if the region is hot or cold, DAMON continues sampling multiple times
> and use number of sampling intervals that seen the access to the region
> (nr_accesses) as the relative hotness of the region.  Hence, how many sampling is
> required depends on what precision of the relative hotness the user wants.
> The size of the region doesn't matter here.
> 
> Am I missing something?

As mentioned before all these are working fine for small footprint applications (<100GB).
But as we go beyond 1TB footprint we start seeing issues. I can show you a demo
on 1TB+ footprint applications.

> 
> >
> > If we sample at 5ms, this takes 131.072 seconds to cover 1% of 10GB
> > and 1310.72 seconds to cover 100GB.
> >
> > DAMON shows that the selected page as accessed if that page was
> > accessed during the 5ms sampling window. Now if we increase the
> > sampling to 20ms to improve access detection, then covering 1% of the region
> takes even longer.
> >
> > >
> > > > Increasing sampling to 1 ms and aggregation to 1 second can only
> > > > cover
> > > > 1000 4K pages, but results in higher CPU overheads due to frequent
> sampling.
> > > > Even increasing the aggregation interval to 60 seconds but
> > > > sampling at 5ms can only cover 12000 samples, but region splitting
> > > > and merging happens once in 60 seconds.
> > >
> > > At the beginning of each sampling interval, DAMON randomly picks one
> > > page per region, clear their accessed bits, wait until the sampling
> > > interval is finished, and check the accessed bits again.  In other
> > > words, DAMON shows only accesses that made in last sampling interval.
> >
> > Yes, I see this in the code:
> >
> > while(time < aggregation_interval)
> > {
> >   clear_access_bit
> >   sleep(sampling_time)
> >   check_access_bit
> > }
> >
> > I would suggest this logic instead.
> >
> > while(time < aggregation_interval)
> > {
> >   Number_of_samples = aggregation_interval / sampling_time;
> >
> >   for (i = 0, I < number_of_samples; i++)
> >   {
> >     clear_access_bit
> >   }
> >
> >   sleep(aggregation_time)
> >
> >   for (i = 0, I < number_of_samples; i++)
> >   {
> >     check_access_bit
> >   }
> > }
> >
> > This can help in better access detection. I am sure you would
> > have already explored it.
> 
> The way to detect the access in the region is implemented by each monitoring
> operations set (vaddr, fvaddr, and paddr).  We could implement yet another
> monitoring operations set with a new access detection method.  Nonetheless, I
> think changing existing monitoring operations sets to use this suggestion while
> keeping their concepts would be not easy.

Agree.

> 
> >
> > >
> > > Increasing number of samples per aggregation interval can help DAMON
> > > knows the access frequency of regions in finer granularity, but
> > > doesn't allow DAMON see more accesses.  Rather than that, if the
> > > aggregation interval is fixed (reducing sampling interval), DAMON can show
> even less amount of accesses.
> > >
> > > What we need here is giving the workload longer sampling time so
> > > that the workload can make access to a size of memory regions that
> > > large enough to be found by DAMON.
> >
> > But even with longer sampling time, we may miss the access. For
> > example, consider all the pages in the region are accessed
> > sequentially. Now if DAMON samples a different page other than the
> > page that is being accessed it will miss. Now even if we have longer sampling
> time it is possible that none of the accesses are detected.
> 
> If there was accesses to some pages of the region but unaccessed page has
> picked as the sampling target, someone could say only a tiny portion of the region
> is accessed, so we can treat the region as not accessed at all.  That's at least what
> the monitoring operations set you use here ('vaddr') thinks.
> 
> [...]
> > > Also, if we can allow large enough age, the random region split will
> > > eventually find the small hot regions even without high level
> > > accessed bit hint.  Of course the hint could help finding it
> > > earlier.  I think that was one of my comment on the first version of this patch.
> >
> > The problem is that a large region that is split is immediately merged
> > as the split regions have access count zero.
> >
> > We observe that large regions are never getting split at all due to this.
> 
> I understand this is a valid concern.  Especially because we currently split each
> region into two sub-regions, finding small hot memory region in the middle of a
> huge region could be challenging.  This concern has raised before DAMON has
> merged into the mainline by Jonathan Cameron.  There was also a research from
> my previous colleague saying incresing the sub-regions for each split improves the
> accuracy.  Nonetheless, it increases overall number of regions and hence
> increased the overhead.  And we didn't get real issue due to this from the
> production so far, so we still keeping the old behavior.  I'm thinking about a way to
> make this better.

These issues are observed only when memory footprint is large enough (1TB+).
Production systems may not be using such large footprint applications yet.  

> 
> That said, the real system would have more than the single region, and the access
> pattern will be more dynamic.  It will cause the region to be merged and split in
> more random and chaotic way.  Hence I think there is still a chance to find the
> small hot portion eventually.  Also, the sampling regions are picked randomly.  A
> page of the small hot portion will eventually picked as sampling target, even
> multiple times, and at least reset the 'age' of the region.
> 
> I sometimes turn on DAMON to monitor entire physical address space (about 128
> GiB) of my machine and run no active workload but just a few background
> deamons.  So the system would have only small amount of accesses.  At the
> beginning, the monitoring output shows all regions as not accessed (nr_accesses
> 0) and having same 'age'.  But as time goes by, the regions are still showing no
> access (nr_accesses 0), but different ages and sizes.

Have not tired with physical address space monitoring. But for "vaddr", we see DAMON
working good up to 100GB footprint. 

> 
> Again, I'm not saying existing monitoring mechanism is perfect and optimum.  We
> should continue optimizing it.  Nonetheless, the current accuracy is not perfectly
> proved to be too awful to be used in real world.  I know at least a few unnamed
> production usages of DAMON, and they didn't complained about DAMON's
> accuracy so far.

We see this problem very consistently on large footprint applications, so could be
working fine for small footprint applications in production.

Regards,
Aravinda


> 
> 
> Thanks,
> SJ
> 
> >
> > Regards,
> > Aravinda
> [...]

[v2,0/3] mm/damon: Profiling enhancements for DAMON

Message

Comments