diff mbox series

[v2] mm: add per-order mTHP alloc_success and alloc_fail counters

Message ID 20240328095139.143374-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series [v2] mm: add per-order mTHP alloc_success and alloc_fail counters | expand

Commit Message

Barry Song March 28, 2024, 9:51 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

Profiling a system blindly with mTHP has become challenging due
to the lack of visibility into its operations. Presenting the
success rate of mTHP allocations appears to be pressing need.

Recently, I've been experiencing significant difficulty debugging
performance improvements and regressions without these figures.
It's crucial for us to understand the true effectiveness of
mTHP in real-world scenarios, especially in systems with
fragmented memory.

This patch sets up the framework for per-order mTHP counters,
starting with the introduction of alloc_success and alloc_fail
counters.  Incorporating additional counters should now be
straightforward as well.

The initial two unsigned longs for each event are unused, given
that order-0 and order-1 are not mTHP. Nonetheless, this refinement
improves code clarity.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 -v2:
 * move to sysfs and provide per-order counters; David, Ryan, Willy
 -v1:
 https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/

 include/linux/huge_mm.h | 17 +++++++++++++
 mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
 mm/memory.c             |  3 +++
 3 files changed, 74 insertions(+)

Comments

Yu Zhao April 1, 2024, 2:46 p.m. UTC | #1
On Thu, Mar 28, 2024 at 5:51 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> Profiling a system blindly with mTHP has become challenging due
> to the lack of visibility into its operations. Presenting the
> success rate of mTHP allocations appears to be pressing need.
>
> Recently, I've been experiencing significant difficulty debugging
> performance improvements and regressions without these figures.
> It's crucial for us to understand the true effectiveness of
> mTHP in real-world scenarios, especially in systems with
> fragmented memory.
>
> This patch sets up the framework for per-order mTHP counters,
> starting with the introduction of alloc_success and alloc_fail
> counters.  Incorporating additional counters should now be
> straightforward as well.
>
> The initial two unsigned longs for each event are unused, given
> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> improves code clarity.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  -v2:
>  * move to sysfs and provide per-order counters; David, Ryan, Willy
>  -v1:
>  https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
>
>  include/linux/huge_mm.h | 17 +++++++++++++
>  mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
>  mm/memory.c             |  3 +++
>  3 files changed, 74 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e896ca4760f6..27fa26a22a8f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>                                           enforce_sysfs, orders);
>  }
>
> +enum thp_event_item {
> +       THP_ALLOC_SUCCESS,
> +       THP_ALLOC_FAIL,
> +       NR_THP_EVENT_ITEMS
> +};
> +
> +struct thp_event_state {
> +       unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
> +};
> +
> +DECLARE_PER_CPU(struct thp_event_state, thp_event_states);

Do we have existing per-CPU counters that cover all possible THP
orders? I.e., foo_counter[PMD_ORDER + 1][BAR_ITEMS]. I don't think we
do but I want to double check.

This might be fine if BAR_ITEMS is global, not per memcg. Otherwise on
larger systems, e.g., 512 CPUs which is not uncommon, we'd have high
per-CPU memory overhead. For Google's datacenters, per-CPU memory
overhead has been a problem.

I'm not against this patch since NR_THP_EVENT_ITEMS is not per memcg.
Alternatively, we could make the per-CPU counters to track only one
order and flush the local counter to a global atomic counter if the
new order doesn't match the existing order stored in the local
counter. WDYT?
Barry Song April 1, 2024, 8:40 p.m. UTC | #2
On Tue, Apr 2, 2024 at 3:46 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Mar 28, 2024 at 5:51 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Profiling a system blindly with mTHP has become challenging due
> > to the lack of visibility into its operations. Presenting the
> > success rate of mTHP allocations appears to be pressing need.
> >
> > Recently, I've been experiencing significant difficulty debugging
> > performance improvements and regressions without these figures.
> > It's crucial for us to understand the true effectiveness of
> > mTHP in real-world scenarios, especially in systems with
> > fragmented memory.
> >
> > This patch sets up the framework for per-order mTHP counters,
> > starting with the introduction of alloc_success and alloc_fail
> > counters.  Incorporating additional counters should now be
> > straightforward as well.
> >
> > The initial two unsigned longs for each event are unused, given
> > that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> > improves code clarity.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  -v2:
> >  * move to sysfs and provide per-order counters; David, Ryan, Willy
> >  -v1:
> >  https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
> >
> >  include/linux/huge_mm.h | 17 +++++++++++++
> >  mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
> >  mm/memory.c             |  3 +++
> >  3 files changed, 74 insertions(+)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index e896ca4760f6..27fa26a22a8f 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >                                           enforce_sysfs, orders);
> >  }
> >
> > +enum thp_event_item {
> > +       THP_ALLOC_SUCCESS,
> > +       THP_ALLOC_FAIL,
> > +       NR_THP_EVENT_ITEMS
> > +};
> > +
> > +struct thp_event_state {
> > +       unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
> > +};
> > +
> > +DECLARE_PER_CPU(struct thp_event_state, thp_event_states);
>
> Do we have existing per-CPU counters that cover all possible THP
> orders? I.e., foo_counter[PMD_ORDER + 1][BAR_ITEMS]. I don't think we
> do but I want to double check.

Right.

The current counters are tailored for PMD-mapped THP within the
vm_event_state. Therefore, it appears that we lack counters specific
to each order.

>
> This might be fine if BAR_ITEMS is global, not per memcg. Otherwise on
> larger systems, e.g., 512 CPUs which is not uncommon, we'd have high
> per-CPU memory overhead. For Google's datacenters, per-CPU memory
> overhead has been a problem.

Right. I don't strongly feel the need for per-memcg counters, and the
/sys/kernel/mm/transparent_hugepage/hugepages-<size> is also global.

>
> I'm not against this patch since NR_THP_EVENT_ITEMS is not per memcg.
> Alternatively, we could make the per-CPU counters to track only one
> order and flush the local counter to a global atomic counter if the
> new order doesn't match the existing order stored in the local
> counter. WDYT?

The code assumes the worst-case scenario where users might enable
multiple orders.
Therefore, we require a lightweight approach to prevent frequent
flushing of atomic
operations.

Thanks
Barry
Ryan Roberts April 2, 2024, 8:58 a.m. UTC | #3
On 28/03/2024 09:51, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Profiling a system blindly with mTHP has become challenging due
> to the lack of visibility into its operations. Presenting the
> success rate of mTHP allocations appears to be pressing need.
> 
> Recently, I've been experiencing significant difficulty debugging
> performance improvements and regressions without these figures.
> It's crucial for us to understand the true effectiveness of
> mTHP in real-world scenarios, especially in systems with
> fragmented memory.
> 
> This patch sets up the framework for per-order mTHP counters,
> starting with the introduction of alloc_success and alloc_fail
> counters.  Incorporating additional counters should now be
> straightforward as well.
> 
> The initial two unsigned longs for each event are unused, given
> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> improves code clarity.

Overall, this approach looks good to me - thanks for doing this!

> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  -v2:
>  * move to sysfs and provide per-order counters; David, Ryan, Willy
>  -v1:
>  https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
> 
>  include/linux/huge_mm.h | 17 +++++++++++++
>  mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
>  mm/memory.c             |  3 +++
>  3 files changed, 74 insertions(+)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e896ca4760f6..27fa26a22a8f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  					  enforce_sysfs, orders);
>  }
>  
> +enum thp_event_item {
> +	THP_ALLOC_SUCCESS,
> +	THP_ALLOC_FAIL,
> +	NR_THP_EVENT_ITEMS
> +};
> +
> +struct thp_event_state {
> +	unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
> +};
> +
> +DECLARE_PER_CPU(struct thp_event_state, thp_event_states);
> +
> +static inline void count_thp_event(int order, enum thp_event_item item)

We previously concluded that we will likely want a "currently allocated THPs"
counter, which will need both increment and decrement. So perhaps
"thp_event_inc()" (and eventual "thp_event_dec()" are more appropriate? And
perhaps "event" isn't the right description either since it implies always
counting upwards (to me at least). Although I guess you have borrowed from the
vmstat pattern? That file has some instantaneous counters, so how does it decrement?

> +{
> +	this_cpu_inc(thp_event_states.event[order][item]);

Could we declare as "PMD_ORDER - 1" then do [order - 2] here? That would save 16
bytes per CPU. (8K in a system with 512 CPUs). Possibly with a order_index()
helper since the *_show() functions will need this too.

> +}
> +
>  #define transparent_hugepage_use_zero_page()				\
>  	(transparent_hugepage_flags &					\
>  	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1683de78c313..addd093d8410 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -526,6 +526,52 @@ static const struct kobj_type thpsize_ktype = {
>  	.sysfs_ops = &kobj_sysfs_ops,
>  };
>  
> +DEFINE_PER_CPU(struct thp_event_state, thp_event_states) = {{{0}}};
> +
> +static unsigned long sum_thp_events(int order, enum thp_event_item item)
> +{
> +	unsigned long sum = 0;
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		struct thp_event_state *this = &per_cpu(thp_event_states, cpu);
> +
> +		sum += this->event[order][item];
> +	}
> +
> +	return sum;
> +}
> +
> +static ssize_t alloc_success_show(struct kobject *kobj,
> +				   struct kobj_attribute *attr, char *buf)
> +{
> +	int order = to_thpsize(kobj)->order;
> +
> +	return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_SUCCESS));
> +}
> +
> +static ssize_t alloc_fail_show(struct kobject *kobj,
> +				   struct kobj_attribute *attr, char *buf)
> +{
> +	int order = to_thpsize(kobj)->order;
> +
> +	return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_FAIL));
> +}
> +
> +static struct kobj_attribute alloc_success_attr = __ATTR_RO(alloc_success);
> +static struct kobj_attribute alloc_fail_attr = __ATTR_RO(alloc_fail);
> +
> +static struct attribute *stats_attrs[] = {
> +	&alloc_success_attr.attr,
> +	&alloc_fail_attr.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group stats_attr_group = {
> +	.name = "stats",
> +	.attrs = stats_attrs,
> +};
> +
>  static struct thpsize *thpsize_create(int order, struct kobject *parent)
>  {
>  	unsigned long size = (PAGE_SIZE << order) / SZ_1K;
> @@ -549,6 +595,12 @@ static struct thpsize *thpsize_create(int order, struct kobject *parent)
>  		return ERR_PTR(ret);
>  	}
>  
> +	ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
> +	if (ret) {
> +		kobject_put(&thpsize->kobj);
> +		return ERR_PTR(ret);
> +	}
> +
>  	thpsize->order = order;
>  	return thpsize;
>  }
> @@ -1050,8 +1102,10 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  	folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true);
>  	if (unlikely(!folio)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> +		count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_FAIL);
>  		return VM_FAULT_FALLBACK;
>  	}
> +	count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_SUCCESS);
>  	return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
>  }
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 984b138f85b4..c9c1031c2ecb 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4365,7 +4365,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  			}
>  			folio_throttle_swaprate(folio, gfp);
>  			clear_huge_page(&folio->page, vmf->address, 1 << order);
> +			count_thp_event(order, THP_ALLOC_SUCCESS);
>  			return folio;
> +		} else {

Do we need this else, given the last statement in the "if" clause is return?

> +			count_thp_event(order, THP_ALLOC_FAIL);
>  		}
>  next:
>  		order = next_order(&orders, order);

Thanks,
Ryan
Barry Song April 2, 2024, 9:40 a.m. UTC | #4
On Tue, Apr 2, 2024 at 9:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/03/2024 09:51, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Profiling a system blindly with mTHP has become challenging due
> > to the lack of visibility into its operations. Presenting the
> > success rate of mTHP allocations appears to be pressing need.
> >
> > Recently, I've been experiencing significant difficulty debugging
> > performance improvements and regressions without these figures.
> > It's crucial for us to understand the true effectiveness of
> > mTHP in real-world scenarios, especially in systems with
> > fragmented memory.
> >
> > This patch sets up the framework for per-order mTHP counters,
> > starting with the introduction of alloc_success and alloc_fail
> > counters.  Incorporating additional counters should now be
> > straightforward as well.
> >
> > The initial two unsigned longs for each event are unused, given
> > that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> > improves code clarity.
>
> Overall, this approach looks good to me - thanks for doing this!

Thank you for reviewing!

>
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  -v2:
> >  * move to sysfs and provide per-order counters; David, Ryan, Willy
> >  -v1:
> >  https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
> >
> >  include/linux/huge_mm.h | 17 +++++++++++++
> >  mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
> >  mm/memory.c             |  3 +++
> >  3 files changed, 74 insertions(+)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index e896ca4760f6..27fa26a22a8f 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >                                         enforce_sysfs, orders);
> >  }
> >
> > +enum thp_event_item {
> > +     THP_ALLOC_SUCCESS,
> > +     THP_ALLOC_FAIL,
> > +     NR_THP_EVENT_ITEMS
> > +};
> > +
> > +struct thp_event_state {
> > +     unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
> > +};
> > +
> > +DECLARE_PER_CPU(struct thp_event_state, thp_event_states);
> > +
> > +static inline void count_thp_event(int order, enum thp_event_item item)
>
> We previously concluded that we will likely want a "currently allocated THPs"
> counter, which will need both increment and decrement. So perhaps

I intend to handle this via incremental patches, but it's improbable
to occur this
month due to my packed schedule with multiple tasks. Presently, my top priority
regarding counting is the fallback ratio for allocations. Following
that, it would
likely be the SWAP allocation and fallback ratio to determine if the swapfile is
highly fragmented.

> "thp_event_inc()" (and eventual "thp_event_dec()" are more appropriate? And
> perhaps "event" isn't the right description either since it implies always
> counting upwards (to me at least). Although I guess you have borrowed from the
> vmstat pattern? That file has some instantaneous counters, so how does it decrement?
>

Counting represents just one aspect of vmstat. In this scenario, the values are
consistently increasing. For those metrics that could fluctuate in
either direction,
an entirely different mechanism is employed. These are typically managed through
global atomic operations.

/*
 * Zone and node-based page accounting with per cpu differentials.
 */
extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
extern atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];

void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long);
void __inc_zone_page_state(struct page *, enum zone_stat_item);
void __dec_zone_page_state(struct page *, enum zone_stat_item);

void __mod_node_page_state(struct pglist_data *, enum node_stat_item
item, long);
void __inc_node_page_state(struct page *, enum node_stat_item);
void __dec_node_page_state(struct page *, enum node_stat_item);

static inline void zone_page_state_add(long x, struct zone *zone,
                                 enum zone_stat_item item)
{
        atomic_long_add(x, &zone->vm_stat[item]);
        atomic_long_add(x, &vm_zone_stat[item]);
}

Counting is likely most efficiently handled with per-CPU copies and a summation
function to aggregate values from all CPUs.

> > +{
> > +     this_cpu_inc(thp_event_states.event[order][item]);
>
> Could we declare as "PMD_ORDER - 1" then do [order - 2] here? That would save 16
> bytes per CPU. (8K in a system with 512 CPUs). Possibly with a order_index()
> helper since the *_show() functions will need this too.

I actually put some words in commit messages

"The initial two unsigned longs for each event are unused, given
 that order-0 and order-1 are not mTHP. Nonetheless, this refinement
improves code clarity."

However, if conserving memory is a higher priority, I can utilize
PMD_ORDER - 1 in version 3.

>
> > +}
> > +
> >  #define transparent_hugepage_use_zero_page()                         \
> >       (transparent_hugepage_flags &                                   \
> >        (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 1683de78c313..addd093d8410 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -526,6 +526,52 @@ static const struct kobj_type thpsize_ktype = {
> >       .sysfs_ops = &kobj_sysfs_ops,
> >  };
> >
> > +DEFINE_PER_CPU(struct thp_event_state, thp_event_states) = {{{0}}};
> > +
> > +static unsigned long sum_thp_events(int order, enum thp_event_item item)
> > +{
> > +     unsigned long sum = 0;
> > +     int cpu;
> > +
> > +     for_each_online_cpu(cpu) {
> > +             struct thp_event_state *this = &per_cpu(thp_event_states, cpu);
> > +
> > +             sum += this->event[order][item];
> > +     }
> > +
> > +     return sum;
> > +}
> > +
> > +static ssize_t alloc_success_show(struct kobject *kobj,
> > +                                struct kobj_attribute *attr, char *buf)
> > +{
> > +     int order = to_thpsize(kobj)->order;
> > +
> > +     return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_SUCCESS));
> > +}
> > +
> > +static ssize_t alloc_fail_show(struct kobject *kobj,
> > +                                struct kobj_attribute *attr, char *buf)
> > +{
> > +     int order = to_thpsize(kobj)->order;
> > +
> > +     return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_FAIL));
> > +}
> > +
> > +static struct kobj_attribute alloc_success_attr = __ATTR_RO(alloc_success);
> > +static struct kobj_attribute alloc_fail_attr = __ATTR_RO(alloc_fail);
> > +
> > +static struct attribute *stats_attrs[] = {
> > +     &alloc_success_attr.attr,
> > +     &alloc_fail_attr.attr,
> > +     NULL,
> > +};
> > +
> > +static struct attribute_group stats_attr_group = {
> > +     .name = "stats",
> > +     .attrs = stats_attrs,
> > +};
> > +
> >  static struct thpsize *thpsize_create(int order, struct kobject *parent)
> >  {
> >       unsigned long size = (PAGE_SIZE << order) / SZ_1K;
> > @@ -549,6 +595,12 @@ static struct thpsize *thpsize_create(int order, struct kobject *parent)
> >               return ERR_PTR(ret);
> >       }
> >
> > +     ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
> > +     if (ret) {
> > +             kobject_put(&thpsize->kobj);
> > +             return ERR_PTR(ret);
> > +     }
> > +
> >       thpsize->order = order;
> >       return thpsize;
> >  }
> > @@ -1050,8 +1102,10 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >       folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true);
> >       if (unlikely(!folio)) {
> >               count_vm_event(THP_FAULT_FALLBACK);
> > +             count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_FAIL);
> >               return VM_FAULT_FALLBACK;
> >       }
> > +     count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_SUCCESS);
> >       return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
> >  }
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 984b138f85b4..c9c1031c2ecb 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4365,7 +4365,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >                       }
> >                       folio_throttle_swaprate(folio, gfp);
> >                       clear_huge_page(&folio->page, vmf->address, 1 << order);
> > +                     count_thp_event(order, THP_ALLOC_SUCCESS);
> >                       return folio;
> > +             } else {
>
> Do we need this else, given the last statement in the "if" clause is return?

I included this "else" statement because we encounter a scenario where
we "goto next" even if allocation succeeds.

                 folio = vma_alloc_folio(gfp, order, vma, addr, true);
                if (folio) {
                        if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
                                folio_put(folio);
                                goto next;
                        }

After revisiting the code for a second time, I noticed that the SUCCESS
counter is incremented after the "goto" statement, so I realize that I won't
need this "else" clause.

               folio = vma_alloc_folio(gfp, order, vma, addr, true);
                if (folio) {
                        if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
                                folio_put(folio);
                                goto next;
                        }
                        folio_throttle_swaprate(folio, gfp);
                        clear_huge_page(&folio->page, vmf->address, 1 << order);
                        count_thp_event(order, THP_ALLOC_SUCCESS);
                        return folio;


So, I have two options:

1. Move the count_thp_event(order, THP_ALLOC_SUCCESS); statement ahead of
mem_cgroup_charge(folio, vma->vm_mm, gfp) and retain the else clause.
2. Keep THP_ALLOC_SUCCESS unchanged and remove the else clause.

Since mem_cgroup_charge() rarely fails, option 2 should suffice.

By the way, why do we proceed to the next order after
mem_cgroup_charge() fails?
I assume mem_cgroup_charge() is still likely to fail even after we move to
the next order?


>
> > +                     count_thp_event(order, THP_ALLOC_FAIL);
> >               }
> >  next:
> >               order = next_order(&orders, order);
>
> Thanks,
> Ryan
>

Thanks
Barry
Ryan Roberts April 2, 2024, 10:44 a.m. UTC | #5
On 02/04/2024 10:40, Barry Song wrote:
> On Tue, Apr 2, 2024 at 9:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 28/03/2024 09:51, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> Profiling a system blindly with mTHP has become challenging due
>>> to the lack of visibility into its operations. Presenting the
>>> success rate of mTHP allocations appears to be pressing need.
>>>
>>> Recently, I've been experiencing significant difficulty debugging
>>> performance improvements and regressions without these figures.
>>> It's crucial for us to understand the true effectiveness of
>>> mTHP in real-world scenarios, especially in systems with
>>> fragmented memory.
>>>
>>> This patch sets up the framework for per-order mTHP counters,
>>> starting with the introduction of alloc_success and alloc_fail
>>> counters.  Incorporating additional counters should now be
>>> straightforward as well.
>>>
>>> The initial two unsigned longs for each event are unused, given
>>> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
>>> improves code clarity.
>>
>> Overall, this approach looks good to me - thanks for doing this!
> 
> Thank you for reviewing!
> 
>>
>>>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>  -v2:
>>>  * move to sysfs and provide per-order counters; David, Ryan, Willy
>>>  -v1:
>>>  https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
>>>
>>>  include/linux/huge_mm.h | 17 +++++++++++++
>>>  mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
>>>  mm/memory.c             |  3 +++
>>>  3 files changed, 74 insertions(+)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index e896ca4760f6..27fa26a22a8f 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>                                         enforce_sysfs, orders);
>>>  }
>>>
>>> +enum thp_event_item {
>>> +     THP_ALLOC_SUCCESS,
>>> +     THP_ALLOC_FAIL,
>>> +     NR_THP_EVENT_ITEMS
>>> +};
>>> +
>>> +struct thp_event_state {
>>> +     unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
>>> +};
>>> +
>>> +DECLARE_PER_CPU(struct thp_event_state, thp_event_states);
>>> +
>>> +static inline void count_thp_event(int order, enum thp_event_item item)
>>
>> We previously concluded that we will likely want a "currently allocated THPs"
>> counter, which will need both increment and decrement. So perhaps
> 
> I intend to handle this via incremental patches, but it's improbable
> to occur this
> month due to my packed schedule with multiple tasks. Presently, my top priority
> regarding counting is the fallback ratio for allocations. Following
> that, it would
> likely be the SWAP allocation and fallback ratio to determine if the swapfile is
> highly fragmented.

Yes indeed. I wasn't suggesting that the extra counters should be added
immediately, just that having a view on the eventual need for "instantaneous"
counters might influence the name of this function.

> 
>> "thp_event_inc()" (and eventual "thp_event_dec()" are more appropriate? And
>> perhaps "event" isn't the right description either since it implies always
>> counting upwards (to me at least). Although I guess you have borrowed from the
>> vmstat pattern? That file has some instantaneous counters, so how does it decrement?
>>
> 
> Counting represents just one aspect of vmstat. In this scenario, the values are
> consistently increasing. For those metrics that could fluctuate in
> either direction,
> an entirely different mechanism is employed. These are typically managed through
> global atomic operations.

OK, but why do these need to be global atomic? I would have thought we could
just define the per-cpu values as signed and allow both inc and dec? Then when
you sum them, you get the correct answer. You could have:

long stat[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];

thp_stat_inc(int order, enum thp_stat_item item);
thp_stat_dec(int order, enum thp_stat_item item);

// Wrapper for increment-only events
static inline void count_thp_event(int order, enum thp_stat_item item)
{
	thp_stat_inc(order, item);
}

Just a thought. Perhaps its better to follow the established pattern.

> 
> /*
>  * Zone and node-based page accounting with per cpu differentials.
>  */
> extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
> extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
> extern atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
> 
> void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long);
> void __inc_zone_page_state(struct page *, enum zone_stat_item);
> void __dec_zone_page_state(struct page *, enum zone_stat_item);
> 
> void __mod_node_page_state(struct pglist_data *, enum node_stat_item
> item, long);
> void __inc_node_page_state(struct page *, enum node_stat_item);
> void __dec_node_page_state(struct page *, enum node_stat_item);
> 
> static inline void zone_page_state_add(long x, struct zone *zone,
>                                  enum zone_stat_item item)
> {
>         atomic_long_add(x, &zone->vm_stat[item]);
>         atomic_long_add(x, &vm_zone_stat[item]);
> }
> 
> Counting is likely most efficiently handled with per-CPU copies and a summation
> function to aggregate values from all CPUs.
> 
>>> +{
>>> +     this_cpu_inc(thp_event_states.event[order][item]);
>>
>> Could we declare as "PMD_ORDER - 1" then do [order - 2] here? That would save 16
>> bytes per CPU. (8K in a system with 512 CPUs). Possibly with a order_index()
>> helper since the *_show() functions will need this too.
> 
> I actually put some words in commit messages
> 
> "The initial two unsigned longs for each event are unused, given
>  that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> improves code clarity."

Yes I saw that. I was just observing that we can have both memory effciency and
code clarity fairly easily. Not a strong opinion though.

> 
> However, if conserving memory is a higher priority, I can utilize
> PMD_ORDER - 1 in version 3.
> 
>>
>>> +}
>>> +
>>>  #define transparent_hugepage_use_zero_page()                         \
>>>       (transparent_hugepage_flags &                                   \
>>>        (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 1683de78c313..addd093d8410 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -526,6 +526,52 @@ static const struct kobj_type thpsize_ktype = {
>>>       .sysfs_ops = &kobj_sysfs_ops,
>>>  };
>>>
>>> +DEFINE_PER_CPU(struct thp_event_state, thp_event_states) = {{{0}}};
>>> +
>>> +static unsigned long sum_thp_events(int order, enum thp_event_item item)
>>> +{
>>> +     unsigned long sum = 0;
>>> +     int cpu;
>>> +
>>> +     for_each_online_cpu(cpu) {
>>> +             struct thp_event_state *this = &per_cpu(thp_event_states, cpu);
>>> +
>>> +             sum += this->event[order][item];
>>> +     }
>>> +
>>> +     return sum;
>>> +}
>>> +
>>> +static ssize_t alloc_success_show(struct kobject *kobj,
>>> +                                struct kobj_attribute *attr, char *buf)
>>> +{
>>> +     int order = to_thpsize(kobj)->order;
>>> +
>>> +     return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_SUCCESS));
>>> +}
>>> +
>>> +static ssize_t alloc_fail_show(struct kobject *kobj,
>>> +                                struct kobj_attribute *attr, char *buf)
>>> +{
>>> +     int order = to_thpsize(kobj)->order;
>>> +
>>> +     return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_FAIL));
>>> +}
>>> +
>>> +static struct kobj_attribute alloc_success_attr = __ATTR_RO(alloc_success);
>>> +static struct kobj_attribute alloc_fail_attr = __ATTR_RO(alloc_fail);
>>> +
>>> +static struct attribute *stats_attrs[] = {
>>> +     &alloc_success_attr.attr,
>>> +     &alloc_fail_attr.attr,
>>> +     NULL,
>>> +};
>>> +
>>> +static struct attribute_group stats_attr_group = {
>>> +     .name = "stats",
>>> +     .attrs = stats_attrs,
>>> +};
>>> +
>>>  static struct thpsize *thpsize_create(int order, struct kobject *parent)
>>>  {
>>>       unsigned long size = (PAGE_SIZE << order) / SZ_1K;
>>> @@ -549,6 +595,12 @@ static struct thpsize *thpsize_create(int order, struct kobject *parent)
>>>               return ERR_PTR(ret);
>>>       }
>>>
>>> +     ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
>>> +     if (ret) {
>>> +             kobject_put(&thpsize->kobj);
>>> +             return ERR_PTR(ret);
>>> +     }
>>> +
>>>       thpsize->order = order;
>>>       return thpsize;
>>>  }
>>> @@ -1050,8 +1102,10 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>>       folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true);
>>>       if (unlikely(!folio)) {
>>>               count_vm_event(THP_FAULT_FALLBACK);
>>> +             count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_FAIL);
>>>               return VM_FAULT_FALLBACK;
>>>       }
>>> +     count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_SUCCESS);
>>>       return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
>>>  }
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 984b138f85b4..c9c1031c2ecb 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4365,7 +4365,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>>                       }
>>>                       folio_throttle_swaprate(folio, gfp);
>>>                       clear_huge_page(&folio->page, vmf->address, 1 << order);
>>> +                     count_thp_event(order, THP_ALLOC_SUCCESS);
>>>                       return folio;
>>> +             } else {
>>
>> Do we need this else, given the last statement in the "if" clause is return?
> 
> I included this "else" statement because we encounter a scenario where
> we "goto next" even if allocation succeeds.
> 
>                  folio = vma_alloc_folio(gfp, order, vma, addr, true);
>                 if (folio) {
>                         if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
>                                 folio_put(folio);
>                                 goto next;
>                         }
> 
> After revisiting the code for a second time, I noticed that the SUCCESS
> counter is incremented after the "goto" statement, so I realize that I won't
> need this "else" clause.
> 
>                folio = vma_alloc_folio(gfp, order, vma, addr, true);
>                 if (folio) {
>                         if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
>                                 folio_put(folio);
>                                 goto next;
>                         }
>                         folio_throttle_swaprate(folio, gfp);
>                         clear_huge_page(&folio->page, vmf->address, 1 << order);
>                         count_thp_event(order, THP_ALLOC_SUCCESS);
>                         return folio;
> 
> 
> So, I have two options:
> 
> 1. Move the count_thp_event(order, THP_ALLOC_SUCCESS); statement ahead of
> mem_cgroup_charge(folio, vma->vm_mm, gfp) and retain the else clause.
> 2. Keep THP_ALLOC_SUCCESS unchanged and remove the else clause.
> 
> Since mem_cgroup_charge() rarely fails, option 2 should suffice.

I guess it depends exactly on what the semantics of THP_ALLOC_SUCCESS and
THP_ALLOC_FAIL are supposed to be. For me, this makes most sense:

	while (orders) {
		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
		folio = vma_alloc_folio(gfp, order, vma, addr, true);
		if (folio) {
			if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
				folio_put(folio);
				goto next;
			}
			folio_throttle_swaprate(folio, gfp);
			clear_huge_page(&folio->page, vmf->address, 1 << order);
			count_thp_event(order, THP_ALLOC_SUCCESS);
			return folio;
		}
next:
		count_thp_event(order, THP_ALLOC_FAIL);
		order = next_order(&orders, order);
	}

But this way you will end up with more THP_ALLOC_FAIL events than page faults.
Let's say you have 1M, 512K, 256K and 128K mTHP enabled, and you fail to
allocate all except 128K; You will end up with 3 FAILs and 1 SUCCESS for a
single fault. Is that a problem?

I guess an alternative approach would be to mark SUCCESS per-size. But have a
single, global FAIL counter, which just tells you how many faults end up getting
a 4K page. (or have that counter in addition to the per-size FAIL counters).

It feels a bit odd that you get a per-size FAIL if the allocation fails, but not
if the VMA is congested; if the VMA is congested, you will have already filtered
out the larger orders and therefore never see the allocation FAIL.

What question are you usually trying to answer with these stats?

Perhaps I'm making this more complicated than it needs to be... :)


> 
> By the way, why do we proceed to the next order after
> mem_cgroup_charge() fails?
> I assume mem_cgroup_charge() is still likely to fail even after we move to
> the next order?

That bit was added by Kefeng Wang, but my understanding is that
mem_cgroup_charge() will fail if we are bumping up against the quota limit? So I
guess the thinking is we may not be able to allocate 1M but 32K may be fine.

> 
> 
>>
>>> +                     count_thp_event(order, THP_ALLOC_FAIL);
>>>               }
>>>  next:
>>>               order = next_order(&orders, order);
>>
>> Thanks,
>> Ryan
>>
> 
> Thanks
> Barry
David Hildenbrand April 2, 2024, 6:46 p.m. UTC | #6
On 28.03.24 10:51, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Profiling a system blindly with mTHP has become challenging due
> to the lack of visibility into its operations. Presenting the
> success rate of mTHP allocations appears to be pressing need.
> 
> Recently, I've been experiencing significant difficulty debugging
> performance improvements and regressions without these figures.
> It's crucial for us to understand the true effectiveness of
> mTHP in real-world scenarios, especially in systems with
> fragmented memory.
> 
> This patch sets up the framework for per-order mTHP counters,
> starting with the introduction of alloc_success and alloc_fail
> counters.  Incorporating additional counters should now be
> straightforward as well.
> 
> The initial two unsigned longs for each event are unused, given
> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> improves code clarity.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   -v2:
>   * move to sysfs and provide per-order counters; David, Ryan, Willy
>   -v1:
>   https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
> 
>   include/linux/huge_mm.h | 17 +++++++++++++
>   mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
>   mm/memory.c             |  3 +++
>   3 files changed, 74 insertions(+)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e896ca4760f6..27fa26a22a8f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>   					  enforce_sysfs, orders);
>   }
>   
> +enum thp_event_item {
> +	THP_ALLOC_SUCCESS,
> +	THP_ALLOC_FAIL,
> +	NR_THP_EVENT_ITEMS
> +};

I'm wondering if these should be ANON specific for now. We might want to 
add others (shmem, file) in the future.
Barry Song April 2, 2024, 9:29 p.m. UTC | #7
On Wed, Apr 3, 2024 at 7:46 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 28.03.24 10:51, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Profiling a system blindly with mTHP has become challenging due
> > to the lack of visibility into its operations. Presenting the
> > success rate of mTHP allocations appears to be pressing need.
> >
> > Recently, I've been experiencing significant difficulty debugging
> > performance improvements and regressions without these figures.
> > It's crucial for us to understand the true effectiveness of
> > mTHP in real-world scenarios, especially in systems with
> > fragmented memory.
> >
> > This patch sets up the framework for per-order mTHP counters,
> > starting with the introduction of alloc_success and alloc_fail
> > counters.  Incorporating additional counters should now be
> > straightforward as well.
> >
> > The initial two unsigned longs for each event are unused, given
> > that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> > improves code clarity.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >   -v2:
> >   * move to sysfs and provide per-order counters; David, Ryan, Willy
> >   -v1:
> >   https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
> >
> >   include/linux/huge_mm.h | 17 +++++++++++++
> >   mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
> >   mm/memory.c             |  3 +++
> >   3 files changed, 74 insertions(+)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index e896ca4760f6..27fa26a22a8f 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >                                         enforce_sysfs, orders);
> >   }
> >
> > +enum thp_event_item {
> > +     THP_ALLOC_SUCCESS,
> > +     THP_ALLOC_FAIL,
> > +     NR_THP_EVENT_ITEMS
> > +};
>
> I'm wondering if these should be ANON specific for now. We might want to
> add others (shmem, file) in the future.

I've two ways to do that
1. rename to ANON_THP_ALLOC, so that I can have SHMEM_THP_ALLOC, FILE_THP_ALLOC
in the future;
2. let THP_ALLOC cover all of shmem, file and anon.

following vmstat, actually 1 might be better as we have both THP_FAULT_ALLOC and
THP_FILE_ALLOC for pmd-mapped THP.

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
                THP_FAULT_ALLOC,
                THP_FAULT_FALLBACK,
                THP_FAULT_FALLBACK_CHARGE,
                THP_COLLAPSE_ALLOC,
                THP_COLLAPSE_ALLOC_FAILED,
                THP_FILE_ALLOC,
                THP_FILE_FALLBACK,
                THP_FILE_FALLBACK_CHARGE,
                THP_FILE_MAPPED,
                THP_SPLIT_PAGE,
                THP_SPLIT_PAGE_FAILED,
                THP_DEFERRED_SPLIT_PAGE,
                THP_SPLIT_PMD,
                THP_SCAN_EXCEED_NONE_PTE,
                THP_SCAN_EXCEED_SWAP_PTE,
                THP_SCAN_EXCEED_SHARED_PTE,
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
                THP_SPLIT_PUD,
#endif
                THP_ZERO_PAGE_ALLOC,
                THP_ZERO_PAGE_ALLOC_FAILED,
                THP_SWPOUT,
                THP_SWPOUT_FALLBACK,
#endif

And reading mm/shmem.c, obviously, shmem is using THP_FILE_ALLOC.

I will rename it to ANON_THP_ALLOC in v3, let me know if you disagree :-)

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
Barry Song April 2, 2024, 10:14 p.m. UTC | #8
On Tue, Apr 2, 2024 at 11:44 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 02/04/2024 10:40, Barry Song wrote:
> > On Tue, Apr 2, 2024 at 9:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 28/03/2024 09:51, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> Profiling a system blindly with mTHP has become challenging due
> >>> to the lack of visibility into its operations. Presenting the
> >>> success rate of mTHP allocations appears to be pressing need.
> >>>
> >>> Recently, I've been experiencing significant difficulty debugging
> >>> performance improvements and regressions without these figures.
> >>> It's crucial for us to understand the true effectiveness of
> >>> mTHP in real-world scenarios, especially in systems with
> >>> fragmented memory.
> >>>
> >>> This patch sets up the framework for per-order mTHP counters,
> >>> starting with the introduction of alloc_success and alloc_fail
> >>> counters.  Incorporating additional counters should now be
> >>> straightforward as well.
> >>>
> >>> The initial two unsigned longs for each event are unused, given
> >>> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> >>> improves code clarity.
> >>
> >> Overall, this approach looks good to me - thanks for doing this!
> >
> > Thank you for reviewing!
> >
> >>
> >>>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>> ---
> >>>  -v2:
> >>>  * move to sysfs and provide per-order counters; David, Ryan, Willy
> >>>  -v1:
> >>>  https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
> >>>
> >>>  include/linux/huge_mm.h | 17 +++++++++++++
> >>>  mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
> >>>  mm/memory.c             |  3 +++
> >>>  3 files changed, 74 insertions(+)
> >>>
> >>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>> index e896ca4760f6..27fa26a22a8f 100644
> >>> --- a/include/linux/huge_mm.h
> >>> +++ b/include/linux/huge_mm.h
> >>> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >>>                                         enforce_sysfs, orders);
> >>>  }
> >>>
> >>> +enum thp_event_item {
> >>> +     THP_ALLOC_SUCCESS,
> >>> +     THP_ALLOC_FAIL,
> >>> +     NR_THP_EVENT_ITEMS
> >>> +};
> >>> +
> >>> +struct thp_event_state {
> >>> +     unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
> >>> +};
> >>> +
> >>> +DECLARE_PER_CPU(struct thp_event_state, thp_event_states);
> >>> +
> >>> +static inline void count_thp_event(int order, enum thp_event_item item)
> >>
> >> We previously concluded that we will likely want a "currently allocated THPs"
> >> counter, which will need both increment and decrement. So perhaps
> >
> > I intend to handle this via incremental patches, but it's improbable
> > to occur this
> > month due to my packed schedule with multiple tasks. Presently, my top priority
> > regarding counting is the fallback ratio for allocations. Following
> > that, it would
> > likely be the SWAP allocation and fallback ratio to determine if the swapfile is
> > highly fragmented.
>
> Yes indeed. I wasn't suggesting that the extra counters should be added
> immediately, just that having a view on the eventual need for "instantaneous"
> counters might influence the name of this function.

make sense. "Count" remains the most suitable name for variables that
consistently
increase by one. We might introduce "inc" and "dec" later and encapsulate them
within "count," as you proposed.

thp_stat_inc(int order, enum thp_stat_item item);
thp_stat_dec(int order, enum thp_stat_item item);

// Wrapper for increment-only events
static inline void count_thp_event(int order, enum thp_stat_item item)
 {
         thp_stat_inc(order, item);
 }

>
> >
> >> "thp_event_inc()" (and eventual "thp_event_dec()" are more appropriate? And
> >> perhaps "event" isn't the right description either since it implies always
> >> counting upwards (to me at least). Although I guess you have borrowed from the
> >> vmstat pattern? That file has some instantaneous counters, so how does it decrement?
> >>
> >
> > Counting represents just one aspect of vmstat. In this scenario, the values are
> > consistently increasing. For those metrics that could fluctuate in
> > either direction,
> > an entirely different mechanism is employed. These are typically managed through
> > global atomic operations.
>
> OK, but why do these need to be global atomic? I would have thought we could
> just define the per-cpu values as signed and allow both inc and dec? Then when
> you sum them, you get the correct answer. You could have:
>
> long stat[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
>
> thp_stat_inc(int order, enum thp_stat_item item);
> thp_stat_dec(int order, enum thp_stat_item item);
>
> // Wrapper for increment-only events
> static inline void count_thp_event(int order, enum thp_stat_item item)
> {
>         thp_stat_inc(order, item);
> }
>
> Just a thought. Perhaps its better to follow the established pattern.
>

i think your proposed approach is exactly the way the existing pattern might
be using, there is a combination of atomic variable and per-cpu inc/dec, for
example:

void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
        struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
        s8 __percpu *p = pcp->vm_stat_diff + item;
        s8 v, t;

        /* See __mod_node_page_state */
        preempt_disable_nested();

        v = __this_cpu_inc_return(*p);
        t = __this_cpu_read(pcp->stat_threshold);
        if (unlikely(v > t)) {
                s8 overstep = t >> 1;

                zone_page_state_add(v + overstep, zone, item);
                __this_cpu_write(*p, -overstep);
        }

        preempt_enable_nested();
}

void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
{
        struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
        s8 __percpu *p = pcp->vm_node_stat_diff + item;
        s8 v, t;

        VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));

        /* See __mod_node_page_state */
        preempt_disable_nested();

        v = __this_cpu_inc_return(*p);
        t = __this_cpu_read(pcp->stat_threshold);
        if (unlikely(v > t)) {
                s8 overstep = t >> 1;

                node_page_state_add(v + overstep, pgdat, item);
                __this_cpu_write(*p, -overstep);
        }

        preempt_enable_nested();
}

zone_page_state_add() and node_page_state_add() are  finally
updating the atomic variable,

static inline void zone_page_state_add(long x, struct zone *zone,
                                 enum zone_stat_item item)
{
        atomic_long_add(x, &zone->vm_stat[item]);
        atomic_long_add(x, &vm_zone_stat[item]);
}

static inline void node_page_state_add(long x, struct pglist_data *pgdat,
                                 enum node_stat_item item)
{
        atomic_long_add(x, &pgdat->vm_stat[item]);
        atomic_long_add(x, &vm_node_stat[item]);
}

I'll dig into the details while bringing up dual-direction counters.

It's likely we can simplify the implementation compared to the complexity
of zone and node states.

> >
> > /*
> >  * Zone and node-based page accounting with per cpu differentials.
> >  */
> > extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
> > extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
> > extern atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
> >
> > void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long);
> > void __inc_zone_page_state(struct page *, enum zone_stat_item);
> > void __dec_zone_page_state(struct page *, enum zone_stat_item);
> >
> > void __mod_node_page_state(struct pglist_data *, enum node_stat_item
> > item, long);
> > void __inc_node_page_state(struct page *, enum node_stat_item);
> > void __dec_node_page_state(struct page *, enum node_stat_item);
> >
> > static inline void zone_page_state_add(long x, struct zone *zone,
> >                                  enum zone_stat_item item)
> > {
> >         atomic_long_add(x, &zone->vm_stat[item]);
> >         atomic_long_add(x, &vm_zone_stat[item]);
> > }
> >
> > Counting is likely most efficiently handled with per-CPU copies and a summation
> > function to aggregate values from all CPUs.
> >
> >>> +{
> >>> +     this_cpu_inc(thp_event_states.event[order][item]);
> >>
> >> Could we declare as "PMD_ORDER - 1" then do [order - 2] here? That would save 16
> >> bytes per CPU. (8K in a system with 512 CPUs). Possibly with a order_index()
> >> helper since the *_show() functions will need this too.
> >
> > I actually put some words in commit messages
> >
> > "The initial two unsigned longs for each event are unused, given
> >  that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> > improves code clarity."
>
> Yes I saw that. I was just observing that we can have both memory effciency and
> code clarity fairly easily. Not a strong opinion though.

Achieving this can be as simple as modifying 3 lines of code. I prefer
the below approach
over using "order_index," as order_index's name feels overly broad, on
the other hand,
THP_MIN_ORDER can potentially be reused by more scenarios.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 27fa26a22a8f..f62a8ecf1bda 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -70,6 +70,7 @@ extern struct kobj_attribute shmem_enabled_attr;
  * (which is a limitation of the THP implementation).
  */
 #define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
+#define THP_MIN_ORDER          2

 /*
  * Mask of all large folio orders supported for file THP.
@@ -271,14 +272,14 @@ enum thp_event_item {
 };

 struct thp_event_state {
-       unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
+       unsigned long event[PMD_ORDER + 1 - THP_MIN_ORDER][NR_THP_EVENT_ITEMS];
 };

 DECLARE_PER_CPU(struct thp_event_state, thp_event_states);

 static inline void count_thp_event(int order, enum thp_event_item item)
 {
-       this_cpu_inc(thp_event_states.event[order][item]);
+       this_cpu_inc(thp_event_states.event[order - THP_MIN_ORDER][item]);
 }

 #define transparent_hugepage_use_zero_page()                           \
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e3d35c2fa563..3414ac2cf0dc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -536,7 +536,7 @@ static unsigned long sum_thp_events(int order,
enum thp_event_item item)
        for_each_online_cpu(cpu) {
                struct thp_event_state *this = &per_cpu(thp_event_states, cpu);

-               sum += this->event[order][item];
+               sum += this->event[order - THP_MIN_ORDER][item];
        }


>
> >
> > However, if conserving memory is a higher priority, I can utilize
> > PMD_ORDER - 1 in version 3.
> >
> >>
> >>> +}
> >>> +
> >>>  #define transparent_hugepage_use_zero_page()                         \
> >>>       (transparent_hugepage_flags &                                   \
> >>>        (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>> index 1683de78c313..addd093d8410 100644
> >>> --- a/mm/huge_memory.c
> >>> +++ b/mm/huge_memory.c
> >>> @@ -526,6 +526,52 @@ static const struct kobj_type thpsize_ktype = {
> >>>       .sysfs_ops = &kobj_sysfs_ops,
> >>>  };
> >>>
> >>> +DEFINE_PER_CPU(struct thp_event_state, thp_event_states) = {{{0}}};
> >>> +
> >>> +static unsigned long sum_thp_events(int order, enum thp_event_item item)
> >>> +{
> >>> +     unsigned long sum = 0;
> >>> +     int cpu;
> >>> +
> >>> +     for_each_online_cpu(cpu) {
> >>> +             struct thp_event_state *this = &per_cpu(thp_event_states, cpu);
> >>> +
> >>> +             sum += this->event[order][item];
> >>> +     }
> >>> +
> >>> +     return sum;
> >>> +}
> >>> +
> >>> +static ssize_t alloc_success_show(struct kobject *kobj,
> >>> +                                struct kobj_attribute *attr, char *buf)
> >>> +{
> >>> +     int order = to_thpsize(kobj)->order;
> >>> +
> >>> +     return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_SUCCESS));
> >>> +}
> >>> +
> >>> +static ssize_t alloc_fail_show(struct kobject *kobj,
> >>> +                                struct kobj_attribute *attr, char *buf)
> >>> +{
> >>> +     int order = to_thpsize(kobj)->order;
> >>> +
> >>> +     return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_FAIL));
> >>> +}
> >>> +
> >>> +static struct kobj_attribute alloc_success_attr = __ATTR_RO(alloc_success);
> >>> +static struct kobj_attribute alloc_fail_attr = __ATTR_RO(alloc_fail);
> >>> +
> >>> +static struct attribute *stats_attrs[] = {
> >>> +     &alloc_success_attr.attr,
> >>> +     &alloc_fail_attr.attr,
> >>> +     NULL,
> >>> +};
> >>> +
> >>> +static struct attribute_group stats_attr_group = {
> >>> +     .name = "stats",
> >>> +     .attrs = stats_attrs,
> >>> +};
> >>> +
> >>>  static struct thpsize *thpsize_create(int order, struct kobject *parent)
> >>>  {
> >>>       unsigned long size = (PAGE_SIZE << order) / SZ_1K;
> >>> @@ -549,6 +595,12 @@ static struct thpsize *thpsize_create(int order, struct kobject *parent)
> >>>               return ERR_PTR(ret);
> >>>       }
> >>>
> >>> +     ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
> >>> +     if (ret) {
> >>> +             kobject_put(&thpsize->kobj);
> >>> +             return ERR_PTR(ret);
> >>> +     }
> >>> +
> >>>       thpsize->order = order;
> >>>       return thpsize;
> >>>  }
> >>> @@ -1050,8 +1102,10 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >>>       folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true);
> >>>       if (unlikely(!folio)) {
> >>>               count_vm_event(THP_FAULT_FALLBACK);
> >>> +             count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_FAIL);
> >>>               return VM_FAULT_FALLBACK;
> >>>       }
> >>> +     count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_SUCCESS);
> >>>       return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
> >>>  }
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index 984b138f85b4..c9c1031c2ecb 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -4365,7 +4365,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >>>                       }
> >>>                       folio_throttle_swaprate(folio, gfp);
> >>>                       clear_huge_page(&folio->page, vmf->address, 1 << order);
> >>> +                     count_thp_event(order, THP_ALLOC_SUCCESS);
> >>>                       return folio;
> >>> +             } else {
> >>
> >> Do we need this else, given the last statement in the "if" clause is return?
> >
> > I included this "else" statement because we encounter a scenario where
> > we "goto next" even if allocation succeeds.
> >
> >                  folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >                 if (folio) {
> >                         if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> >                                 folio_put(folio);
> >                                 goto next;
> >                         }
> >
> > After revisiting the code for a second time, I noticed that the SUCCESS
> > counter is incremented after the "goto" statement, so I realize that I won't
> > need this "else" clause.
> >
> >                folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >                 if (folio) {
> >                         if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> >                                 folio_put(folio);
> >                                 goto next;
> >                         }
> >                         folio_throttle_swaprate(folio, gfp);
> >                         clear_huge_page(&folio->page, vmf->address, 1 << order);
> >                         count_thp_event(order, THP_ALLOC_SUCCESS);
> >                         return folio;
> >
> >
> > So, I have two options:
> >
> > 1. Move the count_thp_event(order, THP_ALLOC_SUCCESS); statement ahead of
> > mem_cgroup_charge(folio, vma->vm_mm, gfp) and retain the else clause.
> > 2. Keep THP_ALLOC_SUCCESS unchanged and remove the else clause.
> >
> > Since mem_cgroup_charge() rarely fails, option 2 should suffice.
>
> I guess it depends exactly on what the semantics of THP_ALLOC_SUCCESS and
> THP_ALLOC_FAIL are supposed to be. For me, this makes most sense:
>
>         while (orders) {
>                 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>                 folio = vma_alloc_folio(gfp, order, vma, addr, true);
>                 if (folio) {
>                         if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
>                                 folio_put(folio);
>                                 goto next;
>                         }
>                         folio_throttle_swaprate(folio, gfp);
>                         clear_huge_page(&folio->page, vmf->address, 1 << order);
>                         count_thp_event(order, THP_ALLOC_SUCCESS);
>                         return folio;
>                 }
> next:
>                 count_thp_event(order, THP_ALLOC_FAIL);
>                 order = next_order(&orders, order);
>         }
>
> But this way you will end up with more THP_ALLOC_FAIL events than page faults.
> Let's say you have 1M, 512K, 256K and 128K mTHP enabled, and you fail to
> allocate all except 128K; You will end up with 3 FAILs and 1 SUCCESS for a
> single fault. Is that a problem?

I don't see this as an issue. Users are familiar with its
configuration in sysfs. The
key focus here is the extent to which we encounter difficulties in
allocating the
mTHP we prefer due to fragmented buddy memory.
But I would like to move count_thp_event(order, THP_ALLOC_FAIL) before "next:"
as this won't count mem_cgroup_charge failure as THP_ALLOC_FAIL.

>
> I guess an alternative approach would be to mark SUCCESS per-size. But have a
> single, global FAIL counter, which just tells you how many faults end up getting
> a 4K page. (or have that counter in addition to the per-size FAIL counters).
>
> It feels a bit odd that you get a per-size FAIL if the allocation fails, but not
> if the VMA is congested; if the VMA is congested, you will have already filtered
> out the larger orders and therefore never see the allocation FAIL.
>
> What question are you usually trying to answer with these stats?

The crux of the matter is the allocation success rate when setting
mTHP to, for instance,
64KiB. If it's just 5%, mTHP is merely posing as a helpful ally. At
50%, mTHP demonstrates
its reliability as a dependable companion. Attaining an 80% allocation
success rate solidifies
mTHP's status as a steadfast friend, consistently supporting us.

>
> Perhaps I'm making this more complicated than it needs to be... :)
>

Indeed, sticking to the simplest approach seems prudent for now.

/sys/kernel/mm/transparent_hugepage/hugepages-<x>kB/stats/alloc_success
/sys/kernel/mm/transparent_hugepage/hugepages-<x>kB/stats/alloc_fail

Users aren't likely to be oblivious to what's happening, and the kernel may
not be so adept at manipulating data to aid user comprehension; it could
potentially distort the data instead.

>
> >
> > By the way, why do we proceed to the next order after
> > mem_cgroup_charge() fails?
> > I assume mem_cgroup_charge() is still likely to fail even after we move to
> > the next order?
>
> That bit was added by Kefeng Wang, but my understanding is that
> mem_cgroup_charge() will fail if we are bumping up against the quota limit? So I
> guess the thinking is we may not be able to allocate 1M but 32K may be fine.
>
> >
> >
> >>
> >>> +                     count_thp_event(order, THP_ALLOC_FAIL);
> >>>               }
> >>>  next:
> >>>               order = next_order(&orders, order);
> >>
> >> Thanks,
> >> Ryan
> >>
> >
Thanks
Barry
>
Ryan Roberts April 3, 2024, 8:03 a.m. UTC | #9
On 02/04/2024 23:14, Barry Song wrote:
> On Tue, Apr 2, 2024 at 11:44 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 02/04/2024 10:40, Barry Song wrote:
>>> On Tue, Apr 2, 2024 at 9:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 28/03/2024 09:51, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> Profiling a system blindly with mTHP has become challenging due
>>>>> to the lack of visibility into its operations. Presenting the
>>>>> success rate of mTHP allocations appears to be pressing need.
>>>>>
>>>>> Recently, I've been experiencing significant difficulty debugging
>>>>> performance improvements and regressions without these figures.
>>>>> It's crucial for us to understand the true effectiveness of
>>>>> mTHP in real-world scenarios, especially in systems with
>>>>> fragmented memory.
>>>>>
>>>>> This patch sets up the framework for per-order mTHP counters,
>>>>> starting with the introduction of alloc_success and alloc_fail
>>>>> counters.  Incorporating additional counters should now be
>>>>> straightforward as well.
>>>>>
>>>>> The initial two unsigned longs for each event are unused, given
>>>>> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
>>>>> improves code clarity.
>>>>
>>>> Overall, this approach looks good to me - thanks for doing this!
>>>
>>> Thank you for reviewing!
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>> ---
>>>>>  -v2:
>>>>>  * move to sysfs and provide per-order counters; David, Ryan, Willy
>>>>>  -v1:
>>>>>  https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
>>>>>
>>>>>  include/linux/huge_mm.h | 17 +++++++++++++
>>>>>  mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
>>>>>  mm/memory.c             |  3 +++
>>>>>  3 files changed, 74 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>> index e896ca4760f6..27fa26a22a8f 100644
>>>>> --- a/include/linux/huge_mm.h
>>>>> +++ b/include/linux/huge_mm.h
>>>>> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>>>                                         enforce_sysfs, orders);
>>>>>  }
>>>>>
>>>>> +enum thp_event_item {
>>>>> +     THP_ALLOC_SUCCESS,
>>>>> +     THP_ALLOC_FAIL,
>>>>> +     NR_THP_EVENT_ITEMS
>>>>> +};
>>>>> +
>>>>> +struct thp_event_state {
>>>>> +     unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
>>>>> +};
>>>>> +
>>>>> +DECLARE_PER_CPU(struct thp_event_state, thp_event_states);
>>>>> +
>>>>> +static inline void count_thp_event(int order, enum thp_event_item item)
>>>>
>>>> We previously concluded that we will likely want a "currently allocated THPs"
>>>> counter, which will need both increment and decrement. So perhaps
>>>
>>> I intend to handle this via incremental patches, but it's improbable
>>> to occur this
>>> month due to my packed schedule with multiple tasks. Presently, my top priority
>>> regarding counting is the fallback ratio for allocations. Following
>>> that, it would
>>> likely be the SWAP allocation and fallback ratio to determine if the swapfile is
>>> highly fragmented.
>>
>> Yes indeed. I wasn't suggesting that the extra counters should be added
>> immediately, just that having a view on the eventual need for "instantaneous"
>> counters might influence the name of this function.
> 
> make sense. "Count" remains the most suitable name for variables that
> consistently
> increase by one. We might introduce "inc" and "dec" later and encapsulate them
> within "count," as you proposed.
> 
> thp_stat_inc(int order, enum thp_stat_item item);
> thp_stat_dec(int order, enum thp_stat_item item);
> 
> // Wrapper for increment-only events
> static inline void count_thp_event(int order, enum thp_stat_item item)
>  {
>          thp_stat_inc(order, item);
>  }
> 
>>
>>>
>>>> "thp_event_inc()" (and eventual "thp_event_dec()" are more appropriate? And
>>>> perhaps "event" isn't the right description either since it implies always
>>>> counting upwards (to me at least). Although I guess you have borrowed from the
>>>> vmstat pattern? That file has some instantaneous counters, so how does it decrement?
>>>>
>>>
>>> Counting represents just one aspect of vmstat. In this scenario, the values are
>>> consistently increasing. For those metrics that could fluctuate in
>>> either direction,
>>> an entirely different mechanism is employed. These are typically managed through
>>> global atomic operations.
>>
>> OK, but why do these need to be global atomic? I would have thought we could
>> just define the per-cpu values as signed and allow both inc and dec? Then when
>> you sum them, you get the correct answer. You could have:
>>
>> long stat[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
>>
>> thp_stat_inc(int order, enum thp_stat_item item);
>> thp_stat_dec(int order, enum thp_stat_item item);
>>
>> // Wrapper for increment-only events
>> static inline void count_thp_event(int order, enum thp_stat_item item)
>> {
>>         thp_stat_inc(order, item);
>> }
>>
>> Just a thought. Perhaps its better to follow the established pattern.
>>
> 
> i think your proposed approach is exactly the way the existing pattern might
> be using, there is a combination of atomic variable and per-cpu inc/dec, for
> example:
> 
> void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
> {
>         struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
>         s8 __percpu *p = pcp->vm_stat_diff + item;
>         s8 v, t;
> 
>         /* See __mod_node_page_state */
>         preempt_disable_nested();
> 
>         v = __this_cpu_inc_return(*p);
>         t = __this_cpu_read(pcp->stat_threshold);
>         if (unlikely(v > t)) {
>                 s8 overstep = t >> 1;
> 
>                 zone_page_state_add(v + overstep, zone, item);
>                 __this_cpu_write(*p, -overstep);
>         }
> 
>         preempt_enable_nested();
> }
> 
> void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
> {
>         struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>         s8 __percpu *p = pcp->vm_node_stat_diff + item;
>         s8 v, t;
> 
>         VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
> 
>         /* See __mod_node_page_state */
>         preempt_disable_nested();
> 
>         v = __this_cpu_inc_return(*p);
>         t = __this_cpu_read(pcp->stat_threshold);
>         if (unlikely(v > t)) {
>                 s8 overstep = t >> 1;
> 
>                 node_page_state_add(v + overstep, pgdat, item);
>                 __this_cpu_write(*p, -overstep);
>         }
> 
>         preempt_enable_nested();
> }
> 
> zone_page_state_add() and node_page_state_add() are  finally
> updating the atomic variable,
> 
> static inline void zone_page_state_add(long x, struct zone *zone,
>                                  enum zone_stat_item item)
> {
>         atomic_long_add(x, &zone->vm_stat[item]);
>         atomic_long_add(x, &vm_zone_stat[item]);
> }
> 
> static inline void node_page_state_add(long x, struct pglist_data *pgdat,
>                                  enum node_stat_item item)
> {
>         atomic_long_add(x, &pgdat->vm_stat[item]);
>         atomic_long_add(x, &vm_node_stat[item]);
> }
> 
> I'll dig into the details while bringing up dual-direction counters.
> 
> It's likely we can simplify the implementation compared to the complexity
> of zone and node states.

Yes, agreed!

> 
>>>
>>> /*
>>>  * Zone and node-based page accounting with per cpu differentials.
>>>  */
>>> extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
>>> extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
>>> extern atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
>>>
>>> void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long);
>>> void __inc_zone_page_state(struct page *, enum zone_stat_item);
>>> void __dec_zone_page_state(struct page *, enum zone_stat_item);
>>>
>>> void __mod_node_page_state(struct pglist_data *, enum node_stat_item
>>> item, long);
>>> void __inc_node_page_state(struct page *, enum node_stat_item);
>>> void __dec_node_page_state(struct page *, enum node_stat_item);
>>>
>>> static inline void zone_page_state_add(long x, struct zone *zone,
>>>                                  enum zone_stat_item item)
>>> {
>>>         atomic_long_add(x, &zone->vm_stat[item]);
>>>         atomic_long_add(x, &vm_zone_stat[item]);
>>> }
>>>
>>> Counting is likely most efficiently handled with per-CPU copies and a summation
>>> function to aggregate values from all CPUs.
>>>
>>>>> +{
>>>>> +     this_cpu_inc(thp_event_states.event[order][item]);
>>>>
>>>> Could we declare as "PMD_ORDER - 1" then do [order - 2] here? That would save 16
>>>> bytes per CPU. (8K in a system with 512 CPUs). Possibly with a order_index()
>>>> helper since the *_show() functions will need this too.
>>>
>>> I actually put some words in commit messages
>>>
>>> "The initial two unsigned longs for each event are unused, given
>>>  that order-0 and order-1 are not mTHP. Nonetheless, this refinement
>>> improves code clarity."
>>
>> Yes I saw that. I was just observing that we can have both memory effciency and
>> code clarity fairly easily. Not a strong opinion though.
> 
> Achieving this can be as simple as modifying 3 lines of code. I prefer
> the below approach
> over using "order_index," as order_index's name feels overly broad, on
> the other hand,
> THP_MIN_ORDER can potentially be reused by more scenarios.

Note that file THP can now be order-1. Perhaps its just simpler to waste the
memory as you initially had it. Sorry for the noise.

> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 27fa26a22a8f..f62a8ecf1bda 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -70,6 +70,7 @@ extern struct kobj_attribute shmem_enabled_attr;
>   * (which is a limitation of the THP implementation).
>   */
>  #define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
> +#define THP_MIN_ORDER          2
> 
>  /*
>   * Mask of all large folio orders supported for file THP.
> @@ -271,14 +272,14 @@ enum thp_event_item {
>  };
> 
>  struct thp_event_state {
> -       unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
> +       unsigned long event[PMD_ORDER + 1 - THP_MIN_ORDER][NR_THP_EVENT_ITEMS];
>  };
> 
>  DECLARE_PER_CPU(struct thp_event_state, thp_event_states);
> 
>  static inline void count_thp_event(int order, enum thp_event_item item)
>  {
> -       this_cpu_inc(thp_event_states.event[order][item]);
> +       this_cpu_inc(thp_event_states.event[order - THP_MIN_ORDER][item]);
>  }
> 
>  #define transparent_hugepage_use_zero_page()                           \
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e3d35c2fa563..3414ac2cf0dc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -536,7 +536,7 @@ static unsigned long sum_thp_events(int order,
> enum thp_event_item item)
>         for_each_online_cpu(cpu) {
>                 struct thp_event_state *this = &per_cpu(thp_event_states, cpu);
> 
> -               sum += this->event[order][item];
> +               sum += this->event[order - THP_MIN_ORDER][item];
>         }
> 
> 
>>
>>>
>>> However, if conserving memory is a higher priority, I can utilize
>>> PMD_ORDER - 1 in version 3.
>>>
>>>>
>>>>> +}
>>>>> +
>>>>>  #define transparent_hugepage_use_zero_page()                         \
>>>>>       (transparent_hugepage_flags &                                   \
>>>>>        (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index 1683de78c313..addd093d8410 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -526,6 +526,52 @@ static const struct kobj_type thpsize_ktype = {
>>>>>       .sysfs_ops = &kobj_sysfs_ops,
>>>>>  };
>>>>>
>>>>> +DEFINE_PER_CPU(struct thp_event_state, thp_event_states) = {{{0}}};
>>>>> +
>>>>> +static unsigned long sum_thp_events(int order, enum thp_event_item item)
>>>>> +{
>>>>> +     unsigned long sum = 0;
>>>>> +     int cpu;
>>>>> +
>>>>> +     for_each_online_cpu(cpu) {
>>>>> +             struct thp_event_state *this = &per_cpu(thp_event_states, cpu);
>>>>> +
>>>>> +             sum += this->event[order][item];
>>>>> +     }
>>>>> +
>>>>> +     return sum;
>>>>> +}
>>>>> +
>>>>> +static ssize_t alloc_success_show(struct kobject *kobj,
>>>>> +                                struct kobj_attribute *attr, char *buf)
>>>>> +{
>>>>> +     int order = to_thpsize(kobj)->order;
>>>>> +
>>>>> +     return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_SUCCESS));
>>>>> +}
>>>>> +
>>>>> +static ssize_t alloc_fail_show(struct kobject *kobj,
>>>>> +                                struct kobj_attribute *attr, char *buf)
>>>>> +{
>>>>> +     int order = to_thpsize(kobj)->order;
>>>>> +
>>>>> +     return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_FAIL));
>>>>> +}
>>>>> +
>>>>> +static struct kobj_attribute alloc_success_attr = __ATTR_RO(alloc_success);
>>>>> +static struct kobj_attribute alloc_fail_attr = __ATTR_RO(alloc_fail);
>>>>> +
>>>>> +static struct attribute *stats_attrs[] = {
>>>>> +     &alloc_success_attr.attr,
>>>>> +     &alloc_fail_attr.attr,
>>>>> +     NULL,
>>>>> +};
>>>>> +
>>>>> +static struct attribute_group stats_attr_group = {
>>>>> +     .name = "stats",
>>>>> +     .attrs = stats_attrs,
>>>>> +};
>>>>> +
>>>>>  static struct thpsize *thpsize_create(int order, struct kobject *parent)
>>>>>  {
>>>>>       unsigned long size = (PAGE_SIZE << order) / SZ_1K;
>>>>> @@ -549,6 +595,12 @@ static struct thpsize *thpsize_create(int order, struct kobject *parent)
>>>>>               return ERR_PTR(ret);
>>>>>       }
>>>>>
>>>>> +     ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
>>>>> +     if (ret) {
>>>>> +             kobject_put(&thpsize->kobj);
>>>>> +             return ERR_PTR(ret);
>>>>> +     }
>>>>> +
>>>>>       thpsize->order = order;
>>>>>       return thpsize;
>>>>>  }
>>>>> @@ -1050,8 +1102,10 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>>>>       folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true);
>>>>>       if (unlikely(!folio)) {
>>>>>               count_vm_event(THP_FAULT_FALLBACK);
>>>>> +             count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_FAIL);
>>>>>               return VM_FAULT_FALLBACK;
>>>>>       }
>>>>> +     count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_SUCCESS);
>>>>>       return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
>>>>>  }
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 984b138f85b4..c9c1031c2ecb 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -4365,7 +4365,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>>>>                       }
>>>>>                       folio_throttle_swaprate(folio, gfp);
>>>>>                       clear_huge_page(&folio->page, vmf->address, 1 << order);
>>>>> +                     count_thp_event(order, THP_ALLOC_SUCCESS);
>>>>>                       return folio;
>>>>> +             } else {
>>>>
>>>> Do we need this else, given the last statement in the "if" clause is return?
>>>
>>> I included this "else" statement because we encounter a scenario where
>>> we "goto next" even if allocation succeeds.
>>>
>>>                  folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>>                 if (folio) {
>>>                         if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
>>>                                 folio_put(folio);
>>>                                 goto next;
>>>                         }
>>>
>>> After revisiting the code for a second time, I noticed that the SUCCESS
>>> counter is incremented after the "goto" statement, so I realize that I won't
>>> need this "else" clause.
>>>
>>>                folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>>                 if (folio) {
>>>                         if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
>>>                                 folio_put(folio);
>>>                                 goto next;
>>>                         }
>>>                         folio_throttle_swaprate(folio, gfp);
>>>                         clear_huge_page(&folio->page, vmf->address, 1 << order);
>>>                         count_thp_event(order, THP_ALLOC_SUCCESS);
>>>                         return folio;
>>>
>>>
>>> So, I have two options:
>>>
>>> 1. Move the count_thp_event(order, THP_ALLOC_SUCCESS); statement ahead of
>>> mem_cgroup_charge(folio, vma->vm_mm, gfp) and retain the else clause.
>>> 2. Keep THP_ALLOC_SUCCESS unchanged and remove the else clause.
>>>
>>> Since mem_cgroup_charge() rarely fails, option 2 should suffice.
>>
>> I guess it depends exactly on what the semantics of THP_ALLOC_SUCCESS and
>> THP_ALLOC_FAIL are supposed to be. For me, this makes most sense:
>>
>>         while (orders) {
>>                 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>                 folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>                 if (folio) {
>>                         if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
>>                                 folio_put(folio);
>>                                 goto next;
>>                         }
>>                         folio_throttle_swaprate(folio, gfp);
>>                         clear_huge_page(&folio->page, vmf->address, 1 << order);
>>                         count_thp_event(order, THP_ALLOC_SUCCESS);
>>                         return folio;
>>                 }
>> next:
>>                 count_thp_event(order, THP_ALLOC_FAIL);
>>                 order = next_order(&orders, order);
>>         }
>>
>> But this way you will end up with more THP_ALLOC_FAIL events than page faults.
>> Let's say you have 1M, 512K, 256K and 128K mTHP enabled, and you fail to
>> allocate all except 128K; You will end up with 3 FAILs and 1 SUCCESS for a
>> single fault. Is that a problem?
> 
> I don't see this as an issue. Users are familiar with its
> configuration in sysfs. The
> key focus here is the extent to which we encounter difficulties in
> allocating the
> mTHP we prefer due to fragmented buddy memory.
> But I would like to move count_thp_event(order, THP_ALLOC_FAIL) before "next:"
> as this won't count mem_cgroup_charge failure as THP_ALLOC_FAIL.

Yes, ok agreed.

> 
>>
>> I guess an alternative approach would be to mark SUCCESS per-size. But have a
>> single, global FAIL counter, which just tells you how many faults end up getting
>> a 4K page. (or have that counter in addition to the per-size FAIL counters).
>>
>> It feels a bit odd that you get a per-size FAIL if the allocation fails, but not
>> if the VMA is congested; if the VMA is congested, you will have already filtered
>> out the larger orders and therefore never see the allocation FAIL.
>>
>> What question are you usually trying to answer with these stats?
> 
> The crux of the matter is the allocation success rate when setting
> mTHP to, for instance,
> 64KiB. If it's just 5%, mTHP is merely posing as a helpful ally. At
> 50%, mTHP demonstrates
> its reliability as a dependable companion. Attaining an 80% allocation
> success rate solidifies
> mTHP's status as a steadfast friend, consistently supporting us.

Yep, OK agreed.

To try to put my original concern succinctly: When memory is not fragmented, you
will always succeed to allocate the largest enabled mTHP size and you won't even
try to allocate a smaller size. But when memory is very fragmented, you will try
to allocate all sizes biggest to smallest, and likely fail them all. So the
small sizes will be skewed towards high number of fails and small number of
successes.

But I agree with your approach. Let's keep it simple.


> 
>>
>> Perhaps I'm making this more complicated than it needs to be... :)
>>
> 
> Indeed, sticking to the simplest approach seems prudent for now.
> 
> /sys/kernel/mm/transparent_hugepage/hugepages-<x>kB/stats/alloc_success
> /sys/kernel/mm/transparent_hugepage/hugepages-<x>kB/stats/alloc_fail
> 
> Users aren't likely to be oblivious to what's happening, and the kernel may
> not be so adept at manipulating data to aid user comprehension; it could
> potentially distort the data instead.
> 
>>
>>>
>>> By the way, why do we proceed to the next order after
>>> mem_cgroup_charge() fails?
>>> I assume mem_cgroup_charge() is still likely to fail even after we move to
>>> the next order?
>>
>> That bit was added by Kefeng Wang, but my understanding is that
>> mem_cgroup_charge() will fail if we are bumping up against the quota limit? So I
>> guess the thinking is we may not be able to allocate 1M but 32K may be fine.
>>
>>>
>>>
>>>>
>>>>> +                     count_thp_event(order, THP_ALLOC_FAIL);
>>>>>               }
>>>>>  next:
>>>>>               order = next_order(&orders, order);
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>
> Thanks
> Barry
>>
David Hildenbrand April 3, 2024, 8:13 a.m. UTC | #10
On 02.04.24 23:29, Barry Song wrote:
> On Wed, Apr 3, 2024 at 7:46 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 28.03.24 10:51, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> Profiling a system blindly with mTHP has become challenging due
>>> to the lack of visibility into its operations. Presenting the
>>> success rate of mTHP allocations appears to be pressing need.
>>>
>>> Recently, I've been experiencing significant difficulty debugging
>>> performance improvements and regressions without these figures.
>>> It's crucial for us to understand the true effectiveness of
>>> mTHP in real-world scenarios, especially in systems with
>>> fragmented memory.
>>>
>>> This patch sets up the framework for per-order mTHP counters,
>>> starting with the introduction of alloc_success and alloc_fail
>>> counters.  Incorporating additional counters should now be
>>> straightforward as well.
>>>
>>> The initial two unsigned longs for each event are unused, given
>>> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
>>> improves code clarity.
>>>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>    -v2:
>>>    * move to sysfs and provide per-order counters; David, Ryan, Willy
>>>    -v1:
>>>    https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
>>>
>>>    include/linux/huge_mm.h | 17 +++++++++++++
>>>    mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
>>>    mm/memory.c             |  3 +++
>>>    3 files changed, 74 insertions(+)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index e896ca4760f6..27fa26a22a8f 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>                                          enforce_sysfs, orders);
>>>    }
>>>
>>> +enum thp_event_item {
>>> +     THP_ALLOC_SUCCESS,
>>> +     THP_ALLOC_FAIL,
>>> +     NR_THP_EVENT_ITEMS
>>> +};

Why not simply "enum thp_event" ... "NR_THP_EVENT".

>>
>> I'm wondering if these should be ANON specific for now. We might want to
>> add others (shmem, file) in the future.
> 
> I've two ways to do that
> 1. rename to ANON_THP_ALLOC, so that I can have SHMEM_THP_ALLOC, FILE_THP_ALLOC
> in the future;
> 2. let THP_ALLOC cover all of shmem, file and anon.
> 
> following vmstat, actually 1 might be better as we have both THP_FAULT_ALLOC and
> THP_FILE_ALLOC for pmd-mapped THP.

Yes. Because anon was first, people just named it "THP". Then, file THP 
were added later. Some of that needs a cleanup.

> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>                  THP_FAULT_ALLOC,
>                  THP_FAULT_FALLBACK,
>                  THP_FAULT_FALLBACK_CHARGE,
>                  THP_COLLAPSE_ALLOC,
>                  THP_COLLAPSE_ALLOC_FAILED,
>                  THP_FILE_ALLOC,
>                  THP_FILE_FALLBACK,
>                  THP_FILE_FALLBACK_CHARGE,
>                  THP_FILE_MAPPED,
>                  THP_SPLIT_PAGE,
>                  THP_SPLIT_PAGE_FAILED,
>                  THP_DEFERRED_SPLIT_PAGE,
>                  THP_SPLIT_PMD,
>                  THP_SCAN_EXCEED_NONE_PTE,
>                  THP_SCAN_EXCEED_SWAP_PTE,
>                  THP_SCAN_EXCEED_SHARED_PTE,
> #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>                  THP_SPLIT_PUD,
> #endif
>                  THP_ZERO_PAGE_ALLOC,
>                  THP_ZERO_PAGE_ALLOC_FAILED,
>                  THP_SWPOUT,
>                  THP_SWPOUT_FALLBACK,
> #endif
> 
> And reading mm/shmem.c, obviously, shmem is using THP_FILE_ALLOC.

Right.

> 
> I will rename it to ANON_THP_ALLOC in v3, let me know if you disagree :-)

You should give people more time to respond before resending. Please try 
sending new versions only after the discussion on the old version 
finished. Otherwise it's going to be a mess (because I won't repost my 
feedback to v3 :P ).

THP_EVENT_ANON_ALLOC

might be better, so "THP_EVENT" would be your common prefix for "enum 
thp_event".
Ryan Roberts April 3, 2024, 8:18 a.m. UTC | #11
On 02/04/2024 22:29, Barry Song wrote:
> On Wed, Apr 3, 2024 at 7:46 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 28.03.24 10:51, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> Profiling a system blindly with mTHP has become challenging due
>>> to the lack of visibility into its operations. Presenting the
>>> success rate of mTHP allocations appears to be pressing need.
>>>
>>> Recently, I've been experiencing significant difficulty debugging
>>> performance improvements and regressions without these figures.
>>> It's crucial for us to understand the true effectiveness of
>>> mTHP in real-world scenarios, especially in systems with
>>> fragmented memory.
>>>
>>> This patch sets up the framework for per-order mTHP counters,
>>> starting with the introduction of alloc_success and alloc_fail
>>> counters.  Incorporating additional counters should now be
>>> straightforward as well.
>>>
>>> The initial two unsigned longs for each event are unused, given
>>> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
>>> improves code clarity.
>>>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>   -v2:
>>>   * move to sysfs and provide per-order counters; David, Ryan, Willy
>>>   -v1:
>>>   https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
>>>
>>>   include/linux/huge_mm.h | 17 +++++++++++++
>>>   mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
>>>   mm/memory.c             |  3 +++
>>>   3 files changed, 74 insertions(+)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index e896ca4760f6..27fa26a22a8f 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>                                         enforce_sysfs, orders);
>>>   }
>>>
>>> +enum thp_event_item {
>>> +     THP_ALLOC_SUCCESS,
>>> +     THP_ALLOC_FAIL,
>>> +     NR_THP_EVENT_ITEMS
>>> +};
>>
>> I'm wondering if these should be ANON specific for now. We might want to
>> add others (shmem, file) in the future.
> 
> I've two ways to do that
> 1. rename to ANON_THP_ALLOC, so that I can have SHMEM_THP_ALLOC, FILE_THP_ALLOC
> in the future;
> 2. let THP_ALLOC cover all of shmem, file and anon.
> 
> following vmstat, actually 1 might be better as we have both THP_FAULT_ALLOC and
> THP_FILE_ALLOC for pmd-mapped THP.
> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>                 THP_FAULT_ALLOC,
>                 THP_FAULT_FALLBACK,
>                 THP_FAULT_FALLBACK_CHARGE,
>                 THP_COLLAPSE_ALLOC,
>                 THP_COLLAPSE_ALLOC_FAILED,
>                 THP_FILE_ALLOC,
>                 THP_FILE_FALLBACK,
>                 THP_FILE_FALLBACK_CHARGE,
>                 THP_FILE_MAPPED,
>                 THP_SPLIT_PAGE,
>                 THP_SPLIT_PAGE_FAILED,
>                 THP_DEFERRED_SPLIT_PAGE,
>                 THP_SPLIT_PMD,
>                 THP_SCAN_EXCEED_NONE_PTE,
>                 THP_SCAN_EXCEED_SWAP_PTE,
>                 THP_SCAN_EXCEED_SHARED_PTE,
> #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>                 THP_SPLIT_PUD,
> #endif
>                 THP_ZERO_PAGE_ALLOC,
>                 THP_ZERO_PAGE_ALLOC_FAILED,
>                 THP_SWPOUT,
>                 THP_SWPOUT_FALLBACK,
> #endif
> 
> And reading mm/shmem.c, obviously, shmem is using THP_FILE_ALLOC.
> 
> I will rename it to ANON_THP_ALLOC in v3, let me know if you disagree :-)

I don't think the name of the enum is important - its an implementation detail
that can be changed. Its the name of the sysfs file that matters. Although of
course its nice to keep them in sync from a maintenance pov.

Currently they are called "alloc_success" and "alloc_fail" I believe? Perhaps
"anon_alloc" and "anon_alloc_fallback" are a bit more in keeping with vmstat?

I'm assuming that:

vmstat:thp_fault_alloc == hugepages-2048kB/stats/anon_alloc
vmstat:thp_fault_alloc_fallback == hugepages-2048kB/stats/anon_alloc_fallback

?

Thanks,
Ryan

> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> Thanks
> Barry
David Hildenbrand April 3, 2024, 8:24 a.m. UTC | #12
On 03.04.24 10:18, Ryan Roberts wrote:
> On 02/04/2024 22:29, Barry Song wrote:
>> On Wed, Apr 3, 2024 at 7:46 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 28.03.24 10:51, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> Profiling a system blindly with mTHP has become challenging due
>>>> to the lack of visibility into its operations. Presenting the
>>>> success rate of mTHP allocations appears to be pressing need.
>>>>
>>>> Recently, I've been experiencing significant difficulty debugging
>>>> performance improvements and regressions without these figures.
>>>> It's crucial for us to understand the true effectiveness of
>>>> mTHP in real-world scenarios, especially in systems with
>>>> fragmented memory.
>>>>
>>>> This patch sets up the framework for per-order mTHP counters,
>>>> starting with the introduction of alloc_success and alloc_fail
>>>> counters.  Incorporating additional counters should now be
>>>> straightforward as well.
>>>>
>>>> The initial two unsigned longs for each event are unused, given
>>>> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
>>>> improves code clarity.
>>>>
>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>> ---
>>>>    -v2:
>>>>    * move to sysfs and provide per-order counters; David, Ryan, Willy
>>>>    -v1:
>>>>    https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
>>>>
>>>>    include/linux/huge_mm.h | 17 +++++++++++++
>>>>    mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
>>>>    mm/memory.c             |  3 +++
>>>>    3 files changed, 74 insertions(+)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index e896ca4760f6..27fa26a22a8f 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>>                                          enforce_sysfs, orders);
>>>>    }
>>>>
>>>> +enum thp_event_item {
>>>> +     THP_ALLOC_SUCCESS,
>>>> +     THP_ALLOC_FAIL,
>>>> +     NR_THP_EVENT_ITEMS
>>>> +};
>>>
>>> I'm wondering if these should be ANON specific for now. We might want to
>>> add others (shmem, file) in the future.
>>
>> I've two ways to do that
>> 1. rename to ANON_THP_ALLOC, so that I can have SHMEM_THP_ALLOC, FILE_THP_ALLOC
>> in the future;
>> 2. let THP_ALLOC cover all of shmem, file and anon.
>>
>> following vmstat, actually 1 might be better as we have both THP_FAULT_ALLOC and
>> THP_FILE_ALLOC for pmd-mapped THP.
>>
>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>                  THP_FAULT_ALLOC,
>>                  THP_FAULT_FALLBACK,
>>                  THP_FAULT_FALLBACK_CHARGE,
>>                  THP_COLLAPSE_ALLOC,
>>                  THP_COLLAPSE_ALLOC_FAILED,
>>                  THP_FILE_ALLOC,
>>                  THP_FILE_FALLBACK,
>>                  THP_FILE_FALLBACK_CHARGE,
>>                  THP_FILE_MAPPED,
>>                  THP_SPLIT_PAGE,
>>                  THP_SPLIT_PAGE_FAILED,
>>                  THP_DEFERRED_SPLIT_PAGE,
>>                  THP_SPLIT_PMD,
>>                  THP_SCAN_EXCEED_NONE_PTE,
>>                  THP_SCAN_EXCEED_SWAP_PTE,
>>                  THP_SCAN_EXCEED_SHARED_PTE,
>> #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>                  THP_SPLIT_PUD,
>> #endif
>>                  THP_ZERO_PAGE_ALLOC,
>>                  THP_ZERO_PAGE_ALLOC_FAILED,
>>                  THP_SWPOUT,
>>                  THP_SWPOUT_FALLBACK,
>> #endif
>>
>> And reading mm/shmem.c, obviously, shmem is using THP_FILE_ALLOC.
>>
>> I will rename it to ANON_THP_ALLOC in v3, let me know if you disagree :-)
> 
> I don't think the name of the enum is important - its an implementation detail
> that can be changed. Its the name of the sysfs file that matters. Although of
> course its nice to keep them in sync from a maintenance pov.

Jup.

> 
> Currently they are called "alloc_success" and "alloc_fail" I believe? Perhaps
> "anon_alloc" and "anon_alloc_fallback" are a bit more in keeping with vmstat?
> 
> I'm assuming that:
> 
> vmstat:thp_fault_alloc == hugepages-2048kB/stats/anon_alloc
> vmstat:thp_fault_alloc_fallback == hugepages-2048kB/stats/anon_alloc_fallback

Or an "anon" subdirectory ... not sure, just a thought.
Ryan Roberts April 3, 2024, 8:46 a.m. UTC | #13
On 03/04/2024 09:24, David Hildenbrand wrote:
> On 03.04.24 10:18, Ryan Roberts wrote:
>> On 02/04/2024 22:29, Barry Song wrote:
>>> On Wed, Apr 3, 2024 at 7:46 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 28.03.24 10:51, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> Profiling a system blindly with mTHP has become challenging due
>>>>> to the lack of visibility into its operations. Presenting the
>>>>> success rate of mTHP allocations appears to be pressing need.
>>>>>
>>>>> Recently, I've been experiencing significant difficulty debugging
>>>>> performance improvements and regressions without these figures.
>>>>> It's crucial for us to understand the true effectiveness of
>>>>> mTHP in real-world scenarios, especially in systems with
>>>>> fragmented memory.
>>>>>
>>>>> This patch sets up the framework for per-order mTHP counters,
>>>>> starting with the introduction of alloc_success and alloc_fail
>>>>> counters.  Incorporating additional counters should now be
>>>>> straightforward as well.
>>>>>
>>>>> The initial two unsigned longs for each event are unused, given
>>>>> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
>>>>> improves code clarity.
>>>>>
>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>> ---
>>>>>    -v2:
>>>>>    * move to sysfs and provide per-order counters; David, Ryan, Willy
>>>>>    -v1:
>>>>>    https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
>>>>>
>>>>>    include/linux/huge_mm.h | 17 +++++++++++++
>>>>>    mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
>>>>>    mm/memory.c             |  3 +++
>>>>>    3 files changed, 74 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>> index e896ca4760f6..27fa26a22a8f 100644
>>>>> --- a/include/linux/huge_mm.h
>>>>> +++ b/include/linux/huge_mm.h
>>>>> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct
>>>>> vm_area_struct *vma,
>>>>>                                          enforce_sysfs, orders);
>>>>>    }
>>>>>
>>>>> +enum thp_event_item {
>>>>> +     THP_ALLOC_SUCCESS,
>>>>> +     THP_ALLOC_FAIL,
>>>>> +     NR_THP_EVENT_ITEMS
>>>>> +};
>>>>
>>>> I'm wondering if these should be ANON specific for now. We might want to
>>>> add others (shmem, file) in the future.
>>>
>>> I've two ways to do that
>>> 1. rename to ANON_THP_ALLOC, so that I can have SHMEM_THP_ALLOC, FILE_THP_ALLOC
>>> in the future;
>>> 2. let THP_ALLOC cover all of shmem, file and anon.
>>>
>>> following vmstat, actually 1 might be better as we have both THP_FAULT_ALLOC and
>>> THP_FILE_ALLOC for pmd-mapped THP.
>>>
>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>                  THP_FAULT_ALLOC,
>>>                  THP_FAULT_FALLBACK,
>>>                  THP_FAULT_FALLBACK_CHARGE,
>>>                  THP_COLLAPSE_ALLOC,
>>>                  THP_COLLAPSE_ALLOC_FAILED,
>>>                  THP_FILE_ALLOC,
>>>                  THP_FILE_FALLBACK,
>>>                  THP_FILE_FALLBACK_CHARGE,
>>>                  THP_FILE_MAPPED,
>>>                  THP_SPLIT_PAGE,
>>>                  THP_SPLIT_PAGE_FAILED,
>>>                  THP_DEFERRED_SPLIT_PAGE,
>>>                  THP_SPLIT_PMD,
>>>                  THP_SCAN_EXCEED_NONE_PTE,
>>>                  THP_SCAN_EXCEED_SWAP_PTE,
>>>                  THP_SCAN_EXCEED_SHARED_PTE,
>>> #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>                  THP_SPLIT_PUD,
>>> #endif
>>>                  THP_ZERO_PAGE_ALLOC,
>>>                  THP_ZERO_PAGE_ALLOC_FAILED,
>>>                  THP_SWPOUT,
>>>                  THP_SWPOUT_FALLBACK,
>>> #endif
>>>
>>> And reading mm/shmem.c, obviously, shmem is using THP_FILE_ALLOC.
>>>
>>> I will rename it to ANON_THP_ALLOC in v3, let me know if you disagree :-)
>>
>> I don't think the name of the enum is important - its an implementation detail
>> that can be changed. Its the name of the sysfs file that matters. Although of
>> course its nice to keep them in sync from a maintenance pov.
> 
> Jup.
> 
>>
>> Currently they are called "alloc_success" and "alloc_fail" I believe? Perhaps
>> "anon_alloc" and "anon_alloc_fallback" are a bit more in keeping with vmstat?
>>
>> I'm assuming that:
>>
>> vmstat:thp_fault_alloc == hugepages-2048kB/stats/anon_alloc
>> vmstat:thp_fault_alloc_fallback == hugepages-2048kB/stats/anon_alloc_fallback
> 
> Or an "anon" subdirectory ... not sure, just a thought.

I have no strong opinion. I'm thinking about how to easily display all the information though:

$ for f in /sys/kernel/mm/hugepages/hugepages-2048kB/*; do printf '%s: ' "$f";  cat "$f"; done
/sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages: 0
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages: 0
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages_mempolicy: 0
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages: 0
/sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages: 0
/sys/kernel/mm/hugepages/hugepages-2048kB/surplus_hugepages: 0

$ for f in /sys/kernel/mm/hugepages/hugepages-2048kB/*; do printf '%s: ' `basename "$f"`;  cat "$f"; done
free_hugepages: 0
nr_hugepages: 0
nr_hugepages_mempolicy: 0
nr_overcommit_hugepages: 0
resv_hugepages: 0
surplus_hugepages: 0

It looks cleaner to me if we have all the info in the filename so we don't have to display the whole directory hierachy. 

But I'm sure someone more competant with bash will tell me exactly how to do it with fewer chars and an even nicer display...
David Hildenbrand April 3, 2024, 9:05 a.m. UTC | #14
On 03.04.24 10:46, Ryan Roberts wrote:
> On 03/04/2024 09:24, David Hildenbrand wrote:
>> On 03.04.24 10:18, Ryan Roberts wrote:
>>> On 02/04/2024 22:29, Barry Song wrote:
>>>> On Wed, Apr 3, 2024 at 7:46 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 28.03.24 10:51, Barry Song wrote:
>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>
>>>>>> Profiling a system blindly with mTHP has become challenging due
>>>>>> to the lack of visibility into its operations. Presenting the
>>>>>> success rate of mTHP allocations appears to be pressing need.
>>>>>>
>>>>>> Recently, I've been experiencing significant difficulty debugging
>>>>>> performance improvements and regressions without these figures.
>>>>>> It's crucial for us to understand the true effectiveness of
>>>>>> mTHP in real-world scenarios, especially in systems with
>>>>>> fragmented memory.
>>>>>>
>>>>>> This patch sets up the framework for per-order mTHP counters,
>>>>>> starting with the introduction of alloc_success and alloc_fail
>>>>>> counters.  Incorporating additional counters should now be
>>>>>> straightforward as well.
>>>>>>
>>>>>> The initial two unsigned longs for each event are unused, given
>>>>>> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
>>>>>> improves code clarity.
>>>>>>
>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>> ---
>>>>>>     -v2:
>>>>>>     * move to sysfs and provide per-order counters; David, Ryan, Willy
>>>>>>     -v1:
>>>>>>     https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
>>>>>>
>>>>>>     include/linux/huge_mm.h | 17 +++++++++++++
>>>>>>     mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
>>>>>>     mm/memory.c             |  3 +++
>>>>>>     3 files changed, 74 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index e896ca4760f6..27fa26a22a8f 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct
>>>>>> vm_area_struct *vma,
>>>>>>                                           enforce_sysfs, orders);
>>>>>>     }
>>>>>>
>>>>>> +enum thp_event_item {
>>>>>> +     THP_ALLOC_SUCCESS,
>>>>>> +     THP_ALLOC_FAIL,
>>>>>> +     NR_THP_EVENT_ITEMS
>>>>>> +};
>>>>>
>>>>> I'm wondering if these should be ANON specific for now. We might want to
>>>>> add others (shmem, file) in the future.
>>>>
>>>> I've two ways to do that
>>>> 1. rename to ANON_THP_ALLOC, so that I can have SHMEM_THP_ALLOC, FILE_THP_ALLOC
>>>> in the future;
>>>> 2. let THP_ALLOC cover all of shmem, file and anon.
>>>>
>>>> following vmstat, actually 1 might be better as we have both THP_FAULT_ALLOC and
>>>> THP_FILE_ALLOC for pmd-mapped THP.
>>>>
>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>                   THP_FAULT_ALLOC,
>>>>                   THP_FAULT_FALLBACK,
>>>>                   THP_FAULT_FALLBACK_CHARGE,
>>>>                   THP_COLLAPSE_ALLOC,
>>>>                   THP_COLLAPSE_ALLOC_FAILED,
>>>>                   THP_FILE_ALLOC,
>>>>                   THP_FILE_FALLBACK,
>>>>                   THP_FILE_FALLBACK_CHARGE,
>>>>                   THP_FILE_MAPPED,
>>>>                   THP_SPLIT_PAGE,
>>>>                   THP_SPLIT_PAGE_FAILED,
>>>>                   THP_DEFERRED_SPLIT_PAGE,
>>>>                   THP_SPLIT_PMD,
>>>>                   THP_SCAN_EXCEED_NONE_PTE,
>>>>                   THP_SCAN_EXCEED_SWAP_PTE,
>>>>                   THP_SCAN_EXCEED_SHARED_PTE,
>>>> #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>>                   THP_SPLIT_PUD,
>>>> #endif
>>>>                   THP_ZERO_PAGE_ALLOC,
>>>>                   THP_ZERO_PAGE_ALLOC_FAILED,
>>>>                   THP_SWPOUT,
>>>>                   THP_SWPOUT_FALLBACK,
>>>> #endif
>>>>
>>>> And reading mm/shmem.c, obviously, shmem is using THP_FILE_ALLOC.
>>>>
>>>> I will rename it to ANON_THP_ALLOC in v3, let me know if you disagree :-)
>>>
>>> I don't think the name of the enum is important - its an implementation detail
>>> that can be changed. Its the name of the sysfs file that matters. Although of
>>> course its nice to keep them in sync from a maintenance pov.
>>
>> Jup.
>>
>>>
>>> Currently they are called "alloc_success" and "alloc_fail" I believe? Perhaps
>>> "anon_alloc" and "anon_alloc_fallback" are a bit more in keeping with vmstat?
>>>
>>> I'm assuming that:
>>>
>>> vmstat:thp_fault_alloc == hugepages-2048kB/stats/anon_alloc
>>> vmstat:thp_fault_alloc_fallback == hugepages-2048kB/stats/anon_alloc_fallback
>>
>> Or an "anon" subdirectory ... not sure, just a thought.
> 
> I have no strong opinion. I'm thinking about how to easily display all the information though:
> 
> $ for f in /sys/kernel/mm/hugepages/hugepages-2048kB/*; do printf '%s: ' "$f";  cat "$f"; done
> /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages: 0
> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages: 0
> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages_mempolicy: 0
> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages: 0
> /sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages: 0
> /sys/kernel/mm/hugepages/hugepages-2048kB/surplus_hugepages: 0
> 
> $ for f in /sys/kernel/mm/hugepages/hugepages-2048kB/*; do printf '%s: ' `basename "$f"`;  cat "$f"; done
> free_hugepages: 0
> nr_hugepages: 0
> nr_hugepages_mempolicy: 0
> nr_overcommit_hugepages: 0
> resv_hugepages: 0
> surplus_hugepages: 0
> 
> It looks cleaner to me if we have all the info in the filename so we don't have to display the whole directory hierachy.
> 
> But I'm sure someone more competant with bash will tell me exactly how to do it with fewer chars and an even nicer display...

Likely something involving "find" would also work. Once we have stats 
subdirectories it would get more involved.

find /sys/kernel/mm/hugepages/hugepages-2048kB/ -type f -printf '%f: ' 
-exec cat {} \;
Barry Song April 3, 2024, 8:47 p.m. UTC | #15
On Wed, Apr 3, 2024 at 9:13 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.04.24 23:29, Barry Song wrote:
> > On Wed, Apr 3, 2024 at 7:46 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 28.03.24 10:51, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> Profiling a system blindly with mTHP has become challenging due
> >>> to the lack of visibility into its operations. Presenting the
> >>> success rate of mTHP allocations appears to be pressing need.
> >>>
> >>> Recently, I've been experiencing significant difficulty debugging
> >>> performance improvements and regressions without these figures.
> >>> It's crucial for us to understand the true effectiveness of
> >>> mTHP in real-world scenarios, especially in systems with
> >>> fragmented memory.
> >>>
> >>> This patch sets up the framework for per-order mTHP counters,
> >>> starting with the introduction of alloc_success and alloc_fail
> >>> counters.  Incorporating additional counters should now be
> >>> straightforward as well.
> >>>
> >>> The initial two unsigned longs for each event are unused, given
> >>> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> >>> improves code clarity.
> >>>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>> ---
> >>>    -v2:
> >>>    * move to sysfs and provide per-order counters; David, Ryan, Willy
> >>>    -v1:
> >>>    https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
> >>>
> >>>    include/linux/huge_mm.h | 17 +++++++++++++
> >>>    mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
> >>>    mm/memory.c             |  3 +++
> >>>    3 files changed, 74 insertions(+)
> >>>
> >>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>> index e896ca4760f6..27fa26a22a8f 100644
> >>> --- a/include/linux/huge_mm.h
> >>> +++ b/include/linux/huge_mm.h
> >>> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >>>                                          enforce_sysfs, orders);
> >>>    }
> >>>
> >>> +enum thp_event_item {
> >>> +     THP_ALLOC_SUCCESS,
> >>> +     THP_ALLOC_FAIL,
> >>> +     NR_THP_EVENT_ITEMS
> >>> +};
>
> Why not simply "enum thp_event" ... "NR_THP_EVENT".

ok.

>
> >>
> >> I'm wondering if these should be ANON specific for now. We might want to
> >> add others (shmem, file) in the future.
> >
> > I've two ways to do that
> > 1. rename to ANON_THP_ALLOC, so that I can have SHMEM_THP_ALLOC, FILE_THP_ALLOC
> > in the future;
> > 2. let THP_ALLOC cover all of shmem, file and anon.
> >
> > following vmstat, actually 1 might be better as we have both THP_FAULT_ALLOC and
> > THP_FILE_ALLOC for pmd-mapped THP.
>
> Yes. Because anon was first, people just named it "THP". Then, file THP
> were added later. Some of that needs a cleanup.

agreed.

>
> >
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >                  THP_FAULT_ALLOC,
> >                  THP_FAULT_FALLBACK,
> >                  THP_FAULT_FALLBACK_CHARGE,
> >                  THP_COLLAPSE_ALLOC,
> >                  THP_COLLAPSE_ALLOC_FAILED,
> >                  THP_FILE_ALLOC,
> >                  THP_FILE_FALLBACK,
> >                  THP_FILE_FALLBACK_CHARGE,
> >                  THP_FILE_MAPPED,
> >                  THP_SPLIT_PAGE,
> >                  THP_SPLIT_PAGE_FAILED,
> >                  THP_DEFERRED_SPLIT_PAGE,
> >                  THP_SPLIT_PMD,
> >                  THP_SCAN_EXCEED_NONE_PTE,
> >                  THP_SCAN_EXCEED_SWAP_PTE,
> >                  THP_SCAN_EXCEED_SHARED_PTE,
> > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >                  THP_SPLIT_PUD,
> > #endif
> >                  THP_ZERO_PAGE_ALLOC,
> >                  THP_ZERO_PAGE_ALLOC_FAILED,
> >                  THP_SWPOUT,
> >                  THP_SWPOUT_FALLBACK,
> > #endif
> >
> > And reading mm/shmem.c, obviously, shmem is using THP_FILE_ALLOC.
>
> Right.
>
> >
> > I will rename it to ANON_THP_ALLOC in v3, let me know if you disagree :-)
>
> You should give people more time to respond before resending. Please try
> sending new versions only after the discussion on the old version
> finished. Otherwise it's going to be a mess (because I won't repost my
> feedback to v3 :P ).

agreed :-)

>
> THP_EVENT_ANON_ALLOC
>
> might be better, so "THP_EVENT" would be your common prefix for "enum
> thp_event".
>

Should be ok. On the other hand, currently, there are numerous occurrences of
events or stats within mm, but we haven't incorporated "EVENT" or "STAT"
into the naming conventions.

> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
Barry Song April 3, 2024, 9:06 p.m. UTC | #16
On Wed, Apr 3, 2024 at 9:24 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 03.04.24 10:18, Ryan Roberts wrote:
> > On 02/04/2024 22:29, Barry Song wrote:
> >> On Wed, Apr 3, 2024 at 7:46 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 28.03.24 10:51, Barry Song wrote:
> >>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>
> >>>> Profiling a system blindly with mTHP has become challenging due
> >>>> to the lack of visibility into its operations. Presenting the
> >>>> success rate of mTHP allocations appears to be pressing need.
> >>>>
> >>>> Recently, I've been experiencing significant difficulty debugging
> >>>> performance improvements and regressions without these figures.
> >>>> It's crucial for us to understand the true effectiveness of
> >>>> mTHP in real-world scenarios, especially in systems with
> >>>> fragmented memory.
> >>>>
> >>>> This patch sets up the framework for per-order mTHP counters,
> >>>> starting with the introduction of alloc_success and alloc_fail
> >>>> counters.  Incorporating additional counters should now be
> >>>> straightforward as well.
> >>>>
> >>>> The initial two unsigned longs for each event are unused, given
> >>>> that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> >>>> improves code clarity.
> >>>>
> >>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>> ---
> >>>>    -v2:
> >>>>    * move to sysfs and provide per-order counters; David, Ryan, Willy
> >>>>    -v1:
> >>>>    https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.com/
> >>>>
> >>>>    include/linux/huge_mm.h | 17 +++++++++++++
> >>>>    mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
> >>>>    mm/memory.c             |  3 +++
> >>>>    3 files changed, 74 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>>> index e896ca4760f6..27fa26a22a8f 100644
> >>>> --- a/include/linux/huge_mm.h
> >>>> +++ b/include/linux/huge_mm.h
> >>>> @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >>>>                                          enforce_sysfs, orders);
> >>>>    }
> >>>>
> >>>> +enum thp_event_item {
> >>>> +     THP_ALLOC_SUCCESS,
> >>>> +     THP_ALLOC_FAIL,
> >>>> +     NR_THP_EVENT_ITEMS
> >>>> +};
> >>>
> >>> I'm wondering if these should be ANON specific for now. We might want to
> >>> add others (shmem, file) in the future.
> >>
> >> I've two ways to do that
> >> 1. rename to ANON_THP_ALLOC, so that I can have SHMEM_THP_ALLOC, FILE_THP_ALLOC
> >> in the future;
> >> 2. let THP_ALLOC cover all of shmem, file and anon.
> >>
> >> following vmstat, actually 1 might be better as we have both THP_FAULT_ALLOC and
> >> THP_FILE_ALLOC for pmd-mapped THP.
> >>
> >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>                  THP_FAULT_ALLOC,
> >>                  THP_FAULT_FALLBACK,
> >>                  THP_FAULT_FALLBACK_CHARGE,
> >>                  THP_COLLAPSE_ALLOC,
> >>                  THP_COLLAPSE_ALLOC_FAILED,
> >>                  THP_FILE_ALLOC,
> >>                  THP_FILE_FALLBACK,
> >>                  THP_FILE_FALLBACK_CHARGE,
> >>                  THP_FILE_MAPPED,
> >>                  THP_SPLIT_PAGE,
> >>                  THP_SPLIT_PAGE_FAILED,
> >>                  THP_DEFERRED_SPLIT_PAGE,
> >>                  THP_SPLIT_PMD,
> >>                  THP_SCAN_EXCEED_NONE_PTE,
> >>                  THP_SCAN_EXCEED_SWAP_PTE,
> >>                  THP_SCAN_EXCEED_SHARED_PTE,
> >> #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >>                  THP_SPLIT_PUD,
> >> #endif
> >>                  THP_ZERO_PAGE_ALLOC,
> >>                  THP_ZERO_PAGE_ALLOC_FAILED,
> >>                  THP_SWPOUT,
> >>                  THP_SWPOUT_FALLBACK,
> >> #endif
> >>
> >> And reading mm/shmem.c, obviously, shmem is using THP_FILE_ALLOC.
> >>
> >> I will rename it to ANON_THP_ALLOC in v3, let me know if you disagree :-)
> >
> > I don't think the name of the enum is important - its an implementation detail
> > that can be changed. Its the name of the sysfs file that matters. Although of
> > course its nice to keep them in sync from a maintenance pov.
>
> Jup.
>
> >
> > Currently they are called "alloc_success" and "alloc_fail" I believe? Perhaps
> > "anon_alloc" and "anon_alloc_fallback" are a bit more in keeping with vmstat?
> >
> > I'm assuming that:
> >
> > vmstat:thp_fault_alloc == hugepages-2048kB/stats/anon_alloc
> > vmstat:thp_fault_alloc_fallback == hugepages-2048kB/stats/anon_alloc_fallback
>
> Or an "anon" subdirectory ... not sure, just a thought.

I'd rather streamline the directory structure and incorporate "anon" or
"file" details into the filenames to avoid the inconvenience of typing
lengthy paths.  we already have the "stats" directory.

>
> --
> Cheers,
>
> David / dhildenb
>
diff mbox series

Patch

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e896ca4760f6..27fa26a22a8f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -264,6 +264,23 @@  unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 					  enforce_sysfs, orders);
 }
 
+enum thp_event_item {
+	THP_ALLOC_SUCCESS,
+	THP_ALLOC_FAIL,
+	NR_THP_EVENT_ITEMS
+};
+
+struct thp_event_state {
+	unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
+};
+
+DECLARE_PER_CPU(struct thp_event_state, thp_event_states);
+
+static inline void count_thp_event(int order, enum thp_event_item item)
+{
+	this_cpu_inc(thp_event_states.event[order][item]);
+}
+
 #define transparent_hugepage_use_zero_page()				\
 	(transparent_hugepage_flags &					\
 	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1683de78c313..addd093d8410 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -526,6 +526,52 @@  static const struct kobj_type thpsize_ktype = {
 	.sysfs_ops = &kobj_sysfs_ops,
 };
 
+DEFINE_PER_CPU(struct thp_event_state, thp_event_states) = {{{0}}};
+
+static unsigned long sum_thp_events(int order, enum thp_event_item item)
+{
+	unsigned long sum = 0;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct thp_event_state *this = &per_cpu(thp_event_states, cpu);
+
+		sum += this->event[order][item];
+	}
+
+	return sum;
+}
+
+static ssize_t alloc_success_show(struct kobject *kobj,
+				   struct kobj_attribute *attr, char *buf)
+{
+	int order = to_thpsize(kobj)->order;
+
+	return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_SUCCESS));
+}
+
+static ssize_t alloc_fail_show(struct kobject *kobj,
+				   struct kobj_attribute *attr, char *buf)
+{
+	int order = to_thpsize(kobj)->order;
+
+	return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_FAIL));
+}
+
+static struct kobj_attribute alloc_success_attr = __ATTR_RO(alloc_success);
+static struct kobj_attribute alloc_fail_attr = __ATTR_RO(alloc_fail);
+
+static struct attribute *stats_attrs[] = {
+	&alloc_success_attr.attr,
+	&alloc_fail_attr.attr,
+	NULL,
+};
+
+static struct attribute_group stats_attr_group = {
+	.name = "stats",
+	.attrs = stats_attrs,
+};
+
 static struct thpsize *thpsize_create(int order, struct kobject *parent)
 {
 	unsigned long size = (PAGE_SIZE << order) / SZ_1K;
@@ -549,6 +595,12 @@  static struct thpsize *thpsize_create(int order, struct kobject *parent)
 		return ERR_PTR(ret);
 	}
 
+	ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
+	if (ret) {
+		kobject_put(&thpsize->kobj);
+		return ERR_PTR(ret);
+	}
+
 	thpsize->order = order;
 	return thpsize;
 }
@@ -1050,8 +1102,10 @@  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true);
 	if (unlikely(!folio)) {
 		count_vm_event(THP_FAULT_FALLBACK);
+		count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_FAIL);
 		return VM_FAULT_FALLBACK;
 	}
+	count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_SUCCESS);
 	return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 984b138f85b4..c9c1031c2ecb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4365,7 +4365,10 @@  static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 			}
 			folio_throttle_swaprate(folio, gfp);
 			clear_huge_page(&folio->page, vmf->address, 1 << order);
+			count_thp_event(order, THP_ALLOC_SUCCESS);
 			return folio;
+		} else {
+			count_thp_event(order, THP_ALLOC_FAIL);
 		}
 next:
 		order = next_order(&orders, order);