diff mbox series

[v2,2/3] mm/page_reporting: Allow driver to specify threshold

Message ID 20210622074926.333223-3-gshan@redhat.com (mailing list archive)
State New
Headers show
Series mm/page_reporting: Make page reporting work on arm64 with 64KB page size | expand

Commit Message

Gavin Shan June 22, 2021, 7:49 a.m. UTC
The page reporting threshold is currently sticky to @pageblock_order.
The page reporting can never be triggered because the freeing page
can't come up with a free area like that huge. The situation becomes
worse when the system memory becomes heavily fragmented.

For example, the following configurations are used on ARM64 when 64KB
base page size is enabled. In this specific case, the page reporting
won't be triggered until the freeing page comes up with a 512MB free
area. That's hard to be met, especially when the system memory becomes
heavily fragmented.

   PAGE_SIZE:          64KB
   HPAGE_SIZE:         512MB
   pageblock_order:    13       (512MB)
   MAX_ORDER:          14

This allows the drivers to specify the threshold when the page
reporting device is registered. The threshold falls back to
@pageblock_order if it's not specified by the driver. The existing
users (hv_balloon and virtio_balloon) don't specify the threshold
and @pageblock_order is still taken as their page reporting order.
So this shouldn't introduce functional changes.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 include/linux/page_reporting.h |  3 +++
 mm/page_reporting.c            | 14 ++++++++++----
 mm/page_reporting.h            | 10 ++--------
 3 files changed, 15 insertions(+), 12 deletions(-)

Comments

Alexander Duyck June 22, 2021, 5:39 p.m. UTC | #1
On Mon, Jun 21, 2021 at 10:48 PM Gavin Shan <gshan@redhat.com> wrote:
>
> The page reporting threshold is currently sticky to @pageblock_order.
> The page reporting can never be triggered because the freeing page
> can't come up with a free area like that huge. The situation becomes
> worse when the system memory becomes heavily fragmented.
>
> For example, the following configurations are used on ARM64 when 64KB
> base page size is enabled. In this specific case, the page reporting
> won't be triggered until the freeing page comes up with a 512MB free
> area. That's hard to be met, especially when the system memory becomes
> heavily fragmented.
>
>    PAGE_SIZE:          64KB
>    HPAGE_SIZE:         512MB
>    pageblock_order:    13       (512MB)
>    MAX_ORDER:          14
>
> This allows the drivers to specify the threshold when the page
> reporting device is registered. The threshold falls back to
> @pageblock_order if it's not specified by the driver. The existing
> users (hv_balloon and virtio_balloon) don't specify the threshold
> and @pageblock_order is still taken as their page reporting order.
> So this shouldn't introduce functional changes.
>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  include/linux/page_reporting.h |  3 +++
>  mm/page_reporting.c            | 14 ++++++++++----
>  mm/page_reporting.h            | 10 ++--------
>  3 files changed, 15 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
> index 3b99e0ec24f2..fe648dfa3a7c 100644
> --- a/include/linux/page_reporting.h
> +++ b/include/linux/page_reporting.h
> @@ -18,6 +18,9 @@ struct page_reporting_dev_info {
>
>         /* Current state of page reporting */
>         atomic_t state;
> +
> +       /* Minimal order of page reporting */
> +       unsigned int order;
>  };
>
>  /* Tear-down and bring-up for page reporting devices */
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> index df9c5054e1b4..27670360bae6 100644
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c

<snip>

> @@ -324,6 +324,12 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
>                 goto err_out;
>         }
>
> +       /*
> +        * We need to choose the minimal order of page reporting if it's
> +        * not specified by the driver.
> +        */
> +       prdev->order = prdev->order ? prdev->order : pageblock_order;
> +
>         /* initialize state and work structures */
>         atomic_set(&prdev->state, PAGE_REPORTING_IDLE);
>         INIT_DELAYED_WORK(&prdev->work, &page_reporting_process);

Rather than using prdev->order directly it might be better to have a
reporting_order value you could export for use by
page_reporting_notify_free. That way you avoid the overhead of having
to make a function call per page freed.

> diff --git a/mm/page_reporting.h b/mm/page_reporting.h
> index 2c385dd4ddbd..d9f972e72649 100644
> --- a/mm/page_reporting.h
> +++ b/mm/page_reporting.h
> @@ -10,11 +10,9 @@
>  #include <linux/pgtable.h>
>  #include <linux/scatterlist.h>
>
> -#define PAGE_REPORTING_MIN_ORDER       pageblock_order
> -
>  #ifdef CONFIG_PAGE_REPORTING
>  DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
> -void __page_reporting_notify(void);
> +void __page_reporting_notify(unsigned int order);
>
>  static inline bool page_reported(struct page *page)
>  {
> @@ -37,12 +35,8 @@ static inline void page_reporting_notify_free(unsigned int order)
>         if (!static_branch_unlikely(&page_reporting_enabled))
>                 return;
>
> -       /* Determine if we have crossed reporting threshold */
> -       if (order < PAGE_REPORTING_MIN_ORDER)
> -               return;
> -
>         /* This will add a few cycles, but should be called infrequently */
> -       __page_reporting_notify();
> +       __page_reporting_notify(order);
>  }
>  #else /* CONFIG_PAGE_REPORTING */
>  #define page_reported(_page)   false

With us making the function call per page freed we are likely to have
a much more significant impact on performance with page reporting
enabled. Ideally we want to limit this impact so that we only take the
cost for the conditional check on the lower order pages.
Gavin Shan June 23, 2021, 12:43 a.m. UTC | #2
On 6/23/21 3:39 AM, Alexander Duyck wrote:
> On Mon, Jun 21, 2021 at 10:48 PM Gavin Shan <gshan@redhat.com> wrote:
>>
>> The page reporting threshold is currently sticky to @pageblock_order.
>> The page reporting can never be triggered because the freeing page
>> can't come up with a free area like that huge. The situation becomes
>> worse when the system memory becomes heavily fragmented.
>>
>> For example, the following configurations are used on ARM64 when 64KB
>> base page size is enabled. In this specific case, the page reporting
>> won't be triggered until the freeing page comes up with a 512MB free
>> area. That's hard to be met, especially when the system memory becomes
>> heavily fragmented.
>>
>>     PAGE_SIZE:          64KB
>>     HPAGE_SIZE:         512MB
>>     pageblock_order:    13       (512MB)
>>     MAX_ORDER:          14
>>
>> This allows the drivers to specify the threshold when the page
>> reporting device is registered. The threshold falls back to
>> @pageblock_order if it's not specified by the driver. The existing
>> users (hv_balloon and virtio_balloon) don't specify the threshold
>> and @pageblock_order is still taken as their page reporting order.
>> So this shouldn't introduce functional changes.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   include/linux/page_reporting.h |  3 +++
>>   mm/page_reporting.c            | 14 ++++++++++----
>>   mm/page_reporting.h            | 10 ++--------
>>   3 files changed, 15 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
>> index 3b99e0ec24f2..fe648dfa3a7c 100644
>> --- a/include/linux/page_reporting.h
>> +++ b/include/linux/page_reporting.h
>> @@ -18,6 +18,9 @@ struct page_reporting_dev_info {
>>
>>          /* Current state of page reporting */
>>          atomic_t state;
>> +
>> +       /* Minimal order of page reporting */
>> +       unsigned int order;
>>   };
>>
>>   /* Tear-down and bring-up for page reporting devices */
>> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
>> index df9c5054e1b4..27670360bae6 100644
>> --- a/mm/page_reporting.c
>> +++ b/mm/page_reporting.c
> 
> <snip>
> 
>> @@ -324,6 +324,12 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
>>                  goto err_out;
>>          }
>>
>> +       /*
>> +        * We need to choose the minimal order of page reporting if it's
>> +        * not specified by the driver.
>> +        */
>> +       prdev->order = prdev->order ? prdev->order : pageblock_order;
>> +
>>          /* initialize state and work structures */
>>          atomic_set(&prdev->state, PAGE_REPORTING_IDLE);
>>          INIT_DELAYED_WORK(&prdev->work, &page_reporting_process);
> 
> Rather than using prdev->order directly it might be better to have a
> reporting_order value you could export for use by
> page_reporting_notify_free. That way you avoid the overhead of having
> to make a function call per page freed.
> 

Yes, I obviously missed the point to reduce the overhead because of
function call. In next revision, I will introduce @page_reporting_order
for this. Besides, it will be exported as a module parameter so that
it can be changed dynamically, as David suggested before.

>> diff --git a/mm/page_reporting.h b/mm/page_reporting.h
>> index 2c385dd4ddbd..d9f972e72649 100644
>> --- a/mm/page_reporting.h
>> +++ b/mm/page_reporting.h
>> @@ -10,11 +10,9 @@
>>   #include <linux/pgtable.h>
>>   #include <linux/scatterlist.h>
>>
>> -#define PAGE_REPORTING_MIN_ORDER       pageblock_order
>> -
>>   #ifdef CONFIG_PAGE_REPORTING
>>   DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
>> -void __page_reporting_notify(void);
>> +void __page_reporting_notify(unsigned int order);
>>
>>   static inline bool page_reported(struct page *page)
>>   {
>> @@ -37,12 +35,8 @@ static inline void page_reporting_notify_free(unsigned int order)
>>          if (!static_branch_unlikely(&page_reporting_enabled))
>>                  return;
>>
>> -       /* Determine if we have crossed reporting threshold */
>> -       if (order < PAGE_REPORTING_MIN_ORDER)
>> -               return;
>> -
>>          /* This will add a few cycles, but should be called infrequently */
>> -       __page_reporting_notify();
>> +       __page_reporting_notify(order);
>>   }
>>   #else /* CONFIG_PAGE_REPORTING */
>>   #define page_reported(_page)   false
> 
> With us making the function call per page freed we are likely to have
> a much more significant impact on performance with page reporting
> enabled. Ideally we want to limit this impact so that we only take the
> cost for the conditional check on the lower order pages.
> 

Yep, thanks for the explanation, Alex.

Thanks,
Gavin
diff mbox series

Patch

diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index 3b99e0ec24f2..fe648dfa3a7c 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -18,6 +18,9 @@  struct page_reporting_dev_info {
 
 	/* Current state of page reporting */
 	atomic_t state;
+
+	/* Minimal order of page reporting */
+	unsigned int order;
 };
 
 /* Tear-down and bring-up for page reporting devices */
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index df9c5054e1b4..27670360bae6 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -47,7 +47,7 @@  __page_reporting_request(struct page_reporting_dev_info *prdev)
 }
 
 /* notify prdev of free page reporting request */
-void __page_reporting_notify(void)
+void __page_reporting_notify(unsigned int order)
 {
 	struct page_reporting_dev_info *prdev;
 
@@ -58,7 +58,7 @@  void __page_reporting_notify(void)
 	 */
 	rcu_read_lock();
 	prdev = rcu_dereference(pr_dev_info);
-	if (likely(prdev))
+	if (likely(prdev && order >= prdev->order))
 		__page_reporting_request(prdev);
 
 	rcu_read_unlock();
@@ -229,7 +229,7 @@  page_reporting_process_zone(struct page_reporting_dev_info *prdev,
 
 	/* Generate minimum watermark to be able to guarantee progress */
 	watermark = low_wmark_pages(zone) +
-		    (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER);
+		    (PAGE_REPORTING_CAPACITY << prdev->order);
 
 	/*
 	 * Cancel request if insufficient free memory or if we failed
@@ -239,7 +239,7 @@  page_reporting_process_zone(struct page_reporting_dev_info *prdev,
 		return err;
 
 	/* Process each free list starting from lowest order/mt */
-	for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) {
+	for (order = prdev->order; order < MAX_ORDER; order++) {
 		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
 			/* We do not pull pages from the isolate free list */
 			if (is_migrate_isolate(mt))
@@ -324,6 +324,12 @@  int page_reporting_register(struct page_reporting_dev_info *prdev)
 		goto err_out;
 	}
 
+	/*
+	 * We need to choose the minimal order of page reporting if it's
+	 * not specified by the driver.
+	 */
+	prdev->order = prdev->order ? prdev->order : pageblock_order;
+
 	/* initialize state and work structures */
 	atomic_set(&prdev->state, PAGE_REPORTING_IDLE);
 	INIT_DELAYED_WORK(&prdev->work, &page_reporting_process);
diff --git a/mm/page_reporting.h b/mm/page_reporting.h
index 2c385dd4ddbd..d9f972e72649 100644
--- a/mm/page_reporting.h
+++ b/mm/page_reporting.h
@@ -10,11 +10,9 @@ 
 #include <linux/pgtable.h>
 #include <linux/scatterlist.h>
 
-#define PAGE_REPORTING_MIN_ORDER	pageblock_order
-
 #ifdef CONFIG_PAGE_REPORTING
 DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
-void __page_reporting_notify(void);
+void __page_reporting_notify(unsigned int order);
 
 static inline bool page_reported(struct page *page)
 {
@@ -37,12 +35,8 @@  static inline void page_reporting_notify_free(unsigned int order)
 	if (!static_branch_unlikely(&page_reporting_enabled))
 		return;
 
-	/* Determine if we have crossed reporting threshold */
-	if (order < PAGE_REPORTING_MIN_ORDER)
-		return;
-
 	/* This will add a few cycles, but should be called infrequently */
-	__page_reporting_notify();
+	__page_reporting_notify(order);
 }
 #else /* CONFIG_PAGE_REPORTING */
 #define page_reported(_page)	false