diff mbox series

mm: compaction: skip memory compaction when there are not enough migratable pages

Message ID 1735981122-2085-1-git-send-email-yangge1116@126.com (mailing list archive)
State New
Headers show
Series mm: compaction: skip memory compaction when there are not enough migratable pages | expand

Commit Message

Ge Yang Jan. 4, 2025, 8:58 a.m. UTC
From: yangge <yangge1116@126.com>

There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
of memory. I have configured 16GB of CMA memory on each NUMA node,
and starting a 32GB virtual machine with device passthrough is
extremely slow, taking almost an hour.

During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
Long term GUP cannot allocate memory from CMA area, so a maximum of
16 GB of no-CMA memory on a NUMA node can be used as virtual machine
memory. There is 16GB of free CMA memory on a NUMA node, which is
sufficient to pass the order-0 watermark check, causing the
__compaction_suitable() function to  consistently return true.
However, if there aren't enough migratable pages available, performing
memory compaction is also meaningless. Besides checking whether
the order-0 watermark is met, __compaction_suitable() also needs
to determine whether there are sufficient migratable pages available
for memory compaction.

For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
    if (compact_result == COMPACT_SKIPPED ||
        compact_result == COMPACT_DEFERRED)
        goto nopage; // should exit __alloc_pages_slowpath() from here

When the 16G of non-CMA memory on a single node is exhausted, we will
fallback to allocating memory on other nodes. In order to quickly
fallback to remote nodes, we should skip memory compaction when
migratable pages are insufficient. After this fix, it only takes a
few tens of seconds to start a 32GB virtual machine with device
passthrough functionality.

Signed-off-by: yangge <yangge1116@126.com>
---
 mm/compaction.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

kernel test robot Jan. 4, 2025, 11:28 a.m. UTC | #1
Hi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/yangge1116-126-com/mm-compaction-skip-memory-compaction-when-there-are-not-enough-migratable-pages/20250104-170112
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/1735981122-2085-1-git-send-email-yangge1116%40126.com
patch subject: [PATCH] mm: compaction: skip memory compaction when there are not enough migratable pages
config: i386-buildonly-randconfig-001-20250104 (https://download.01.org/0day-ci/archive/20250104/202501041908.jDpLhAgL-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250104/202501041908.jDpLhAgL-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501041908.jDpLhAgL-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from mm/compaction.c:15:
   include/linux/mm_inline.h:47:41: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      47 |         __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
         |                                    ~~~~~~~~~~~ ^ ~~~
   include/linux/mm_inline.h:49:22: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      49 |                                 NR_ZONE_LRU_BASE + lru, nr_pages);
         |                                 ~~~~~~~~~~~~~~~~ ^ ~~~
>> mm/compaction.c:2386:13: warning: unused variable 'pgdat' [-Wunused-variable]
    2386 |         pg_data_t *pgdat = zone->zone_pgdat;
         |                    ^~~~~
   3 warnings generated.


vim +/pgdat +2386 mm/compaction.c

  2381	
  2382	static bool __compaction_suitable(struct zone *zone, int order,
  2383					  int highest_zoneidx,
  2384					  unsigned long wmark_target)
  2385	{
> 2386		pg_data_t *pgdat = zone->zone_pgdat;
  2387		unsigned long sum, nr_pinned;
  2388		unsigned long watermark;
  2389	
  2390		sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
  2391			node_page_state(pgdat, NR_INACTIVE_ANON) +
  2392			node_page_state(pgdat, NR_ACTIVE_FILE) +
  2393			node_page_state(pgdat, NR_ACTIVE_ANON);
  2394	
  2395		nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
  2396			node_page_state(pgdat, NR_FOLL_PIN_RELEASED);
  2397	
  2398		/*
  2399		 * Gup-pinned pages are non-migratable. After subtracting these pages,
  2400		 * we need to check if the remaining pages are sufficient for memory
  2401		 * compaction.
  2402		 */
  2403		if ((sum - nr_pinned) < (1 << order))
  2404			return false;
  2405	
  2406		/*
  2407		 * Watermarks for order-0 must be met for compaction to be able to
  2408		 * isolate free pages for migration targets. This means that the
  2409		 * watermark and alloc_flags have to match, or be more pessimistic than
  2410		 * the check in __isolate_free_page(). We don't use the direct
  2411		 * compactor's alloc_flags, as they are not relevant for freepage
  2412		 * isolation. We however do use the direct compactor's highest_zoneidx
  2413		 * to skip over zones where lowmem reserves would prevent allocation
  2414		 * even if compaction succeeds.
  2415		 * For costly orders, we require low watermark instead of min for
  2416		 * compaction to proceed to increase its chances.
  2417		 * ALLOC_CMA is used, as pages in CMA pageblocks are considered
  2418		 * suitable migration targets
  2419		 */
  2420		watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
  2421					low_wmark_pages(zone) : min_wmark_pages(zone);
  2422		watermark += compact_gap(order);
  2423		return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
  2424					   ALLOC_CMA, wmark_target);
  2425	}
  2426
Baolin Wang Jan. 6, 2025, 8:12 a.m. UTC | #2
On 2025/1/4 16:58, yangge1116@126.com wrote:
> From: yangge <yangge1116@126.com>
> 
> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
> of memory. I have configured 16GB of CMA memory on each NUMA node,
> and starting a 32GB virtual machine with device passthrough is
> extremely slow, taking almost an hour.
> 
> During the start-up of the virtual machine, it will call
> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
> Long term GUP cannot allocate memory from CMA area, so a maximum of
> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
> memory. There is 16GB of free CMA memory on a NUMA node, which is
> sufficient to pass the order-0 watermark check, causing the
> __compaction_suitable() function to  consistently return true.
> However, if there aren't enough migratable pages available, performing
> memory compaction is also meaningless. Besides checking whether
> the order-0 watermark is met, __compaction_suitable() also needs
> to determine whether there are sufficient migratable pages available
> for memory compaction.
> 
> For costly allocations, because __compaction_suitable() always
> returns true, __alloc_pages_slowpath() can't exit at the appropriate
> place, resulting in excessively long virtual machine startup times.
> Call trace:
> __alloc_pages_slowpath
>      if (compact_result == COMPACT_SKIPPED ||
>          compact_result == COMPACT_DEFERRED)
>          goto nopage; // should exit __alloc_pages_slowpath() from here
> 
> When the 16G of non-CMA memory on a single node is exhausted, we will
> fallback to allocating memory on other nodes. In order to quickly
> fallback to remote nodes, we should skip memory compaction when
> migratable pages are insufficient. After this fix, it only takes a
> few tens of seconds to start a 32GB virtual machine with device
> passthrough functionality.
> 
> Signed-off-by: yangge <yangge1116@126.com>
> ---
>   mm/compaction.c | 19 +++++++++++++++++++
>   1 file changed, 19 insertions(+)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 07bd227..1c469b3 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2383,7 +2383,26 @@ static bool __compaction_suitable(struct zone *zone, int order,
>   				  int highest_zoneidx,
>   				  unsigned long wmark_target)
>   {
> +	pg_data_t *pgdat = zone->zone_pgdat;
> +	unsigned long sum, nr_pinned;
>   	unsigned long watermark;
> +
> +	sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
> +		node_page_state(pgdat, NR_INACTIVE_ANON) +
> +		node_page_state(pgdat, NR_ACTIVE_FILE) +
> +		node_page_state(pgdat, NR_ACTIVE_ANON);
> +
> +	nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
> +		node_page_state(pgdat, NR_FOLL_PIN_RELEASED);
> +
> +	/*
> +	 * Gup-pinned pages are non-migratable. After subtracting these pages,
> +	 * we need to check if the remaining pages are sufficient for memory
> +	 * compaction.
> +	 */
> +	if ((sum - nr_pinned) < (1 << order))
> +		return false;
> +

IMO, using the node's statistics to determine whether the zone is 
suitable for compaction doesn't make sense. It is possible that even 
though the normal zone has long-term pinned pages, the movable zone can 
still be suitable for compaction.
Ge Yang Jan. 6, 2025, 8:49 a.m. UTC | #3
在 2025/1/6 16:12, Baolin Wang 写道:
> 
> 
> On 2025/1/4 16:58, yangge1116@126.com wrote:
>> From: yangge <yangge1116@126.com>
>>
>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>> and starting a 32GB virtual machine with device passthrough is
>> extremely slow, taking almost an hour.
>>
>> During the start-up of the virtual machine, it will call
>> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
>> Long term GUP cannot allocate memory from CMA area, so a maximum of
>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
>> memory. There is 16GB of free CMA memory on a NUMA node, which is
>> sufficient to pass the order-0 watermark check, causing the
>> __compaction_suitable() function to  consistently return true.
>> However, if there aren't enough migratable pages available, performing
>> memory compaction is also meaningless. Besides checking whether
>> the order-0 watermark is met, __compaction_suitable() also needs
>> to determine whether there are sufficient migratable pages available
>> for memory compaction.
>>
>> For costly allocations, because __compaction_suitable() always
>> returns true, __alloc_pages_slowpath() can't exit at the appropriate
>> place, resulting in excessively long virtual machine startup times.
>> Call trace:
>> __alloc_pages_slowpath
>>      if (compact_result == COMPACT_SKIPPED ||
>>          compact_result == COMPACT_DEFERRED)
>>          goto nopage; // should exit __alloc_pages_slowpath() from here
>>
>> When the 16G of non-CMA memory on a single node is exhausted, we will
>> fallback to allocating memory on other nodes. In order to quickly
>> fallback to remote nodes, we should skip memory compaction when
>> migratable pages are insufficient. After this fix, it only takes a
>> few tens of seconds to start a 32GB virtual machine with device
>> passthrough functionality.
>>
>> Signed-off-by: yangge <yangge1116@126.com>
>> ---
>>   mm/compaction.c | 19 +++++++++++++++++++
>>   1 file changed, 19 insertions(+)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 07bd227..1c469b3 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -2383,7 +2383,26 @@ static bool __compaction_suitable(struct zone 
>> *zone, int order,
>>                     int highest_zoneidx,
>>                     unsigned long wmark_target)
>>   {
>> +    pg_data_t *pgdat = zone->zone_pgdat;
>> +    unsigned long sum, nr_pinned;
>>       unsigned long watermark;
>> +
>> +    sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
>> +        node_page_state(pgdat, NR_INACTIVE_ANON) +
>> +        node_page_state(pgdat, NR_ACTIVE_FILE) +
>> +        node_page_state(pgdat, NR_ACTIVE_ANON);
>> +
>> +    nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
>> +        node_page_state(pgdat, NR_FOLL_PIN_RELEASED);
>> +
>> +    /*
>> +     * Gup-pinned pages are non-migratable. After subtracting these 
>> pages,
>> +     * we need to check if the remaining pages are sufficient for memory
>> +     * compaction.
>> +     */
>> +    if ((sum - nr_pinned) < (1 << order))
>> +        return false;
>> +
> 
> IMO, using the node's statistics to determine whether the zone is 
> suitable for compaction doesn't make sense. It is possible that even 
> though the normal zone has long-term pinned pages, the movable zone can 
> still be suitable for compaction.
If all the memory used on a node is pinned, then this memory cannot be 
migrated anymore, and memory compaction operations would not succeed.
I haven't used movable zone before, can you explain why memory 
compaction is still necessary? Thank you.
Baolin Wang Jan. 8, 2025, 2:50 a.m. UTC | #4
On 2025/1/6 16:49, Ge Yang wrote:
> 
> 
> 在 2025/1/6 16:12, Baolin Wang 写道:
>>
>>
>> On 2025/1/4 16:58, yangge1116@126.com wrote:
>>> From: yangge <yangge1116@126.com>
>>>
>>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>>> and starting a 32GB virtual machine with device passthrough is
>>> extremely slow, taking almost an hour.
>>>
>>> During the start-up of the virtual machine, it will call
>>> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
>>> Long term GUP cannot allocate memory from CMA area, so a maximum of
>>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
>>> memory. There is 16GB of free CMA memory on a NUMA node, which is
>>> sufficient to pass the order-0 watermark check, causing the
>>> __compaction_suitable() function to  consistently return true.
>>> However, if there aren't enough migratable pages available, performing
>>> memory compaction is also meaningless. Besides checking whether
>>> the order-0 watermark is met, __compaction_suitable() also needs
>>> to determine whether there are sufficient migratable pages available
>>> for memory compaction.
>>>
>>> For costly allocations, because __compaction_suitable() always
>>> returns true, __alloc_pages_slowpath() can't exit at the appropriate
>>> place, resulting in excessively long virtual machine startup times.
>>> Call trace:
>>> __alloc_pages_slowpath
>>>      if (compact_result == COMPACT_SKIPPED ||
>>>          compact_result == COMPACT_DEFERRED)
>>>          goto nopage; // should exit __alloc_pages_slowpath() from here
>>>
>>> When the 16G of non-CMA memory on a single node is exhausted, we will
>>> fallback to allocating memory on other nodes. In order to quickly
>>> fallback to remote nodes, we should skip memory compaction when
>>> migratable pages are insufficient. After this fix, it only takes a
>>> few tens of seconds to start a 32GB virtual machine with device
>>> passthrough functionality.
>>>
>>> Signed-off-by: yangge <yangge1116@126.com>
>>> ---
>>>   mm/compaction.c | 19 +++++++++++++++++++
>>>   1 file changed, 19 insertions(+)
>>>
>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>> index 07bd227..1c469b3 100644
>>> --- a/mm/compaction.c
>>> +++ b/mm/compaction.c
>>> @@ -2383,7 +2383,26 @@ static bool __compaction_suitable(struct zone 
>>> *zone, int order,
>>>                     int highest_zoneidx,
>>>                     unsigned long wmark_target)
>>>   {
>>> +    pg_data_t *pgdat = zone->zone_pgdat;
>>> +    unsigned long sum, nr_pinned;
>>>       unsigned long watermark;
>>> +
>>> +    sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
>>> +        node_page_state(pgdat, NR_INACTIVE_ANON) +
>>> +        node_page_state(pgdat, NR_ACTIVE_FILE) +
>>> +        node_page_state(pgdat, NR_ACTIVE_ANON);
>>> +
>>> +    nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
>>> +        node_page_state(pgdat, NR_FOLL_PIN_RELEASED);
>>> +
>>> +    /*
>>> +     * Gup-pinned pages are non-migratable. After subtracting these 
>>> pages,
>>> +     * we need to check if the remaining pages are sufficient for 
>>> memory
>>> +     * compaction.
>>> +     */
>>> +    if ((sum - nr_pinned) < (1 << order))
>>> +        return false;
>>> +
>>
>> IMO, using the node's statistics to determine whether the zone is 
>> suitable for compaction doesn't make sense. It is possible that even 
>> though the normal zone has long-term pinned pages, the movable zone 
>> can still be suitable for compaction.
> If all the memory used on a node is pinned, then this memory cannot be 
> migrated anymore, and memory compaction operations would not succeed.
> I haven't used movable zone before, can you explain why memory 
> compaction is still necessary? Thank you.

Please consider unevictable folios that are not in the active/inactive 
file/anon LRU lists, yet can still be migrated.
diff mbox series

Patch

diff --git a/mm/compaction.c b/mm/compaction.c
index 07bd227..1c469b3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2383,7 +2383,26 @@  static bool __compaction_suitable(struct zone *zone, int order,
 				  int highest_zoneidx,
 				  unsigned long wmark_target)
 {
+	pg_data_t *pgdat = zone->zone_pgdat;
+	unsigned long sum, nr_pinned;
 	unsigned long watermark;
+
+	sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
+		node_page_state(pgdat, NR_INACTIVE_ANON) +
+		node_page_state(pgdat, NR_ACTIVE_FILE) +
+		node_page_state(pgdat, NR_ACTIVE_ANON);
+
+	nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
+		node_page_state(pgdat, NR_FOLL_PIN_RELEASED);
+
+	/*
+	 * Gup-pinned pages are non-migratable. After subtracting these pages,
+	 * we need to check if the remaining pages are sufficient for memory
+	 * compaction.
+	 */
+	if ((sum - nr_pinned) < (1 << order))
+		return false;
+
 	/*
 	 * Watermarks for order-0 must be met for compaction to be able to
 	 * isolate free pages for migration targets. This means that the