mbox series

[RFC,0/6] migrate_pages(): batch TLB flushing

Message ID 20220921060616.73086-1-ying.huang@intel.com (mailing list archive)
Headers show
Series migrate_pages(): batch TLB flushing | expand

Message

Huang, Ying Sept. 21, 2022, 6:06 a.m. UTC
From: "Huang, Ying" <ying.huang@intel.com>

Now, migrate_pages() migrate pages one by one, like the fake code as
follows,

  for each page
    unmap
    flush TLB
    copy
    restore map

If multiple pages are passed to migrate_pages(), there are
opportunities to batch the TLB flushing and copying.  That is, we can
change the code to something as follows,

  for each page
    unmap
  for each page
    flush TLB
  for each page
    copy
  for each page
    restore map

The total number of TLB flushing IPI can be reduced considerably.  And
we may use some hardware accelerator such as DSA to accelerate the
page copying.

So in this patch, we refactor the migrate_pages() implementation and
implement the TLB flushing batching.  Base on this, hardware
accelerated page copying can be implemented.

If too many pages are passed to migrate_pages(), in the naive batched
implementation, we may unmap too many pages at the same time.  The
possibility for a task to wait for the migrated pages to be mapped
again increases.  So the latency may be hurt.  To deal with this
issue, the max number of pages be unmapped in batch is restricted to
no more than HPAGE_PMD_NR.  That is, the influence is at the same
level of THP migration.

We use the following test to measure the performance impact of the
patchset,

On a 2-socket Intel server,

 - Run pmbench memory accessing benchmark

 - Run `migratepages` to migrate pages of pmbench between node 0 and
   node 1 back and forth.

With the patch, the TLB flushing IPI reduces 99.1% during the test and
the number of pages migrated successfully per second increases 291.7%.

This patchset is based on v6.0-rc5 and the following patchset,

[PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/

The migrate_pages() related code is converting to folio now. So this
patchset cannot apply recent akpm/mm-unstable branch.  This patchset
is used to check the basic idea.  If it is OK, I will rebase the
patchset on top of folio changes.

Best Regards,
Huang, Ying

Comments

Zi Yan Sept. 21, 2022, 3:47 p.m. UTC | #1
On 21 Sep 2022, at 2:06, Huang Ying wrote:

> From: "Huang, Ying" <ying.huang@intel.com>
>
> Now, migrate_pages() migrate pages one by one, like the fake code as
> follows,
>
>   for each page
>     unmap
>     flush TLB
>     copy
>     restore map
>
> If multiple pages are passed to migrate_pages(), there are
> opportunities to batch the TLB flushing and copying.  That is, we can
> change the code to something as follows,
>
>   for each page
>     unmap
>   for each page
>     flush TLB
>   for each page
>     copy
>   for each page
>     restore map
>
> The total number of TLB flushing IPI can be reduced considerably.  And
> we may use some hardware accelerator such as DSA to accelerate the
> page copying.
>
> So in this patch, we refactor the migrate_pages() implementation and
> implement the TLB flushing batching.  Base on this, hardware
> accelerated page copying can be implemented.
>
> If too many pages are passed to migrate_pages(), in the naive batched
> implementation, we may unmap too many pages at the same time.  The
> possibility for a task to wait for the migrated pages to be mapped
> again increases.  So the latency may be hurt.  To deal with this
> issue, the max number of pages be unmapped in batch is restricted to
> no more than HPAGE_PMD_NR.  That is, the influence is at the same
> level of THP migration.
>
> We use the following test to measure the performance impact of the
> patchset,
>
> On a 2-socket Intel server,
>
>  - Run pmbench memory accessing benchmark
>
>  - Run `migratepages` to migrate pages of pmbench between node 0 and
>    node 1 back and forth.
>
> With the patch, the TLB flushing IPI reduces 99.1% during the test and
> the number of pages migrated successfully per second increases 291.7%.

Thank you for the patchset. Batching page migration will definitely
improve its throughput from my past experiments[1] and starting with
TLB flushing is a good first step.

BTW, what is the rationality behind the increased page migration
success rate per second?

>
> This patchset is based on v6.0-rc5 and the following patchset,
>
> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/
>
> The migrate_pages() related code is converting to folio now. So this
> patchset cannot apply recent akpm/mm-unstable branch.  This patchset
> is used to check the basic idea.  If it is OK, I will rebase the
> patchset on top of folio changes.
>
> Best Regards,
> Huang, Ying


[1] https://lwn.net/Articles/784925/

--
Best Regards,
Yan, Zi
Huang, Ying Sept. 22, 2022, 1:45 a.m. UTC | #2
Zi Yan <ziy@nvidia.com> writes:

> On 21 Sep 2022, at 2:06, Huang Ying wrote:
>
>> From: "Huang, Ying" <ying.huang@intel.com>
>>
>> Now, migrate_pages() migrate pages one by one, like the fake code as
>> follows,
>>
>>   for each page
>>     unmap
>>     flush TLB
>>     copy
>>     restore map
>>
>> If multiple pages are passed to migrate_pages(), there are
>> opportunities to batch the TLB flushing and copying.  That is, we can
>> change the code to something as follows,
>>
>>   for each page
>>     unmap
>>   for each page
>>     flush TLB
>>   for each page
>>     copy
>>   for each page
>>     restore map
>>
>> The total number of TLB flushing IPI can be reduced considerably.  And
>> we may use some hardware accelerator such as DSA to accelerate the
>> page copying.
>>
>> So in this patch, we refactor the migrate_pages() implementation and
>> implement the TLB flushing batching.  Base on this, hardware
>> accelerated page copying can be implemented.
>>
>> If too many pages are passed to migrate_pages(), in the naive batched
>> implementation, we may unmap too many pages at the same time.  The
>> possibility for a task to wait for the migrated pages to be mapped
>> again increases.  So the latency may be hurt.  To deal with this
>> issue, the max number of pages be unmapped in batch is restricted to
>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>> level of THP migration.
>>
>> We use the following test to measure the performance impact of the
>> patchset,
>>
>> On a 2-socket Intel server,
>>
>>  - Run pmbench memory accessing benchmark
>>
>>  - Run `migratepages` to migrate pages of pmbench between node 0 and
>>    node 1 back and forth.
>>
>> With the patch, the TLB flushing IPI reduces 99.1% during the test and
>> the number of pages migrated successfully per second increases 291.7%.
>
> Thank you for the patchset. Batching page migration will definitely
> improve its throughput from my past experiments[1] and starting with
> TLB flushing is a good first step.

Thanks for the pointer, the patch description provides valuable information
for me already!

> BTW, what is the rationality behind the increased page migration
> success rate per second?

From perf profiling data, in the base kernel,

  migrate_pages.migrate_to_node.do_migrate_pages.kernel_migrate_pages.__x64_sys_migrate_pages:	2.87
  ptep_clear_flush.try_to_migrate_one.rmap_walk_anon.try_to_migrate.__unmap_and_move:           2.39

Because pmbench run in the system too, the CPU cycles of migrate_pages()
is about 2.87%.  While the CPU cycles for TLB flushing is 2.39%.  That
is, 2.39/2.87 = 83.3% CPU cycles of migrate_pages() are used for TLB
flushing.

After batching the TLB flushing, the perf profiling data becomes,

  migrate_pages.migrate_to_node.do_migrate_pages.kernel_migrate_pages.__x64_sys_migrate_pages:	2.77
  move_to_new_folio.migrate_pages_batch.migrate_pages.migrate_to_node.do_migrate_pages:         1.68
  copy_page.folio_copy.migrate_folio.move_to_new_folio.migrate_pages_batch:                     1.21

1.21/2.77 = 43.7% CPU cycles of migrate_pages() are used for page
copying now.

  try_to_migrate_one:	0.23

The CPU cycles of unmapping and TLB flushing becomes 0.23/2.77 = 8.3% of
migrate_pages().

All in all, after the optimization, we do much less TLB flushing, which
consumes a lot of CPU cycles before the optimization.  So the throughput
of migrate_pages() increases greatly.

I will add these data in the next version of patch.

Best Regards,
Huang, Ying

>>
>> This patchset is based on v6.0-rc5 and the following patchset,
>>
>> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
>> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/
>>
>> The migrate_pages() related code is converting to folio now. So this
>> patchset cannot apply recent akpm/mm-unstable branch.  This patchset
>> is used to check the basic idea.  If it is OK, I will rebase the
>> patchset on top of folio changes.
>>
>> Best Regards,
>> Huang, Ying
>
>
> [1] https://lwn.net/Articles/784925/
>
> --
> Best Regards,
> Yan, Zi
haoxin Sept. 22, 2022, 3:47 a.m. UTC | #3
Hi Huang,

     This is an exciting change, but on ARM64 machine the TLB flushing 
are not through IPI, it depends on 'vale1is'

instruction,so I'm wondering if there's also a  benefit on arm64, and 
I'm going to test it on an ARM64 machine.


在 2022/9/21 下午11:47, Zi Yan 写道:
> On 21 Sep 2022, at 2:06, Huang Ying wrote:
>
>> From: "Huang, Ying" <ying.huang@intel.com>
>>
>> Now, migrate_pages() migrate pages one by one, like the fake code as
>> follows,
>>
>>    for each page
>>      unmap
>>      flush TLB
>>      copy
>>      restore map
>>
>> If multiple pages are passed to migrate_pages(), there are
>> opportunities to batch the TLB flushing and copying.  That is, we can
>> change the code to something as follows,
>>
>>    for each page
>>      unmap
>>    for each page
>>      flush TLB
>>    for each page
>>      copy
>>    for each page
>>      restore map
>>
>> The total number of TLB flushing IPI can be reduced considerably.  And
>> we may use some hardware accelerator such as DSA to accelerate the
>> page copying.
>>
>> So in this patch, we refactor the migrate_pages() implementation and
>> implement the TLB flushing batching.  Base on this, hardware
>> accelerated page copying can be implemented.
>>
>> If too many pages are passed to migrate_pages(), in the naive batched
>> implementation, we may unmap too many pages at the same time.  The
>> possibility for a task to wait for the migrated pages to be mapped
>> again increases.  So the latency may be hurt.  To deal with this
>> issue, the max number of pages be unmapped in batch is restricted to
>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>> level of THP migration.
>>
>> We use the following test to measure the performance impact of the
>> patchset,
>>
>> On a 2-socket Intel server,
>>
>>   - Run pmbench memory accessing benchmark
>>
>>   - Run `migratepages` to migrate pages of pmbench between node 0 and
>>     node 1 back and forth.
>>
>> With the patch, the TLB flushing IPI reduces 99.1% during the test and
>> the number of pages migrated successfully per second increases 291.7%.
> Thank you for the patchset. Batching page migration will definitely
> improve its throughput from my past experiments[1] and starting with
> TLB flushing is a good first step.
>
> BTW, what is the rationality behind the increased page migration
> success rate per second?
>
>> This patchset is based on v6.0-rc5 and the following patchset,
>>
>> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
>> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/
>>
>> The migrate_pages() related code is converting to folio now. So this
>> patchset cannot apply recent akpm/mm-unstable branch.  This patchset
>> is used to check the basic idea.  If it is OK, I will rebase the
>> patchset on top of folio changes.
>>
>> Best Regards,
>> Huang, Ying
>
> [1] https://lwn.net/Articles/784925/
>
> --
> Best Regards,
> Yan, Zi
Huang, Ying Sept. 22, 2022, 4:36 a.m. UTC | #4
haoxin <xhao@linux.alibaba.com> writes:

> Hi Huang,
>
> This is an exciting change, but on ARM64 machine the TLB
> flushing are not through IPI, it depends on 'vale1is'
>
> instructionso I'm wondering if there's also a benefit on arm64,
> and I'm going to test it on an ARM64 machine.

We have no arm64 machine to test and I know very little about arm64.
Thanks for information and testing.

Best Regards,
Huang, Ying

>
> ( 2022/9/21 H11:47, Zi Yan S:
>> On 21 Sep 2022, at 2:06, Huang Ying wrote:
>>
>>> From: "Huang, Ying" <ying.huang@intel.com>
>>>
>>> Now, migrate_pages() migrate pages one by one, like the fake code as
>>> follows,
>>>
>>>    for each page
>>>      unmap
>>>      flush TLB
>>>      copy
>>>      restore map
>>>
>>> If multiple pages are passed to migrate_pages(), there are
>>> opportunities to batch the TLB flushing and copying.  That is, we can
>>> change the code to something as follows,
>>>
>>>    for each page
>>>      unmap
>>>    for each page
>>>      flush TLB
>>>    for each page
>>>      copy
>>>    for each page
>>>      restore map
>>>
>>> The total number of TLB flushing IPI can be reduced considerably.  And
>>> we may use some hardware accelerator such as DSA to accelerate the
>>> page copying.
>>>
>>> So in this patch, we refactor the migrate_pages() implementation and
>>> implement the TLB flushing batching.  Base on this, hardware
>>> accelerated page copying can be implemented.
>>>
>>> If too many pages are passed to migrate_pages(), in the naive batched
>>> implementation, we may unmap too many pages at the same time.  The
>>> possibility for a task to wait for the migrated pages to be mapped
>>> again increases.  So the latency may be hurt.  To deal with this
>>> issue, the max number of pages be unmapped in batch is restricted to
>>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>>> level of THP migration.
>>>
>>> We use the following test to measure the performance impact of the
>>> patchset,
>>>
>>> On a 2-socket Intel server,
>>>
>>>   - Run pmbench memory accessing benchmark
>>>
>>>   - Run `migratepages` to migrate pages of pmbench between node 0 and
>>>     node 1 back and forth.
>>>
>>> With the patch, the TLB flushing IPI reduces 99.1% during the test and
>>> the number of pages migrated successfully per second increases 291.7%.
>> Thank you for the patchset. Batching page migration will definitely
>> improve its throughput from my past experiments[1] and starting with
>> TLB flushing is a good first step.
>>
>> BTW, what is the rationality behind the increased page migration
>> success rate per second?
>>
>>> This patchset is based on v6.0-rc5 and the following patchset,
>>>
>>> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
>>> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/
>>>
>>> The migrate_pages() related code is converting to folio now. So this
>>> patchset cannot apply recent akpm/mm-unstable branch.  This patchset
>>> is used to check the basic idea.  If it is OK, I will rebase the
>>> patchset on top of folio changes.
>>>
>>> Best Regards,
>>> Huang, Ying
>>
>> [1] https://lwn.net/Articles/784925/
>>
>> --
>> Best Regards,
>> Yan, Zi
Bharata B Rao Sept. 22, 2022, 12:50 p.m. UTC | #5
On 9/21/2022 11:36 AM, Huang Ying wrote:
> From: "Huang, Ying" <ying.huang@intel.com>
> 
> Now, migrate_pages() migrate pages one by one, like the fake code as
> follows,
> 
>   for each page
>     unmap
>     flush TLB
>     copy
>     restore map
> 
> If multiple pages are passed to migrate_pages(), there are
> opportunities to batch the TLB flushing and copying.  That is, we can
> change the code to something as follows,
> 
>   for each page
>     unmap
>   for each page
>     flush TLB
>   for each page
>     copy
>   for each page
>     restore map
> 
> The total number of TLB flushing IPI can be reduced considerably.  And
> we may use some hardware accelerator such as DSA to accelerate the
> page copying.
> 
> So in this patch, we refactor the migrate_pages() implementation and
> implement the TLB flushing batching.  Base on this, hardware
> accelerated page copying can be implemented.
> 
> If too many pages are passed to migrate_pages(), in the naive batched
> implementation, we may unmap too many pages at the same time.  The
> possibility for a task to wait for the migrated pages to be mapped
> again increases.  So the latency may be hurt.  To deal with this
> issue, the max number of pages be unmapped in batch is restricted to
> no more than HPAGE_PMD_NR.  That is, the influence is at the same
> level of THP migration.

Thanks for the patchset. I find it hitting the following BUG() when
running mmtests/autonumabench:

kernel BUG at mm/migrate.c:2432!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 7 PID: 7150 Comm: numa01 Not tainted 6.0.0-rc5+ #171
Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.5.6 10/06/2021
RIP: 0010:migrate_misplaced_page+0x670/0x830 
Code: 36 48 8b 3c c5 e0 7a 19 8d e8 dc 10 f7 ff 4c 89 e7 e8 f4 43 f5 ff 8b 55 bc 85 d2 75 6f 48 8b 45 c0 4c 39 e8 0f 84 b0 fb ff ff <0f> 0b 48 8b 7d 90 e9 ec fc ff ff 48 83 e8 01 e9 48 fa ff ff 48 83
RSP: 0000:ffffb1b29ec3fd38 EFLAGS: 00010202
RAX: ffffe946460f8248 RBX: 0000000000000001 RCX: ffffe946460f8248
RDX: 0000000000000000 RSI: ffffe946460f8248 RDI: ffffb1b29ec3fce0
RBP: ffffb1b29ec3fda8 R08: 0000000000000000 R09: 0000000000000005
R10: 0000000000000001 R11: 0000000000000000 R12: ffffe946460f8240
R13: ffffb1b29ec3fd68 R14: 0000000000000001 R15: ffff9698beed5000
FS:  00007fcc31fee640(0000) GS:ffff9697b0000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcc3a3a5000 CR3: 000000016e89c002 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
 <TASK>
 __handle_mm_fault+0xb87/0xff0
 handle_mm_fault+0x126/0x3c0
 do_user_addr_fault+0x1ed/0x690
 exc_page_fault+0x84/0x2c0
 asm_exc_page_fault+0x27/0x30 
RIP: 0033:0x7fccfa1a1180
Code: 81 fa 80 00 00 00 76 d2 c5 fe 7f 40 40 c5 fe 7f 40 60 48 83 c7 80 48 81 fa 00 01 00 00 76 2b 48 8d 90 80 00 00 00 48 83 e2 c0 <c5> fd 7f 02 c5 fd 7f 42 20 c5 fd 7f 42 40 c5 fd 7f 42 60 48 83 ea
RSP: 002b:00007fcc31fede38 EFLAGS: 00010283
RAX: 00007fcc39fff010 RBX: 000000000000002c RCX: 00007fccfa11ea3d
RDX: 00007fcc3a3a5000 RSI: 0000000000000000 RDI: 00007fccf9ffef90
RBP: 00007fcc39fff010 R08: 00007fcc31fee640 R09: 00007fcc31fee640
R10: 00007ffdecef614f R11: 0000000000000246 R12: 00000000c0000000
R13: 0000000000000000 R14: 00007fccfa094850 R15: 00007ffdecef6190

This is BUG_ON(!list_empty(&migratepages)) in migrate_misplaced_page().

Regards,
Bharata.
Huang, Ying Sept. 23, 2022, 7:52 a.m. UTC | #6
Bharata B Rao <bharata@amd.com> writes:

> On 9/21/2022 11:36 AM, Huang Ying wrote:
>> From: "Huang, Ying" <ying.huang@intel.com>
>> 
>> Now, migrate_pages() migrate pages one by one, like the fake code as
>> follows,
>> 
>>   for each page
>>     unmap
>>     flush TLB
>>     copy
>>     restore map
>> 
>> If multiple pages are passed to migrate_pages(), there are
>> opportunities to batch the TLB flushing and copying.  That is, we can
>> change the code to something as follows,
>> 
>>   for each page
>>     unmap
>>   for each page
>>     flush TLB
>>   for each page
>>     copy
>>   for each page
>>     restore map
>> 
>> The total number of TLB flushing IPI can be reduced considerably.  And
>> we may use some hardware accelerator such as DSA to accelerate the
>> page copying.
>> 
>> So in this patch, we refactor the migrate_pages() implementation and
>> implement the TLB flushing batching.  Base on this, hardware
>> accelerated page copying can be implemented.
>> 
>> If too many pages are passed to migrate_pages(), in the naive batched
>> implementation, we may unmap too many pages at the same time.  The
>> possibility for a task to wait for the migrated pages to be mapped
>> again increases.  So the latency may be hurt.  To deal with this
>> issue, the max number of pages be unmapped in batch is restricted to
>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>> level of THP migration.
>
> Thanks for the patchset. I find it hitting the following BUG() when
> running mmtests/autonumabench:
>
> kernel BUG at mm/migrate.c:2432!
> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 7 PID: 7150 Comm: numa01 Not tainted 6.0.0-rc5+ #171
> Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.5.6 10/06/2021
> RIP: 0010:migrate_misplaced_page+0x670/0x830 
> Code: 36 48 8b 3c c5 e0 7a 19 8d e8 dc 10 f7 ff 4c 89 e7 e8 f4 43 f5 ff 8b 55 bc 85 d2 75 6f 48 8b 45 c0 4c 39 e8 0f 84 b0 fb ff ff <0f> 0b 48 8b 7d 90 e9 ec fc ff ff 48 83 e8 01 e9 48 fa ff ff 48 83
> RSP: 0000:ffffb1b29ec3fd38 EFLAGS: 00010202
> RAX: ffffe946460f8248 RBX: 0000000000000001 RCX: ffffe946460f8248
> RDX: 0000000000000000 RSI: ffffe946460f8248 RDI: ffffb1b29ec3fce0
> RBP: ffffb1b29ec3fda8 R08: 0000000000000000 R09: 0000000000000005
> R10: 0000000000000001 R11: 0000000000000000 R12: ffffe946460f8240
> R13: ffffb1b29ec3fd68 R14: 0000000000000001 R15: ffff9698beed5000
> FS:  00007fcc31fee640(0000) GS:ffff9697b0000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fcc3a3a5000 CR3: 000000016e89c002 CR4: 0000000000770ee0
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  __handle_mm_fault+0xb87/0xff0
>  handle_mm_fault+0x126/0x3c0
>  do_user_addr_fault+0x1ed/0x690
>  exc_page_fault+0x84/0x2c0
>  asm_exc_page_fault+0x27/0x30 
> RIP: 0033:0x7fccfa1a1180
> Code: 81 fa 80 00 00 00 76 d2 c5 fe 7f 40 40 c5 fe 7f 40 60 48 83 c7 80 48 81 fa 00 01 00 00 76 2b 48 8d 90 80 00 00 00 48 83 e2 c0 <c5> fd 7f 02 c5 fd 7f 42 20 c5 fd 7f 42 40 c5 fd 7f 42 60 48 83 ea
> RSP: 002b:00007fcc31fede38 EFLAGS: 00010283
> RAX: 00007fcc39fff010 RBX: 000000000000002c RCX: 00007fccfa11ea3d
> RDX: 00007fcc3a3a5000 RSI: 0000000000000000 RDI: 00007fccf9ffef90
> RBP: 00007fcc39fff010 R08: 00007fcc31fee640 R09: 00007fcc31fee640
> R10: 00007ffdecef614f R11: 0000000000000246 R12: 00000000c0000000
> R13: 0000000000000000 R14: 00007fccfa094850 R15: 00007ffdecef6190
>
> This is BUG_ON(!list_empty(&migratepages)) in migrate_misplaced_page().

Thank you very much for reporting!  I haven't reproduced this yet.  But
I will pay special attention to this when develop the next version, even
if I cannot reproduce this finally.

Best Regards,
Huang, Ying
Alistair Popple Sept. 26, 2022, 9:11 a.m. UTC | #7
Huang Ying <ying.huang@intel.com> writes:

> From: "Huang, Ying" <ying.huang@intel.com>
>
> Now, migrate_pages() migrate pages one by one, like the fake code as
> follows,
>
>   for each page
>     unmap
>     flush TLB
>     copy
>     restore map
>
> If multiple pages are passed to migrate_pages(), there are
> opportunities to batch the TLB flushing and copying.  That is, we can
> change the code to something as follows,
>
>   for each page
>     unmap
>   for each page
>     flush TLB
>   for each page
>     copy
>   for each page
>     restore map

We use a very similar sequence for the migrate_vma_*() set of calls. It
would be good if we could one day consolidate the two. I believe the
biggest hindrance to that is migrate_vma_*() operates on arrays of pfns
rather than a list of pages. The reason for that is it needs to migrate
non-lru pages and hence can't use page->lru to create a list of pages to
migrate.

So from my perspective I think this direction is good as it would help
with that. One thing to watch out for is deadlocking if locking multiple
pages though.

> The total number of TLB flushing IPI can be reduced considerably.  And
> we may use some hardware accelerator such as DSA to accelerate the
> page copying.
>
> So in this patch, we refactor the migrate_pages() implementation and
> implement the TLB flushing batching.  Base on this, hardware
> accelerated page copying can be implemented.
>
> If too many pages are passed to migrate_pages(), in the naive batched
> implementation, we may unmap too many pages at the same time.  The
> possibility for a task to wait for the migrated pages to be mapped
> again increases.  So the latency may be hurt.  To deal with this
> issue, the max number of pages be unmapped in batch is restricted to
> no more than HPAGE_PMD_NR.  That is, the influence is at the same
> level of THP migration.
>
> We use the following test to measure the performance impact of the
> patchset,
>
> On a 2-socket Intel server,
>
>  - Run pmbench memory accessing benchmark
>
>  - Run `migratepages` to migrate pages of pmbench between node 0 and
>    node 1 back and forth.
>
> With the patch, the TLB flushing IPI reduces 99.1% during the test and
> the number of pages migrated successfully per second increases 291.7%.
>
> This patchset is based on v6.0-rc5 and the following patchset,
>
> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path
> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/
>
> The migrate_pages() related code is converting to folio now. So this
> patchset cannot apply recent akpm/mm-unstable branch.  This patchset
> is used to check the basic idea.  If it is OK, I will rebase the
> patchset on top of folio changes.
>
> Best Regards,
> Huang, Ying
Bharata B Rao Sept. 27, 2022, 10:46 a.m. UTC | #8
On 9/23/2022 1:22 PM, Huang, Ying wrote:
> Bharata B Rao <bharata@amd.com> writes:
>>
>> Thanks for the patchset. I find it hitting the following BUG() when
>> running mmtests/autonumabench:
>>
>> kernel BUG at mm/migrate.c:2432!
>>
>> This is BUG_ON(!list_empty(&migratepages)) in migrate_misplaced_page().
> 
> Thank you very much for reporting!  I haven't reproduced this yet.  But
> I will pay special attention to this when develop the next version, even
> if I cannot reproduce this finally.

The following change fixes the above reported BUG_ON().

diff --git a/mm/migrate.c b/mm/migrate.c
index a0de0d9b4d41..c11dd82245e5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1197,7 +1197,7 @@ static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
         * references and be restored.
         */
        /* restore the page to right list. */
-       if (rc != -EAGAIN)
+       if (rc == -EAGAIN)
                 ret = NULL;
 
        migrate_page_undo_page(page, page_was_mapped, anon_vma, locked, ret);

The pages that returned from unmapping stage with -EAGAIN used to
end up on "ret" list rather than continuing on the "from" list.

Regards,
Bharata.
haoxin Sept. 27, 2022, 11:21 a.m. UTC | #9
Hi,  Huang

在 2022/9/21 下午2:06, Huang Ying 写道:
> From: "Huang, Ying" <ying.huang@intel.com>
>
> Now, migrate_pages() migrate pages one by one, like the fake code as
> follows,
>
>    for each page
>      unmap
>      flush TLB
>      copy
>      restore map
>
> If multiple pages are passed to migrate_pages(), there are
> opportunities to batch the TLB flushing and copying.  That is, we can
> change the code to something as follows,
>
>    for each page
>      unmap
>    for each page
>      flush TLB
>    for each page
>      copy
>    for each page
>      restore map
>
> The total number of TLB flushing IPI can be reduced considerably.  And
> we may use some hardware accelerator such as DSA to accelerate the
> page copying.
>
> So in this patch, we refactor the migrate_pages() implementation and
> implement the TLB flushing batching.  Base on this, hardware
> accelerated page copying can be implemented.
>
> If too many pages are passed to migrate_pages(), in the naive batched
> implementation, we may unmap too many pages at the same time.  The
> possibility for a task to wait for the migrated pages to be mapped
> again increases.  So the latency may be hurt.  To deal with this
> issue, the max number of pages be unmapped in batch is restricted to
> no more than HPAGE_PMD_NR.  That is, the influence is at the same
> level of THP migration.
>
> We use the following test to measure the performance impact of the
> patchset,
>
> On a 2-socket Intel server,
>
>   - Run pmbench memory accessing benchmark
>
>   - Run `migratepages` to migrate pages of pmbench between node 0 and
>     node 1 back and forth.
>
As the pmbench can not run on arm64 machine, so i use lmbench instead.
I test case like this:  (i am not sure whether it is reasonable, but it seems worked)
./bw_mem -N10000 10000m rd &
time migratepages pid node0 node1

o/patch      		w/patch
real	0m0.035s  	real	0m0.024s
user	0m0.000s  	user	0m0.000s
sys	0m0.035s        sys	0m0.024s

the migratepages time is reduced above 32%.

But there has a problem, i see the batch flush is called by
migrate_pages_batch
	try_to_unmap_flush
		arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.

But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.

So, the tlb batch flush means no any flush is did, it is a empty func.

Maybe this patch can help solve this problem.
https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/
Huang, Ying Sept. 28, 2022, 1:46 a.m. UTC | #10
Bharata B Rao <bharata@amd.com> writes:

> On 9/23/2022 1:22 PM, Huang, Ying wrote:
>> Bharata B Rao <bharata@amd.com> writes:
>>>
>>> Thanks for the patchset. I find it hitting the following BUG() when
>>> running mmtests/autonumabench:
>>>
>>> kernel BUG at mm/migrate.c:2432!
>>>
>>> This is BUG_ON(!list_empty(&migratepages)) in migrate_misplaced_page().
>> 
>> Thank you very much for reporting!  I haven't reproduced this yet.  But
>> I will pay special attention to this when develop the next version, even
>> if I cannot reproduce this finally.
>
> The following change fixes the above reported BUG_ON().
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a0de0d9b4d41..c11dd82245e5 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1197,7 +1197,7 @@ static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page,
>          * references and be restored.
>          */
>         /* restore the page to right list. */
> -       if (rc != -EAGAIN)
> +       if (rc == -EAGAIN)
>                  ret = NULL;
>  
>         migrate_page_undo_page(page, page_was_mapped, anon_vma, locked, ret);
>
> The pages that returned from unmapping stage with -EAGAIN used to
> end up on "ret" list rather than continuing on the "from" list.

Yes.  You are right.  Thank you very much!

Digging some history, it is found that the code was correct in previous
versions, but became wrong for mistake during code rebasing.  Will be
more careful in the future and try to organize the patchset better to
make it easier to review the changes.

Best Regards,
Huang, Ying
Huang, Ying Sept. 28, 2022, 2:01 a.m. UTC | #11
haoxin <xhao@linux.alibaba.com> writes:

> Hi, Huang
>
> ( 2022/9/21 H2:06, Huang Ying S:
>> From: "Huang, Ying" <ying.huang@intel.com>
>>
>> Now, migrate_pages() migrate pages one by one, like the fake code as
>> follows,
>>
>>    for each page
>>      unmap
>>      flush TLB
>>      copy
>>      restore map
>>
>> If multiple pages are passed to migrate_pages(), there are
>> opportunities to batch the TLB flushing and copying.  That is, we can
>> change the code to something as follows,
>>
>>    for each page
>>      unmap
>>    for each page
>>      flush TLB
>>    for each page
>>      copy
>>    for each page
>>      restore map
>>
>> The total number of TLB flushing IPI can be reduced considerably.  And
>> we may use some hardware accelerator such as DSA to accelerate the
>> page copying.
>>
>> So in this patch, we refactor the migrate_pages() implementation and
>> implement the TLB flushing batching.  Base on this, hardware
>> accelerated page copying can be implemented.
>>
>> If too many pages are passed to migrate_pages(), in the naive batched
>> implementation, we may unmap too many pages at the same time.  The
>> possibility for a task to wait for the migrated pages to be mapped
>> again increases.  So the latency may be hurt.  To deal with this
>> issue, the max number of pages be unmapped in batch is restricted to
>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>> level of THP migration.
>>
>> We use the following test to measure the performance impact of the
>> patchset,
>>
>> On a 2-socket Intel server,
>>
>>   - Run pmbench memory accessing benchmark
>>
>>   - Run `migratepages` to migrate pages of pmbench between node 0 and
>>     node 1 back and forth.
>>
> As the pmbench can not run on arm64 machine, so i use lmbench instead.
> I test case like this:  (i am not sure whether it is reasonable, but it seems worked)
> ./bw_mem -N10000 10000m rd &
> time migratepages pid node0 node1
>
> o/patch      		w/patch
> real	0m0.035s  	real	0m0.024s
> user	0m0.000s  	user	0m0.000s
> sys	0m0.035s        sys	0m0.024s
>
> the migratepages time is reduced above 32%.
>
> But there has a problem, i see the batch flush is called by
> migrate_pages_batch
> 	try_to_unmap_flush
> 		arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.
>
> But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.
>
> So, the tlb batch flush means no any flush is did, it is a empty func.

Yes.  And should_defer_flush() will always return false too.  That is,
the TLB will still be flushed, but will not be batched.

> Maybe this patch can help solve this problem.
> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/

Yes.  This will bring TLB flush batching to ARM64.

Best Regards,
Huang, Ying
haoxin Sept. 28, 2022, 3:33 a.m. UTC | #12
在 2022/9/28 上午10:01, Huang, Ying 写道:
> haoxin <xhao@linux.alibaba.com> writes:
>
>> Hi, Huang
>>
>> ( 2022/9/21 H2:06, Huang Ying S:
>>> From: "Huang, Ying" <ying.huang@intel.com>
>>>
>>> Now, migrate_pages() migrate pages one by one, like the fake code as
>>> follows,
>>>
>>>     for each page
>>>       unmap
>>>       flush TLB
>>>       copy
>>>       restore map
>>>
>>> If multiple pages are passed to migrate_pages(), there are
>>> opportunities to batch the TLB flushing and copying.  That is, we can
>>> change the code to something as follows,
>>>
>>>     for each page
>>>       unmap
>>>     for each page
>>>       flush TLB
>>>     for each page
>>>       copy
>>>     for each page
>>>       restore map
>>>
>>> The total number of TLB flushing IPI can be reduced considerably.  And
>>> we may use some hardware accelerator such as DSA to accelerate the
>>> page copying.
>>>
>>> So in this patch, we refactor the migrate_pages() implementation and
>>> implement the TLB flushing batching.  Base on this, hardware
>>> accelerated page copying can be implemented.
>>>
>>> If too many pages are passed to migrate_pages(), in the naive batched
>>> implementation, we may unmap too many pages at the same time.  The
>>> possibility for a task to wait for the migrated pages to be mapped
>>> again increases.  So the latency may be hurt.  To deal with this
>>> issue, the max number of pages be unmapped in batch is restricted to
>>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>>> level of THP migration.
>>>
>>> We use the following test to measure the performance impact of the
>>> patchset,
>>>
>>> On a 2-socket Intel server,
>>>
>>>    - Run pmbench memory accessing benchmark
>>>
>>>    - Run `migratepages` to migrate pages of pmbench between node 0 and
>>>      node 1 back and forth.
>>>
>> As the pmbench can not run on arm64 machine, so i use lmbench instead.
>> I test case like this:  (i am not sure whether it is reasonable, but it seems worked)
>> ./bw_mem -N10000 10000m rd &
>> time migratepages pid node0 node1
>>
>> o/patch      		w/patch
>> real	0m0.035s  	real	0m0.024s
>> user	0m0.000s  	user	0m0.000s
>> sys	0m0.035s        sys	0m0.024s
>>
>> the migratepages time is reduced above 32%.
>>
>> But there has a problem, i see the batch flush is called by
>> migrate_pages_batch
>> 	try_to_unmap_flush
>> 		arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.
>>
>> But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.
>>
>> So, the tlb batch flush means no any flush is did, it is a empty func.
> Yes.  And should_defer_flush() will always return false too.  That is,
> the TLB will still be flushed, but will not be batched.
Oh, yes, i  ignore this, thank you.
>
>> Maybe this patch can help solve this problem.
>> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/
> Yes.  This will bring TLB flush batching to ARM64.
Next time,  i will combine with this patch, and do some test again, do 
you have any suggestion about  benchmark ?
>
> Best Regards,
> Huang, Ying
Huang, Ying Sept. 28, 2022, 4:53 a.m. UTC | #13
haoxin <xhao@linux.alibaba.com> writes:

> ( 2022/9/28 H10:01, Huang, Ying S:
>> haoxin <xhao@linux.alibaba.com> writes:
>>
>>> Hi, Huang
>>>
>>> ( 2022/9/21 H2:06, Huang Ying S:
>>>> From: "Huang, Ying" <ying.huang@intel.com>
>>>>
>>>> Now, migrate_pages() migrate pages one by one, like the fake code as
>>>> follows,
>>>>
>>>>     for each page
>>>>       unmap
>>>>       flush TLB
>>>>       copy
>>>>       restore map
>>>>
>>>> If multiple pages are passed to migrate_pages(), there are
>>>> opportunities to batch the TLB flushing and copying.  That is, we can
>>>> change the code to something as follows,
>>>>
>>>>     for each page
>>>>       unmap
>>>>     for each page
>>>>       flush TLB
>>>>     for each page
>>>>       copy
>>>>     for each page
>>>>       restore map
>>>>
>>>> The total number of TLB flushing IPI can be reduced considerably.  And
>>>> we may use some hardware accelerator such as DSA to accelerate the
>>>> page copying.
>>>>
>>>> So in this patch, we refactor the migrate_pages() implementation and
>>>> implement the TLB flushing batching.  Base on this, hardware
>>>> accelerated page copying can be implemented.
>>>>
>>>> If too many pages are passed to migrate_pages(), in the naive batched
>>>> implementation, we may unmap too many pages at the same time.  The
>>>> possibility for a task to wait for the migrated pages to be mapped
>>>> again increases.  So the latency may be hurt.  To deal with this
>>>> issue, the max number of pages be unmapped in batch is restricted to
>>>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>>>> level of THP migration.
>>>>
>>>> We use the following test to measure the performance impact of the
>>>> patchset,
>>>>
>>>> On a 2-socket Intel server,
>>>>
>>>>    - Run pmbench memory accessing benchmark
>>>>
>>>>    - Run `migratepages` to migrate pages of pmbench between node 0 and
>>>>      node 1 back and forth.
>>>>
>>> As the pmbench can not run on arm64 machine, so i use lmbench instead.
>>> I test case like this:  (i am not sure whether it is reasonable, but it seems worked)
>>> ./bw_mem -N10000 10000m rd &
>>> time migratepages pid node0 node1
>>>
>>> o/patch      		w/patch
>>> real	0m0.035s  	real	0m0.024s
>>> user	0m0.000s  	user	0m0.000s
>>> sys	0m0.035s        sys	0m0.024s
>>>
>>> the migratepages time is reduced above 32%.
>>>
>>> But there has a problem, i see the batch flush is called by
>>> migrate_pages_batch
>>> 	try_to_unmap_flush
>>> 		arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.
>>>
>>> But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.
>>>
>>> So, the tlb batch flush means no any flush is did, it is a empty func.
>> Yes.  And should_defer_flush() will always return false too.  That is,
>> the TLB will still be flushed, but will not be batched.
> Oh, yes, i ignore this, thank you.
>>
>>> Maybe this patch can help solve this problem.
>>> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/
>> Yes.  This will bring TLB flush batching to ARM64.
> Next time, i will combine with this patch, and do some test again,
> do you have any suggestion about benchmark ?

I think your benchmark should be OK.  If multiple threads are used, the
effect of patchset will be better.

Best Regards,
Huang, Ying
Hesham Almatary Nov. 1, 2022, 2:49 p.m. UTC | #14
On 9/27/2022 12:21 PM, haoxin wrote:
> Hi, Huang
>
> 在 2022/9/21 下午2:06, Huang Ying 写道:
>> From: "Huang, Ying" <ying.huang@intel.com>
>>
>> Now, migrate_pages() migrate pages one by one, like the fake code as
>> follows,
>>
>>    for each page
>>      unmap
>>      flush TLB
>>      copy
>>      restore map
>>
>> If multiple pages are passed to migrate_pages(), there are
>> opportunities to batch the TLB flushing and copying.  That is, we can
>> change the code to something as follows,
>>
>>    for each page
>>      unmap
>>    for each page
>>      flush TLB
>>    for each page
>>      copy
>>    for each page
>>      restore map
>>
>> The total number of TLB flushing IPI can be reduced considerably.  And
>> we may use some hardware accelerator such as DSA to accelerate the
>> page copying.
>>
>> So in this patch, we refactor the migrate_pages() implementation and
>> implement the TLB flushing batching.  Base on this, hardware
>> accelerated page copying can be implemented.
>>
>> If too many pages are passed to migrate_pages(), in the naive batched
>> implementation, we may unmap too many pages at the same time. The
>> possibility for a task to wait for the migrated pages to be mapped
>> again increases.  So the latency may be hurt.  To deal with this
>> issue, the max number of pages be unmapped in batch is restricted to
>> no more than HPAGE_PMD_NR.  That is, the influence is at the same
>> level of THP migration.
>>
>> We use the following test to measure the performance impact of the
>> patchset,
>>
>> On a 2-socket Intel server,
>>
>>   - Run pmbench memory accessing benchmark
>>
>>   - Run `migratepages` to migrate pages of pmbench between node 0 and
>>     node 1 back and forth.
>>
> As the pmbench can not run on arm64 machine, so i use lmbench instead.
> I test case like this:  (i am not sure whether it is reasonable, but it seems 
> worked)
> ./bw_mem -N10000 10000m rd &
> time migratepages pid node0 node1
>
FYI, I have ported pmbench to AArch64 [1]. The project seems to be abandoned on 
bitbucket,

I wonder if it makes sense to fork it elsewhere and push the pending PRs there.


[1] https://bitbucket.org/jisooy/pmbench/pull-requests/5

> o/patch w/patch
> real    0m0.035s      real    0m0.024s
> user    0m0.000s      user    0m0.000s
> sys    0m0.035s        sys    0m0.024s
>
> the migratepages time is reduced above 32%.
>
> But there has a problem, i see the batch flush is called by
> migrate_pages_batch
>     try_to_unmap_flush
>         arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.
>
> But in arm64, the arch_tlbbatch_flush are not supported, becasue it not 
> support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.
>
> So, the tlb batch flush means no any flush is did, it is a empty func.
>
> Maybe this patch can help solve this problem.
> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/ 
>
>
>
>
>
>
>
Hesham Almatary Nov. 2, 2022, 2:13 p.m. UTC | #15
On 11/2/2022 3:14 AM, Huang, Ying wrote:
> Hesham Almatary <hesham.almatary@huawei.com> writes:
>
>> On 9/27/2022 12:21 PM, haoxin wrote:
>>> Hi, Huang
>>>
>>> ( 2022/9/21 H2:06, Huang Ying S:
>>>> From: "Huang, Ying" <ying.huang@intel.com>
>>>>
>>>> Now, migrate_pages() migrate pages one by one, like the fake code as
>>>> follows,
>>>>
>>>> ? for each page
>>>> ?? unmap
>>>> ?? flush TLB
>>>> ?? copy
>>>> ?? restore map
>>>>
>>>> If multiple pages are passed to migrate_pages(), there are
>>>> opportunities to batch the TLB flushing and copying. That is, we can
>>>> change the code to something as follows,
>>>>
>>>> ? for each page
>>>> ?? unmap
>>>> ? for each page
>>>> ?? flush TLB
>>>> ? for each page
>>>> ?? copy
>>>> ? for each page
>>>> ?? restore map
>>>>
>>>> The total number of TLB flushing IPI can be reduced considerably. And
>>>> we may use some hardware accelerator such as DSA to accelerate the
>>>> page copying.
>>>>
>>>> So in this patch, we refactor the migrate_pages() implementation and
>>>> implement the TLB flushing batching. Base on this, hardware
>>>> accelerated page copying can be implemented.
>>>>
>>>> If too many pages are passed to migrate_pages(), in the naive batched
>>>> implementation, we may unmap too many pages at the same time. The
>>>> possibility for a task to wait for the migrated pages to be mapped
>>>> again increases. So the latency may be hurt. To deal with this
>>>> issue, the max number of pages be unmapped in batch is restricted to
>>>> no more than HPAGE_PMD_NR. That is, the influence is at the same
>>>> level of THP migration.
>>>>
>>>> We use the following test to measure the performance impact of the
>>>> patchset,
>>>>
>>>> On a 2-socket Intel server,
>>>>
>>>>   - Run pmbench memory accessing benchmark
>>>>
>>>>   - Run `migratepages` to migrate pages of pmbench between node 0 and
>>>> ? node 1 back and forth.
>>>>
>>> As the pmbench can not run on arm64 machine, so i use lmbench instead.
>>> I test case like this: (i am not sure whether it is reasonable,
>>> but it seems worked)
>>> ./bw_mem -N10000 10000m rd &
>>> time migratepages pid node0 node1
>>>
>> FYI, I have ported pmbench to AArch64 [1]. The project seems to be
>> abandoned on bitbucket,
>>
>> I wonder if it makes sense to fork it elsewhere and push the pending PRs there.
>>
>>
>> [1] https://bitbucket.org/jisooy/pmbench/pull-requests/5
> Maybe try to contact the original author with email firstly?

That's  a good idea. I'm not planning to fork/maintain it myself, but if anyone

is interested in doing so, I am happy to help out and submit PRs there.


> Best Regards,
> Huang, Ying
>
>>> o/patch w/patch
>>> real? 0m0.035s?? real? 0m0.024s
>>> user? 0m0.000s?? user? 0m0.000s
>>> sys? 0m0.035s??? sys? 0m0.024s
>>>
>>> the migratepages time is reduced above 32%.
>>>
>>> But there has a problem, i see the batch flush is called by
>>> migrate_pages_batch
>>> ??try_to_unmap_flush
>>> ??? arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work.
>>>
>>> But in arm64, the arch_tlbbatch_flush are not supported, becasue it
>>> not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet.
>>>
>>> So, the tlb batch flush means no any flush is did, it is a empty func.
>>>
>>> Maybe this patch can help solve this problem.
>>> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/
>>>
>>>
>>>
>>>
>>>
>>>