Message ID | 20220921060616.73086-1-ying.huang@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | migrate_pages(): batch TLB flushing | expand |
On 21 Sep 2022, at 2:06, Huang Ying wrote: > From: "Huang, Ying" <ying.huang@intel.com> > > Now, migrate_pages() migrate pages one by one, like the fake code as > follows, > > for each page > unmap > flush TLB > copy > restore map > > If multiple pages are passed to migrate_pages(), there are > opportunities to batch the TLB flushing and copying. That is, we can > change the code to something as follows, > > for each page > unmap > for each page > flush TLB > for each page > copy > for each page > restore map > > The total number of TLB flushing IPI can be reduced considerably. And > we may use some hardware accelerator such as DSA to accelerate the > page copying. > > So in this patch, we refactor the migrate_pages() implementation and > implement the TLB flushing batching. Base on this, hardware > accelerated page copying can be implemented. > > If too many pages are passed to migrate_pages(), in the naive batched > implementation, we may unmap too many pages at the same time. The > possibility for a task to wait for the migrated pages to be mapped > again increases. So the latency may be hurt. To deal with this > issue, the max number of pages be unmapped in batch is restricted to > no more than HPAGE_PMD_NR. That is, the influence is at the same > level of THP migration. > > We use the following test to measure the performance impact of the > patchset, > > On a 2-socket Intel server, > > - Run pmbench memory accessing benchmark > > - Run `migratepages` to migrate pages of pmbench between node 0 and > node 1 back and forth. > > With the patch, the TLB flushing IPI reduces 99.1% during the test and > the number of pages migrated successfully per second increases 291.7%. Thank you for the patchset. Batching page migration will definitely improve its throughput from my past experiments[1] and starting with TLB flushing is a good first step. BTW, what is the rationality behind the increased page migration success rate per second? > > This patchset is based on v6.0-rc5 and the following patchset, > > [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path > https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/ > > The migrate_pages() related code is converting to folio now. So this > patchset cannot apply recent akpm/mm-unstable branch. This patchset > is used to check the basic idea. If it is OK, I will rebase the > patchset on top of folio changes. > > Best Regards, > Huang, Ying [1] https://lwn.net/Articles/784925/ -- Best Regards, Yan, Zi
Zi Yan <ziy@nvidia.com> writes: > On 21 Sep 2022, at 2:06, Huang Ying wrote: > >> From: "Huang, Ying" <ying.huang@intel.com> >> >> Now, migrate_pages() migrate pages one by one, like the fake code as >> follows, >> >> for each page >> unmap >> flush TLB >> copy >> restore map >> >> If multiple pages are passed to migrate_pages(), there are >> opportunities to batch the TLB flushing and copying. That is, we can >> change the code to something as follows, >> >> for each page >> unmap >> for each page >> flush TLB >> for each page >> copy >> for each page >> restore map >> >> The total number of TLB flushing IPI can be reduced considerably. And >> we may use some hardware accelerator such as DSA to accelerate the >> page copying. >> >> So in this patch, we refactor the migrate_pages() implementation and >> implement the TLB flushing batching. Base on this, hardware >> accelerated page copying can be implemented. >> >> If too many pages are passed to migrate_pages(), in the naive batched >> implementation, we may unmap too many pages at the same time. The >> possibility for a task to wait for the migrated pages to be mapped >> again increases. So the latency may be hurt. To deal with this >> issue, the max number of pages be unmapped in batch is restricted to >> no more than HPAGE_PMD_NR. That is, the influence is at the same >> level of THP migration. >> >> We use the following test to measure the performance impact of the >> patchset, >> >> On a 2-socket Intel server, >> >> - Run pmbench memory accessing benchmark >> >> - Run `migratepages` to migrate pages of pmbench between node 0 and >> node 1 back and forth. >> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and >> the number of pages migrated successfully per second increases 291.7%. > > Thank you for the patchset. Batching page migration will definitely > improve its throughput from my past experiments[1] and starting with > TLB flushing is a good first step. Thanks for the pointer, the patch description provides valuable information for me already! > BTW, what is the rationality behind the increased page migration > success rate per second? From perf profiling data, in the base kernel, migrate_pages.migrate_to_node.do_migrate_pages.kernel_migrate_pages.__x64_sys_migrate_pages: 2.87 ptep_clear_flush.try_to_migrate_one.rmap_walk_anon.try_to_migrate.__unmap_and_move: 2.39 Because pmbench run in the system too, the CPU cycles of migrate_pages() is about 2.87%. While the CPU cycles for TLB flushing is 2.39%. That is, 2.39/2.87 = 83.3% CPU cycles of migrate_pages() are used for TLB flushing. After batching the TLB flushing, the perf profiling data becomes, migrate_pages.migrate_to_node.do_migrate_pages.kernel_migrate_pages.__x64_sys_migrate_pages: 2.77 move_to_new_folio.migrate_pages_batch.migrate_pages.migrate_to_node.do_migrate_pages: 1.68 copy_page.folio_copy.migrate_folio.move_to_new_folio.migrate_pages_batch: 1.21 1.21/2.77 = 43.7% CPU cycles of migrate_pages() are used for page copying now. try_to_migrate_one: 0.23 The CPU cycles of unmapping and TLB flushing becomes 0.23/2.77 = 8.3% of migrate_pages(). All in all, after the optimization, we do much less TLB flushing, which consumes a lot of CPU cycles before the optimization. So the throughput of migrate_pages() increases greatly. I will add these data in the next version of patch. Best Regards, Huang, Ying >> >> This patchset is based on v6.0-rc5 and the following patchset, >> >> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path >> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/ >> >> The migrate_pages() related code is converting to folio now. So this >> patchset cannot apply recent akpm/mm-unstable branch. This patchset >> is used to check the basic idea. If it is OK, I will rebase the >> patchset on top of folio changes. >> >> Best Regards, >> Huang, Ying > > > [1] https://lwn.net/Articles/784925/ > > -- > Best Regards, > Yan, Zi
Hi Huang, This is an exciting change, but on ARM64 machine the TLB flushing are not through IPI, it depends on 'vale1is' instruction,so I'm wondering if there's also a benefit on arm64, and I'm going to test it on an ARM64 machine. 在 2022/9/21 下午11:47, Zi Yan 写道: > On 21 Sep 2022, at 2:06, Huang Ying wrote: > >> From: "Huang, Ying" <ying.huang@intel.com> >> >> Now, migrate_pages() migrate pages one by one, like the fake code as >> follows, >> >> for each page >> unmap >> flush TLB >> copy >> restore map >> >> If multiple pages are passed to migrate_pages(), there are >> opportunities to batch the TLB flushing and copying. That is, we can >> change the code to something as follows, >> >> for each page >> unmap >> for each page >> flush TLB >> for each page >> copy >> for each page >> restore map >> >> The total number of TLB flushing IPI can be reduced considerably. And >> we may use some hardware accelerator such as DSA to accelerate the >> page copying. >> >> So in this patch, we refactor the migrate_pages() implementation and >> implement the TLB flushing batching. Base on this, hardware >> accelerated page copying can be implemented. >> >> If too many pages are passed to migrate_pages(), in the naive batched >> implementation, we may unmap too many pages at the same time. The >> possibility for a task to wait for the migrated pages to be mapped >> again increases. So the latency may be hurt. To deal with this >> issue, the max number of pages be unmapped in batch is restricted to >> no more than HPAGE_PMD_NR. That is, the influence is at the same >> level of THP migration. >> >> We use the following test to measure the performance impact of the >> patchset, >> >> On a 2-socket Intel server, >> >> - Run pmbench memory accessing benchmark >> >> - Run `migratepages` to migrate pages of pmbench between node 0 and >> node 1 back and forth. >> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and >> the number of pages migrated successfully per second increases 291.7%. > Thank you for the patchset. Batching page migration will definitely > improve its throughput from my past experiments[1] and starting with > TLB flushing is a good first step. > > BTW, what is the rationality behind the increased page migration > success rate per second? > >> This patchset is based on v6.0-rc5 and the following patchset, >> >> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path >> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/ >> >> The migrate_pages() related code is converting to folio now. So this >> patchset cannot apply recent akpm/mm-unstable branch. This patchset >> is used to check the basic idea. If it is OK, I will rebase the >> patchset on top of folio changes. >> >> Best Regards, >> Huang, Ying > > [1] https://lwn.net/Articles/784925/ > > -- > Best Regards, > Yan, Zi
haoxin <xhao@linux.alibaba.com> writes: > Hi Huang, > > This is an exciting change, but on ARM64 machine the TLB > flushing are not through IPI, it depends on 'vale1is' > > instructionso I'm wondering if there's also a benefit on arm64, > and I'm going to test it on an ARM64 machine. We have no arm64 machine to test and I know very little about arm64. Thanks for information and testing. Best Regards, Huang, Ying > > ( 2022/9/21 H11:47, Zi Yan S: >> On 21 Sep 2022, at 2:06, Huang Ying wrote: >> >>> From: "Huang, Ying" <ying.huang@intel.com> >>> >>> Now, migrate_pages() migrate pages one by one, like the fake code as >>> follows, >>> >>> for each page >>> unmap >>> flush TLB >>> copy >>> restore map >>> >>> If multiple pages are passed to migrate_pages(), there are >>> opportunities to batch the TLB flushing and copying. That is, we can >>> change the code to something as follows, >>> >>> for each page >>> unmap >>> for each page >>> flush TLB >>> for each page >>> copy >>> for each page >>> restore map >>> >>> The total number of TLB flushing IPI can be reduced considerably. And >>> we may use some hardware accelerator such as DSA to accelerate the >>> page copying. >>> >>> So in this patch, we refactor the migrate_pages() implementation and >>> implement the TLB flushing batching. Base on this, hardware >>> accelerated page copying can be implemented. >>> >>> If too many pages are passed to migrate_pages(), in the naive batched >>> implementation, we may unmap too many pages at the same time. The >>> possibility for a task to wait for the migrated pages to be mapped >>> again increases. So the latency may be hurt. To deal with this >>> issue, the max number of pages be unmapped in batch is restricted to >>> no more than HPAGE_PMD_NR. That is, the influence is at the same >>> level of THP migration. >>> >>> We use the following test to measure the performance impact of the >>> patchset, >>> >>> On a 2-socket Intel server, >>> >>> - Run pmbench memory accessing benchmark >>> >>> - Run `migratepages` to migrate pages of pmbench between node 0 and >>> node 1 back and forth. >>> >>> With the patch, the TLB flushing IPI reduces 99.1% during the test and >>> the number of pages migrated successfully per second increases 291.7%. >> Thank you for the patchset. Batching page migration will definitely >> improve its throughput from my past experiments[1] and starting with >> TLB flushing is a good first step. >> >> BTW, what is the rationality behind the increased page migration >> success rate per second? >> >>> This patchset is based on v6.0-rc5 and the following patchset, >>> >>> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path >>> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/ >>> >>> The migrate_pages() related code is converting to folio now. So this >>> patchset cannot apply recent akpm/mm-unstable branch. This patchset >>> is used to check the basic idea. If it is OK, I will rebase the >>> patchset on top of folio changes. >>> >>> Best Regards, >>> Huang, Ying >> >> [1] https://lwn.net/Articles/784925/ >> >> -- >> Best Regards, >> Yan, Zi
On 9/21/2022 11:36 AM, Huang Ying wrote: > From: "Huang, Ying" <ying.huang@intel.com> > > Now, migrate_pages() migrate pages one by one, like the fake code as > follows, > > for each page > unmap > flush TLB > copy > restore map > > If multiple pages are passed to migrate_pages(), there are > opportunities to batch the TLB flushing and copying. That is, we can > change the code to something as follows, > > for each page > unmap > for each page > flush TLB > for each page > copy > for each page > restore map > > The total number of TLB flushing IPI can be reduced considerably. And > we may use some hardware accelerator such as DSA to accelerate the > page copying. > > So in this patch, we refactor the migrate_pages() implementation and > implement the TLB flushing batching. Base on this, hardware > accelerated page copying can be implemented. > > If too many pages are passed to migrate_pages(), in the naive batched > implementation, we may unmap too many pages at the same time. The > possibility for a task to wait for the migrated pages to be mapped > again increases. So the latency may be hurt. To deal with this > issue, the max number of pages be unmapped in batch is restricted to > no more than HPAGE_PMD_NR. That is, the influence is at the same > level of THP migration. Thanks for the patchset. I find it hitting the following BUG() when running mmtests/autonumabench: kernel BUG at mm/migrate.c:2432! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 7 PID: 7150 Comm: numa01 Not tainted 6.0.0-rc5+ #171 Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.5.6 10/06/2021 RIP: 0010:migrate_misplaced_page+0x670/0x830 Code: 36 48 8b 3c c5 e0 7a 19 8d e8 dc 10 f7 ff 4c 89 e7 e8 f4 43 f5 ff 8b 55 bc 85 d2 75 6f 48 8b 45 c0 4c 39 e8 0f 84 b0 fb ff ff <0f> 0b 48 8b 7d 90 e9 ec fc ff ff 48 83 e8 01 e9 48 fa ff ff 48 83 RSP: 0000:ffffb1b29ec3fd38 EFLAGS: 00010202 RAX: ffffe946460f8248 RBX: 0000000000000001 RCX: ffffe946460f8248 RDX: 0000000000000000 RSI: ffffe946460f8248 RDI: ffffb1b29ec3fce0 RBP: ffffb1b29ec3fda8 R08: 0000000000000000 R09: 0000000000000005 R10: 0000000000000001 R11: 0000000000000000 R12: ffffe946460f8240 R13: ffffb1b29ec3fd68 R14: 0000000000000001 R15: ffff9698beed5000 FS: 00007fcc31fee640(0000) GS:ffff9697b0000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcc3a3a5000 CR3: 000000016e89c002 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> __handle_mm_fault+0xb87/0xff0 handle_mm_fault+0x126/0x3c0 do_user_addr_fault+0x1ed/0x690 exc_page_fault+0x84/0x2c0 asm_exc_page_fault+0x27/0x30 RIP: 0033:0x7fccfa1a1180 Code: 81 fa 80 00 00 00 76 d2 c5 fe 7f 40 40 c5 fe 7f 40 60 48 83 c7 80 48 81 fa 00 01 00 00 76 2b 48 8d 90 80 00 00 00 48 83 e2 c0 <c5> fd 7f 02 c5 fd 7f 42 20 c5 fd 7f 42 40 c5 fd 7f 42 60 48 83 ea RSP: 002b:00007fcc31fede38 EFLAGS: 00010283 RAX: 00007fcc39fff010 RBX: 000000000000002c RCX: 00007fccfa11ea3d RDX: 00007fcc3a3a5000 RSI: 0000000000000000 RDI: 00007fccf9ffef90 RBP: 00007fcc39fff010 R08: 00007fcc31fee640 R09: 00007fcc31fee640 R10: 00007ffdecef614f R11: 0000000000000246 R12: 00000000c0000000 R13: 0000000000000000 R14: 00007fccfa094850 R15: 00007ffdecef6190 This is BUG_ON(!list_empty(&migratepages)) in migrate_misplaced_page(). Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 9/21/2022 11:36 AM, Huang Ying wrote: >> From: "Huang, Ying" <ying.huang@intel.com> >> >> Now, migrate_pages() migrate pages one by one, like the fake code as >> follows, >> >> for each page >> unmap >> flush TLB >> copy >> restore map >> >> If multiple pages are passed to migrate_pages(), there are >> opportunities to batch the TLB flushing and copying. That is, we can >> change the code to something as follows, >> >> for each page >> unmap >> for each page >> flush TLB >> for each page >> copy >> for each page >> restore map >> >> The total number of TLB flushing IPI can be reduced considerably. And >> we may use some hardware accelerator such as DSA to accelerate the >> page copying. >> >> So in this patch, we refactor the migrate_pages() implementation and >> implement the TLB flushing batching. Base on this, hardware >> accelerated page copying can be implemented. >> >> If too many pages are passed to migrate_pages(), in the naive batched >> implementation, we may unmap too many pages at the same time. The >> possibility for a task to wait for the migrated pages to be mapped >> again increases. So the latency may be hurt. To deal with this >> issue, the max number of pages be unmapped in batch is restricted to >> no more than HPAGE_PMD_NR. That is, the influence is at the same >> level of THP migration. > > Thanks for the patchset. I find it hitting the following BUG() when > running mmtests/autonumabench: > > kernel BUG at mm/migrate.c:2432! > invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > CPU: 7 PID: 7150 Comm: numa01 Not tainted 6.0.0-rc5+ #171 > Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.5.6 10/06/2021 > RIP: 0010:migrate_misplaced_page+0x670/0x830 > Code: 36 48 8b 3c c5 e0 7a 19 8d e8 dc 10 f7 ff 4c 89 e7 e8 f4 43 f5 ff 8b 55 bc 85 d2 75 6f 48 8b 45 c0 4c 39 e8 0f 84 b0 fb ff ff <0f> 0b 48 8b 7d 90 e9 ec fc ff ff 48 83 e8 01 e9 48 fa ff ff 48 83 > RSP: 0000:ffffb1b29ec3fd38 EFLAGS: 00010202 > RAX: ffffe946460f8248 RBX: 0000000000000001 RCX: ffffe946460f8248 > RDX: 0000000000000000 RSI: ffffe946460f8248 RDI: ffffb1b29ec3fce0 > RBP: ffffb1b29ec3fda8 R08: 0000000000000000 R09: 0000000000000005 > R10: 0000000000000001 R11: 0000000000000000 R12: ffffe946460f8240 > R13: ffffb1b29ec3fd68 R14: 0000000000000001 R15: ffff9698beed5000 > FS: 00007fcc31fee640(0000) GS:ffff9697b0000000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fcc3a3a5000 CR3: 000000016e89c002 CR4: 0000000000770ee0 > PKRU: 55555554 > Call Trace: > <TASK> > __handle_mm_fault+0xb87/0xff0 > handle_mm_fault+0x126/0x3c0 > do_user_addr_fault+0x1ed/0x690 > exc_page_fault+0x84/0x2c0 > asm_exc_page_fault+0x27/0x30 > RIP: 0033:0x7fccfa1a1180 > Code: 81 fa 80 00 00 00 76 d2 c5 fe 7f 40 40 c5 fe 7f 40 60 48 83 c7 80 48 81 fa 00 01 00 00 76 2b 48 8d 90 80 00 00 00 48 83 e2 c0 <c5> fd 7f 02 c5 fd 7f 42 20 c5 fd 7f 42 40 c5 fd 7f 42 60 48 83 ea > RSP: 002b:00007fcc31fede38 EFLAGS: 00010283 > RAX: 00007fcc39fff010 RBX: 000000000000002c RCX: 00007fccfa11ea3d > RDX: 00007fcc3a3a5000 RSI: 0000000000000000 RDI: 00007fccf9ffef90 > RBP: 00007fcc39fff010 R08: 00007fcc31fee640 R09: 00007fcc31fee640 > R10: 00007ffdecef614f R11: 0000000000000246 R12: 00000000c0000000 > R13: 0000000000000000 R14: 00007fccfa094850 R15: 00007ffdecef6190 > > This is BUG_ON(!list_empty(&migratepages)) in migrate_misplaced_page(). Thank you very much for reporting! I haven't reproduced this yet. But I will pay special attention to this when develop the next version, even if I cannot reproduce this finally. Best Regards, Huang, Ying
Huang Ying <ying.huang@intel.com> writes: > From: "Huang, Ying" <ying.huang@intel.com> > > Now, migrate_pages() migrate pages one by one, like the fake code as > follows, > > for each page > unmap > flush TLB > copy > restore map > > If multiple pages are passed to migrate_pages(), there are > opportunities to batch the TLB flushing and copying. That is, we can > change the code to something as follows, > > for each page > unmap > for each page > flush TLB > for each page > copy > for each page > restore map We use a very similar sequence for the migrate_vma_*() set of calls. It would be good if we could one day consolidate the two. I believe the biggest hindrance to that is migrate_vma_*() operates on arrays of pfns rather than a list of pages. The reason for that is it needs to migrate non-lru pages and hence can't use page->lru to create a list of pages to migrate. So from my perspective I think this direction is good as it would help with that. One thing to watch out for is deadlocking if locking multiple pages though. > The total number of TLB flushing IPI can be reduced considerably. And > we may use some hardware accelerator such as DSA to accelerate the > page copying. > > So in this patch, we refactor the migrate_pages() implementation and > implement the TLB flushing batching. Base on this, hardware > accelerated page copying can be implemented. > > If too many pages are passed to migrate_pages(), in the naive batched > implementation, we may unmap too many pages at the same time. The > possibility for a task to wait for the migrated pages to be mapped > again increases. So the latency may be hurt. To deal with this > issue, the max number of pages be unmapped in batch is restricted to > no more than HPAGE_PMD_NR. That is, the influence is at the same > level of THP migration. > > We use the following test to measure the performance impact of the > patchset, > > On a 2-socket Intel server, > > - Run pmbench memory accessing benchmark > > - Run `migratepages` to migrate pages of pmbench between node 0 and > node 1 back and forth. > > With the patch, the TLB flushing IPI reduces 99.1% during the test and > the number of pages migrated successfully per second increases 291.7%. > > This patchset is based on v6.0-rc5 and the following patchset, > > [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path > https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/ > > The migrate_pages() related code is converting to folio now. So this > patchset cannot apply recent akpm/mm-unstable branch. This patchset > is used to check the basic idea. If it is OK, I will rebase the > patchset on top of folio changes. > > Best Regards, > Huang, Ying
On 9/23/2022 1:22 PM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: >> >> Thanks for the patchset. I find it hitting the following BUG() when >> running mmtests/autonumabench: >> >> kernel BUG at mm/migrate.c:2432! >> >> This is BUG_ON(!list_empty(&migratepages)) in migrate_misplaced_page(). > > Thank you very much for reporting! I haven't reproduced this yet. But > I will pay special attention to this when develop the next version, even > if I cannot reproduce this finally. The following change fixes the above reported BUG_ON(). diff --git a/mm/migrate.c b/mm/migrate.c index a0de0d9b4d41..c11dd82245e5 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1197,7 +1197,7 @@ static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page, * references and be restored. */ /* restore the page to right list. */ - if (rc != -EAGAIN) + if (rc == -EAGAIN) ret = NULL; migrate_page_undo_page(page, page_was_mapped, anon_vma, locked, ret); The pages that returned from unmapping stage with -EAGAIN used to end up on "ret" list rather than continuing on the "from" list. Regards, Bharata.
Hi, Huang 在 2022/9/21 下午2:06, Huang Ying 写道: > From: "Huang, Ying" <ying.huang@intel.com> > > Now, migrate_pages() migrate pages one by one, like the fake code as > follows, > > for each page > unmap > flush TLB > copy > restore map > > If multiple pages are passed to migrate_pages(), there are > opportunities to batch the TLB flushing and copying. That is, we can > change the code to something as follows, > > for each page > unmap > for each page > flush TLB > for each page > copy > for each page > restore map > > The total number of TLB flushing IPI can be reduced considerably. And > we may use some hardware accelerator such as DSA to accelerate the > page copying. > > So in this patch, we refactor the migrate_pages() implementation and > implement the TLB flushing batching. Base on this, hardware > accelerated page copying can be implemented. > > If too many pages are passed to migrate_pages(), in the naive batched > implementation, we may unmap too many pages at the same time. The > possibility for a task to wait for the migrated pages to be mapped > again increases. So the latency may be hurt. To deal with this > issue, the max number of pages be unmapped in batch is restricted to > no more than HPAGE_PMD_NR. That is, the influence is at the same > level of THP migration. > > We use the following test to measure the performance impact of the > patchset, > > On a 2-socket Intel server, > > - Run pmbench memory accessing benchmark > > - Run `migratepages` to migrate pages of pmbench between node 0 and > node 1 back and forth. > As the pmbench can not run on arm64 machine, so i use lmbench instead. I test case like this: (i am not sure whether it is reasonable, but it seems worked) ./bw_mem -N10000 10000m rd & time migratepages pid node0 node1 o/patch w/patch real 0m0.035s real 0m0.024s user 0m0.000s user 0m0.000s sys 0m0.035s sys 0m0.024s the migratepages time is reduced above 32%. But there has a problem, i see the batch flush is called by migrate_pages_batch try_to_unmap_flush arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work. But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet. So, the tlb batch flush means no any flush is did, it is a empty func. Maybe this patch can help solve this problem. https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/
Bharata B Rao <bharata@amd.com> writes: > On 9/23/2022 1:22 PM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >>> >>> Thanks for the patchset. I find it hitting the following BUG() when >>> running mmtests/autonumabench: >>> >>> kernel BUG at mm/migrate.c:2432! >>> >>> This is BUG_ON(!list_empty(&migratepages)) in migrate_misplaced_page(). >> >> Thank you very much for reporting! I haven't reproduced this yet. But >> I will pay special attention to this when develop the next version, even >> if I cannot reproduce this finally. > > The following change fixes the above reported BUG_ON(). > > diff --git a/mm/migrate.c b/mm/migrate.c > index a0de0d9b4d41..c11dd82245e5 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1197,7 +1197,7 @@ static int migrate_page_unmap(new_page_t get_new_page, free_page_t put_new_page, > * references and be restored. > */ > /* restore the page to right list. */ > - if (rc != -EAGAIN) > + if (rc == -EAGAIN) > ret = NULL; > > migrate_page_undo_page(page, page_was_mapped, anon_vma, locked, ret); > > The pages that returned from unmapping stage with -EAGAIN used to > end up on "ret" list rather than continuing on the "from" list. Yes. You are right. Thank you very much! Digging some history, it is found that the code was correct in previous versions, but became wrong for mistake during code rebasing. Will be more careful in the future and try to organize the patchset better to make it easier to review the changes. Best Regards, Huang, Ying
haoxin <xhao@linux.alibaba.com> writes: > Hi, Huang > > ( 2022/9/21 H2:06, Huang Ying S: >> From: "Huang, Ying" <ying.huang@intel.com> >> >> Now, migrate_pages() migrate pages one by one, like the fake code as >> follows, >> >> for each page >> unmap >> flush TLB >> copy >> restore map >> >> If multiple pages are passed to migrate_pages(), there are >> opportunities to batch the TLB flushing and copying. That is, we can >> change the code to something as follows, >> >> for each page >> unmap >> for each page >> flush TLB >> for each page >> copy >> for each page >> restore map >> >> The total number of TLB flushing IPI can be reduced considerably. And >> we may use some hardware accelerator such as DSA to accelerate the >> page copying. >> >> So in this patch, we refactor the migrate_pages() implementation and >> implement the TLB flushing batching. Base on this, hardware >> accelerated page copying can be implemented. >> >> If too many pages are passed to migrate_pages(), in the naive batched >> implementation, we may unmap too many pages at the same time. The >> possibility for a task to wait for the migrated pages to be mapped >> again increases. So the latency may be hurt. To deal with this >> issue, the max number of pages be unmapped in batch is restricted to >> no more than HPAGE_PMD_NR. That is, the influence is at the same >> level of THP migration. >> >> We use the following test to measure the performance impact of the >> patchset, >> >> On a 2-socket Intel server, >> >> - Run pmbench memory accessing benchmark >> >> - Run `migratepages` to migrate pages of pmbench between node 0 and >> node 1 back and forth. >> > As the pmbench can not run on arm64 machine, so i use lmbench instead. > I test case like this: (i am not sure whether it is reasonable, but it seems worked) > ./bw_mem -N10000 10000m rd & > time migratepages pid node0 node1 > > o/patch w/patch > real 0m0.035s real 0m0.024s > user 0m0.000s user 0m0.000s > sys 0m0.035s sys 0m0.024s > > the migratepages time is reduced above 32%. > > But there has a problem, i see the batch flush is called by > migrate_pages_batch > try_to_unmap_flush > arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work. > > But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet. > > So, the tlb batch flush means no any flush is did, it is a empty func. Yes. And should_defer_flush() will always return false too. That is, the TLB will still be flushed, but will not be batched. > Maybe this patch can help solve this problem. > https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/ Yes. This will bring TLB flush batching to ARM64. Best Regards, Huang, Ying
在 2022/9/28 上午10:01, Huang, Ying 写道: > haoxin <xhao@linux.alibaba.com> writes: > >> Hi, Huang >> >> ( 2022/9/21 H2:06, Huang Ying S: >>> From: "Huang, Ying" <ying.huang@intel.com> >>> >>> Now, migrate_pages() migrate pages one by one, like the fake code as >>> follows, >>> >>> for each page >>> unmap >>> flush TLB >>> copy >>> restore map >>> >>> If multiple pages are passed to migrate_pages(), there are >>> opportunities to batch the TLB flushing and copying. That is, we can >>> change the code to something as follows, >>> >>> for each page >>> unmap >>> for each page >>> flush TLB >>> for each page >>> copy >>> for each page >>> restore map >>> >>> The total number of TLB flushing IPI can be reduced considerably. And >>> we may use some hardware accelerator such as DSA to accelerate the >>> page copying. >>> >>> So in this patch, we refactor the migrate_pages() implementation and >>> implement the TLB flushing batching. Base on this, hardware >>> accelerated page copying can be implemented. >>> >>> If too many pages are passed to migrate_pages(), in the naive batched >>> implementation, we may unmap too many pages at the same time. The >>> possibility for a task to wait for the migrated pages to be mapped >>> again increases. So the latency may be hurt. To deal with this >>> issue, the max number of pages be unmapped in batch is restricted to >>> no more than HPAGE_PMD_NR. That is, the influence is at the same >>> level of THP migration. >>> >>> We use the following test to measure the performance impact of the >>> patchset, >>> >>> On a 2-socket Intel server, >>> >>> - Run pmbench memory accessing benchmark >>> >>> - Run `migratepages` to migrate pages of pmbench between node 0 and >>> node 1 back and forth. >>> >> As the pmbench can not run on arm64 machine, so i use lmbench instead. >> I test case like this: (i am not sure whether it is reasonable, but it seems worked) >> ./bw_mem -N10000 10000m rd & >> time migratepages pid node0 node1 >> >> o/patch w/patch >> real 0m0.035s real 0m0.024s >> user 0m0.000s user 0m0.000s >> sys 0m0.035s sys 0m0.024s >> >> the migratepages time is reduced above 32%. >> >> But there has a problem, i see the batch flush is called by >> migrate_pages_batch >> try_to_unmap_flush >> arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work. >> >> But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet. >> >> So, the tlb batch flush means no any flush is did, it is a empty func. > Yes. And should_defer_flush() will always return false too. That is, > the TLB will still be flushed, but will not be batched. Oh, yes, i ignore this, thank you. > >> Maybe this patch can help solve this problem. >> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/ > Yes. This will bring TLB flush batching to ARM64. Next time, i will combine with this patch, and do some test again, do you have any suggestion about benchmark ? > > Best Regards, > Huang, Ying
haoxin <xhao@linux.alibaba.com> writes: > ( 2022/9/28 H10:01, Huang, Ying S: >> haoxin <xhao@linux.alibaba.com> writes: >> >>> Hi, Huang >>> >>> ( 2022/9/21 H2:06, Huang Ying S: >>>> From: "Huang, Ying" <ying.huang@intel.com> >>>> >>>> Now, migrate_pages() migrate pages one by one, like the fake code as >>>> follows, >>>> >>>> for each page >>>> unmap >>>> flush TLB >>>> copy >>>> restore map >>>> >>>> If multiple pages are passed to migrate_pages(), there are >>>> opportunities to batch the TLB flushing and copying. That is, we can >>>> change the code to something as follows, >>>> >>>> for each page >>>> unmap >>>> for each page >>>> flush TLB >>>> for each page >>>> copy >>>> for each page >>>> restore map >>>> >>>> The total number of TLB flushing IPI can be reduced considerably. And >>>> we may use some hardware accelerator such as DSA to accelerate the >>>> page copying. >>>> >>>> So in this patch, we refactor the migrate_pages() implementation and >>>> implement the TLB flushing batching. Base on this, hardware >>>> accelerated page copying can be implemented. >>>> >>>> If too many pages are passed to migrate_pages(), in the naive batched >>>> implementation, we may unmap too many pages at the same time. The >>>> possibility for a task to wait for the migrated pages to be mapped >>>> again increases. So the latency may be hurt. To deal with this >>>> issue, the max number of pages be unmapped in batch is restricted to >>>> no more than HPAGE_PMD_NR. That is, the influence is at the same >>>> level of THP migration. >>>> >>>> We use the following test to measure the performance impact of the >>>> patchset, >>>> >>>> On a 2-socket Intel server, >>>> >>>> - Run pmbench memory accessing benchmark >>>> >>>> - Run `migratepages` to migrate pages of pmbench between node 0 and >>>> node 1 back and forth. >>>> >>> As the pmbench can not run on arm64 machine, so i use lmbench instead. >>> I test case like this: (i am not sure whether it is reasonable, but it seems worked) >>> ./bw_mem -N10000 10000m rd & >>> time migratepages pid node0 node1 >>> >>> o/patch w/patch >>> real 0m0.035s real 0m0.024s >>> user 0m0.000s user 0m0.000s >>> sys 0m0.035s sys 0m0.024s >>> >>> the migratepages time is reduced above 32%. >>> >>> But there has a problem, i see the batch flush is called by >>> migrate_pages_batch >>> try_to_unmap_flush >>> arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work. >>> >>> But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet. >>> >>> So, the tlb batch flush means no any flush is did, it is a empty func. >> Yes. And should_defer_flush() will always return false too. That is, >> the TLB will still be flushed, but will not be batched. > Oh, yes, i ignore this, thank you. >> >>> Maybe this patch can help solve this problem. >>> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/ >> Yes. This will bring TLB flush batching to ARM64. > Next time, i will combine with this patch, and do some test again, > do you have any suggestion about benchmark ? I think your benchmark should be OK. If multiple threads are used, the effect of patchset will be better. Best Regards, Huang, Ying
On 9/27/2022 12:21 PM, haoxin wrote: > Hi, Huang > > 在 2022/9/21 下午2:06, Huang Ying 写道: >> From: "Huang, Ying" <ying.huang@intel.com> >> >> Now, migrate_pages() migrate pages one by one, like the fake code as >> follows, >> >> for each page >> unmap >> flush TLB >> copy >> restore map >> >> If multiple pages are passed to migrate_pages(), there are >> opportunities to batch the TLB flushing and copying. That is, we can >> change the code to something as follows, >> >> for each page >> unmap >> for each page >> flush TLB >> for each page >> copy >> for each page >> restore map >> >> The total number of TLB flushing IPI can be reduced considerably. And >> we may use some hardware accelerator such as DSA to accelerate the >> page copying. >> >> So in this patch, we refactor the migrate_pages() implementation and >> implement the TLB flushing batching. Base on this, hardware >> accelerated page copying can be implemented. >> >> If too many pages are passed to migrate_pages(), in the naive batched >> implementation, we may unmap too many pages at the same time. The >> possibility for a task to wait for the migrated pages to be mapped >> again increases. So the latency may be hurt. To deal with this >> issue, the max number of pages be unmapped in batch is restricted to >> no more than HPAGE_PMD_NR. That is, the influence is at the same >> level of THP migration. >> >> We use the following test to measure the performance impact of the >> patchset, >> >> On a 2-socket Intel server, >> >> - Run pmbench memory accessing benchmark >> >> - Run `migratepages` to migrate pages of pmbench between node 0 and >> node 1 back and forth. >> > As the pmbench can not run on arm64 machine, so i use lmbench instead. > I test case like this: (i am not sure whether it is reasonable, but it seems > worked) > ./bw_mem -N10000 10000m rd & > time migratepages pid node0 node1 > FYI, I have ported pmbench to AArch64 [1]. The project seems to be abandoned on bitbucket, I wonder if it makes sense to fork it elsewhere and push the pending PRs there. [1] https://bitbucket.org/jisooy/pmbench/pull-requests/5 > o/patch w/patch > real 0m0.035s real 0m0.024s > user 0m0.000s user 0m0.000s > sys 0m0.035s sys 0m0.024s > > the migratepages time is reduced above 32%. > > But there has a problem, i see the batch flush is called by > migrate_pages_batch > try_to_unmap_flush > arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work. > > But in arm64, the arch_tlbbatch_flush are not supported, becasue it not > support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet. > > So, the tlb batch flush means no any flush is did, it is a empty func. > > Maybe this patch can help solve this problem. > https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/ > > > > > > >
On 11/2/2022 3:14 AM, Huang, Ying wrote: > Hesham Almatary <hesham.almatary@huawei.com> writes: > >> On 9/27/2022 12:21 PM, haoxin wrote: >>> Hi, Huang >>> >>> ( 2022/9/21 H2:06, Huang Ying S: >>>> From: "Huang, Ying" <ying.huang@intel.com> >>>> >>>> Now, migrate_pages() migrate pages one by one, like the fake code as >>>> follows, >>>> >>>> ? for each page >>>> ?? unmap >>>> ?? flush TLB >>>> ?? copy >>>> ?? restore map >>>> >>>> If multiple pages are passed to migrate_pages(), there are >>>> opportunities to batch the TLB flushing and copying. That is, we can >>>> change the code to something as follows, >>>> >>>> ? for each page >>>> ?? unmap >>>> ? for each page >>>> ?? flush TLB >>>> ? for each page >>>> ?? copy >>>> ? for each page >>>> ?? restore map >>>> >>>> The total number of TLB flushing IPI can be reduced considerably. And >>>> we may use some hardware accelerator such as DSA to accelerate the >>>> page copying. >>>> >>>> So in this patch, we refactor the migrate_pages() implementation and >>>> implement the TLB flushing batching. Base on this, hardware >>>> accelerated page copying can be implemented. >>>> >>>> If too many pages are passed to migrate_pages(), in the naive batched >>>> implementation, we may unmap too many pages at the same time. The >>>> possibility for a task to wait for the migrated pages to be mapped >>>> again increases. So the latency may be hurt. To deal with this >>>> issue, the max number of pages be unmapped in batch is restricted to >>>> no more than HPAGE_PMD_NR. That is, the influence is at the same >>>> level of THP migration. >>>> >>>> We use the following test to measure the performance impact of the >>>> patchset, >>>> >>>> On a 2-socket Intel server, >>>> >>>> - Run pmbench memory accessing benchmark >>>> >>>> - Run `migratepages` to migrate pages of pmbench between node 0 and >>>> ? node 1 back and forth. >>>> >>> As the pmbench can not run on arm64 machine, so i use lmbench instead. >>> I test case like this: (i am not sure whether it is reasonable, >>> but it seems worked) >>> ./bw_mem -N10000 10000m rd & >>> time migratepages pid node0 node1 >>> >> FYI, I have ported pmbench to AArch64 [1]. The project seems to be >> abandoned on bitbucket, >> >> I wonder if it makes sense to fork it elsewhere and push the pending PRs there. >> >> >> [1] https://bitbucket.org/jisooy/pmbench/pull-requests/5 > Maybe try to contact the original author with email firstly? That's a good idea. I'm not planning to fork/maintain it myself, but if anyone is interested in doing so, I am happy to help out and submit PRs there. > Best Regards, > Huang, Ying > >>> o/patch w/patch >>> real? 0m0.035s?? real? 0m0.024s >>> user? 0m0.000s?? user? 0m0.000s >>> sys? 0m0.035s??? sys? 0m0.024s >>> >>> the migratepages time is reduced above 32%. >>> >>> But there has a problem, i see the batch flush is called by >>> migrate_pages_batch >>> ??try_to_unmap_flush >>> ??? arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work. >>> >>> But in arm64, the arch_tlbbatch_flush are not supported, becasue it >>> not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet. >>> >>> So, the tlb batch flush means no any flush is did, it is a empty func. >>> >>> Maybe this patch can help solve this problem. >>> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/ >>> >>> >>> >>> >>> >>>
From: "Huang, Ying" <ying.huang@intel.com> Now, migrate_pages() migrate pages one by one, like the fake code as follows, for each page unmap flush TLB copy restore map If multiple pages are passed to migrate_pages(), there are opportunities to batch the TLB flushing and copying. That is, we can change the code to something as follows, for each page unmap for each page flush TLB for each page copy for each page restore map The total number of TLB flushing IPI can be reduced considerably. And we may use some hardware accelerator such as DSA to accelerate the page copying. So in this patch, we refactor the migrate_pages() implementation and implement the TLB flushing batching. Base on this, hardware accelerated page copying can be implemented. If too many pages are passed to migrate_pages(), in the naive batched implementation, we may unmap too many pages at the same time. The possibility for a task to wait for the migrated pages to be mapped again increases. So the latency may be hurt. To deal with this issue, the max number of pages be unmapped in batch is restricted to no more than HPAGE_PMD_NR. That is, the influence is at the same level of THP migration. We use the following test to measure the performance impact of the patchset, On a 2-socket Intel server, - Run pmbench memory accessing benchmark - Run `migratepages` to migrate pages of pmbench between node 0 and node 1 back and forth. With the patch, the TLB flushing IPI reduces 99.1% during the test and the number of pages migrated successfully per second increases 291.7%. This patchset is based on v6.0-rc5 and the following patchset, [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@intel.com/ The migrate_pages() related code is converting to folio now. So this patchset cannot apply recent akpm/mm-unstable branch. This patchset is used to check the basic idea. If it is OK, I will rebase the patchset on top of folio changes. Best Regards, Huang, Ying