Message ID | 20241219124717.4907-1-donettom@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: migration :shared anonymous migration test is failing | expand |
On 19.12.24 13:47, Donet Tom wrote: > The migration selftest is currently failing for shared anonymous > mappings due to a race condition. > > During migration, the source folio's PTE is unmapped by nuking the > PTE, flushing the TLB,and then marking the page for migration > (by creating the swap entries). The issue arises when, immediately > after the PTE is nuked and the TLB is flushed, but before the page > is marked for migration, another thread accesses the page. This > triggers a page fault, and the page fault handler invokes > do_pte_missing() instead of do_swap_page(), as the page is not yet > marked for migration. > > In the fault handling path, do_pte_missing() calls __do_fault() > ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). > This eventually calls folio_try_get(), incrementing the reference > count of the folio undergoing migration. The thread then blocks > on folio_lock(), as the migration path holds the lock. This > results in the migration failing in __migrate_folio(), which expects > the folio's reference count to be 2. However, the reference count is > incremented by the fault handler, leading to the failure. > > The issue arises because, after nuking the PTE and before marking the > page for migration, the page is accessed. To address this, we have > updated the logic to first nuke the PTE, then mark the page for > migration, and only then flush the TLB. With this patch, If the page is > accessed immediately after nuking the PTE, the TLB entry is still > valid, so no fault occurs. After marking the page for migration, > flushing the TLB ensures that the next page fault correctly triggers > do_swap_page() and waits for the migration to complete. > Does this reproduce with commit 536ab838a5b37b6ae3f8d53552560b7c51daeb41 Author: Dev Jain <dev.jain@arm.com> Date: Fri Aug 30 10:46:09 2024 +0530 selftests/mm: relax test to fail after 100 migration failures It was recently observed at [1] that during the folio unmapping stage of migration, when the PTEs are cleared, a racing thread faulting on that folio may increase the refcount of the folio, sleep on the folio lock (the migration path has the lock), and migration ultimately fails when asserting the actual refcount against the expected. Thereby, the migration selftest fails on shared-anon mappings. The above enforces the fact that migration is a best-effort service, therefore, it is wrong to fail the test for just a single failure; hence, fail the test after 100 consecutive failures (where 100 is still a subjective choice). Note that, this has no effect on the execution time of the test since that is controlled by a timeout. [1] https://lore.kernel.org/all/20240801081657.1386743-1-dev.jain@arm.com/ part of 6.12? As part of that discussion, we discussed alternatives, such as retrying migration more often internally.
On 19.12.24 13:47, Donet Tom wrote: > The migration selftest is currently failing for shared anonymous > mappings due to a race condition. > > During migration, the source folio's PTE is unmapped by nuking the > PTE, flushing the TLB,and then marking the page for migration > (by creating the swap entries). The issue arises when, immediately > after the PTE is nuked and the TLB is flushed, but before the page > is marked for migration, another thread accesses the page. This > triggers a page fault, and the page fault handler invokes > do_pte_missing() instead of do_swap_page(), as the page is not yet > marked for migration. > > In the fault handling path, do_pte_missing() calls __do_fault() > ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). > This eventually calls folio_try_get(), incrementing the reference > count of the folio undergoing migration. The thread then blocks > on folio_lock(), as the migration path holds the lock. This > results in the migration failing in __migrate_folio(), which expects > the folio's reference count to be 2. However, the reference count is > incremented by the fault handler, leading to the failure. > > The issue arises because, after nuking the PTE and before marking the > page for migration, the page is accessed. To address this, we have > updated the logic to first nuke the PTE, then mark the page for > migration, and only then flush the TLB. With this patch, If the page is > accessed immediately after nuking the PTE, the TLB entry is still > valid, so no fault occurs. But what about if the PTE is not in the TLB yet, and you get an access from another CPU just after clearing the PTE (but not flushing the TLB)? The other CPU will still observe PTE=none, trigger a fault etc. So I don't think what you propose rules out all cases.
On 12/19/24 18:25, David Hildenbrand wrote: > On 19.12.24 13:47, Donet Tom wrote: >> The migration selftest is currently failing for shared anonymous >> mappings due to a race condition. >> >> During migration, the source folio's PTE is unmapped by nuking the >> PTE, flushing the TLB,and then marking the page for migration >> (by creating the swap entries). The issue arises when, immediately >> after the PTE is nuked and the TLB is flushed, but before the page >> is marked for migration, another thread accesses the page. This >> triggers a page fault, and the page fault handler invokes >> do_pte_missing() instead of do_swap_page(), as the page is not yet >> marked for migration. >> >> In the fault handling path, do_pte_missing() calls __do_fault() >> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). >> This eventually calls folio_try_get(), incrementing the reference >> count of the folio undergoing migration. The thread then blocks >> on folio_lock(), as the migration path holds the lock. This >> results in the migration failing in __migrate_folio(), which expects >> the folio's reference count to be 2. However, the reference count is >> incremented by the fault handler, leading to the failure. >> >> The issue arises because, after nuking the PTE and before marking the >> page for migration, the page is accessed. To address this, we have >> updated the logic to first nuke the PTE, then mark the page for >> migration, and only then flush the TLB. With this patch, If the page is >> accessed immediately after nuking the PTE, the TLB entry is still >> valid, so no fault occurs. After marking the page for migration, >> flushing the TLB ensures that the next page fault correctly triggers >> do_swap_page() and waits for the migration to complete. >> > > Does this reproduce with > > commit 536ab838a5b37b6ae3f8d53552560b7c51daeb41 > Author: Dev Jain <dev.jain@arm.com> > Date: Fri Aug 30 10:46:09 2024 +0530 > > selftests/mm: relax test to fail after 100 migration failures > It was recently observed at [1] that during the folio > unmapping stage of > migration, when the PTEs are cleared, a racing thread faulting on > that > folio may increase the refcount of the folio, sleep on the folio > lock (the > migration path has the lock), and migration ultimately fails when > asserting the actual refcount against the expected. Thereby, the > migration selftest fails on shared-anon mappings. The above > enforces the > fact that migration is a best-effort service, therefore, it is > wrong to > fail the test for just a single failure; hence, fail the test > after 100 > consecutive failures (where 100 is still a subjective choice). > Note that, > this has no effect on the execution time of the test since that is > controlled by a timeout. > [1] > https://lore.kernel.org/all/20240801081657.1386743-1-dev.jain@arm.com/ > > part of 6.12? > > > As part of that discussion, we discussed alternatives, such as > retrying migration more often internally. > I was trying with this patch and was able to recreate the issue on PowerPC. I tried increasing the retry count, but the test was still failing.
On 2024/12/19 20:47, Donet Tom wrote: > The migration selftest is currently failing for shared anonymous > mappings due to a race condition. > > During migration, the source folio's PTE is unmapped by nuking the > PTE, flushing the TLB,and then marking the page for migration > (by creating the swap entries). The issue arises when, immediately > after the PTE is nuked and the TLB is flushed, but before the page > is marked for migration, another thread accesses the page. This > triggers a page fault, and the page fault handler invokes > do_pte_missing() instead of do_swap_page(), as the page is not yet > marked for migration. > > In the fault handling path, do_pte_missing() calls __do_fault() > ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). > This eventually calls folio_try_get(), incrementing the reference > count of the folio undergoing migration. The thread then blocks > on folio_lock(), as the migration path holds the lock. This > results in the migration failing in __migrate_folio(), which expects > the folio's reference count to be 2. However, the reference count is > incremented by the fault handler, leading to the failure. > > The issue arises because, after nuking the PTE and before marking the > page for migration, the page is accessed. To address this, we have > updated the logic to first nuke the PTE, then mark the page for > migration, and only then flush the TLB. With this patch, If the page is > accessed immediately after nuking the PTE, the TLB entry is still > valid, so no fault occurs. After marking the page for migration, IMO, I don't think this assumption is correct. At this point, the TLB entry might also be evicted, so a page fault could still occur. It's just a matter of probability. Additionally, IIUC, if another thread is accessing the shmem folio causing the migration to fail, I think this is expected, and migration failure is not a vital issue?
On 12/19/24 18:28, David Hildenbrand wrote: > On 19.12.24 13:47, Donet Tom wrote: >> The migration selftest is currently failing for shared anonymous >> mappings due to a race condition. >> >> During migration, the source folio's PTE is unmapped by nuking the >> PTE, flushing the TLB,and then marking the page for migration >> (by creating the swap entries). The issue arises when, immediately >> after the PTE is nuked and the TLB is flushed, but before the page >> is marked for migration, another thread accesses the page. This >> triggers a page fault, and the page fault handler invokes >> do_pte_missing() instead of do_swap_page(), as the page is not yet >> marked for migration. >> >> In the fault handling path, do_pte_missing() calls __do_fault() >> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). >> This eventually calls folio_try_get(), incrementing the reference >> count of the folio undergoing migration. The thread then blocks >> on folio_lock(), as the migration path holds the lock. This >> results in the migration failing in __migrate_folio(), which expects >> the folio's reference count to be 2. However, the reference count is >> incremented by the fault handler, leading to the failure. >> >> The issue arises because, after nuking the PTE and before marking the >> page for migration, the page is accessed. To address this, we have >> updated the logic to first nuke the PTE, then mark the page for >> migration, and only then flush the TLB. With this patch, If the page is >> accessed immediately after nuking the PTE, the TLB entry is still >> valid, so no fault occurs. > > But what about if the PTE is not in the TLB yet, and you get an access > from another CPU just after clearing the PTE (but not flushing the > TLB)? The other CPU will still observe PTE=none, trigger a fault etc. > Yes, in this scenario, the migration will fail. Do you think the migration test failure, even after a retry, should be considered a major issue that must be fixed? > So I don't think what you propose rules out all cases. >
On 12/20/24 08:01, Baolin Wang wrote: > > > On 2024/12/19 20:47, Donet Tom wrote: >> The migration selftest is currently failing for shared anonymous >> mappings due to a race condition. >> >> During migration, the source folio's PTE is unmapped by nuking the >> PTE, flushing the TLB,and then marking the page for migration >> (by creating the swap entries). The issue arises when, immediately >> after the PTE is nuked and the TLB is flushed, but before the page >> is marked for migration, another thread accesses the page. This >> triggers a page fault, and the page fault handler invokes >> do_pte_missing() instead of do_swap_page(), as the page is not yet >> marked for migration. >> >> In the fault handling path, do_pte_missing() calls __do_fault() >> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). >> This eventually calls folio_try_get(), incrementing the reference >> count of the folio undergoing migration. The thread then blocks >> on folio_lock(), as the migration path holds the lock. This >> results in the migration failing in __migrate_folio(), which expects >> the folio's reference count to be 2. However, the reference count is >> incremented by the fault handler, leading to the failure. >> >> The issue arises because, after nuking the PTE and before marking the >> page for migration, the page is accessed. To address this, we have >> updated the logic to first nuke the PTE, then mark the page for >> migration, and only then flush the TLB. With this patch, If the page is >> accessed immediately after nuking the PTE, the TLB entry is still >> valid, so no fault occurs. After marking the page for migration, > > IMO, I don't think this assumption is correct. At this point, the TLB > entry might also be evicted, so a page fault could still occur. It's > just a matter of probability. In this patch, we mark the page for migration before flushing the TLB. This ensures that if someone accesses the page after the TLB flush, the page fault will occur and in the page fault handler will wait for the migration to complete. So migration will not fail Without this patch, if someone accesses the page after the TLB flush but before it is marked for migration, the migration will fail. > > Additionally, IIUC, if another thread is accessing the shmem folio > causing the migration to fail, I think this is expected, and migration > failure is not a vital issue? > In my case, the shmem migration test is always failing, even after retries. Would it be correct to consider this as expected behavior?
On 2024/12/20 11:12, Donet Tom wrote: > > On 12/20/24 08:01, Baolin Wang wrote: >> >> >> On 2024/12/19 20:47, Donet Tom wrote: >>> The migration selftest is currently failing for shared anonymous >>> mappings due to a race condition. >>> >>> During migration, the source folio's PTE is unmapped by nuking the >>> PTE, flushing the TLB,and then marking the page for migration >>> (by creating the swap entries). The issue arises when, immediately >>> after the PTE is nuked and the TLB is flushed, but before the page >>> is marked for migration, another thread accesses the page. This >>> triggers a page fault, and the page fault handler invokes >>> do_pte_missing() instead of do_swap_page(), as the page is not yet >>> marked for migration. >>> >>> In the fault handling path, do_pte_missing() calls __do_fault() >>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). >>> This eventually calls folio_try_get(), incrementing the reference >>> count of the folio undergoing migration. The thread then blocks >>> on folio_lock(), as the migration path holds the lock. This >>> results in the migration failing in __migrate_folio(), which expects >>> the folio's reference count to be 2. However, the reference count is >>> incremented by the fault handler, leading to the failure. >>> >>> The issue arises because, after nuking the PTE and before marking the >>> page for migration, the page is accessed. To address this, we have >>> updated the logic to first nuke the PTE, then mark the page for >>> migration, and only then flush the TLB. With this patch, If the page is >>> accessed immediately after nuking the PTE, the TLB entry is still >>> valid, so no fault occurs. After marking the page for migration, >> >> IMO, I don't think this assumption is correct. At this point, the TLB >> entry might also be evicted, so a page fault could still occur. It's >> just a matter of probability. > In this patch, we mark the page for migration before flushing the TLB. > This ensures that if someone accesses the page after the TLB flush, > the page fault will occur and in the page fault handler will wait for the > migration to complete. So migration will not fail > > Without this patch, if someone accesses the page after the TLB flush > but before it is marked for migration, the migration will fail. Actually my concern is the same as David's (I did not see David's reply before sending my comments), which is that your patch does not "rules out all cases". >> Additionally, IIUC, if another thread is accessing the shmem folio >> causing the migration to fail, I think this is expected, and migration >> failure is not a vital issue? >> > In my case, the shmem migration test is always failing, > even after retries. Would it be correct to consider this > as expected behavior? IMHO I think your test case is too aggressive and unlikely to occur in real-world scenarios. Additionally, as I mentioned, migration failure is not a vital issue in the system, and some temporary refcnt can also lead to migration failure if you want to create such test cases. So personally, I don't think it is worthy doing.
On 12/20/24 09:02, Baolin Wang wrote: > > > On 2024/12/20 11:12, Donet Tom wrote: >> >> On 12/20/24 08:01, Baolin Wang wrote: >>> >>> >>> On 2024/12/19 20:47, Donet Tom wrote: >>>> The migration selftest is currently failing for shared anonymous >>>> mappings due to a race condition. >>>> >>>> During migration, the source folio's PTE is unmapped by nuking the >>>> PTE, flushing the TLB,and then marking the page for migration >>>> (by creating the swap entries). The issue arises when, immediately >>>> after the PTE is nuked and the TLB is flushed, but before the page >>>> is marked for migration, another thread accesses the page. This >>>> triggers a page fault, and the page fault handler invokes >>>> do_pte_missing() instead of do_swap_page(), as the page is not yet >>>> marked for migration. >>>> >>>> In the fault handling path, do_pte_missing() calls __do_fault() >>>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). >>>> This eventually calls folio_try_get(), incrementing the reference >>>> count of the folio undergoing migration. The thread then blocks >>>> on folio_lock(), as the migration path holds the lock. This >>>> results in the migration failing in __migrate_folio(), which expects >>>> the folio's reference count to be 2. However, the reference count is >>>> incremented by the fault handler, leading to the failure. >>>> >>>> The issue arises because, after nuking the PTE and before marking the >>>> page for migration, the page is accessed. To address this, we have >>>> updated the logic to first nuke the PTE, then mark the page for >>>> migration, and only then flush the TLB. With this patch, If the >>>> page is >>>> accessed immediately after nuking the PTE, the TLB entry is still >>>> valid, so no fault occurs. After marking the page for migration, >>> >>> IMO, I don't think this assumption is correct. At this point, the >>> TLB entry might also be evicted, so a page fault could still occur. >>> It's just a matter of probability. >> In this patch, we mark the page for migration before flushing the TLB. >> This ensures that if someone accesses the page after the TLB flush, >> the page fault will occur and in the page fault handler will wait for >> the >> migration to complete. So migration will not fail >> >> Without this patch, if someone accesses the page after the TLB flush >> but before it is marked for migration, the migration will fail. > > Actually my concern is the same as David's (I did not see David's > reply before sending my comments), which is that your patch does not > "rules out all cases". > >>> Additionally, IIUC, if another thread is accessing the shmem folio >>> causing the migration to fail, I think this is expected, and >>> migration failure is not a vital issue? >>> >> In my case, the shmem migration test is always failing, >> even after retries. Would it be correct to consider this >> as expected behavior? > > IMHO I think your test case is too aggressive and unlikely to occur in > real-world scenarios. Additionally, as I mentioned, migration failure > is not a vital issue in the system, and some temporary refcnt can also > lead to migration failure if you want to create such test cases. So > personally, I don't think it is worthy doing. Sure. Thank you Baolin.
On 20/12/24 9:02 am, Baolin Wang wrote: > > > On 2024/12/20 11:12, Donet Tom wrote: >> >> On 12/20/24 08:01, Baolin Wang wrote: >>> >>> >>> On 2024/12/19 20:47, Donet Tom wrote: >>>> The migration selftest is currently failing for shared anonymous >>>> mappings due to a race condition. >>>> >>>> During migration, the source folio's PTE is unmapped by nuking the >>>> PTE, flushing the TLB,and then marking the page for migration >>>> (by creating the swap entries). The issue arises when, immediately >>>> after the PTE is nuked and the TLB is flushed, but before the page >>>> is marked for migration, another thread accesses the page. This >>>> triggers a page fault, and the page fault handler invokes >>>> do_pte_missing() instead of do_swap_page(), as the page is not yet >>>> marked for migration. >>>> >>>> In the fault handling path, do_pte_missing() calls __do_fault() >>>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). >>>> This eventually calls folio_try_get(), incrementing the reference >>>> count of the folio undergoing migration. The thread then blocks >>>> on folio_lock(), as the migration path holds the lock. This >>>> results in the migration failing in __migrate_folio(), which expects >>>> the folio's reference count to be 2. However, the reference count is >>>> incremented by the fault handler, leading to the failure. >>>> >>>> The issue arises because, after nuking the PTE and before marking the >>>> page for migration, the page is accessed. To address this, we have >>>> updated the logic to first nuke the PTE, then mark the page for >>>> migration, and only then flush the TLB. With this patch, If the >>>> page is >>>> accessed immediately after nuking the PTE, the TLB entry is still >>>> valid, so no fault occurs. After marking the page for migration, >>> >>> IMO, I don't think this assumption is correct. At this point, the >>> TLB entry might also be evicted, so a page fault could still occur. >>> It's just a matter of probability. >> In this patch, we mark the page for migration before flushing the TLB. >> This ensures that if someone accesses the page after the TLB flush, >> the page fault will occur and in the page fault handler will wait for >> the >> migration to complete. So migration will not fail >> >> Without this patch, if someone accesses the page after the TLB flush >> but before it is marked for migration, the migration will fail. > > Actually my concern is the same as David's (I did not see David's > reply before sending my comments), which is that your patch does not > "rules out all cases". I like this solution but really the proper solution for this one was to atomically set the migration entry IMHO. > >>> Additionally, IIUC, if another thread is accessing the shmem folio >>> causing the migration to fail, I think this is expected, and >>> migration failure is not a vital issue? >>> >> In my case, the shmem migration test is always failing, >> even after retries. Would it be correct to consider this >> as expected behavior? > > IMHO I think your test case is too aggressive and unlikely to occur in > real-world scenarios. Additionally, as I mentioned, migration failure > is not a vital issue in the system, and some temporary refcnt can also > lead to migration failure if you want to create such test cases. So > personally, I don't think it is worthy doing. Agreed, AFAIR the test case starts faulting exactly on those pages which we want to migrate, making this a very artificial scenario.
Hi Donet, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Donet-Tom/mm-migration-shared-anonymous-migration-test-is-failing/20241219-204920 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20241219124717.4907-1-donettom%40linux.ibm.com patch subject: [PATCH] mm: migration :shared anonymous migration test is failing config: arc-randconfig-002-20241220 (https://download.01.org/0day-ci/archive/20241220/202412201738.gUNPwJ1x-lkp@intel.com/config) compiler: arc-elf-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241220/202412201738.gUNPwJ1x-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202412201738.gUNPwJ1x-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from include/linux/migrate.h:8, from mm/rmap.c:69: include/linux/hugetlb.h:1063:5: warning: no previous prototype for 'replace_free_hugepage_folios' [-Wmissing-prototypes] 1063 | int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/rmap.c: In function 'try_to_migrate_one': >> mm/rmap.c:2157:34: error: implicit declaration of function 'huge_ptep_get_and_clear'; did you mean 'ptep_get_and_clear'? [-Werror=implicit-function-declaration] 2157 | pteval = huge_ptep_get_and_clear(mm, address, pvmw.pte); | ^~~~~~~~~~~~~~~~~~~~~~~ | ptep_get_and_clear >> mm/rmap.c:2157:34: error: incompatible types when assigning to type 'pte_t' from type 'int' >> mm/rmap.c:2326:33: error: implicit declaration of function 'flush_hugetlb_page'; did you mean 'flush_tlb_page'? [-Werror=implicit-function-declaration] 2326 | flush_hugetlb_page(vma, address); | ^~~~~~~~~~~~~~~~~~ | flush_tlb_page cc1: some warnings being treated as errors vim +2157 mm/rmap.c 2081 2082 /* Unexpected PMD-mapped THP? */ 2083 VM_BUG_ON_FOLIO(!pvmw.pte, folio); 2084 2085 pfn = pte_pfn(ptep_get(pvmw.pte)); 2086 2087 if (folio_is_zone_device(folio)) { 2088 /* 2089 * Our PTE is a non-present device exclusive entry and 2090 * calculating the subpage as for the common case would 2091 * result in an invalid pointer. 2092 * 2093 * Since only PAGE_SIZE pages can currently be 2094 * migrated, just set it to page. This will need to be 2095 * changed when hugepage migrations to device private 2096 * memory are supported. 2097 */ 2098 VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio); 2099 subpage = &folio->page; 2100 } else { 2101 subpage = folio_page(folio, pfn - folio_pfn(folio)); 2102 } 2103 address = pvmw.address; 2104 anon_exclusive = folio_test_anon(folio) && 2105 PageAnonExclusive(subpage); 2106 2107 if (folio_test_hugetlb(folio)) { 2108 bool anon = folio_test_anon(folio); 2109 2110 /* 2111 * huge_pmd_unshare may unmap an entire PMD page. 2112 * There is no way of knowing exactly which PMDs may 2113 * be cached for this mm, so we must flush them all. 2114 * start/end were already adjusted above to cover this 2115 * range. 2116 */ 2117 flush_cache_range(vma, range.start, range.end); 2118 2119 /* 2120 * To call huge_pmd_unshare, i_mmap_rwsem must be 2121 * held in write mode. Caller needs to explicitly 2122 * do this outside rmap routines. 2123 * 2124 * We also must hold hugetlb vma_lock in write mode. 2125 * Lock order dictates acquiring vma_lock BEFORE 2126 * i_mmap_rwsem. We can only try lock here and 2127 * fail if unsuccessful. 2128 */ 2129 if (!anon) { 2130 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); 2131 if (!hugetlb_vma_trylock_write(vma)) { 2132 page_vma_mapped_walk_done(&pvmw); 2133 ret = false; 2134 break; 2135 } 2136 if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { 2137 hugetlb_vma_unlock_write(vma); 2138 flush_tlb_range(vma, 2139 range.start, range.end); 2140 2141 /* 2142 * The ref count of the PMD page was 2143 * dropped which is part of the way map 2144 * counting is done for shared PMDs. 2145 * Return 'true' here. When there is 2146 * no other sharing, huge_pmd_unshare 2147 * returns false and we will unmap the 2148 * actual page and drop map count 2149 * to zero. 2150 */ 2151 page_vma_mapped_walk_done(&pvmw); 2152 break; 2153 } 2154 hugetlb_vma_unlock_write(vma); 2155 } 2156 /* Nuke the hugetlb page table entry */ > 2157 pteval = huge_ptep_get_and_clear(mm, address, pvmw.pte); 2158 } else { 2159 flush_cache_page(vma, address, pfn); 2160 /* Nuke the page table entry. */ 2161 if (should_defer_flush(mm, flags)) { 2162 /* 2163 * We clear the PTE but do not flush so potentially 2164 * a remote CPU could still be writing to the folio. 2165 * If the entry was previously clean then the 2166 * architecture must guarantee that a clear->dirty 2167 * transition on a cached TLB entry is written through 2168 * and traps if the PTE is unmapped. 2169 */ 2170 pteval = ptep_get_and_clear(mm, address, pvmw.pte); 2171 2172 set_tlb_ubc_flush_pending(mm, pteval, address); 2173 } else { 2174 pteval = ptep_get_and_clear(mm, address, pvmw.pte); 2175 } 2176 } 2177 2178 /* Set the dirty flag on the folio now the pte is gone. */ 2179 if (pte_dirty(pteval)) 2180 folio_mark_dirty(folio); 2181 2182 /* Update high watermark before we lower rss */ 2183 update_hiwater_rss(mm); 2184 2185 if (folio_is_device_private(folio)) { 2186 unsigned long pfn = folio_pfn(folio); 2187 swp_entry_t entry; 2188 pte_t swp_pte; 2189 2190 if (anon_exclusive) 2191 WARN_ON_ONCE(folio_try_share_anon_rmap_pte(folio, 2192 subpage)); 2193 2194 /* 2195 * Store the pfn of the page in a special migration 2196 * pte. do_swap_page() will wait until the migration 2197 * pte is removed and then restart fault handling. 2198 */ 2199 entry = pte_to_swp_entry(pteval); 2200 if (is_writable_device_private_entry(entry)) 2201 entry = make_writable_migration_entry(pfn); 2202 else if (anon_exclusive) 2203 entry = make_readable_exclusive_migration_entry(pfn); 2204 else 2205 entry = make_readable_migration_entry(pfn); 2206 swp_pte = swp_entry_to_pte(entry); 2207 2208 /* 2209 * pteval maps a zone device page and is therefore 2210 * a swap pte. 2211 */ 2212 if (pte_swp_soft_dirty(pteval)) 2213 swp_pte = pte_swp_mksoft_dirty(swp_pte); 2214 if (pte_swp_uffd_wp(pteval)) 2215 swp_pte = pte_swp_mkuffd_wp(swp_pte); 2216 set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); 2217 trace_set_migration_pte(pvmw.address, pte_val(swp_pte), 2218 folio_order(folio)); 2219 /* 2220 * No need to invalidate here it will synchronize on 2221 * against the special swap migration pte. 2222 */ 2223 } else if (PageHWPoison(subpage)) { 2224 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); 2225 if (folio_test_hugetlb(folio)) { 2226 hugetlb_count_sub(folio_nr_pages(folio), mm); 2227 set_huge_pte_at(mm, address, pvmw.pte, pteval, 2228 hsz); 2229 } else { 2230 dec_mm_counter(mm, mm_counter(folio)); 2231 set_pte_at(mm, address, pvmw.pte, pteval); 2232 } 2233 2234 } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) { 2235 /* 2236 * The guest indicated that the page content is of no 2237 * interest anymore. Simply discard the pte, vmscan 2238 * will take care of the rest. 2239 * A future reference will then fault in a new zero 2240 * page. When userfaultfd is active, we must not drop 2241 * this page though, as its main user (postcopy 2242 * migration) will not expect userfaults on already 2243 * copied pages. 2244 */ 2245 dec_mm_counter(mm, mm_counter(folio)); 2246 } else { 2247 swp_entry_t entry; 2248 pte_t swp_pte; 2249 2250 if (arch_unmap_one(mm, vma, address, pteval) < 0) { 2251 if (folio_test_hugetlb(folio)) 2252 set_huge_pte_at(mm, address, pvmw.pte, 2253 pteval, hsz); 2254 else 2255 set_pte_at(mm, address, pvmw.pte, pteval); 2256 ret = false; 2257 page_vma_mapped_walk_done(&pvmw); 2258 break; 2259 } 2260 VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) && 2261 !anon_exclusive, subpage); 2262 2263 /* See folio_try_share_anon_rmap_pte(): clear PTE first. */ 2264 if (folio_test_hugetlb(folio)) { 2265 if (anon_exclusive && 2266 hugetlb_try_share_anon_rmap(folio)) { 2267 set_huge_pte_at(mm, address, pvmw.pte, 2268 pteval, hsz); 2269 ret = false; 2270 page_vma_mapped_walk_done(&pvmw); 2271 break; 2272 } 2273 } else if (anon_exclusive && 2274 folio_try_share_anon_rmap_pte(folio, subpage)) { 2275 set_pte_at(mm, address, pvmw.pte, pteval); 2276 ret = false; 2277 page_vma_mapped_walk_done(&pvmw); 2278 break; 2279 } 2280 2281 /* 2282 * Store the pfn of the page in a special migration 2283 * pte. do_swap_page() will wait until the migration 2284 * pte is removed and then restart fault handling. 2285 */ 2286 if (pte_write(pteval)) 2287 entry = make_writable_migration_entry( 2288 page_to_pfn(subpage)); 2289 else if (anon_exclusive) 2290 entry = make_readable_exclusive_migration_entry( 2291 page_to_pfn(subpage)); 2292 else 2293 entry = make_readable_migration_entry( 2294 page_to_pfn(subpage)); 2295 if (pte_young(pteval)) 2296 entry = make_migration_entry_young(entry); 2297 if (pte_dirty(pteval)) 2298 entry = make_migration_entry_dirty(entry); 2299 swp_pte = swp_entry_to_pte(entry); 2300 if (pte_soft_dirty(pteval)) 2301 swp_pte = pte_swp_mksoft_dirty(swp_pte); 2302 if (pte_uffd_wp(pteval)) 2303 swp_pte = pte_swp_mkuffd_wp(swp_pte); 2304 if (folio_test_hugetlb(folio)) 2305 set_huge_pte_at(mm, address, pvmw.pte, swp_pte, 2306 hsz); 2307 else 2308 set_pte_at(mm, address, pvmw.pte, swp_pte); 2309 trace_set_migration_pte(address, pte_val(swp_pte), 2310 folio_order(folio)); 2311 /* 2312 * No need to invalidate here it will synchronize on 2313 * against the special swap migration pte. 2314 */ 2315 } 2316 2317 if (unlikely(folio_test_hugetlb(folio))) 2318 hugetlb_remove_rmap(folio); 2319 else 2320 folio_remove_rmap_pte(folio, subpage, vma); 2321 if (vma->vm_flags & VM_LOCKED) 2322 mlock_drain_local(); 2323 2324 if (!should_defer_flush(mm, flags)) { 2325 if (folio_test_hugetlb(folio)) > 2326 flush_hugetlb_page(vma, address); 2327 else 2328 flush_tlb_page(vma, address); 2329 } 2330 2331 folio_put(folio); 2332 } 2333 2334 mmu_notifier_invalidate_range_end(&range); 2335 2336 return ret; 2337 } 2338
On 20.12.24 03:55, Donet Tom wrote: > > On 12/19/24 18:28, David Hildenbrand wrote: >> On 19.12.24 13:47, Donet Tom wrote: >>> The migration selftest is currently failing for shared anonymous >>> mappings due to a race condition. >>> >>> During migration, the source folio's PTE is unmapped by nuking the >>> PTE, flushing the TLB,and then marking the page for migration >>> (by creating the swap entries). The issue arises when, immediately >>> after the PTE is nuked and the TLB is flushed, but before the page >>> is marked for migration, another thread accesses the page. This >>> triggers a page fault, and the page fault handler invokes >>> do_pte_missing() instead of do_swap_page(), as the page is not yet >>> marked for migration. >>> >>> In the fault handling path, do_pte_missing() calls __do_fault() >>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). >>> This eventually calls folio_try_get(), incrementing the reference >>> count of the folio undergoing migration. The thread then blocks >>> on folio_lock(), as the migration path holds the lock. This >>> results in the migration failing in __migrate_folio(), which expects >>> the folio's reference count to be 2. However, the reference count is >>> incremented by the fault handler, leading to the failure. >>> >>> The issue arises because, after nuking the PTE and before marking the >>> page for migration, the page is accessed. To address this, we have >>> updated the logic to first nuke the PTE, then mark the page for >>> migration, and only then flush the TLB. With this patch, If the page is >>> accessed immediately after nuking the PTE, the TLB entry is still >>> valid, so no fault occurs. >> >> But what about if the PTE is not in the TLB yet, and you get an access >> from another CPU just after clearing the PTE (but not flushing the >> TLB)? The other CPU will still observe PTE=none, trigger a fault etc. >> > Yes, in this scenario, the migration will fail. Do you think the > migration test > failure, even after a retry, should be considered a major issue that > must be fixed? I think it is something we should definitely improve, but I think our page migration should handle this in a better way, not the unmap logic. I recall we discussed with Dev some ideas on how to improve that? I'm pretty sure one can trigger similar case using a tmpfs file and using read/write in a loop instead of memory access -> page faults. So where racing with page faults is completely out of the picture.
Hi Donet, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Donet-Tom/mm-migration-shared-anonymous-migration-test-is-failing/20241219-204920 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20241219124717.4907-1-donettom%40linux.ibm.com patch subject: [PATCH] mm: migration :shared anonymous migration test is failing config: arm-randconfig-001-20241220 (https://download.01.org/0day-ci/archive/20241220/202412201828.3GvmHte5-lkp@intel.com/config) compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241220/202412201828.3GvmHte5-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202412201828.3GvmHte5-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from mm/rmap.c:69: In file included from include/linux/migrate.h:8: include/linux/hugetlb.h:1063:5: warning: no previous prototype for function 'replace_free_hugepage_folios' [-Wmissing-prototypes] 1063 | int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) | ^ include/linux/hugetlb.h:1063:1: note: declare 'static' if the function is not intended to be used outside of this translation unit 1063 | int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) | ^ | static In file included from mm/rmap.c:76: include/linux/mm_inline.h:47:41: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion] 47 | __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages); | ~~~~~~~~~~~ ^ ~~~ include/linux/mm_inline.h:49:22: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion] 49 | NR_ZONE_LRU_BASE + lru, nr_pages); | ~~~~~~~~~~~~~~~~ ^ ~~~ >> mm/rmap.c:2157:13: error: call to undeclared function 'huge_ptep_get_and_clear'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 2157 | pteval = huge_ptep_get_and_clear(mm, address, pvmw.pte); | ^ mm/rmap.c:2157:13: note: did you mean 'ptep_get_and_clear'? include/linux/pgtable.h:478:21: note: 'ptep_get_and_clear' declared here 478 | static inline pte_t ptep_get_and_clear(struct mm_struct *mm, | ^ >> mm/rmap.c:2326:5: error: call to undeclared function 'flush_hugetlb_page'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 2326 | flush_hugetlb_page(vma, address); | ^ mm/rmap.c:2326:5: note: did you mean 'is_vm_hugetlb_page'? include/linux/hugetlb_inline.h:16:20: note: 'is_vm_hugetlb_page' declared here 16 | static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma) | ^ 3 warnings and 2 errors generated. vim +/huge_ptep_get_and_clear +2157 mm/rmap.c 2081 2082 /* Unexpected PMD-mapped THP? */ 2083 VM_BUG_ON_FOLIO(!pvmw.pte, folio); 2084 2085 pfn = pte_pfn(ptep_get(pvmw.pte)); 2086 2087 if (folio_is_zone_device(folio)) { 2088 /* 2089 * Our PTE is a non-present device exclusive entry and 2090 * calculating the subpage as for the common case would 2091 * result in an invalid pointer. 2092 * 2093 * Since only PAGE_SIZE pages can currently be 2094 * migrated, just set it to page. This will need to be 2095 * changed when hugepage migrations to device private 2096 * memory are supported. 2097 */ 2098 VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio); 2099 subpage = &folio->page; 2100 } else { 2101 subpage = folio_page(folio, pfn - folio_pfn(folio)); 2102 } 2103 address = pvmw.address; 2104 anon_exclusive = folio_test_anon(folio) && 2105 PageAnonExclusive(subpage); 2106 2107 if (folio_test_hugetlb(folio)) { 2108 bool anon = folio_test_anon(folio); 2109 2110 /* 2111 * huge_pmd_unshare may unmap an entire PMD page. 2112 * There is no way of knowing exactly which PMDs may 2113 * be cached for this mm, so we must flush them all. 2114 * start/end were already adjusted above to cover this 2115 * range. 2116 */ 2117 flush_cache_range(vma, range.start, range.end); 2118 2119 /* 2120 * To call huge_pmd_unshare, i_mmap_rwsem must be 2121 * held in write mode. Caller needs to explicitly 2122 * do this outside rmap routines. 2123 * 2124 * We also must hold hugetlb vma_lock in write mode. 2125 * Lock order dictates acquiring vma_lock BEFORE 2126 * i_mmap_rwsem. We can only try lock here and 2127 * fail if unsuccessful. 2128 */ 2129 if (!anon) { 2130 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); 2131 if (!hugetlb_vma_trylock_write(vma)) { 2132 page_vma_mapped_walk_done(&pvmw); 2133 ret = false; 2134 break; 2135 } 2136 if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { 2137 hugetlb_vma_unlock_write(vma); 2138 flush_tlb_range(vma, 2139 range.start, range.end); 2140 2141 /* 2142 * The ref count of the PMD page was 2143 * dropped which is part of the way map 2144 * counting is done for shared PMDs. 2145 * Return 'true' here. When there is 2146 * no other sharing, huge_pmd_unshare 2147 * returns false and we will unmap the 2148 * actual page and drop map count 2149 * to zero. 2150 */ 2151 page_vma_mapped_walk_done(&pvmw); 2152 break; 2153 } 2154 hugetlb_vma_unlock_write(vma); 2155 } 2156 /* Nuke the hugetlb page table entry */ > 2157 pteval = huge_ptep_get_and_clear(mm, address, pvmw.pte); 2158 } else { 2159 flush_cache_page(vma, address, pfn); 2160 /* Nuke the page table entry. */ 2161 if (should_defer_flush(mm, flags)) { 2162 /* 2163 * We clear the PTE but do not flush so potentially 2164 * a remote CPU could still be writing to the folio. 2165 * If the entry was previously clean then the 2166 * architecture must guarantee that a clear->dirty 2167 * transition on a cached TLB entry is written through 2168 * and traps if the PTE is unmapped. 2169 */ 2170 pteval = ptep_get_and_clear(mm, address, pvmw.pte); 2171 2172 set_tlb_ubc_flush_pending(mm, pteval, address); 2173 } else { 2174 pteval = ptep_get_and_clear(mm, address, pvmw.pte); 2175 } 2176 } 2177 2178 /* Set the dirty flag on the folio now the pte is gone. */ 2179 if (pte_dirty(pteval)) 2180 folio_mark_dirty(folio); 2181 2182 /* Update high watermark before we lower rss */ 2183 update_hiwater_rss(mm); 2184 2185 if (folio_is_device_private(folio)) { 2186 unsigned long pfn = folio_pfn(folio); 2187 swp_entry_t entry; 2188 pte_t swp_pte; 2189 2190 if (anon_exclusive) 2191 WARN_ON_ONCE(folio_try_share_anon_rmap_pte(folio, 2192 subpage)); 2193 2194 /* 2195 * Store the pfn of the page in a special migration 2196 * pte. do_swap_page() will wait until the migration 2197 * pte is removed and then restart fault handling. 2198 */ 2199 entry = pte_to_swp_entry(pteval); 2200 if (is_writable_device_private_entry(entry)) 2201 entry = make_writable_migration_entry(pfn); 2202 else if (anon_exclusive) 2203 entry = make_readable_exclusive_migration_entry(pfn); 2204 else 2205 entry = make_readable_migration_entry(pfn); 2206 swp_pte = swp_entry_to_pte(entry); 2207 2208 /* 2209 * pteval maps a zone device page and is therefore 2210 * a swap pte. 2211 */ 2212 if (pte_swp_soft_dirty(pteval)) 2213 swp_pte = pte_swp_mksoft_dirty(swp_pte); 2214 if (pte_swp_uffd_wp(pteval)) 2215 swp_pte = pte_swp_mkuffd_wp(swp_pte); 2216 set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); 2217 trace_set_migration_pte(pvmw.address, pte_val(swp_pte), 2218 folio_order(folio)); 2219 /* 2220 * No need to invalidate here it will synchronize on 2221 * against the special swap migration pte. 2222 */ 2223 } else if (PageHWPoison(subpage)) { 2224 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); 2225 if (folio_test_hugetlb(folio)) { 2226 hugetlb_count_sub(folio_nr_pages(folio), mm); 2227 set_huge_pte_at(mm, address, pvmw.pte, pteval, 2228 hsz); 2229 } else { 2230 dec_mm_counter(mm, mm_counter(folio)); 2231 set_pte_at(mm, address, pvmw.pte, pteval); 2232 } 2233 2234 } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) { 2235 /* 2236 * The guest indicated that the page content is of no 2237 * interest anymore. Simply discard the pte, vmscan 2238 * will take care of the rest. 2239 * A future reference will then fault in a new zero 2240 * page. When userfaultfd is active, we must not drop 2241 * this page though, as its main user (postcopy 2242 * migration) will not expect userfaults on already 2243 * copied pages. 2244 */ 2245 dec_mm_counter(mm, mm_counter(folio)); 2246 } else { 2247 swp_entry_t entry; 2248 pte_t swp_pte; 2249 2250 if (arch_unmap_one(mm, vma, address, pteval) < 0) { 2251 if (folio_test_hugetlb(folio)) 2252 set_huge_pte_at(mm, address, pvmw.pte, 2253 pteval, hsz); 2254 else 2255 set_pte_at(mm, address, pvmw.pte, pteval); 2256 ret = false; 2257 page_vma_mapped_walk_done(&pvmw); 2258 break; 2259 } 2260 VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) && 2261 !anon_exclusive, subpage); 2262 2263 /* See folio_try_share_anon_rmap_pte(): clear PTE first. */ 2264 if (folio_test_hugetlb(folio)) { 2265 if (anon_exclusive && 2266 hugetlb_try_share_anon_rmap(folio)) { 2267 set_huge_pte_at(mm, address, pvmw.pte, 2268 pteval, hsz); 2269 ret = false; 2270 page_vma_mapped_walk_done(&pvmw); 2271 break; 2272 } 2273 } else if (anon_exclusive && 2274 folio_try_share_anon_rmap_pte(folio, subpage)) { 2275 set_pte_at(mm, address, pvmw.pte, pteval); 2276 ret = false; 2277 page_vma_mapped_walk_done(&pvmw); 2278 break; 2279 } 2280 2281 /* 2282 * Store the pfn of the page in a special migration 2283 * pte. do_swap_page() will wait until the migration 2284 * pte is removed and then restart fault handling. 2285 */ 2286 if (pte_write(pteval)) 2287 entry = make_writable_migration_entry( 2288 page_to_pfn(subpage)); 2289 else if (anon_exclusive) 2290 entry = make_readable_exclusive_migration_entry( 2291 page_to_pfn(subpage)); 2292 else 2293 entry = make_readable_migration_entry( 2294 page_to_pfn(subpage)); 2295 if (pte_young(pteval)) 2296 entry = make_migration_entry_young(entry); 2297 if (pte_dirty(pteval)) 2298 entry = make_migration_entry_dirty(entry); 2299 swp_pte = swp_entry_to_pte(entry); 2300 if (pte_soft_dirty(pteval)) 2301 swp_pte = pte_swp_mksoft_dirty(swp_pte); 2302 if (pte_uffd_wp(pteval)) 2303 swp_pte = pte_swp_mkuffd_wp(swp_pte); 2304 if (folio_test_hugetlb(folio)) 2305 set_huge_pte_at(mm, address, pvmw.pte, swp_pte, 2306 hsz); 2307 else 2308 set_pte_at(mm, address, pvmw.pte, swp_pte); 2309 trace_set_migration_pte(address, pte_val(swp_pte), 2310 folio_order(folio)); 2311 /* 2312 * No need to invalidate here it will synchronize on 2313 * against the special swap migration pte. 2314 */ 2315 } 2316 2317 if (unlikely(folio_test_hugetlb(folio))) 2318 hugetlb_remove_rmap(folio); 2319 else 2320 folio_remove_rmap_pte(folio, subpage, vma); 2321 if (vma->vm_flags & VM_LOCKED) 2322 mlock_drain_local(); 2323 2324 if (!should_defer_flush(mm, flags)) { 2325 if (folio_test_hugetlb(folio)) > 2326 flush_hugetlb_page(vma, address); 2327 else 2328 flush_tlb_page(vma, address); 2329 } 2330 2331 folio_put(folio); 2332 } 2333 2334 mmu_notifier_invalidate_range_end(&range); 2335 2336 return ret; 2337 } 2338
On 12/20/24 10:07, Dev Jain wrote: > > On 20/12/24 9:02 am, Baolin Wang wrote: >> >> >> On 2024/12/20 11:12, Donet Tom wrote: >>> >>> On 12/20/24 08:01, Baolin Wang wrote: >>>> >>>> >>>> On 2024/12/19 20:47, Donet Tom wrote: >>>>> The migration selftest is currently failing for shared anonymous >>>>> mappings due to a race condition. >>>>> >>>>> During migration, the source folio's PTE is unmapped by nuking the >>>>> PTE, flushing the TLB,and then marking the page for migration >>>>> (by creating the swap entries). The issue arises when, immediately >>>>> after the PTE is nuked and the TLB is flushed, but before the page >>>>> is marked for migration, another thread accesses the page. This >>>>> triggers a page fault, and the page fault handler invokes >>>>> do_pte_missing() instead of do_swap_page(), as the page is not yet >>>>> marked for migration. >>>>> >>>>> In the fault handling path, do_pte_missing() calls __do_fault() >>>>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). >>>>> This eventually calls folio_try_get(), incrementing the reference >>>>> count of the folio undergoing migration. The thread then blocks >>>>> on folio_lock(), as the migration path holds the lock. This >>>>> results in the migration failing in __migrate_folio(), which expects >>>>> the folio's reference count to be 2. However, the reference count is >>>>> incremented by the fault handler, leading to the failure. >>>>> >>>>> The issue arises because, after nuking the PTE and before marking the >>>>> page for migration, the page is accessed. To address this, we have >>>>> updated the logic to first nuke the PTE, then mark the page for >>>>> migration, and only then flush the TLB. With this patch, If the >>>>> page is >>>>> accessed immediately after nuking the PTE, the TLB entry is still >>>>> valid, so no fault occurs. After marking the page for migration, >>>> >>>> IMO, I don't think this assumption is correct. At this point, the >>>> TLB entry might also be evicted, so a page fault could still occur. >>>> It's just a matter of probability. >>> In this patch, we mark the page for migration before flushing the TLB. >>> This ensures that if someone accesses the page after the TLB flush, >>> the page fault will occur and in the page fault handler will wait >>> for the >>> migration to complete. So migration will not fail >>> >>> Without this patch, if someone accesses the page after the TLB flush >>> but before it is marked for migration, the migration will fail. >> >> Actually my concern is the same as David's (I did not see David's >> reply before sending my comments), which is that your patch does not >> "rules out all cases". > > > I like this solution but really the proper solution for this one was > to atomically set the migration entry IMHO. > Thank you Dev. I will check and come back on this. > >> >>>> Additionally, IIUC, if another thread is accessing the shmem folio >>>> causing the migration to fail, I think this is expected, and >>>> migration failure is not a vital issue? >>>> >>> In my case, the shmem migration test is always failing, >>> even after retries. Would it be correct to consider this >>> as expected behavior? >> >> IMHO I think your test case is too aggressive and unlikely to occur >> in real-world scenarios. Additionally, as I mentioned, migration >> failure is not a vital issue in the system, and some temporary refcnt >> can also lead to migration failure if you want to create such test >> cases. So personally, I don't think it is worthy doing. > > Agreed, AFAIR the test case starts faulting exactly on those pages > which we want to migrate, making this a very artificial scenario. >
On 12/20/24 15:41, David Hildenbrand wrote: > On 20.12.24 03:55, Donet Tom wrote: >> >> On 12/19/24 18:28, David Hildenbrand wrote: >>> On 19.12.24 13:47, Donet Tom wrote: >>>> The migration selftest is currently failing for shared anonymous >>>> mappings due to a race condition. >>>> >>>> During migration, the source folio's PTE is unmapped by nuking the >>>> PTE, flushing the TLB,and then marking the page for migration >>>> (by creating the swap entries). The issue arises when, immediately >>>> after the PTE is nuked and the TLB is flushed, but before the page >>>> is marked for migration, another thread accesses the page. This >>>> triggers a page fault, and the page fault handler invokes >>>> do_pte_missing() instead of do_swap_page(), as the page is not yet >>>> marked for migration. >>>> >>>> In the fault handling path, do_pte_missing() calls __do_fault() >>>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). >>>> This eventually calls folio_try_get(), incrementing the reference >>>> count of the folio undergoing migration. The thread then blocks >>>> on folio_lock(), as the migration path holds the lock. This >>>> results in the migration failing in __migrate_folio(), which expects >>>> the folio's reference count to be 2. However, the reference count is >>>> incremented by the fault handler, leading to the failure. >>>> >>>> The issue arises because, after nuking the PTE and before marking the >>>> page for migration, the page is accessed. To address this, we have >>>> updated the logic to first nuke the PTE, then mark the page for >>>> migration, and only then flush the TLB. With this patch, If the >>>> page is >>>> accessed immediately after nuking the PTE, the TLB entry is still >>>> valid, so no fault occurs. >>> >>> But what about if the PTE is not in the TLB yet, and you get an access >>> from another CPU just after clearing the PTE (but not flushing the >>> TLB)? The other CPU will still observe PTE=none, trigger a fault etc. >>> >> Yes, in this scenario, the migration will fail. Do you think the >> migration test >> failure, even after a retry, should be considered a major issue that >> must be fixed? > > I think it is something we should definitely improve, but I think our > page migration should handle this in a better way, not the unmap > logic. I recall we discussed with Dev some ideas on how to improve that? > > > I'm pretty sure one can trigger similar case using a tmpfs file and > using read/write in a loop instead of memory access -> page faults. So > where racing with page faults is completely out of the picture. Thank you David. I will try this scenario as well and come back with some ideas for improvement.
diff --git a/mm/rmap.c b/mm/rmap.c index c6c4d4ea29a7..920ae46e977f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2154,7 +2154,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, hugetlb_vma_unlock_write(vma); } /* Nuke the hugetlb page table entry */ - pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); + pteval = huge_ptep_get_and_clear(mm, address, pvmw.pte); } else { flush_cache_page(vma, address, pfn); /* Nuke the page table entry. */ @@ -2171,7 +2171,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, set_tlb_ubc_flush_pending(mm, pteval, address); } else { - pteval = ptep_clear_flush(vma, address, pvmw.pte); + pteval = ptep_get_and_clear(mm, address, pvmw.pte); } } @@ -2320,6 +2320,14 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, folio_remove_rmap_pte(folio, subpage, vma); if (vma->vm_flags & VM_LOCKED) mlock_drain_local(); + + if (!should_defer_flush(mm, flags)) { + if (folio_test_hugetlb(folio)) + flush_hugetlb_page(vma, address); + else + flush_tlb_page(vma, address); + } + folio_put(folio); }
The migration selftest is currently failing for shared anonymous mappings due to a race condition. During migration, the source folio's PTE is unmapped by nuking the PTE, flushing the TLB,and then marking the page for migration (by creating the swap entries). The issue arises when, immediately after the PTE is nuked and the TLB is flushed, but before the page is marked for migration, another thread accesses the page. This triggers a page fault, and the page fault handler invokes do_pte_missing() instead of do_swap_page(), as the page is not yet marked for migration. In the fault handling path, do_pte_missing() calls __do_fault() ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). This eventually calls folio_try_get(), incrementing the reference count of the folio undergoing migration. The thread then blocks on folio_lock(), as the migration path holds the lock. This results in the migration failing in __migrate_folio(), which expects the folio's reference count to be 2. However, the reference count is incremented by the fault handler, leading to the failure. The issue arises because, after nuking the PTE and before marking the page for migration, the page is accessed. To address this, we have updated the logic to first nuke the PTE, then mark the page for migration, and only then flush the TLB. With this patch, If the page is accessed immediately after nuking the PTE, the TLB entry is still valid, so no fault occurs. After marking the page for migration, flushing the TLB ensures that the next page fault correctly triggers do_swap_page() and waits for the migration to complete. Test Result without this patch ============================== # ./tools/testing/selftests/mm/migration TAP version 13 1..3 # Starting 3 tests from 1 test cases. # RUN migration.private_anon ... # OK migration.private_anon ok 1 migration.private_anon # RUN migration.shared_anon ... Didn't migrate 1 pages # migration.c:175:shared_anon:Expected migrate(ptr, self->n1, self->n2) (-2) == 0 (0) # shared_anon: Test terminated by assertion # FAIL migration.shared_anon not ok 2 migration.shared_anon # RUN migration.private_anon_thp ... # OK migration.private_anon_thp ok 3 migration.private_anon_thp # FAILED: 2 / 3 tests passed. # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0 Test result with this patch =========================== # ./tools/testing/selftests/mm/migration TAP version 13 1..3 # Starting 3 tests from 1 test cases. # RUN migration.private_anon ... # OK migration.private_anon ok 1 migration.private_anon # RUN migration.shared_anon ... # OK migration.shared_anon ok 2 migration.shared_anon # RUN migration.private_anon_thp ... # OK migration.private_anon_thp ok 3 migration.private_anon_thp # PASSED: 3 / 3 tests passed. # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Donet Tom <donettom@linux.ibm.com> --- mm/rmap.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)