diff mbox series

mm: migration :shared anonymous migration test is failing

Message ID 20241219124717.4907-1-donettom@linux.ibm.com (mailing list archive)
State New
Headers show
Series mm: migration :shared anonymous migration test is failing | expand

Commit Message

Donet Tom Dec. 19, 2024, 12:47 p.m. UTC
The migration selftest is currently failing for shared anonymous
mappings due to a race condition.

During migration, the source folio's PTE is unmapped by nuking the
PTE, flushing the TLB,and then marking the page for migration
(by creating the swap entries). The issue arises when, immediately
after the PTE is nuked and the TLB is flushed, but before the page
is marked for migration, another thread accesses the page. This
triggers a page fault, and the page fault handler invokes
do_pte_missing() instead of do_swap_page(), as the page is not yet
marked for migration.

In the fault handling path, do_pte_missing() calls __do_fault()
->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
This eventually calls folio_try_get(), incrementing the reference
count of the folio undergoing migration. The thread then blocks
on folio_lock(), as the migration path holds the lock. This
results in the migration failing in __migrate_folio(), which expects
the folio's reference count to be 2. However, the reference count is
incremented by the fault handler, leading to the failure.

The issue arises because, after nuking the PTE and before marking the
page for migration, the page is accessed. To address this, we have
updated the logic to first nuke the PTE, then mark the page for
migration, and only then flush the TLB. With this patch, If the page is
accessed immediately after nuking the PTE, the TLB entry is still
valid, so no fault occurs. After marking the page for migration,
flushing the TLB ensures that the next page fault correctly triggers
do_swap_page() and waits for the migration to complete.

Test Result without this patch
==============================
 # ./tools/testing/selftests/mm/migration
 TAP version 13
 1..3
 # Starting 3 tests from 1 test cases.
 #  RUN           migration.private_anon ...
 #            OK  migration.private_anon
 ok 1 migration.private_anon
 #  RUN           migration.shared_anon ...
 Didn't migrate 1 pages
 # migration.c:175:shared_anon:Expected migrate(ptr,
   self->n1, self->n2) (-2) == 0 (0)
 # shared_anon: Test terminated by assertion
 #          FAIL  migration.shared_anon
 not ok 2 migration.shared_anon
 #  RUN           migration.private_anon_thp ...
 #            OK  migration.private_anon_thp
 ok 3 migration.private_anon_thp
 # FAILED: 2 / 3 tests passed.
 # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0

Test result with this patch
===========================
 # ./tools/testing/selftests/mm/migration
 TAP version 13
 1..3
 # Starting 3 tests from 1 test cases.
 #  RUN           migration.private_anon ...
 #            OK  migration.private_anon
 ok 1 migration.private_anon
 #  RUN           migration.shared_anon ...
 #            OK  migration.shared_anon
 ok 2 migration.shared_anon
 #  RUN           migration.private_anon_thp ...
 #            OK  migration.private_anon_thp
 ok 3 migration.private_anon_thp
 # PASSED: 3 / 3 tests passed.
 # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 mm/rmap.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

David Hildenbrand Dec. 19, 2024, 12:55 p.m. UTC | #1
On 19.12.24 13:47, Donet Tom wrote:
> The migration selftest is currently failing for shared anonymous
> mappings due to a race condition.
> 
> During migration, the source folio's PTE is unmapped by nuking the
> PTE, flushing the TLB,and then marking the page for migration
> (by creating the swap entries). The issue arises when, immediately
> after the PTE is nuked and the TLB is flushed, but before the page
> is marked for migration, another thread accesses the page. This
> triggers a page fault, and the page fault handler invokes
> do_pte_missing() instead of do_swap_page(), as the page is not yet
> marked for migration.
> 
> In the fault handling path, do_pte_missing() calls __do_fault()
> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
> This eventually calls folio_try_get(), incrementing the reference
> count of the folio undergoing migration. The thread then blocks
> on folio_lock(), as the migration path holds the lock. This
> results in the migration failing in __migrate_folio(), which expects
> the folio's reference count to be 2. However, the reference count is
> incremented by the fault handler, leading to the failure.
> 
> The issue arises because, after nuking the PTE and before marking the
> page for migration, the page is accessed. To address this, we have
> updated the logic to first nuke the PTE, then mark the page for
> migration, and only then flush the TLB. With this patch, If the page is
> accessed immediately after nuking the PTE, the TLB entry is still
> valid, so no fault occurs. After marking the page for migration,
> flushing the TLB ensures that the next page fault correctly triggers
> do_swap_page() and waits for the migration to complete.
> 

Does this reproduce with

commit 536ab838a5b37b6ae3f8d53552560b7c51daeb41
Author: Dev Jain <dev.jain@arm.com>
Date:   Fri Aug 30 10:46:09 2024 +0530

     selftests/mm: relax test to fail after 100 migration failures
     
     It was recently observed at [1] that during the folio unmapping stage of
     migration, when the PTEs are cleared, a racing thread faulting on that
     folio may increase the refcount of the folio, sleep on the folio lock (the
     migration path has the lock), and migration ultimately fails when
     asserting the actual refcount against the expected.  Thereby, the
     migration selftest fails on shared-anon mappings.  The above enforces the
     fact that migration is a best-effort service, therefore, it is wrong to
     fail the test for just a single failure; hence, fail the test after 100
     consecutive failures (where 100 is still a subjective choice).  Note that,
     this has no effect on the execution time of the test since that is
     controlled by a timeout.
     
     [1] https://lore.kernel.org/all/20240801081657.1386743-1-dev.jain@arm.com/
     

part of 6.12?


As part of that discussion, we discussed alternatives, such as
retrying migration more often internally.
David Hildenbrand Dec. 19, 2024, 12:58 p.m. UTC | #2
On 19.12.24 13:47, Donet Tom wrote:
> The migration selftest is currently failing for shared anonymous
> mappings due to a race condition.
> 
> During migration, the source folio's PTE is unmapped by nuking the
> PTE, flushing the TLB,and then marking the page for migration
> (by creating the swap entries). The issue arises when, immediately
> after the PTE is nuked and the TLB is flushed, but before the page
> is marked for migration, another thread accesses the page. This
> triggers a page fault, and the page fault handler invokes
> do_pte_missing() instead of do_swap_page(), as the page is not yet
> marked for migration.
> 
> In the fault handling path, do_pte_missing() calls __do_fault()
> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
> This eventually calls folio_try_get(), incrementing the reference
> count of the folio undergoing migration. The thread then blocks
> on folio_lock(), as the migration path holds the lock. This
> results in the migration failing in __migrate_folio(), which expects
> the folio's reference count to be 2. However, the reference count is
> incremented by the fault handler, leading to the failure.
> 
> The issue arises because, after nuking the PTE and before marking the
> page for migration, the page is accessed. To address this, we have
> updated the logic to first nuke the PTE, then mark the page for
> migration, and only then flush the TLB. With this patch, If the page is
> accessed immediately after nuking the PTE, the TLB entry is still
> valid, so no fault occurs.

But what about if the PTE is not in the TLB yet, and you get an access 
from another CPU just after clearing the PTE (but not flushing the TLB)? 
The other CPU will still observe PTE=none, trigger a fault etc.

So I don't think what you propose rules out all cases.
Donet Tom Dec. 20, 2024, 2:16 a.m. UTC | #3
On 12/19/24 18:25, David Hildenbrand wrote:
> On 19.12.24 13:47, Donet Tom wrote:
>> The migration selftest is currently failing for shared anonymous
>> mappings due to a race condition.
>>
>> During migration, the source folio's PTE is unmapped by nuking the
>> PTE, flushing the TLB,and then marking the page for migration
>> (by creating the swap entries). The issue arises when, immediately
>> after the PTE is nuked and the TLB is flushed, but before the page
>> is marked for migration, another thread accesses the page. This
>> triggers a page fault, and the page fault handler invokes
>> do_pte_missing() instead of do_swap_page(), as the page is not yet
>> marked for migration.
>>
>> In the fault handling path, do_pte_missing() calls __do_fault()
>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
>> This eventually calls folio_try_get(), incrementing the reference
>> count of the folio undergoing migration. The thread then blocks
>> on folio_lock(), as the migration path holds the lock. This
>> results in the migration failing in __migrate_folio(), which expects
>> the folio's reference count to be 2. However, the reference count is
>> incremented by the fault handler, leading to the failure.
>>
>> The issue arises because, after nuking the PTE and before marking the
>> page for migration, the page is accessed. To address this, we have
>> updated the logic to first nuke the PTE, then mark the page for
>> migration, and only then flush the TLB. With this patch, If the page is
>> accessed immediately after nuking the PTE, the TLB entry is still
>> valid, so no fault occurs. After marking the page for migration,
>> flushing the TLB ensures that the next page fault correctly triggers
>> do_swap_page() and waits for the migration to complete.
>>
>
> Does this reproduce with
>
> commit 536ab838a5b37b6ae3f8d53552560b7c51daeb41
> Author: Dev Jain <dev.jain@arm.com>
> Date:   Fri Aug 30 10:46:09 2024 +0530
>
>     selftests/mm: relax test to fail after 100 migration failures
>         It was recently observed at [1] that during the folio 
> unmapping stage of
>     migration, when the PTEs are cleared, a racing thread faulting on 
> that
>     folio may increase the refcount of the folio, sleep on the folio 
> lock (the
>     migration path has the lock), and migration ultimately fails when
>     asserting the actual refcount against the expected.  Thereby, the
>     migration selftest fails on shared-anon mappings.  The above 
> enforces the
>     fact that migration is a best-effort service, therefore, it is 
> wrong to
>     fail the test for just a single failure; hence, fail the test 
> after 100
>     consecutive failures (where 100 is still a subjective choice).  
> Note that,
>     this has no effect on the execution time of the test since that is
>     controlled by a timeout.
>         [1] 
> https://lore.kernel.org/all/20240801081657.1386743-1-dev.jain@arm.com/
>
> part of 6.12?
>
>
> As part of that discussion, we discussed alternatives, such as
> retrying migration more often internally.
>
I was trying with this patch and was able to recreate the issue on PowerPC.
I tried increasing the retry count, but the test was still failing.
Baolin Wang Dec. 20, 2024, 2:31 a.m. UTC | #4
On 2024/12/19 20:47, Donet Tom wrote:
> The migration selftest is currently failing for shared anonymous
> mappings due to a race condition.
> 
> During migration, the source folio's PTE is unmapped by nuking the
> PTE, flushing the TLB,and then marking the page for migration
> (by creating the swap entries). The issue arises when, immediately
> after the PTE is nuked and the TLB is flushed, but before the page
> is marked for migration, another thread accesses the page. This
> triggers a page fault, and the page fault handler invokes
> do_pte_missing() instead of do_swap_page(), as the page is not yet
> marked for migration.
> 
> In the fault handling path, do_pte_missing() calls __do_fault()
> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
> This eventually calls folio_try_get(), incrementing the reference
> count of the folio undergoing migration. The thread then blocks
> on folio_lock(), as the migration path holds the lock. This
> results in the migration failing in __migrate_folio(), which expects
> the folio's reference count to be 2. However, the reference count is
> incremented by the fault handler, leading to the failure.
> 
> The issue arises because, after nuking the PTE and before marking the
> page for migration, the page is accessed. To address this, we have
> updated the logic to first nuke the PTE, then mark the page for
> migration, and only then flush the TLB. With this patch, If the page is
> accessed immediately after nuking the PTE, the TLB entry is still
> valid, so no fault occurs. After marking the page for migration,

IMO, I don't think this assumption is correct. At this point, the TLB 
entry might also be evicted, so a page fault could still occur. It's 
just a matter of probability.

Additionally, IIUC, if another thread is accessing the shmem folio 
causing the migration to fail, I think this is expected, and migration 
failure is not a vital issue?
Donet Tom Dec. 20, 2024, 2:55 a.m. UTC | #5
On 12/19/24 18:28, David Hildenbrand wrote:
> On 19.12.24 13:47, Donet Tom wrote:
>> The migration selftest is currently failing for shared anonymous
>> mappings due to a race condition.
>>
>> During migration, the source folio's PTE is unmapped by nuking the
>> PTE, flushing the TLB,and then marking the page for migration
>> (by creating the swap entries). The issue arises when, immediately
>> after the PTE is nuked and the TLB is flushed, but before the page
>> is marked for migration, another thread accesses the page. This
>> triggers a page fault, and the page fault handler invokes
>> do_pte_missing() instead of do_swap_page(), as the page is not yet
>> marked for migration.
>>
>> In the fault handling path, do_pte_missing() calls __do_fault()
>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
>> This eventually calls folio_try_get(), incrementing the reference
>> count of the folio undergoing migration. The thread then blocks
>> on folio_lock(), as the migration path holds the lock. This
>> results in the migration failing in __migrate_folio(), which expects
>> the folio's reference count to be 2. However, the reference count is
>> incremented by the fault handler, leading to the failure.
>>
>> The issue arises because, after nuking the PTE and before marking the
>> page for migration, the page is accessed. To address this, we have
>> updated the logic to first nuke the PTE, then mark the page for
>> migration, and only then flush the TLB. With this patch, If the page is
>> accessed immediately after nuking the PTE, the TLB entry is still
>> valid, so no fault occurs.
>
> But what about if the PTE is not in the TLB yet, and you get an access 
> from another CPU just after clearing the PTE (but not flushing the 
> TLB)? The other CPU will still observe PTE=none, trigger a fault etc.
>
Yes, in this scenario, the migration will fail. Do you think the 
migration test
failure, even after a retry, should be considered a major issue that 
must be fixed?


> So I don't think what you propose rules out all cases.
>
Donet Tom Dec. 20, 2024, 3:12 a.m. UTC | #6
On 12/20/24 08:01, Baolin Wang wrote:
>
>
> On 2024/12/19 20:47, Donet Tom wrote:
>> The migration selftest is currently failing for shared anonymous
>> mappings due to a race condition.
>>
>> During migration, the source folio's PTE is unmapped by nuking the
>> PTE, flushing the TLB,and then marking the page for migration
>> (by creating the swap entries). The issue arises when, immediately
>> after the PTE is nuked and the TLB is flushed, but before the page
>> is marked for migration, another thread accesses the page. This
>> triggers a page fault, and the page fault handler invokes
>> do_pte_missing() instead of do_swap_page(), as the page is not yet
>> marked for migration.
>>
>> In the fault handling path, do_pte_missing() calls __do_fault()
>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
>> This eventually calls folio_try_get(), incrementing the reference
>> count of the folio undergoing migration. The thread then blocks
>> on folio_lock(), as the migration path holds the lock. This
>> results in the migration failing in __migrate_folio(), which expects
>> the folio's reference count to be 2. However, the reference count is
>> incremented by the fault handler, leading to the failure.
>>
>> The issue arises because, after nuking the PTE and before marking the
>> page for migration, the page is accessed. To address this, we have
>> updated the logic to first nuke the PTE, then mark the page for
>> migration, and only then flush the TLB. With this patch, If the page is
>> accessed immediately after nuking the PTE, the TLB entry is still
>> valid, so no fault occurs. After marking the page for migration,
>
> IMO, I don't think this assumption is correct. At this point, the TLB 
> entry might also be evicted, so a page fault could still occur. It's 
> just a matter of probability.
In this patch, we mark the page for migration before flushing the TLB.
This ensures that if someone accesses the page after the TLB flush,
the page fault will occur and in the page fault handler will wait for the
migration to complete. So migration will not fail

Without this patch, if someone accesses the page after the TLB flush
but before it is marked for migration, the migration will fail.
>
> Additionally, IIUC, if another thread is accessing the shmem folio 
> causing the migration to fail, I think this is expected, and migration 
> failure is not a vital issue?
>
In my case, the shmem migration test is always failing,
even after retries. Would it be correct to consider this
as expected behavior?
Baolin Wang Dec. 20, 2024, 3:32 a.m. UTC | #7
On 2024/12/20 11:12, Donet Tom wrote:
> 
> On 12/20/24 08:01, Baolin Wang wrote:
>>
>>
>> On 2024/12/19 20:47, Donet Tom wrote:
>>> The migration selftest is currently failing for shared anonymous
>>> mappings due to a race condition.
>>>
>>> During migration, the source folio's PTE is unmapped by nuking the
>>> PTE, flushing the TLB,and then marking the page for migration
>>> (by creating the swap entries). The issue arises when, immediately
>>> after the PTE is nuked and the TLB is flushed, but before the page
>>> is marked for migration, another thread accesses the page. This
>>> triggers a page fault, and the page fault handler invokes
>>> do_pte_missing() instead of do_swap_page(), as the page is not yet
>>> marked for migration.
>>>
>>> In the fault handling path, do_pte_missing() calls __do_fault()
>>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
>>> This eventually calls folio_try_get(), incrementing the reference
>>> count of the folio undergoing migration. The thread then blocks
>>> on folio_lock(), as the migration path holds the lock. This
>>> results in the migration failing in __migrate_folio(), which expects
>>> the folio's reference count to be 2. However, the reference count is
>>> incremented by the fault handler, leading to the failure.
>>>
>>> The issue arises because, after nuking the PTE and before marking the
>>> page for migration, the page is accessed. To address this, we have
>>> updated the logic to first nuke the PTE, then mark the page for
>>> migration, and only then flush the TLB. With this patch, If the page is
>>> accessed immediately after nuking the PTE, the TLB entry is still
>>> valid, so no fault occurs. After marking the page for migration,
>>
>> IMO, I don't think this assumption is correct. At this point, the TLB 
>> entry might also be evicted, so a page fault could still occur. It's 
>> just a matter of probability.
> In this patch, we mark the page for migration before flushing the TLB.
> This ensures that if someone accesses the page after the TLB flush,
> the page fault will occur and in the page fault handler will wait for the
> migration to complete. So migration will not fail
> 
> Without this patch, if someone accesses the page after the TLB flush
> but before it is marked for migration, the migration will fail.

Actually my concern is the same as David's (I did not see David's reply 
before sending my comments), which is that your patch does not "rules 
out all cases".

>> Additionally, IIUC, if another thread is accessing the shmem folio 
>> causing the migration to fail, I think this is expected, and migration 
>> failure is not a vital issue?
>>
> In my case, the shmem migration test is always failing,
> even after retries. Would it be correct to consider this
> as expected behavior?

IMHO I think your test case is too aggressive and unlikely to occur in 
real-world scenarios. Additionally, as I mentioned, migration failure is 
not a vital issue in the system, and some temporary refcnt can also lead 
to migration failure if you want to create such test cases. So 
personally, I don't think it is worthy doing.
Donet Tom Dec. 20, 2024, 4:30 a.m. UTC | #8
On 12/20/24 09:02, Baolin Wang wrote:
>
>
> On 2024/12/20 11:12, Donet Tom wrote:
>>
>> On 12/20/24 08:01, Baolin Wang wrote:
>>>
>>>
>>> On 2024/12/19 20:47, Donet Tom wrote:
>>>> The migration selftest is currently failing for shared anonymous
>>>> mappings due to a race condition.
>>>>
>>>> During migration, the source folio's PTE is unmapped by nuking the
>>>> PTE, flushing the TLB,and then marking the page for migration
>>>> (by creating the swap entries). The issue arises when, immediately
>>>> after the PTE is nuked and the TLB is flushed, but before the page
>>>> is marked for migration, another thread accesses the page. This
>>>> triggers a page fault, and the page fault handler invokes
>>>> do_pte_missing() instead of do_swap_page(), as the page is not yet
>>>> marked for migration.
>>>>
>>>> In the fault handling path, do_pte_missing() calls __do_fault()
>>>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
>>>> This eventually calls folio_try_get(), incrementing the reference
>>>> count of the folio undergoing migration. The thread then blocks
>>>> on folio_lock(), as the migration path holds the lock. This
>>>> results in the migration failing in __migrate_folio(), which expects
>>>> the folio's reference count to be 2. However, the reference count is
>>>> incremented by the fault handler, leading to the failure.
>>>>
>>>> The issue arises because, after nuking the PTE and before marking the
>>>> page for migration, the page is accessed. To address this, we have
>>>> updated the logic to first nuke the PTE, then mark the page for
>>>> migration, and only then flush the TLB. With this patch, If the 
>>>> page is
>>>> accessed immediately after nuking the PTE, the TLB entry is still
>>>> valid, so no fault occurs. After marking the page for migration,
>>>
>>> IMO, I don't think this assumption is correct. At this point, the 
>>> TLB entry might also be evicted, so a page fault could still occur. 
>>> It's just a matter of probability.
>> In this patch, we mark the page for migration before flushing the TLB.
>> This ensures that if someone accesses the page after the TLB flush,
>> the page fault will occur and in the page fault handler will wait for 
>> the
>> migration to complete. So migration will not fail
>>
>> Without this patch, if someone accesses the page after the TLB flush
>> but before it is marked for migration, the migration will fail.
>
> Actually my concern is the same as David's (I did not see David's 
> reply before sending my comments), which is that your patch does not 
> "rules out all cases".
>
>>> Additionally, IIUC, if another thread is accessing the shmem folio 
>>> causing the migration to fail, I think this is expected, and 
>>> migration failure is not a vital issue?
>>>
>> In my case, the shmem migration test is always failing,
>> even after retries. Would it be correct to consider this
>> as expected behavior?
>
> IMHO I think your test case is too aggressive and unlikely to occur in 
> real-world scenarios. Additionally, as I mentioned, migration failure 
> is not a vital issue in the system, and some temporary refcnt can also 
> lead to migration failure if you want to create such test cases. So 
> personally, I don't think it is worthy doing.
Sure. Thank you Baolin.
Dev Jain Dec. 20, 2024, 4:37 a.m. UTC | #9
On 20/12/24 9:02 am, Baolin Wang wrote:
>
>
> On 2024/12/20 11:12, Donet Tom wrote:
>>
>> On 12/20/24 08:01, Baolin Wang wrote:
>>>
>>>
>>> On 2024/12/19 20:47, Donet Tom wrote:
>>>> The migration selftest is currently failing for shared anonymous
>>>> mappings due to a race condition.
>>>>
>>>> During migration, the source folio's PTE is unmapped by nuking the
>>>> PTE, flushing the TLB,and then marking the page for migration
>>>> (by creating the swap entries). The issue arises when, immediately
>>>> after the PTE is nuked and the TLB is flushed, but before the page
>>>> is marked for migration, another thread accesses the page. This
>>>> triggers a page fault, and the page fault handler invokes
>>>> do_pte_missing() instead of do_swap_page(), as the page is not yet
>>>> marked for migration.
>>>>
>>>> In the fault handling path, do_pte_missing() calls __do_fault()
>>>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
>>>> This eventually calls folio_try_get(), incrementing the reference
>>>> count of the folio undergoing migration. The thread then blocks
>>>> on folio_lock(), as the migration path holds the lock. This
>>>> results in the migration failing in __migrate_folio(), which expects
>>>> the folio's reference count to be 2. However, the reference count is
>>>> incremented by the fault handler, leading to the failure.
>>>>
>>>> The issue arises because, after nuking the PTE and before marking the
>>>> page for migration, the page is accessed. To address this, we have
>>>> updated the logic to first nuke the PTE, then mark the page for
>>>> migration, and only then flush the TLB. With this patch, If the 
>>>> page is
>>>> accessed immediately after nuking the PTE, the TLB entry is still
>>>> valid, so no fault occurs. After marking the page for migration,
>>>
>>> IMO, I don't think this assumption is correct. At this point, the 
>>> TLB entry might also be evicted, so a page fault could still occur. 
>>> It's just a matter of probability.
>> In this patch, we mark the page for migration before flushing the TLB.
>> This ensures that if someone accesses the page after the TLB flush,
>> the page fault will occur and in the page fault handler will wait for 
>> the
>> migration to complete. So migration will not fail
>>
>> Without this patch, if someone accesses the page after the TLB flush
>> but before it is marked for migration, the migration will fail.
>
> Actually my concern is the same as David's (I did not see David's 
> reply before sending my comments), which is that your patch does not 
> "rules out all cases".


I like this solution but really the proper solution for this one was to 
atomically set the migration entry IMHO.

>
>>> Additionally, IIUC, if another thread is accessing the shmem folio 
>>> causing the migration to fail, I think this is expected, and 
>>> migration failure is not a vital issue?
>>>
>> In my case, the shmem migration test is always failing,
>> even after retries. Would it be correct to consider this
>> as expected behavior?
>
> IMHO I think your test case is too aggressive and unlikely to occur in 
> real-world scenarios. Additionally, as I mentioned, migration failure 
> is not a vital issue in the system, and some temporary refcnt can also 
> lead to migration failure if you want to create such test cases. So 
> personally, I don't think it is worthy doing.

Agreed, AFAIR the test case starts faulting exactly on those pages which 
we want to migrate, making this a very artificial scenario.
kernel test robot Dec. 20, 2024, 10:05 a.m. UTC | #10
Hi Donet,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Donet-Tom/mm-migration-shared-anonymous-migration-test-is-failing/20241219-204920
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20241219124717.4907-1-donettom%40linux.ibm.com
patch subject: [PATCH] mm: migration :shared anonymous migration test is failing
config: arc-randconfig-002-20241220 (https://download.01.org/0day-ci/archive/20241220/202412201738.gUNPwJ1x-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241220/202412201738.gUNPwJ1x-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412201738.gUNPwJ1x-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/migrate.h:8,
                    from mm/rmap.c:69:
   include/linux/hugetlb.h:1063:5: warning: no previous prototype for 'replace_free_hugepage_folios' [-Wmissing-prototypes]
    1063 | int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
         |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/rmap.c: In function 'try_to_migrate_one':
>> mm/rmap.c:2157:34: error: implicit declaration of function 'huge_ptep_get_and_clear'; did you mean 'ptep_get_and_clear'? [-Werror=implicit-function-declaration]
    2157 |                         pteval = huge_ptep_get_and_clear(mm, address, pvmw.pte);
         |                                  ^~~~~~~~~~~~~~~~~~~~~~~
         |                                  ptep_get_and_clear
>> mm/rmap.c:2157:34: error: incompatible types when assigning to type 'pte_t' from type 'int'
>> mm/rmap.c:2326:33: error: implicit declaration of function 'flush_hugetlb_page'; did you mean 'flush_tlb_page'? [-Werror=implicit-function-declaration]
    2326 |                                 flush_hugetlb_page(vma, address);
         |                                 ^~~~~~~~~~~~~~~~~~
         |                                 flush_tlb_page
   cc1: some warnings being treated as errors


vim +2157 mm/rmap.c

  2081	
  2082			/* Unexpected PMD-mapped THP? */
  2083			VM_BUG_ON_FOLIO(!pvmw.pte, folio);
  2084	
  2085			pfn = pte_pfn(ptep_get(pvmw.pte));
  2086	
  2087			if (folio_is_zone_device(folio)) {
  2088				/*
  2089				 * Our PTE is a non-present device exclusive entry and
  2090				 * calculating the subpage as for the common case would
  2091				 * result in an invalid pointer.
  2092				 *
  2093				 * Since only PAGE_SIZE pages can currently be
  2094				 * migrated, just set it to page. This will need to be
  2095				 * changed when hugepage migrations to device private
  2096				 * memory are supported.
  2097				 */
  2098				VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio);
  2099				subpage = &folio->page;
  2100			} else {
  2101				subpage = folio_page(folio, pfn - folio_pfn(folio));
  2102			}
  2103			address = pvmw.address;
  2104			anon_exclusive = folio_test_anon(folio) &&
  2105					 PageAnonExclusive(subpage);
  2106	
  2107			if (folio_test_hugetlb(folio)) {
  2108				bool anon = folio_test_anon(folio);
  2109	
  2110				/*
  2111				 * huge_pmd_unshare may unmap an entire PMD page.
  2112				 * There is no way of knowing exactly which PMDs may
  2113				 * be cached for this mm, so we must flush them all.
  2114				 * start/end were already adjusted above to cover this
  2115				 * range.
  2116				 */
  2117				flush_cache_range(vma, range.start, range.end);
  2118	
  2119				/*
  2120				 * To call huge_pmd_unshare, i_mmap_rwsem must be
  2121				 * held in write mode.  Caller needs to explicitly
  2122				 * do this outside rmap routines.
  2123				 *
  2124				 * We also must hold hugetlb vma_lock in write mode.
  2125				 * Lock order dictates acquiring vma_lock BEFORE
  2126				 * i_mmap_rwsem.  We can only try lock here and
  2127				 * fail if unsuccessful.
  2128				 */
  2129				if (!anon) {
  2130					VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
  2131					if (!hugetlb_vma_trylock_write(vma)) {
  2132						page_vma_mapped_walk_done(&pvmw);
  2133						ret = false;
  2134						break;
  2135					}
  2136					if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
  2137						hugetlb_vma_unlock_write(vma);
  2138						flush_tlb_range(vma,
  2139							range.start, range.end);
  2140	
  2141						/*
  2142						 * The ref count of the PMD page was
  2143						 * dropped which is part of the way map
  2144						 * counting is done for shared PMDs.
  2145						 * Return 'true' here.  When there is
  2146						 * no other sharing, huge_pmd_unshare
  2147						 * returns false and we will unmap the
  2148						 * actual page and drop map count
  2149						 * to zero.
  2150						 */
  2151						page_vma_mapped_walk_done(&pvmw);
  2152						break;
  2153					}
  2154					hugetlb_vma_unlock_write(vma);
  2155				}
  2156				/* Nuke the hugetlb page table entry */
> 2157				pteval = huge_ptep_get_and_clear(mm, address, pvmw.pte);
  2158			} else {
  2159				flush_cache_page(vma, address, pfn);
  2160				/* Nuke the page table entry. */
  2161				if (should_defer_flush(mm, flags)) {
  2162					/*
  2163					 * We clear the PTE but do not flush so potentially
  2164					 * a remote CPU could still be writing to the folio.
  2165					 * If the entry was previously clean then the
  2166					 * architecture must guarantee that a clear->dirty
  2167					 * transition on a cached TLB entry is written through
  2168					 * and traps if the PTE is unmapped.
  2169					 */
  2170					pteval = ptep_get_and_clear(mm, address, pvmw.pte);
  2171	
  2172					set_tlb_ubc_flush_pending(mm, pteval, address);
  2173				} else {
  2174					pteval = ptep_get_and_clear(mm, address, pvmw.pte);
  2175				}
  2176			}
  2177	
  2178			/* Set the dirty flag on the folio now the pte is gone. */
  2179			if (pte_dirty(pteval))
  2180				folio_mark_dirty(folio);
  2181	
  2182			/* Update high watermark before we lower rss */
  2183			update_hiwater_rss(mm);
  2184	
  2185			if (folio_is_device_private(folio)) {
  2186				unsigned long pfn = folio_pfn(folio);
  2187				swp_entry_t entry;
  2188				pte_t swp_pte;
  2189	
  2190				if (anon_exclusive)
  2191					WARN_ON_ONCE(folio_try_share_anon_rmap_pte(folio,
  2192										   subpage));
  2193	
  2194				/*
  2195				 * Store the pfn of the page in a special migration
  2196				 * pte. do_swap_page() will wait until the migration
  2197				 * pte is removed and then restart fault handling.
  2198				 */
  2199				entry = pte_to_swp_entry(pteval);
  2200				if (is_writable_device_private_entry(entry))
  2201					entry = make_writable_migration_entry(pfn);
  2202				else if (anon_exclusive)
  2203					entry = make_readable_exclusive_migration_entry(pfn);
  2204				else
  2205					entry = make_readable_migration_entry(pfn);
  2206				swp_pte = swp_entry_to_pte(entry);
  2207	
  2208				/*
  2209				 * pteval maps a zone device page and is therefore
  2210				 * a swap pte.
  2211				 */
  2212				if (pte_swp_soft_dirty(pteval))
  2213					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2214				if (pte_swp_uffd_wp(pteval))
  2215					swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2216				set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
  2217				trace_set_migration_pte(pvmw.address, pte_val(swp_pte),
  2218							folio_order(folio));
  2219				/*
  2220				 * No need to invalidate here it will synchronize on
  2221				 * against the special swap migration pte.
  2222				 */
  2223			} else if (PageHWPoison(subpage)) {
  2224				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  2225				if (folio_test_hugetlb(folio)) {
  2226					hugetlb_count_sub(folio_nr_pages(folio), mm);
  2227					set_huge_pte_at(mm, address, pvmw.pte, pteval,
  2228							hsz);
  2229				} else {
  2230					dec_mm_counter(mm, mm_counter(folio));
  2231					set_pte_at(mm, address, pvmw.pte, pteval);
  2232				}
  2233	
  2234			} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
  2235				/*
  2236				 * The guest indicated that the page content is of no
  2237				 * interest anymore. Simply discard the pte, vmscan
  2238				 * will take care of the rest.
  2239				 * A future reference will then fault in a new zero
  2240				 * page. When userfaultfd is active, we must not drop
  2241				 * this page though, as its main user (postcopy
  2242				 * migration) will not expect userfaults on already
  2243				 * copied pages.
  2244				 */
  2245				dec_mm_counter(mm, mm_counter(folio));
  2246			} else {
  2247				swp_entry_t entry;
  2248				pte_t swp_pte;
  2249	
  2250				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  2251					if (folio_test_hugetlb(folio))
  2252						set_huge_pte_at(mm, address, pvmw.pte,
  2253								pteval, hsz);
  2254					else
  2255						set_pte_at(mm, address, pvmw.pte, pteval);
  2256					ret = false;
  2257					page_vma_mapped_walk_done(&pvmw);
  2258					break;
  2259				}
  2260				VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) &&
  2261					       !anon_exclusive, subpage);
  2262	
  2263				/* See folio_try_share_anon_rmap_pte(): clear PTE first. */
  2264				if (folio_test_hugetlb(folio)) {
  2265					if (anon_exclusive &&
  2266					    hugetlb_try_share_anon_rmap(folio)) {
  2267						set_huge_pte_at(mm, address, pvmw.pte,
  2268								pteval, hsz);
  2269						ret = false;
  2270						page_vma_mapped_walk_done(&pvmw);
  2271						break;
  2272					}
  2273				} else if (anon_exclusive &&
  2274					   folio_try_share_anon_rmap_pte(folio, subpage)) {
  2275					set_pte_at(mm, address, pvmw.pte, pteval);
  2276					ret = false;
  2277					page_vma_mapped_walk_done(&pvmw);
  2278					break;
  2279				}
  2280	
  2281				/*
  2282				 * Store the pfn of the page in a special migration
  2283				 * pte. do_swap_page() will wait until the migration
  2284				 * pte is removed and then restart fault handling.
  2285				 */
  2286				if (pte_write(pteval))
  2287					entry = make_writable_migration_entry(
  2288								page_to_pfn(subpage));
  2289				else if (anon_exclusive)
  2290					entry = make_readable_exclusive_migration_entry(
  2291								page_to_pfn(subpage));
  2292				else
  2293					entry = make_readable_migration_entry(
  2294								page_to_pfn(subpage));
  2295				if (pte_young(pteval))
  2296					entry = make_migration_entry_young(entry);
  2297				if (pte_dirty(pteval))
  2298					entry = make_migration_entry_dirty(entry);
  2299				swp_pte = swp_entry_to_pte(entry);
  2300				if (pte_soft_dirty(pteval))
  2301					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2302				if (pte_uffd_wp(pteval))
  2303					swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2304				if (folio_test_hugetlb(folio))
  2305					set_huge_pte_at(mm, address, pvmw.pte, swp_pte,
  2306							hsz);
  2307				else
  2308					set_pte_at(mm, address, pvmw.pte, swp_pte);
  2309				trace_set_migration_pte(address, pte_val(swp_pte),
  2310							folio_order(folio));
  2311				/*
  2312				 * No need to invalidate here it will synchronize on
  2313				 * against the special swap migration pte.
  2314				 */
  2315			}
  2316	
  2317			if (unlikely(folio_test_hugetlb(folio)))
  2318				hugetlb_remove_rmap(folio);
  2319			else
  2320				folio_remove_rmap_pte(folio, subpage, vma);
  2321			if (vma->vm_flags & VM_LOCKED)
  2322				mlock_drain_local();
  2323	
  2324			if (!should_defer_flush(mm, flags)) {
  2325				if (folio_test_hugetlb(folio))
> 2326					flush_hugetlb_page(vma, address);
  2327				else
  2328					flush_tlb_page(vma, address);
  2329			}
  2330	
  2331			folio_put(folio);
  2332		}
  2333	
  2334		mmu_notifier_invalidate_range_end(&range);
  2335	
  2336		return ret;
  2337	}
  2338
David Hildenbrand Dec. 20, 2024, 10:11 a.m. UTC | #11
On 20.12.24 03:55, Donet Tom wrote:
> 
> On 12/19/24 18:28, David Hildenbrand wrote:
>> On 19.12.24 13:47, Donet Tom wrote:
>>> The migration selftest is currently failing for shared anonymous
>>> mappings due to a race condition.
>>>
>>> During migration, the source folio's PTE is unmapped by nuking the
>>> PTE, flushing the TLB,and then marking the page for migration
>>> (by creating the swap entries). The issue arises when, immediately
>>> after the PTE is nuked and the TLB is flushed, but before the page
>>> is marked for migration, another thread accesses the page. This
>>> triggers a page fault, and the page fault handler invokes
>>> do_pte_missing() instead of do_swap_page(), as the page is not yet
>>> marked for migration.
>>>
>>> In the fault handling path, do_pte_missing() calls __do_fault()
>>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
>>> This eventually calls folio_try_get(), incrementing the reference
>>> count of the folio undergoing migration. The thread then blocks
>>> on folio_lock(), as the migration path holds the lock. This
>>> results in the migration failing in __migrate_folio(), which expects
>>> the folio's reference count to be 2. However, the reference count is
>>> incremented by the fault handler, leading to the failure.
>>>
>>> The issue arises because, after nuking the PTE and before marking the
>>> page for migration, the page is accessed. To address this, we have
>>> updated the logic to first nuke the PTE, then mark the page for
>>> migration, and only then flush the TLB. With this patch, If the page is
>>> accessed immediately after nuking the PTE, the TLB entry is still
>>> valid, so no fault occurs.
>>
>> But what about if the PTE is not in the TLB yet, and you get an access
>> from another CPU just after clearing the PTE (but not flushing the
>> TLB)? The other CPU will still observe PTE=none, trigger a fault etc.
>>
> Yes, in this scenario, the migration will fail. Do you think the
> migration test
> failure, even after a retry, should be considered a major issue that
> must be fixed?

I think it is something we should definitely improve, but I think our 
page migration should handle this in a better way, not the unmap logic. 
I recall we discussed with Dev some ideas on how to improve that?


I'm pretty sure one can trigger similar case using a tmpfs file and 
using read/write in a loop instead of memory access -> page faults. So 
where racing with page faults is completely out of the picture.
kernel test robot Dec. 20, 2024, 10:17 a.m. UTC | #12
Hi Donet,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Donet-Tom/mm-migration-shared-anonymous-migration-test-is-failing/20241219-204920
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20241219124717.4907-1-donettom%40linux.ibm.com
patch subject: [PATCH] mm: migration :shared anonymous migration test is failing
config: arm-randconfig-001-20241220 (https://download.01.org/0day-ci/archive/20241220/202412201828.3GvmHte5-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241220/202412201828.3GvmHte5-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412201828.3GvmHte5-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/rmap.c:69:
   In file included from include/linux/migrate.h:8:
   include/linux/hugetlb.h:1063:5: warning: no previous prototype for function 'replace_free_hugepage_folios' [-Wmissing-prototypes]
    1063 | int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
         |     ^
   include/linux/hugetlb.h:1063:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
    1063 | int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
         | ^
         | static 
   In file included from mm/rmap.c:76:
   include/linux/mm_inline.h:47:41: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      47 |         __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
         |                                    ~~~~~~~~~~~ ^ ~~~
   include/linux/mm_inline.h:49:22: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      49 |                                 NR_ZONE_LRU_BASE + lru, nr_pages);
         |                                 ~~~~~~~~~~~~~~~~ ^ ~~~
>> mm/rmap.c:2157:13: error: call to undeclared function 'huge_ptep_get_and_clear'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2157 |                         pteval = huge_ptep_get_and_clear(mm, address, pvmw.pte);
         |                                  ^
   mm/rmap.c:2157:13: note: did you mean 'ptep_get_and_clear'?
   include/linux/pgtable.h:478:21: note: 'ptep_get_and_clear' declared here
     478 | static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
         |                     ^
>> mm/rmap.c:2326:5: error: call to undeclared function 'flush_hugetlb_page'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2326 |                                 flush_hugetlb_page(vma, address);
         |                                 ^
   mm/rmap.c:2326:5: note: did you mean 'is_vm_hugetlb_page'?
   include/linux/hugetlb_inline.h:16:20: note: 'is_vm_hugetlb_page' declared here
      16 | static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
         |                    ^
   3 warnings and 2 errors generated.


vim +/huge_ptep_get_and_clear +2157 mm/rmap.c

  2081	
  2082			/* Unexpected PMD-mapped THP? */
  2083			VM_BUG_ON_FOLIO(!pvmw.pte, folio);
  2084	
  2085			pfn = pte_pfn(ptep_get(pvmw.pte));
  2086	
  2087			if (folio_is_zone_device(folio)) {
  2088				/*
  2089				 * Our PTE is a non-present device exclusive entry and
  2090				 * calculating the subpage as for the common case would
  2091				 * result in an invalid pointer.
  2092				 *
  2093				 * Since only PAGE_SIZE pages can currently be
  2094				 * migrated, just set it to page. This will need to be
  2095				 * changed when hugepage migrations to device private
  2096				 * memory are supported.
  2097				 */
  2098				VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio);
  2099				subpage = &folio->page;
  2100			} else {
  2101				subpage = folio_page(folio, pfn - folio_pfn(folio));
  2102			}
  2103			address = pvmw.address;
  2104			anon_exclusive = folio_test_anon(folio) &&
  2105					 PageAnonExclusive(subpage);
  2106	
  2107			if (folio_test_hugetlb(folio)) {
  2108				bool anon = folio_test_anon(folio);
  2109	
  2110				/*
  2111				 * huge_pmd_unshare may unmap an entire PMD page.
  2112				 * There is no way of knowing exactly which PMDs may
  2113				 * be cached for this mm, so we must flush them all.
  2114				 * start/end were already adjusted above to cover this
  2115				 * range.
  2116				 */
  2117				flush_cache_range(vma, range.start, range.end);
  2118	
  2119				/*
  2120				 * To call huge_pmd_unshare, i_mmap_rwsem must be
  2121				 * held in write mode.  Caller needs to explicitly
  2122				 * do this outside rmap routines.
  2123				 *
  2124				 * We also must hold hugetlb vma_lock in write mode.
  2125				 * Lock order dictates acquiring vma_lock BEFORE
  2126				 * i_mmap_rwsem.  We can only try lock here and
  2127				 * fail if unsuccessful.
  2128				 */
  2129				if (!anon) {
  2130					VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
  2131					if (!hugetlb_vma_trylock_write(vma)) {
  2132						page_vma_mapped_walk_done(&pvmw);
  2133						ret = false;
  2134						break;
  2135					}
  2136					if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
  2137						hugetlb_vma_unlock_write(vma);
  2138						flush_tlb_range(vma,
  2139							range.start, range.end);
  2140	
  2141						/*
  2142						 * The ref count of the PMD page was
  2143						 * dropped which is part of the way map
  2144						 * counting is done for shared PMDs.
  2145						 * Return 'true' here.  When there is
  2146						 * no other sharing, huge_pmd_unshare
  2147						 * returns false and we will unmap the
  2148						 * actual page and drop map count
  2149						 * to zero.
  2150						 */
  2151						page_vma_mapped_walk_done(&pvmw);
  2152						break;
  2153					}
  2154					hugetlb_vma_unlock_write(vma);
  2155				}
  2156				/* Nuke the hugetlb page table entry */
> 2157				pteval = huge_ptep_get_and_clear(mm, address, pvmw.pte);
  2158			} else {
  2159				flush_cache_page(vma, address, pfn);
  2160				/* Nuke the page table entry. */
  2161				if (should_defer_flush(mm, flags)) {
  2162					/*
  2163					 * We clear the PTE but do not flush so potentially
  2164					 * a remote CPU could still be writing to the folio.
  2165					 * If the entry was previously clean then the
  2166					 * architecture must guarantee that a clear->dirty
  2167					 * transition on a cached TLB entry is written through
  2168					 * and traps if the PTE is unmapped.
  2169					 */
  2170					pteval = ptep_get_and_clear(mm, address, pvmw.pte);
  2171	
  2172					set_tlb_ubc_flush_pending(mm, pteval, address);
  2173				} else {
  2174					pteval = ptep_get_and_clear(mm, address, pvmw.pte);
  2175				}
  2176			}
  2177	
  2178			/* Set the dirty flag on the folio now the pte is gone. */
  2179			if (pte_dirty(pteval))
  2180				folio_mark_dirty(folio);
  2181	
  2182			/* Update high watermark before we lower rss */
  2183			update_hiwater_rss(mm);
  2184	
  2185			if (folio_is_device_private(folio)) {
  2186				unsigned long pfn = folio_pfn(folio);
  2187				swp_entry_t entry;
  2188				pte_t swp_pte;
  2189	
  2190				if (anon_exclusive)
  2191					WARN_ON_ONCE(folio_try_share_anon_rmap_pte(folio,
  2192										   subpage));
  2193	
  2194				/*
  2195				 * Store the pfn of the page in a special migration
  2196				 * pte. do_swap_page() will wait until the migration
  2197				 * pte is removed and then restart fault handling.
  2198				 */
  2199				entry = pte_to_swp_entry(pteval);
  2200				if (is_writable_device_private_entry(entry))
  2201					entry = make_writable_migration_entry(pfn);
  2202				else if (anon_exclusive)
  2203					entry = make_readable_exclusive_migration_entry(pfn);
  2204				else
  2205					entry = make_readable_migration_entry(pfn);
  2206				swp_pte = swp_entry_to_pte(entry);
  2207	
  2208				/*
  2209				 * pteval maps a zone device page and is therefore
  2210				 * a swap pte.
  2211				 */
  2212				if (pte_swp_soft_dirty(pteval))
  2213					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2214				if (pte_swp_uffd_wp(pteval))
  2215					swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2216				set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
  2217				trace_set_migration_pte(pvmw.address, pte_val(swp_pte),
  2218							folio_order(folio));
  2219				/*
  2220				 * No need to invalidate here it will synchronize on
  2221				 * against the special swap migration pte.
  2222				 */
  2223			} else if (PageHWPoison(subpage)) {
  2224				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  2225				if (folio_test_hugetlb(folio)) {
  2226					hugetlb_count_sub(folio_nr_pages(folio), mm);
  2227					set_huge_pte_at(mm, address, pvmw.pte, pteval,
  2228							hsz);
  2229				} else {
  2230					dec_mm_counter(mm, mm_counter(folio));
  2231					set_pte_at(mm, address, pvmw.pte, pteval);
  2232				}
  2233	
  2234			} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
  2235				/*
  2236				 * The guest indicated that the page content is of no
  2237				 * interest anymore. Simply discard the pte, vmscan
  2238				 * will take care of the rest.
  2239				 * A future reference will then fault in a new zero
  2240				 * page. When userfaultfd is active, we must not drop
  2241				 * this page though, as its main user (postcopy
  2242				 * migration) will not expect userfaults on already
  2243				 * copied pages.
  2244				 */
  2245				dec_mm_counter(mm, mm_counter(folio));
  2246			} else {
  2247				swp_entry_t entry;
  2248				pte_t swp_pte;
  2249	
  2250				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  2251					if (folio_test_hugetlb(folio))
  2252						set_huge_pte_at(mm, address, pvmw.pte,
  2253								pteval, hsz);
  2254					else
  2255						set_pte_at(mm, address, pvmw.pte, pteval);
  2256					ret = false;
  2257					page_vma_mapped_walk_done(&pvmw);
  2258					break;
  2259				}
  2260				VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) &&
  2261					       !anon_exclusive, subpage);
  2262	
  2263				/* See folio_try_share_anon_rmap_pte(): clear PTE first. */
  2264				if (folio_test_hugetlb(folio)) {
  2265					if (anon_exclusive &&
  2266					    hugetlb_try_share_anon_rmap(folio)) {
  2267						set_huge_pte_at(mm, address, pvmw.pte,
  2268								pteval, hsz);
  2269						ret = false;
  2270						page_vma_mapped_walk_done(&pvmw);
  2271						break;
  2272					}
  2273				} else if (anon_exclusive &&
  2274					   folio_try_share_anon_rmap_pte(folio, subpage)) {
  2275					set_pte_at(mm, address, pvmw.pte, pteval);
  2276					ret = false;
  2277					page_vma_mapped_walk_done(&pvmw);
  2278					break;
  2279				}
  2280	
  2281				/*
  2282				 * Store the pfn of the page in a special migration
  2283				 * pte. do_swap_page() will wait until the migration
  2284				 * pte is removed and then restart fault handling.
  2285				 */
  2286				if (pte_write(pteval))
  2287					entry = make_writable_migration_entry(
  2288								page_to_pfn(subpage));
  2289				else if (anon_exclusive)
  2290					entry = make_readable_exclusive_migration_entry(
  2291								page_to_pfn(subpage));
  2292				else
  2293					entry = make_readable_migration_entry(
  2294								page_to_pfn(subpage));
  2295				if (pte_young(pteval))
  2296					entry = make_migration_entry_young(entry);
  2297				if (pte_dirty(pteval))
  2298					entry = make_migration_entry_dirty(entry);
  2299				swp_pte = swp_entry_to_pte(entry);
  2300				if (pte_soft_dirty(pteval))
  2301					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2302				if (pte_uffd_wp(pteval))
  2303					swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2304				if (folio_test_hugetlb(folio))
  2305					set_huge_pte_at(mm, address, pvmw.pte, swp_pte,
  2306							hsz);
  2307				else
  2308					set_pte_at(mm, address, pvmw.pte, swp_pte);
  2309				trace_set_migration_pte(address, pte_val(swp_pte),
  2310							folio_order(folio));
  2311				/*
  2312				 * No need to invalidate here it will synchronize on
  2313				 * against the special swap migration pte.
  2314				 */
  2315			}
  2316	
  2317			if (unlikely(folio_test_hugetlb(folio)))
  2318				hugetlb_remove_rmap(folio);
  2319			else
  2320				folio_remove_rmap_pte(folio, subpage, vma);
  2321			if (vma->vm_flags & VM_LOCKED)
  2322				mlock_drain_local();
  2323	
  2324			if (!should_defer_flush(mm, flags)) {
  2325				if (folio_test_hugetlb(folio))
> 2326					flush_hugetlb_page(vma, address);
  2327				else
  2328					flush_tlb_page(vma, address);
  2329			}
  2330	
  2331			folio_put(folio);
  2332		}
  2333	
  2334		mmu_notifier_invalidate_range_end(&range);
  2335	
  2336		return ret;
  2337	}
  2338
Donet Tom Dec. 23, 2024, 12:02 p.m. UTC | #13
On 12/20/24 10:07, Dev Jain wrote:
>
> On 20/12/24 9:02 am, Baolin Wang wrote:
>>
>>
>> On 2024/12/20 11:12, Donet Tom wrote:
>>>
>>> On 12/20/24 08:01, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/12/19 20:47, Donet Tom wrote:
>>>>> The migration selftest is currently failing for shared anonymous
>>>>> mappings due to a race condition.
>>>>>
>>>>> During migration, the source folio's PTE is unmapped by nuking the
>>>>> PTE, flushing the TLB,and then marking the page for migration
>>>>> (by creating the swap entries). The issue arises when, immediately
>>>>> after the PTE is nuked and the TLB is flushed, but before the page
>>>>> is marked for migration, another thread accesses the page. This
>>>>> triggers a page fault, and the page fault handler invokes
>>>>> do_pte_missing() instead of do_swap_page(), as the page is not yet
>>>>> marked for migration.
>>>>>
>>>>> In the fault handling path, do_pte_missing() calls __do_fault()
>>>>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
>>>>> This eventually calls folio_try_get(), incrementing the reference
>>>>> count of the folio undergoing migration. The thread then blocks
>>>>> on folio_lock(), as the migration path holds the lock. This
>>>>> results in the migration failing in __migrate_folio(), which expects
>>>>> the folio's reference count to be 2. However, the reference count is
>>>>> incremented by the fault handler, leading to the failure.
>>>>>
>>>>> The issue arises because, after nuking the PTE and before marking the
>>>>> page for migration, the page is accessed. To address this, we have
>>>>> updated the logic to first nuke the PTE, then mark the page for
>>>>> migration, and only then flush the TLB. With this patch, If the 
>>>>> page is
>>>>> accessed immediately after nuking the PTE, the TLB entry is still
>>>>> valid, so no fault occurs. After marking the page for migration,
>>>>
>>>> IMO, I don't think this assumption is correct. At this point, the 
>>>> TLB entry might also be evicted, so a page fault could still occur. 
>>>> It's just a matter of probability.
>>> In this patch, we mark the page for migration before flushing the TLB.
>>> This ensures that if someone accesses the page after the TLB flush,
>>> the page fault will occur and in the page fault handler will wait 
>>> for the
>>> migration to complete. So migration will not fail
>>>
>>> Without this patch, if someone accesses the page after the TLB flush
>>> but before it is marked for migration, the migration will fail.
>>
>> Actually my concern is the same as David's (I did not see David's 
>> reply before sending my comments), which is that your patch does not 
>> "rules out all cases".
>
>
> I like this solution but really the proper solution for this one was 
> to atomically set the migration entry IMHO.
>
Thank you Dev. I will check and come back on this.
>
>>
>>>> Additionally, IIUC, if another thread is accessing the shmem folio 
>>>> causing the migration to fail, I think this is expected, and 
>>>> migration failure is not a vital issue?
>>>>
>>> In my case, the shmem migration test is always failing,
>>> even after retries. Would it be correct to consider this
>>> as expected behavior?
>>
>> IMHO I think your test case is too aggressive and unlikely to occur 
>> in real-world scenarios. Additionally, as I mentioned, migration 
>> failure is not a vital issue in the system, and some temporary refcnt 
>> can also lead to migration failure if you want to create such test 
>> cases. So personally, I don't think it is worthy doing.
>
> Agreed, AFAIR the test case starts faulting exactly on those pages 
> which we want to migrate, making this a very artificial scenario.
>
Donet Tom Dec. 23, 2024, 12:08 p.m. UTC | #14
On 12/20/24 15:41, David Hildenbrand wrote:
> On 20.12.24 03:55, Donet Tom wrote:
>>
>> On 12/19/24 18:28, David Hildenbrand wrote:
>>> On 19.12.24 13:47, Donet Tom wrote:
>>>> The migration selftest is currently failing for shared anonymous
>>>> mappings due to a race condition.
>>>>
>>>> During migration, the source folio's PTE is unmapped by nuking the
>>>> PTE, flushing the TLB,and then marking the page for migration
>>>> (by creating the swap entries). The issue arises when, immediately
>>>> after the PTE is nuked and the TLB is flushed, but before the page
>>>> is marked for migration, another thread accesses the page. This
>>>> triggers a page fault, and the page fault handler invokes
>>>> do_pte_missing() instead of do_swap_page(), as the page is not yet
>>>> marked for migration.
>>>>
>>>> In the fault handling path, do_pte_missing() calls __do_fault()
>>>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
>>>> This eventually calls folio_try_get(), incrementing the reference
>>>> count of the folio undergoing migration. The thread then blocks
>>>> on folio_lock(), as the migration path holds the lock. This
>>>> results in the migration failing in __migrate_folio(), which expects
>>>> the folio's reference count to be 2. However, the reference count is
>>>> incremented by the fault handler, leading to the failure.
>>>>
>>>> The issue arises because, after nuking the PTE and before marking the
>>>> page for migration, the page is accessed. To address this, we have
>>>> updated the logic to first nuke the PTE, then mark the page for
>>>> migration, and only then flush the TLB. With this patch, If the 
>>>> page is
>>>> accessed immediately after nuking the PTE, the TLB entry is still
>>>> valid, so no fault occurs.
>>>
>>> But what about if the PTE is not in the TLB yet, and you get an access
>>> from another CPU just after clearing the PTE (but not flushing the
>>> TLB)? The other CPU will still observe PTE=none, trigger a fault etc.
>>>
>> Yes, in this scenario, the migration will fail. Do you think the
>> migration test
>> failure, even after a retry, should be considered a major issue that
>> must be fixed?
>
> I think it is something we should definitely improve, but I think our 
> page migration should handle this in a better way, not the unmap 
> logic. I recall we discussed with Dev some ideas on how to improve that?
>
>
> I'm pretty sure one can trigger similar case using a tmpfs file and 
> using read/write in a loop instead of memory access -> page faults. So 
> where racing with page faults is completely out of the picture.

Thank you David. I will try this scenario as well and come back with 
some ideas for improvement.
diff mbox series

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index c6c4d4ea29a7..920ae46e977f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2154,7 +2154,7 @@  static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				hugetlb_vma_unlock_write(vma);
 			}
 			/* Nuke the hugetlb page table entry */
-			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
+			pteval = huge_ptep_get_and_clear(mm, address, pvmw.pte);
 		} else {
 			flush_cache_page(vma, address, pfn);
 			/* Nuke the page table entry. */
@@ -2171,7 +2171,7 @@  static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 
 				set_tlb_ubc_flush_pending(mm, pteval, address);
 			} else {
-				pteval = ptep_clear_flush(vma, address, pvmw.pte);
+				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 			}
 		}
 
@@ -2320,6 +2320,14 @@  static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			folio_remove_rmap_pte(folio, subpage, vma);
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
+
+		if (!should_defer_flush(mm, flags)) {
+			if (folio_test_hugetlb(folio))
+				flush_hugetlb_page(vma, address);
+			else
+				flush_tlb_page(vma, address);
+		}
+
 		folio_put(folio);
 	}