diff mbox series

[v3,1/3] mm/migrate_device.c: Flush TLB while holding PTL

Message ID 3b01af093515ce2960ac39bb16ff77473150d179.1661309831.git-series.apopple@nvidia.com (mailing list archive)
State New
Headers show
Series [v3,1/3] mm/migrate_device.c: Flush TLB while holding PTL | expand

Commit Message

Alistair Popple Aug. 24, 2022, 3:03 a.m. UTC
When clearing a PTE the TLB should be flushed whilst still holding the
PTL to avoid a potential race with madvise/munmap/etc. For example
consider the following sequence:

  CPU0                          CPU1
  ----                          ----

  migrate_vma_collect_pmd()
  pte_unmap_unlock()
                                madvise(MADV_DONTNEED)
                                -> zap_pte_range()
                                pte_offset_map_lock()
                                [ PTE not present, TLB not flushed ]
                                pte_unmap_unlock()
                                [ page is still accessible via stale TLB ]
  flush_tlb_range()

In this case the page may still be accessed via the stale TLB entry
after madvise returns. Fix this by flushing the TLB while holding the
PTL.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Cc: stable@vger.kernel.org

---

Changes for v3:

 - New for v3
---
 mm/migrate_device.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136

Comments

David Hildenbrand Aug. 24, 2022, 8:21 a.m. UTC | #1
On 24.08.22 05:03, Alistair Popple wrote:
> When clearing a PTE the TLB should be flushed whilst still holding the
> PTL to avoid a potential race with madvise/munmap/etc. For example
> consider the following sequence:
> 
>   CPU0                          CPU1
>   ----                          ----
> 
>   migrate_vma_collect_pmd()
>   pte_unmap_unlock()
>                                 madvise(MADV_DONTNEED)
>                                 -> zap_pte_range()
>                                 pte_offset_map_lock()
>                                 [ PTE not present, TLB not flushed ]
>                                 pte_unmap_unlock()
>                                 [ page is still accessible via stale TLB ]
>   flush_tlb_range()
> 
> In this case the page may still be accessed via the stale TLB entry
> after madvise returns. Fix this by flushing the TLB while holding the
> PTL.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reported-by: Nadav Amit <nadav.amit@gmail.com>
> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
> Cc: stable@vger.kernel.org
> 
> ---
> 
> Changes for v3:
> 
>  - New for v3
> ---
>  mm/migrate_device.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d..6a5ef9f 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  		migrate->dst[migrate->npages] = 0;
>  		migrate->src[migrate->npages++] = mpfn;
>  	}
> -	arch_leave_lazy_mmu_mode();
> -	pte_unmap_unlock(ptep - 1, ptl);
>  
>  	/* Only flush the TLB if we actually modified any entries */
>  	if (unmapped)
>  		flush_tlb_range(walk->vma, start, end);
>  
> +	arch_leave_lazy_mmu_mode();
> +	pte_unmap_unlock(ptep - 1, ptl);
> +
>  	return 0;
>  }
>  
> 
> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136

I'm not a TLB-flushing expert, but this matches my understanding (and a
TLB flushing Linux documentation I stumbled over some while ago but
cannot quickly find).

In the ordinary try_to_migrate_one() path, flushing would happen via
ptep_clear_flush() (just like we do for the anon_exclusive case here as
well), correct?
Alistair Popple Aug. 24, 2022, 12:26 p.m. UTC | #2
David Hildenbrand <david@redhat.com> writes:

> On 24.08.22 05:03, Alistair Popple wrote:
>> When clearing a PTE the TLB should be flushed whilst still holding the
>> PTL to avoid a potential race with madvise/munmap/etc. For example
>> consider the following sequence:
>>
>>   CPU0                          CPU1
>>   ----                          ----
>>
>>   migrate_vma_collect_pmd()
>>   pte_unmap_unlock()
>>                                 madvise(MADV_DONTNEED)
>>                                 -> zap_pte_range()
>>                                 pte_offset_map_lock()
>>                                 [ PTE not present, TLB not flushed ]
>>                                 pte_unmap_unlock()
>>                                 [ page is still accessible via stale TLB ]
>>   flush_tlb_range()
>>
>> In this case the page may still be accessed via the stale TLB entry
>> after madvise returns. Fix this by flushing the TLB while holding the
>> PTL.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Reported-by: Nadav Amit <nadav.amit@gmail.com>
>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>> Cc: stable@vger.kernel.org
>>
>> ---
>>
>> Changes for v3:
>>
>>  - New for v3
>> ---
>>  mm/migrate_device.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 27fb37d..6a5ef9f 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  		migrate->dst[migrate->npages] = 0;
>>  		migrate->src[migrate->npages++] = mpfn;
>>  	}
>> -	arch_leave_lazy_mmu_mode();
>> -	pte_unmap_unlock(ptep - 1, ptl);
>>
>>  	/* Only flush the TLB if we actually modified any entries */
>>  	if (unmapped)
>>  		flush_tlb_range(walk->vma, start, end);
>>
>> +	arch_leave_lazy_mmu_mode();
>> +	pte_unmap_unlock(ptep - 1, ptl);
>> +
>>  	return 0;
>>  }
>>
>>
>> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
>
> I'm not a TLB-flushing expert, but this matches my understanding (and a
> TLB flushing Linux documentation I stumbled over some while ago but
> cannot quickly find).
>
> In the ordinary try_to_migrate_one() path, flushing would happen via
> ptep_clear_flush() (just like we do for the anon_exclusive case here as
> well), correct?

Correct.
David Hildenbrand Aug. 24, 2022, 12:35 p.m. UTC | #3
On 24.08.22 14:26, Alistair Popple wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 24.08.22 05:03, Alistair Popple wrote:
>>> When clearing a PTE the TLB should be flushed whilst still holding the
>>> PTL to avoid a potential race with madvise/munmap/etc. For example
>>> consider the following sequence:
>>>
>>>   CPU0                          CPU1
>>>   ----                          ----
>>>
>>>   migrate_vma_collect_pmd()
>>>   pte_unmap_unlock()
>>>                                 madvise(MADV_DONTNEED)
>>>                                 -> zap_pte_range()
>>>                                 pte_offset_map_lock()
>>>                                 [ PTE not present, TLB not flushed ]
>>>                                 pte_unmap_unlock()
>>>                                 [ page is still accessible via stale TLB ]
>>>   flush_tlb_range()
>>>
>>> In this case the page may still be accessed via the stale TLB entry
>>> after madvise returns. Fix this by flushing the TLB while holding the
>>> PTL.
>>>
>>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>> Reported-by: Nadav Amit <nadav.amit@gmail.com>
>>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>>> Cc: stable@vger.kernel.org
>>>
>>> ---
>>>
>>> Changes for v3:
>>>
>>>  - New for v3
>>> ---
>>>  mm/migrate_device.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 27fb37d..6a5ef9f 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>  		migrate->dst[migrate->npages] = 0;
>>>  		migrate->src[migrate->npages++] = mpfn;
>>>  	}
>>> -	arch_leave_lazy_mmu_mode();
>>> -	pte_unmap_unlock(ptep - 1, ptl);
>>>
>>>  	/* Only flush the TLB if we actually modified any entries */
>>>  	if (unmapped)
>>>  		flush_tlb_range(walk->vma, start, end);
>>>
>>> +	arch_leave_lazy_mmu_mode();
>>> +	pte_unmap_unlock(ptep - 1, ptl);
>>> +
>>>  	return 0;
>>>  }
>>>
>>>
>>> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
>>
>> I'm not a TLB-flushing expert, but this matches my understanding (and a
>> TLB flushing Linux documentation I stumbled over some while ago but
>> cannot quickly find).
>>
>> In the ordinary try_to_migrate_one() path, flushing would happen via
>> ptep_clear_flush() (just like we do for the anon_exclusive case here as
>> well), correct?
> 
> Correct.
> 

Acked-by: David Hildenbrand <david@redhat.com>
Huang, Ying Aug. 25, 2022, 1:36 a.m. UTC | #4
Alistair Popple <apopple@nvidia.com> writes:

> When clearing a PTE the TLB should be flushed whilst still holding the
> PTL to avoid a potential race with madvise/munmap/etc. For example
> consider the following sequence:
>
>   CPU0                          CPU1
>   ----                          ----
>
>   migrate_vma_collect_pmd()
>   pte_unmap_unlock()
>                                 madvise(MADV_DONTNEED)
>                                 -> zap_pte_range()
>                                 pte_offset_map_lock()
>                                 [ PTE not present, TLB not flushed ]
>                                 pte_unmap_unlock()
>                                 [ page is still accessible via stale TLB ]
>   flush_tlb_range()
>
> In this case the page may still be accessed via the stale TLB entry
> after madvise returns. Fix this by flushing the TLB while holding the
> PTL.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reported-by: Nadav Amit <nadav.amit@gmail.com>
> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
> Cc: stable@vger.kernel.org
>
> ---
>
> Changes for v3:
>
>  - New for v3
> ---
>  mm/migrate_device.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d..6a5ef9f 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  		migrate->dst[migrate->npages] = 0;
>  		migrate->src[migrate->npages++] = mpfn;
>  	}
> -	arch_leave_lazy_mmu_mode();
> -	pte_unmap_unlock(ptep - 1, ptl);
>  
>  	/* Only flush the TLB if we actually modified any entries */
>  	if (unmapped)
>  		flush_tlb_range(walk->vma, start, end);

It appears that we can increase "unmapped" only if ptep_get_and_clear()
is used?

Best Regards,
Huang, Ying

> +	arch_leave_lazy_mmu_mode();
> +	pte_unmap_unlock(ptep - 1, ptl);
> +
>  	return 0;
>  }
>  
>
> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
Alistair Popple Aug. 25, 2022, 10:35 p.m. UTC | #5
"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> When clearing a PTE the TLB should be flushed whilst still holding the
>> PTL to avoid a potential race with madvise/munmap/etc. For example
>> consider the following sequence:
>>
>>   CPU0                          CPU1
>>   ----                          ----
>>
>>   migrate_vma_collect_pmd()
>>   pte_unmap_unlock()
>>                                 madvise(MADV_DONTNEED)
>>                                 -> zap_pte_range()
>>                                 pte_offset_map_lock()
>>                                 [ PTE not present, TLB not flushed ]
>>                                 pte_unmap_unlock()
>>                                 [ page is still accessible via stale TLB ]
>>   flush_tlb_range()
>>
>> In this case the page may still be accessed via the stale TLB entry
>> after madvise returns. Fix this by flushing the TLB while holding the
>> PTL.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Reported-by: Nadav Amit <nadav.amit@gmail.com>
>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>> Cc: stable@vger.kernel.org
>>
>> ---
>>
>> Changes for v3:
>>
>>  - New for v3
>> ---
>>  mm/migrate_device.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 27fb37d..6a5ef9f 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  		migrate->dst[migrate->npages] = 0;
>>  		migrate->src[migrate->npages++] = mpfn;
>>  	}
>> -	arch_leave_lazy_mmu_mode();
>> -	pte_unmap_unlock(ptep - 1, ptl);
>>
>>  	/* Only flush the TLB if we actually modified any entries */
>>  	if (unmapped)
>>  		flush_tlb_range(walk->vma, start, end);
>
> It appears that we can increase "unmapped" only if ptep_get_and_clear()
> is used?

In other words you mean we only need to increase unmapped if pte_present
&& !anon_exclusive?

Agree, that's a good optimisation to make. However I'm just trying to
solve a data corruption issue (not dirtying the page) here, so will post
that as a separate optimisation patch. Thanks.

 - Alistair

> Best Regards,
> Huang, Ying
>
>> +	arch_leave_lazy_mmu_mode();
>> +	pte_unmap_unlock(ptep - 1, ptl);
>> +
>>  	return 0;
>>  }
>>
>>
>> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
Huang, Ying Aug. 26, 2022, 12:56 a.m. UTC | #6
Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> When clearing a PTE the TLB should be flushed whilst still holding the
>>> PTL to avoid a potential race with madvise/munmap/etc. For example
>>> consider the following sequence:
>>>
>>>   CPU0                          CPU1
>>>   ----                          ----
>>>
>>>   migrate_vma_collect_pmd()
>>>   pte_unmap_unlock()
>>>                                 madvise(MADV_DONTNEED)
>>>                                 -> zap_pte_range()
>>>                                 pte_offset_map_lock()
>>>                                 [ PTE not present, TLB not flushed ]
>>>                                 pte_unmap_unlock()
>>>                                 [ page is still accessible via stale TLB ]
>>>   flush_tlb_range()
>>>
>>> In this case the page may still be accessed via the stale TLB entry
>>> after madvise returns. Fix this by flushing the TLB while holding the
>>> PTL.
>>>
>>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>> Reported-by: Nadav Amit <nadav.amit@gmail.com>
>>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>>> Cc: stable@vger.kernel.org
>>>
>>> ---
>>>
>>> Changes for v3:
>>>
>>>  - New for v3
>>> ---
>>>  mm/migrate_device.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 27fb37d..6a5ef9f 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>  		migrate->dst[migrate->npages] = 0;
>>>  		migrate->src[migrate->npages++] = mpfn;
>>>  	}
>>> -	arch_leave_lazy_mmu_mode();
>>> -	pte_unmap_unlock(ptep - 1, ptl);
>>>
>>>  	/* Only flush the TLB if we actually modified any entries */
>>>  	if (unmapped)
>>>  		flush_tlb_range(walk->vma, start, end);
>>
>> It appears that we can increase "unmapped" only if ptep_get_and_clear()
>> is used?
>
> In other words you mean we only need to increase unmapped if pte_present
> && !anon_exclusive?
>
> Agree, that's a good optimisation to make. However I'm just trying to
> solve a data corruption issue (not dirtying the page) here, so will post
> that as a separate optimisation patch. Thanks.

OK.  Then the patch looks good to me.  Feel free to add my,

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

Best Regards,
Huang, Ying

>>
>>> +	arch_leave_lazy_mmu_mode();
>>> +	pte_unmap_unlock(ptep - 1, ptl);
>>> +
>>>  	return 0;
>>>  }
>>>
>>>
>>> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
diff mbox series

Patch

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 27fb37d..6a5ef9f 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -254,13 +254,14 @@  static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		migrate->dst[migrate->npages] = 0;
 		migrate->src[migrate->npages++] = mpfn;
 	}
-	arch_leave_lazy_mmu_mode();
-	pte_unmap_unlock(ptep - 1, ptl);
 
 	/* Only flush the TLB if we actually modified any entries */
 	if (unmapped)
 		flush_tlb_range(walk->vma, start, end);
 
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(ptep - 1, ptl);
+
 	return 0;
 }