diff mbox series

[v2] mm/hugetlb: Fix calculation of adjust_range_if_pmd_sharing_possible

Message ID 20200730201636.74778-1-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm/hugetlb: Fix calculation of adjust_range_if_pmd_sharing_possible | expand

Commit Message

Peter Xu July 30, 2020, 8:16 p.m. UTC
This is found by code observation only.

Firstly, the worst case scenario should assume the whole range was covered by
pmd sharing.  The old algorithm might not work as expected for ranges
like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the
expected range should be (0, 2g).

Since at it, remove the loop since it should not be required.  With that, the
new code should be faster too when the invalidating range is huge.

CC: Andrea Arcangeli <aarcange@redhat.com>
CC: Mike Kravetz <mike.kravetz@oracle.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Matthew Wilcox <willy@infradead.org>
CC: linux-mm@kvack.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
v2:
- use min/max instead of custom MIN/MAX [Matthew]
---
 mm/hugetlb.c | 24 ++++++++++--------------
 1 file changed, 10 insertions(+), 14 deletions(-)

Comments

Mike Kravetz July 30, 2020, 9:49 p.m. UTC | #1
On 7/30/20 1:16 PM, Peter Xu wrote:
> This is found by code observation only.
> 
> Firstly, the worst case scenario should assume the whole range was covered by
> pmd sharing.  The old algorithm might not work as expected for ranges
> like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the
> expected range should be (0, 2g).
> 
> Since at it, remove the loop since it should not be required.  With that, the
> new code should be faster too when the invalidating range is huge.

Thanks Peter!

That is certainly much simpler than the loop in current code.  You say there
are instances where old code 'might not work' for ranges like (1g-2m, 1g+2m).
Not sure I understand what you mean by adjusted and expected ranges in the
message.  Both are possible 'adjusted' ranges depending on vma size.

Just trying to figure out if there is an actual problem in the existing code
that needs to be fixed in stable.  I think the existing code is correct, just
inefficient.
Peter Xu July 30, 2020, 11:26 p.m. UTC | #2
Hi, Mike,

On Thu, Jul 30, 2020 at 02:49:18PM -0700, Mike Kravetz wrote:
> On 7/30/20 1:16 PM, Peter Xu wrote:
> > This is found by code observation only.
> > 
> > Firstly, the worst case scenario should assume the whole range was covered by
> > pmd sharing.  The old algorithm might not work as expected for ranges
> > like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the
> > expected range should be (0, 2g).
> > 
> > Since at it, remove the loop since it should not be required.  With that, the
> > new code should be faster too when the invalidating range is huge.
> 
> Thanks Peter!
> 
> That is certainly much simpler than the loop in current code.  You say there
> are instances where old code 'might not work' for ranges like (1g-2m, 1g+2m).
> Not sure I understand what you mean by adjusted and expected ranges in the
> message.  Both are possible 'adjusted' ranges depending on vma size.
> 
> Just trying to figure out if there is an actual problem in the existing code
> that needs to be fixed in stable.  I think the existing code is correct, just
> inefficient.

Thanks for the quick review!

I'm not sure whether that will cause a real problem, but iiuc in my previous
example of (1g-2m, 1g+2m) in the commit message, the old code will extend the
range to (0, 1g+2m).  In this case, if unluckily the (1g, 2g) range is a pud
with shared pmd, then imho we face the risk of partial tlb flushing with the
old code, because it will only flush tlb for range (0, 1g+2m) but not (0, 2g).
If that's the case, maybe it worths cc stable.

Anyway, I'd like to double confirm with you in case I missed something.

Thanks,
Mike Kravetz July 31, 2020, 12:28 a.m. UTC | #3
On 7/30/20 4:26 PM, Peter Xu wrote:
> Hi, Mike,
> 
> On Thu, Jul 30, 2020 at 02:49:18PM -0700, Mike Kravetz wrote:
>> On 7/30/20 1:16 PM, Peter Xu wrote:
>>> This is found by code observation only.
>>>
>>> Firstly, the worst case scenario should assume the whole range was covered by
>>> pmd sharing.  The old algorithm might not work as expected for ranges
>>> like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the
>>> expected range should be (0, 2g).
>>>
>>> Since at it, remove the loop since it should not be required.  With that, the
>>> new code should be faster too when the invalidating range is huge.
>>
>> Thanks Peter!
>>
>> That is certainly much simpler than the loop in current code.  You say there
>> are instances where old code 'might not work' for ranges like (1g-2m, 1g+2m).
>> Not sure I understand what you mean by adjusted and expected ranges in the
>> message.  Both are possible 'adjusted' ranges depending on vma size.
>>
>> Just trying to figure out if there is an actual problem in the existing code
>> that needs to be fixed in stable.  I think the existing code is correct, just
>> inefficient.
> 
> Thanks for the quick review!
> 
> I'm not sure whether that will cause a real problem, but iiuc in my previous
> example of (1g-2m, 1g+2m) in the commit message, the old code will extend the
> range to (0, 1g+2m).  In this case, if unluckily the (1g, 2g) range is a pud
> with shared pmd, then imho we face the risk of partial tlb flushing with the
> old code, because it will only flush tlb for range (0, 1g+2m) but not (0, 2g).
> If that's the case, maybe it worths cc stable.
> 
> Anyway, I'd like to double confirm with you in case I missed something.

You are correct.  With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing
code will only adjust to (0, 1g+2m) which is incorrect.

We should cc stable.  The original reason for adjusting the range was to
prevent data corruption (getting wrong page).  Since the range is not always
adjusted correctly, the potential for corruption still exists.

However, I am fairly confident that adjust_range_if_pmd_sharing_possible is
only gong to be called in two cases:
1) for a single page
2) for range == entire vma
In those cases, the current code should produce the correct results.

To be safe, let's just cc stable.

Also,
Fixes: 017b1660df89 ("mm: migration: fix migration of huge PMD shared pages")
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4645f1441d32..7332f3c4b8ec 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5321,25 +5321,21 @@  static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 				unsigned long *start, unsigned long *end)
 {
-	unsigned long check_addr;
+	unsigned long a_start, a_end;
 
 	if (!(vma->vm_flags & VM_MAYSHARE))
 		return;
 
-	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
-		unsigned long a_start = check_addr & PUD_MASK;
-		unsigned long a_end = a_start + PUD_SIZE;
+	/* Extend the range to be PUD aligned for a worst case scenario */
+	a_start = ALIGN_DOWN(*start, PUD_SIZE);
+	a_end = ALIGN(*end, PUD_SIZE);
 
-		/*
-		 * If sharing is possible, adjust start/end if necessary.
-		 */
-		if (range_in_vma(vma, a_start, a_end)) {
-			if (a_start < *start)
-				*start = a_start;
-			if (a_end > *end)
-				*end = a_end;
-		}
-	}
+	/*
+	 * Intersect the range with the vma range, since pmd sharing won't be
+	 * across vma after all
+	 */
+	*start = max(vma->vm_start, a_start);
+	*end = min(vma->vm_end, a_end);
 }
 
 /*