diff mbox series

mm: use correct VMA flags when freeing page-tables

Message ID 20211021122322.592822-1-namit@vmware.com (mailing list archive)
State New
Headers show
Series mm: use correct VMA flags when freeing page-tables | expand

Commit Message

Nadav Amit Oct. 21, 2021, 12:23 p.m. UTC
From: Nadav Amit <namit@vmware.com>

Consistent use of the mmu_gather interface requires a call to
tlb_start_vma() and tlb_end_vma() for each VMA. free_pgtables() does not
follow this pattern.

Certain architectures need tlb_start_vma() to be called in order for
tlb_update_vma_flags() to update the VMA flags (tlb->vma_exec and
tlb->vma_huge), which are later used for the proper TLB flush to be
issued. Since tlb_start_vma() is not called, this can lead to the wrong
VMA flags being used when the flush is performed.

Specifically, the munmap syscall would call unmap_region(), which unmaps
the VMAs and then frees the page-tables. A flush is needed after
the page-tables are removed to prevent page-walk caches from holding
stale entries, but this flush would use the flags of the VMA flags of
the last VMA that was flushed. This does not appear to be right.

Use tlb_start_vma() and tlb_end_vma() to prevent this from happening.
This might lead to unnecessary calls to flush_cache_range() on certain
arch's. If needed, a new flag can be added to mmu_gather to indicate
that the flush is not needed.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/memory.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Andrew Morton Oct. 22, 2021, 3:06 a.m. UTC | #1
On Thu, 21 Oct 2021 05:23:22 -0700 Nadav Amit <nadav.amit@gmail.com> wrote:

> From: Nadav Amit <namit@vmware.com>
> 
> Consistent use of the mmu_gather interface requires a call to
> tlb_start_vma() and tlb_end_vma() for each VMA. free_pgtables() does not
> follow this pattern.
> 
> Certain architectures need tlb_start_vma() to be called in order for
> tlb_update_vma_flags() to update the VMA flags (tlb->vma_exec and
> tlb->vma_huge), which are later used for the proper TLB flush to be
> issued. Since tlb_start_vma() is not called, this can lead to the wrong
> VMA flags being used when the flush is performed.
> 
> Specifically, the munmap syscall would call unmap_region(), which unmaps
> the VMAs and then frees the page-tables. A flush is needed after
> the page-tables are removed to prevent page-walk caches from holding
> stale entries, but this flush would use the flags of the VMA flags of
> the last VMA that was flushed. This does not appear to be right.

Any thoughts on what the worst-case end-user cisible effects of this
would be?

Again, I'm wondering about the desirability of a -stable backport.

> Use tlb_start_vma() and tlb_end_vma() to prevent this from happening.
> This might lead to unnecessary calls to flush_cache_range() on certain
> arch's. If needed, a new flag can be added to mmu_gather to indicate
> that the flush is not needed.
Nadav Amit Oct. 22, 2021, 3:37 p.m. UTC | #2
> On Oct 21, 2021, at 8:06 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Thu, 21 Oct 2021 05:23:22 -0700 Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Consistent use of the mmu_gather interface requires a call to
>> tlb_start_vma() and tlb_end_vma() for each VMA. free_pgtables() does not
>> follow this pattern.
>> 
>> Certain architectures need tlb_start_vma() to be called in order for
>> tlb_update_vma_flags() to update the VMA flags (tlb->vma_exec and
>> tlb->vma_huge), which are later used for the proper TLB flush to be
>> issued. Since tlb_start_vma() is not called, this can lead to the wrong
>> VMA flags being used when the flush is performed.
>> 
>> Specifically, the munmap syscall would call unmap_region(), which unmaps
>> the VMAs and then frees the page-tables. A flush is needed after
>> the page-tables are removed to prevent page-walk caches from holding
>> stale entries, but this flush would use the flags of the VMA flags of
>> the last VMA that was flushed. This does not appear to be right.
> 
> Any thoughts on what the worst-case end-user cisible effects of this
> would be?
> 
> Again, I'm wondering about the desirability of a -stable backport.

This issue is not relevant for x86, which I am most familiar with, hence my
limited understanding of the impact on specific architectures.

In general, a TLB flush is needed after the page-tables are released
(in contrast to PTE removal) to prevent a speculative page-walk that might
access IO pages or install bogus entries. Such speculative page-walks have
been reported (on x86) as causing a machine-check. [1]

If a certain architecture has different page-walk caches for executable and
non-executable pages (i.e., different page-walk caches for iTLB and dTLB) or
for different page sizes, it might not perform the necessary TLB flush on the
proper TLB.

Looking at the code, we can see MIPS’s flow of:

tlb_flush()
->flush_tlb_range()
->local_flush_tlb_page()
->flush_micro_tlb_vm()

which calls flush_micro_tlb() only if (vma->vm_flags & VM_EXEC). So
MIPS cares about the VM_EXEC. Yet, it is not certain MIPS might be
affected.

For an architecture to be affected it needs to have all the following
properties:

1. Be able to experience hardware failure/exceptions on speculative
   page-walks.

2. Have separate page-walk caches for exec/data or huge/small pages.

3. Flush the TLBs based on VM_EXEC and VM_HUGETLB.

From the code, MIPS has the 3rd property and presumably the 2nd, but it is
not certain it has the 1st.

I did not mark it as stable, since I am not sure such an architecture
exists.



[1] https://lore.kernel.org/lkml/tip-b956575bed91ecfb136a8300742ecbbf451471ab@git.kernel.org/
Hugh Dickins Nov. 1, 2021, 7:28 a.m. UTC | #3
On Thu, 21 Oct 2021, Nadav Amit wrote:

> From: Nadav Amit <namit@vmware.com>
> 
> Consistent use of the mmu_gather interface requires a call to
> tlb_start_vma() and tlb_end_vma() for each VMA. free_pgtables() does not
> follow this pattern.
> 
> Certain architectures need tlb_start_vma() to be called in order for
> tlb_update_vma_flags() to update the VMA flags (tlb->vma_exec and
> tlb->vma_huge), which are later used for the proper TLB flush to be
> issued. Since tlb_start_vma() is not called, this can lead to the wrong
> VMA flags being used when the flush is performed.
> 
> Specifically, the munmap syscall would call unmap_region(), which unmaps
> the VMAs and then frees the page-tables. A flush is needed after
> the page-tables are removed to prevent page-walk caches from holding
> stale entries, but this flush would use the flags of the VMA flags of
> the last VMA that was flushed. This does not appear to be right.
> 
> Use tlb_start_vma() and tlb_end_vma() to prevent this from happening.
> This might lead to unnecessary calls to flush_cache_range() on certain
> arch's. If needed, a new flag can be added to mmu_gather to indicate
> that the flush is not needed.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Cc: x86@kernel.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  mm/memory.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 12a7b2094434..056fbfdd3c1f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -412,6 +412,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		unlink_anon_vmas(vma);
>  		unlink_file_vma(vma);
>  
> +		tlb_start_vma(tlb, vma);
> +
>  		if (is_vm_hugetlb_page(vma)) {
>  			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
>  				floor, next ? next->vm_start : ceiling);
> @@ -429,6 +431,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			free_pgd_range(tlb, addr, vma->vm_end,
>  				floor, next ? next->vm_start : ceiling);
>  		}
> +
> +		tlb_end_vma(tlb, vma);
>  		vma = next;
>  	}
>  }
> -- 
> 2.25.1

No.

This is an experiment to see whether reviewers look at a wider context
than is seen in the patch itself?  Let's take a look:

		tlb_start_vma(tlb, vma);

		if (is_vm_hugetlb_page(vma)) {
			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
				floor, next ? next->vm_start : ceiling);
		} else {
			/*
			 * Optimization: gather nearby vmas into one call down
			 */
			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
			       && !is_vm_hugetlb_page(next)) {
				vma = next;
				next = vma->vm_next;
				unlink_anon_vmas(vma);
				unlink_file_vma(vma);
			}
			free_pgd_range(tlb, addr, vma->vm_end,
				floor, next ? next->vm_start : ceiling);
		}

		tlb_end_vma(tlb, vma);
		vma = next;

So, the vma may well have changed in between the new tlb_start_vma()
and tlb_end_vma(): which defeats the intent of the patch.

And I doubt that optimization should be dropped to suit the patch:
you admit to limited understanding of those architectures which need
tlb_start_vma(), me too; but they seem to have survived many years
without it there in free_pgtables(), and I think that tlb_start_vma()
is for when freeing pages, not for when freeing page tables.  Surely
all architectures have to accommodate the fact that a single page
table can be occupied by many different kinds of vma.

(Sorry, I'm being totally unresponsive at present, needing to focus
on something else; but thought I'd better get these comments in before
mmotm's mm-use-correct-vma-flags-when-freeing-page-tables.patch goes to 5.16.)

Hugh
Nadav Amit Nov. 1, 2021, 3:45 p.m. UTC | #4
> On Nov 1, 2021, at 12:28 AM, Hugh Dickins <hughd@google.com> wrote:
> 
> On Thu, 21 Oct 2021, Nadav Amit wrote:
> 
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Consistent use of the mmu_gather interface requires a call to
>> tlb_start_vma() and tlb_end_vma() for each VMA. free_pgtables() does not
>> follow this pattern.
>> 
>> Certain architectures need tlb_start_vma() to be called in order for
>> tlb_update_vma_flags() to update the VMA flags (tlb->vma_exec and
>> tlb->vma_huge), which are later used for the proper TLB flush to be
>> issued. Since tlb_start_vma() is not called, this can lead to the wrong
>> VMA flags being used when the flush is performed.
>> 
>> Specifically, the munmap syscall would call unmap_region(), which unmaps
>> the VMAs and then frees the page-tables. A flush is needed after
>> the page-tables are removed to prevent page-walk caches from holding
>> stale entries, but this flush would use the flags of the VMA flags of
>> the last VMA that was flushed. This does not appear to be right.
>> 
>> Use tlb_start_vma() and tlb_end_vma() to prevent this from happening.
>> This might lead to unnecessary calls to flush_cache_range() on certain
>> arch's. If needed, a new flag can be added to mmu_gather to indicate
>> that the flush is not needed.
>> 
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> Cc: x86@kernel.org
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> mm/memory.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>> 
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 12a7b2094434..056fbfdd3c1f 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -412,6 +412,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> 		unlink_anon_vmas(vma);
>> 		unlink_file_vma(vma);
>> 
>> +		tlb_start_vma(tlb, vma);
>> +
>> 		if (is_vm_hugetlb_page(vma)) {
>> 			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
>> 				floor, next ? next->vm_start : ceiling);
>> @@ -429,6 +431,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> 			free_pgd_range(tlb, addr, vma->vm_end,
>> 				floor, next ? next->vm_start : ceiling);
>> 		}
>> +
>> +		tlb_end_vma(tlb, vma);
>> 		vma = next;
>> 	}
>> }
>> -- 
>> 2.25.1
> 
> No.
> 
> This is an experiment to see whether reviewers look at a wider context
> than is seen in the patch itself?  Let's take a look:
> 
> 		tlb_start_vma(tlb, vma);
> 
> 		if (is_vm_hugetlb_page(vma)) {
> 			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
> 				floor, next ? next->vm_start : ceiling);
> 		} else {
> 			/*
> 			 * Optimization: gather nearby vmas into one call down
> 			 */
> 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
> 			       && !is_vm_hugetlb_page(next)) {
> 				vma = next;
> 				next = vma->vm_next;
> 				unlink_anon_vmas(vma);
> 				unlink_file_vma(vma);
> 			}
> 			free_pgd_range(tlb, addr, vma->vm_end,
> 				floor, next ? next->vm_start : ceiling);
> 		}
> 
> 		tlb_end_vma(tlb, vma);
> 		vma = next;
> 
> So, the vma may well have changed in between the new tlb_start_vma()
> and tlb_end_vma(): which defeats the intent of the patch.

Indeed, I made a an embarrassing bug. Having said that, I do not
understand the experiment and whether I conducted it or was the
object of it.

> 
> And I doubt that optimization should be dropped to suit the patch:
> you admit to limited understanding of those architectures which need
> tlb_start_vma(), me too; but they seem to have survived many years
> without it there in free_pgtables(), and I think that tlb_start_vma()
> is for when freeing pages, not for when freeing page tables.  Surely
> all architectures have to accommodate the fact that a single page
> table can be occupied by many different kinds of vma.

When it comes to TLB flushing, the assumption that if something is
in the code for many years it must be working is questionable, and
I have already encountered (and fixed) many such cases before.

Freeing page-tables, as I mentioned before, is needed for the
invalidation the page-walk caches after the page-tables are dropped,
if a speculative page-walk takes place. I can post v2 and fix my
embarrassing bug. I am not going to force this patch - I just
encountered the issue as I was modifying a different piece of code
and the behavior seems very inconsistent.
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 12a7b2094434..056fbfdd3c1f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -412,6 +412,8 @@  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		unlink_anon_vmas(vma);
 		unlink_file_vma(vma);
 
+		tlb_start_vma(tlb, vma);
+
 		if (is_vm_hugetlb_page(vma)) {
 			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
@@ -429,6 +431,8 @@  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
 		}
+
+		tlb_end_vma(tlb, vma);
 		vma = next;
 	}
 }