diff mbox series

[1/4] mm: Export flush_vm_area() to sync the PTEs upon construction

Message ID 20200821085011.28878-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series [1/4] mm: Export flush_vm_area() to sync the PTEs upon construction | expand

Commit Message

Chris Wilson Aug. 21, 2020, 8:50 a.m. UTC
The alloc_vm_area() is another method for drivers to
vmap/map_kernel_range that uses apply_to_page_range() rather than the
direct vmalloc walkers. This is missing the page table modification
tracking, and the ability to synchronize the PTE updates afterwards.
Provide flush_vm_area() for the users of alloc_vm_area() that assumes
the worst and ensures that the page directories are correctly flushed
upon construction.

The impact is most pronounced on x86_32 due to the delayed set_pmd().

Reported-by: Pavel Machek <pavel@ucw.cz>
References: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified")
References: 86cf69f1d893 ("x86/mm/32: implement arch_sync_kernel_mappings()")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: <stable@vger.kernel.org> # v5.8+
---
 include/linux/vmalloc.h |  1 +
 mm/vmalloc.c            | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

Comments

Joerg Roedel Aug. 21, 2020, 9:51 a.m. UTC | #1
On Fri, Aug 21, 2020 at 09:50:08AM +0100, Chris Wilson wrote:
> The alloc_vm_area() is another method for drivers to
> vmap/map_kernel_range that uses apply_to_page_range() rather than the
> direct vmalloc walkers. This is missing the page table modification
> tracking, and the ability to synchronize the PTE updates afterwards.
> Provide flush_vm_area() for the users of alloc_vm_area() that assumes
> the worst and ensures that the page directories are correctly flushed
> upon construction.
> 
> The impact is most pronounced on x86_32 due to the delayed set_pmd().
> 
> Reported-by: Pavel Machek <pavel@ucw.cz>
> References: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified")
> References: 86cf69f1d893 ("x86/mm/32: implement arch_sync_kernel_mappings()")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Joerg Roedel <jroedel@suse.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: David Vrabel <david.vrabel@citrix.com>
> Cc: <stable@vger.kernel.org> # v5.8+
> ---
>  include/linux/vmalloc.h |  1 +
>  mm/vmalloc.c            | 16 ++++++++++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 0221f852a7e1..a253b27df0ac 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -204,6 +204,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
>  
>  /* Allocate/destroy a 'vmalloc' VM area. */
>  extern struct vm_struct *alloc_vm_area(size_t size, pte_t **ptes);
> +extern void flush_vm_area(struct vm_struct *area);
>  extern void free_vm_area(struct vm_struct *area);
>  
>  /* for /dev/kmem */
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index b482d240f9a2..c41934486031 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3078,6 +3078,22 @@ struct vm_struct *alloc_vm_area(size_t size, pte_t **ptes)
>  }
>  EXPORT_SYMBOL_GPL(alloc_vm_area);
>  
> +void flush_vm_area(struct vm_struct *area)
> +{
> +	unsigned long addr = (unsigned long)area->addr;
> +
> +	/* apply_to_page_range() doesn't track the damage, assume the worst */
> +	if (ARCH_PAGE_TABLE_SYNC_MASK & (PGTBL_PTE_MODIFIED |
> +					 PGTBL_PMD_MODIFIED |
> +					 PGTBL_PUD_MODIFIED |
> +					 PGTBL_P4D_MODIFIED |
> +					 PGTBL_PGD_MODIFIED))
> +		arch_sync_kernel_mappings(addr, addr + area->size);

This should happen in __apply_to_page_range() directly and look like
this:

	if (ARCH_PAGE_TABLE_SYNC_MASK && create)
		arch_sync_kernel_mappings(addr, addr + size);

Or even better, track whether something had to be allocated in the
__apply_to_page_range() path and check for:

	if (ARCH_PAGE_TABLE_SYNC_MASK & mask)
Chris Wilson Aug. 21, 2020, 9:54 a.m. UTC | #2
Quoting Joerg Roedel (2020-08-21 10:51:29)
> On Fri, Aug 21, 2020 at 09:50:08AM +0100, Chris Wilson wrote:
> > The alloc_vm_area() is another method for drivers to
> > vmap/map_kernel_range that uses apply_to_page_range() rather than the
> > direct vmalloc walkers. This is missing the page table modification
> > tracking, and the ability to synchronize the PTE updates afterwards.
> > Provide flush_vm_area() for the users of alloc_vm_area() that assumes
> > the worst and ensures that the page directories are correctly flushed
> > upon construction.
> > 
> > The impact is most pronounced on x86_32 due to the delayed set_pmd().
> > 
> > Reported-by: Pavel Machek <pavel@ucw.cz>
> > References: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified")
> > References: 86cf69f1d893 ("x86/mm/32: implement arch_sync_kernel_mappings()")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Joerg Roedel <jroedel@suse.de>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Dave Airlie <airlied@redhat.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > Cc: Pavel Machek <pavel@ucw.cz>
> > Cc: David Vrabel <david.vrabel@citrix.com>
> > Cc: <stable@vger.kernel.org> # v5.8+
> > ---
> >  include/linux/vmalloc.h |  1 +
> >  mm/vmalloc.c            | 16 ++++++++++++++++
> >  2 files changed, 17 insertions(+)
> > 
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index 0221f852a7e1..a253b27df0ac 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -204,6 +204,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
> >  
> >  /* Allocate/destroy a 'vmalloc' VM area. */
> >  extern struct vm_struct *alloc_vm_area(size_t size, pte_t **ptes);
> > +extern void flush_vm_area(struct vm_struct *area);
> >  extern void free_vm_area(struct vm_struct *area);
> >  
> >  /* for /dev/kmem */
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index b482d240f9a2..c41934486031 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3078,6 +3078,22 @@ struct vm_struct *alloc_vm_area(size_t size, pte_t **ptes)
> >  }
> >  EXPORT_SYMBOL_GPL(alloc_vm_area);
> >  
> > +void flush_vm_area(struct vm_struct *area)
> > +{
> > +     unsigned long addr = (unsigned long)area->addr;
> > +
> > +     /* apply_to_page_range() doesn't track the damage, assume the worst */
> > +     if (ARCH_PAGE_TABLE_SYNC_MASK & (PGTBL_PTE_MODIFIED |
> > +                                      PGTBL_PMD_MODIFIED |
> > +                                      PGTBL_PUD_MODIFIED |
> > +                                      PGTBL_P4D_MODIFIED |
> > +                                      PGTBL_PGD_MODIFIED))
> > +             arch_sync_kernel_mappings(addr, addr + area->size);
> 
> This should happen in __apply_to_page_range() directly and look like
> this:

Ok. I thought it had to be after assigning the *ptep. If we apply the
sync first, do not have to worry about PGTBL_PTE_MODIFIED from the
*ptep?
-Chris
Joerg Roedel Aug. 21, 2020, 10:22 a.m. UTC | #3
On Fri, Aug 21, 2020 at 10:54:22AM +0100, Chris Wilson wrote:
> Ok. I thought it had to be after assigning the *ptep. If we apply the
> sync first, do not have to worry about PGTBL_PTE_MODIFIED from the
> *ptep?

Hmm, if I understand the code correctly, you are re-implementing some
generic ioremap/vmalloc mapping logic in the i915 driver. I don't know
the reason, but if it is valid you need to manually call
arch_sync_kernel_mappings() from your driver like this to be correct:

	if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PTE_MODIFIED)
		arch_sync_kernel_mappings();

In practice this is a no-op, because nobody sets PGTBL_PTE_MODIFIED in
ARCH_PAGE_TABLE_SYNC_MASK, so the above code would be optimized away.

But what you really care about is the tracking in apply_to_page_range(),
as that allocates the !pte levels of your page-table, which needs
synchronization on x86-32.

Btw, what are the reasons you can't use generic vmalloc/ioremap
interfaces to map the range?

Regards,

	Joerg
Chris Wilson Aug. 21, 2020, 10:36 a.m. UTC | #4
Quoting Joerg Roedel (2020-08-21 11:22:04)
> On Fri, Aug 21, 2020 at 10:54:22AM +0100, Chris Wilson wrote:
> > Ok. I thought it had to be after assigning the *ptep. If we apply the
> > sync first, do not have to worry about PGTBL_PTE_MODIFIED from the
> > *ptep?
> 
> Hmm, if I understand the code correctly, you are re-implementing some
> generic ioremap/vmalloc mapping logic in the i915 driver. I don't know
> the reason, but if it is valid you need to manually call
> arch_sync_kernel_mappings() from your driver like this to be correct:
> 
>         if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PTE_MODIFIED)
>                 arch_sync_kernel_mappings();
> 
> In practice this is a no-op, because nobody sets PGTBL_PTE_MODIFIED in
> ARCH_PAGE_TABLE_SYNC_MASK, so the above code would be optimized away.
> 
> But what you really care about is the tracking in apply_to_page_range(),
> as that allocates the !pte levels of your page-table, which needs
> synchronization on x86-32.
> 
> Btw, what are the reasons you can't use generic vmalloc/ioremap
> interfaces to map the range?

ioremap_prot and ioremap_page_range assume a contiguous IO address. So
we needed to allocate the vmalloc area [and would then need to iterate
over the discontiguous iomem chunks with ioremap_page_range], and since
alloc_vm_area returned the ptep, it looked clearer to then assign those
according to whether we wanted ioremapping or a plain page. So we ended
up with one call to the core to return us a vm_struct and a pte array
that worked for either backing store.
-Chris
diff mbox series

Patch

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 0221f852a7e1..a253b27df0ac 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -204,6 +204,7 @@  static inline void set_vm_flush_reset_perms(void *addr)
 
 /* Allocate/destroy a 'vmalloc' VM area. */
 extern struct vm_struct *alloc_vm_area(size_t size, pte_t **ptes);
+extern void flush_vm_area(struct vm_struct *area);
 extern void free_vm_area(struct vm_struct *area);
 
 /* for /dev/kmem */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b482d240f9a2..c41934486031 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3078,6 +3078,22 @@  struct vm_struct *alloc_vm_area(size_t size, pte_t **ptes)
 }
 EXPORT_SYMBOL_GPL(alloc_vm_area);
 
+void flush_vm_area(struct vm_struct *area)
+{
+	unsigned long addr = (unsigned long)area->addr;
+
+	/* apply_to_page_range() doesn't track the damage, assume the worst */
+	if (ARCH_PAGE_TABLE_SYNC_MASK & (PGTBL_PTE_MODIFIED |
+					 PGTBL_PMD_MODIFIED |
+					 PGTBL_PUD_MODIFIED |
+					 PGTBL_P4D_MODIFIED |
+					 PGTBL_PGD_MODIFIED))
+		arch_sync_kernel_mappings(addr, addr + area->size);
+
+	flush_cache_vmap(addr, area->size);
+}
+EXPORT_SYMBOL_GPL(flush_vm_area);
+
 void free_vm_area(struct vm_struct *area)
 {
 	struct vm_struct *ret;