Message ID | 20190717071439.14261-4-joro@8bytes.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Sync unmappings in vmalloc/ioremap areas | expand |
On Wed, Jul 17, 2019 at 12:14 AM Joerg Roedel <joro@8bytes.org> wrote: > > From: Joerg Roedel <jroedel@suse.de> > > On x86-32 with PTI enabled, parts of the kernel page-tables > are not shared between processes. This can cause mappings in > the vmalloc/ioremap area to persist in some page-tables > after the regions is unmapped and released. > > When the region is re-used the processes with the old > mappings do not fault in the new mappings but still access > the old ones. > > This causes undefined behavior, in reality often data > corruption, kernel oopses and panics and even spontaneous > reboots. > > Fix this problem by activly syncing unmaps in the > vmalloc/ioremap area to all page-tables in the system. > > References: https://bugzilla.suse.com/show_bug.cgi?id=1118689 > Fixes: 5d72b4fba40ef ('x86, mm: support huge I/O mapping capability I/F') > Signed-off-by: Joerg Roedel <jroedel@suse.de> > --- > mm/vmalloc.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 4fa8d84599b0..322b11a374fd 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -132,6 +132,8 @@ static void vunmap_page_range(unsigned long addr, unsigned long end) > continue; > vunmap_p4d_range(pgd, addr, next); > } while (pgd++, addr = next, addr != end); > + > + vmalloc_sync_all(); > } I'm confused. Shouldn't the code in _vm_unmap_aliases handle this? As it stands, won't your patch hurt performance on x86_64? If x86_32 is a special snowflake here, maybe flush_tlb_kernel_range() should handle this? Even if your patch is correct, a comment would be nice
Hi Andy, On Wed, Jul 17, 2019 at 02:24:09PM -0700, Andy Lutomirski wrote: > On Wed, Jul 17, 2019 at 12:14 AM Joerg Roedel <joro@8bytes.org> wrote: > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 4fa8d84599b0..322b11a374fd 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -132,6 +132,8 @@ static void vunmap_page_range(unsigned long addr, unsigned long end) > > continue; > > vunmap_p4d_range(pgd, addr, next); > > } while (pgd++, addr = next, addr != end); > > + > > + vmalloc_sync_all(); > > } > > I'm confused. Shouldn't the code in _vm_unmap_aliases handle this? > As it stands, won't your patch hurt performance on x86_64? If x86_32 > is a special snowflake here, maybe flush_tlb_kernel_range() should > handle this? Imo this is the logical place to handle this. The code first unmaps the area from the init_mm page-table and then syncs that page-table to all other page-tables in the system, so one place to update the page-tables. Performance-wise it makes no difference if we put that into _vm_unmap_aliases(), as that is called in the vmunmap path too. But it is right that vmunmap/iounmap performance on x86-64 will decrease to some degree. If that is a problem for some workloads I can also implement a complete separate code-path which just syncs unmappings and is only implemented for x86-32 with !SHARED_KERNEL_PMD. Regards, Joerg
On Thu, Jul 18, 2019 at 2:17 AM Joerg Roedel <jroedel@suse.de> wrote: > > Hi Andy, > > On Wed, Jul 17, 2019 at 02:24:09PM -0700, Andy Lutomirski wrote: > > On Wed, Jul 17, 2019 at 12:14 AM Joerg Roedel <joro@8bytes.org> wrote: > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > index 4fa8d84599b0..322b11a374fd 100644 > > > --- a/mm/vmalloc.c > > > +++ b/mm/vmalloc.c > > > @@ -132,6 +132,8 @@ static void vunmap_page_range(unsigned long addr, unsigned long end) > > > continue; > > > vunmap_p4d_range(pgd, addr, next); > > > } while (pgd++, addr = next, addr != end); > > > + > > > + vmalloc_sync_all(); > > > } > > > > I'm confused. Shouldn't the code in _vm_unmap_aliases handle this? > > As it stands, won't your patch hurt performance on x86_64? If x86_32 > > is a special snowflake here, maybe flush_tlb_kernel_range() should > > handle this? > > Imo this is the logical place to handle this. The code first unmaps the > area from the init_mm page-table and then syncs that page-table to all > other page-tables in the system, so one place to update the page-tables. I find it problematic that there is no meaningful documentation as to what vmalloc_sync_all() is supposed to do. The closest I can find is this comment by following the x86_64 code, which calls sync_global_pgds(), which says: /* * When memory was added make sure all the processes MM have * suitable PGD entries in the local PGD level page. */ void sync_global_pgds(unsigned long start, unsigned long end) { Which is obviously entirely inapplicable. If I'm understanding correctly, the underlying issue here is that the vmalloc fault mechanism can propagate PGD entry *addition*, but nothing (not even flush_tlb_kernel_range()) propagates PGD entry *removal*. I find it suspicious that only x86 has this. How do other architectures handle this? At the very least, I think this series needs a comment in vmalloc_sync_all() explaining exactly what the function promises to do. But maybe a better fix is to add code to flush_tlb_kernel_range() to sync the vmalloc area if the flushed range overlaps the vmalloc area. Or, even better, improve x86_32 the way we did x86_64: adjust the memory mapping code such that top-level paging entries are never deleted in the first place.
On Thu, Jul 18, 2019 at 12:04:49PM -0700, Andy Lutomirski wrote: > I find it problematic that there is no meaningful documentation as to > what vmalloc_sync_all() is supposed to do. Yeah, I found that too, there is no real design around vmalloc_sync_all(). It looks like it was just added to fit the purpose on x86-32. That also makes it hard to find all necessary call-sites. > Which is obviously entirely inapplicable. If I'm understanding > correctly, the underlying issue here is that the vmalloc fault > mechanism can propagate PGD entry *addition*, but nothing (not even > flush_tlb_kernel_range()) propagates PGD entry *removal*. Close, the underlying issue is not about PGD, but PMD entry addition/removal on x86-32 pae systems. > I find it suspicious that only x86 has this. How do other > architectures handle this? The problem on x86-PAE arises from the !SHARED_KERNEL_PMD case, which was introduced by the Xen-PV patches and then re-used for the PTI-x32 enablement to be able to map the LDT into user-space at a fixed address. Other architectures probably don't have the !SHARED_KERNEL_PMD case (or do unsharing of kernel page-tables on any level where a huge-page could be mapped). > At the very least, I think this series needs a comment in > vmalloc_sync_all() explaining exactly what the function promises to > do. Okay, as it stands, it promises to sync mappings for the vmalloc area between all PGDs in the system. I will add that as a comment. > But maybe a better fix is to add code to flush_tlb_kernel_range() > to sync the vmalloc area if the flushed range overlaps the vmalloc > area. That would also cause needless overhead on x86-64 because the vmalloc area doesn't need syncing there. I can make it x86-32 only, but that is not a clean solution imo. > Or, even better, improve x86_32 the way we did x86_64: adjust > the memory mapping code such that top-level paging entries are never > deleted in the first place. There is not enough address space on x86-32 to partition it like on x86-64. In the default PAE configuration there are _four_ PGD entries, usually one for the kernel, and then 512 PMD entries. Partitioning happens on the PMD level, for example there is one entry (2MB of address space) reserved for the user-space LDT mapping. Regards, Joerg
On Fri, Jul 19, 2019 at 5:21 AM Joerg Roedel <jroedel@suse.de> wrote: > > On Thu, Jul 18, 2019 at 12:04:49PM -0700, Andy Lutomirski wrote: > > I find it problematic that there is no meaningful documentation as to > > what vmalloc_sync_all() is supposed to do. > > Yeah, I found that too, there is no real design around > vmalloc_sync_all(). It looks like it was just added to fit the purpose > on x86-32. That also makes it hard to find all necessary call-sites. > > > Which is obviously entirely inapplicable. If I'm understanding > > correctly, the underlying issue here is that the vmalloc fault > > mechanism can propagate PGD entry *addition*, but nothing (not even > > flush_tlb_kernel_range()) propagates PGD entry *removal*. > > Close, the underlying issue is not about PGD, but PMD entry > addition/removal on x86-32 pae systems. > > > I find it suspicious that only x86 has this. How do other > > architectures handle this? > > The problem on x86-PAE arises from the !SHARED_KERNEL_PMD case, which was > introduced by the Xen-PV patches and then re-used for the PTI-x32 > enablement to be able to map the LDT into user-space at a fixed address. > > Other architectures probably don't have the !SHARED_KERNEL_PMD case (or > do unsharing of kernel page-tables on any level where a huge-page could > be mapped). > > > At the very least, I think this series needs a comment in > > vmalloc_sync_all() explaining exactly what the function promises to > > do. > > Okay, as it stands, it promises to sync mappings for the vmalloc area > between all PGDs in the system. I will add that as a comment. > > > But maybe a better fix is to add code to flush_tlb_kernel_range() > > to sync the vmalloc area if the flushed range overlaps the vmalloc > > area. > > That would also cause needless overhead on x86-64 because the vmalloc > area doesn't need syncing there. I can make it x86-32 only, but that is > not a clean solution imo. Could you move the vmalloc_sync_all() call to the lazy purge path, though? If nothing else, it will cause it to be called fewer times under any given workload, and it looks like it could be rather slow on x86_32. > > > Or, even better, improve x86_32 the way we did x86_64: adjust > > the memory mapping code such that top-level paging entries are never > > deleted in the first place. > > There is not enough address space on x86-32 to partition it like on > x86-64. In the default PAE configuration there are _four_ PGD entries, > usually one for the kernel, and then 512 PMD entries. Partitioning > happens on the PMD level, for example there is one entry (2MB of address > space) reserved for the user-space LDT mapping. Ugh, fair enough.
On Fri, Jul 19, 2019 at 05:24:03AM -0700, Andy Lutomirski wrote: > Could you move the vmalloc_sync_all() call to the lazy purge path, > though? If nothing else, it will cause it to be called fewer times > under any given workload, and it looks like it could be rather slow on > x86_32. Okay, I move it to __purge_vmap_area_lazy(). That looks like the right place. Thanks, Joerg
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 4fa8d84599b0..322b11a374fd 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -132,6 +132,8 @@ static void vunmap_page_range(unsigned long addr, unsigned long end) continue; vunmap_p4d_range(pgd, addr, next); } while (pgd++, addr = next, addr != end); + + vmalloc_sync_all(); } static int vmap_pte_range(pmd_t *pmd, unsigned long addr,