Message ID | a578ee9bb656234d3a19bf9e97c3012378d31a19.1727440966.git.lorenzo.stoakes@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | implement lightweight guard pages | expand |
Hi Lorenzo, Please add me to this series, I 'm interested in everything related to mseal :-), thanks. I also added Kees into the cc, since mseal is a security feature. On Fri, Sep 27, 2024 at 5:52 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > Implement a new lightweight guard page feature, that is regions of userland > virtual memory that, when accessed, cause a fatal signal to arise. > > Currently users must establish PROT_NONE ranges to achieve this. > > However this is very costly memory-wise - we need a VMA for each and every > one of these regions AND they become unmergeable with surrounding VMAs. > > In addition repeated mmap() calls require repeated kernel context switches > and contention of the mmap lock to install these ranges, potentially also > having to unmap memory if installed over existing ranges. > > The lightweight guard approach eliminates the VMA cost altogether - rather > than establishing a PROT_NONE VMA, it operates at the level of page table > entries - poisoning PTEs such that accesses to them cause a fault followed > by a SIGSGEV signal being raised. > > This is achieved through the PTE marker mechanism, which a previous commit > in this series extended to permit this to be done, installed via the > generic page walking logic, also extended by a prior commit for this > purpose. > > These poison ranges are established with MADV_GUARD_POISON, and if the > range in which they are installed contain any existing mappings, they will > be zapped, i.e. free the range and unmap memory (thus mimicking the > behaviour of MADV_DONTNEED in this respect). > > Any existing poison entries will be left untouched. There is no nesting of > poisoned pages. > > Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather > unexpected behaviour, but are cleared on process teardown or unmapping of > memory ranges. > > Ranges can have the poison property removed by MADV_GUARD_UNPOISON - > 'remedying' the poisoning. The ranges over which this is applied, should > they contain non-poison entries, will be untouched, only poison entries > will be cleared. > > We permit this operation on anonymous memory only, and only VMAs which are > non-special, non-huge and not mlock()'d (if we permitted this we'd have to > drop locked pages which would be rather counterintuitive). > > The poisoning of the range must be performed under mmap write lock as we > have to install an anon_vma to ensure correct behaviour on fork. > > Suggested-by: Vlastimil Babka <vbabka@suze.cz> > Suggested-by: Jann Horn <jannh@google.com> > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > --- > arch/alpha/include/uapi/asm/mman.h | 3 + > arch/mips/include/uapi/asm/mman.h | 3 + > arch/parisc/include/uapi/asm/mman.h | 3 + > arch/xtensa/include/uapi/asm/mman.h | 3 + > include/uapi/asm-generic/mman-common.h | 3 + > mm/madvise.c | 158 +++++++++++++++++++++++++ > mm/mprotect.c | 3 +- > mm/mseal.c | 1 + > 8 files changed, 176 insertions(+), 1 deletion(-) > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > index 763929e814e9..71e13f27742d 100644 > --- a/arch/alpha/include/uapi/asm/mman.h > +++ b/arch/alpha/include/uapi/asm/mman.h > @@ -78,6 +78,9 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > index 9c48d9a21aa0..1a2222322f77 100644 > --- a/arch/mips/include/uapi/asm/mman.h > +++ b/arch/mips/include/uapi/asm/mman.h > @@ -105,6 +105,9 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > index 68c44f99bc93..380905522397 100644 > --- a/arch/parisc/include/uapi/asm/mman.h > +++ b/arch/parisc/include/uapi/asm/mman.h > @@ -75,6 +75,9 @@ > #define MADV_HWPOISON 100 /* poison a page for testing */ > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > index 1ff0c858544f..e8d5affceb28 100644 > --- a/arch/xtensa/include/uapi/asm/mman.h > +++ b/arch/xtensa/include/uapi/asm/mman.h > @@ -113,6 +113,9 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index 6ce1f1ceb432..5dfd3d442de4 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -79,6 +79,9 @@ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/mm/madvise.c b/mm/madvise.c > index e871a72a6c32..7216e10723ae 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > case MADV_POPULATE_READ: > case MADV_POPULATE_WRITE: > case MADV_COLLAPSE: > + case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */ > return 0; > default: > /* be safe, default to 1. list exceptions explicitly */ > @@ -1017,6 +1018,157 @@ static long madvise_remove(struct vm_area_struct *vma, > return error; > } > > +static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) > +{ > + vm_flags_t disallowed = VM_SPECIAL | VM_HUGETLB; > + > + /* > + * A user could lock after poisoning but that's fine, as they'd not be > + * able to fault in. The issue arises when we try to zap existing locked > + * VMAs. We don't want to do that. > + */ > + if (!allow_locked) > + disallowed |= VM_LOCKED; > + > + if (!vma_is_anonymous(vma)) > + return false; > + > + if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE) > + return false; > + > + return true; > +} > + > +static int guard_poison_install_pte(unsigned long addr, unsigned long next, > + pte_t *ptep, struct mm_walk *walk) > +{ > + unsigned long *num_installed = (unsigned long *)walk->private; > + > + (*num_installed)++; > + /* Simply install a PTE marker, this causes segfault on access. */ > + *ptep = make_pte_marker(PTE_MARKER_GUARD); > + > + return 0; > +} > + > +static bool is_guard_pte_marker(pte_t ptent) > +{ > + return is_pte_marker(ptent) && > + is_guard_swp_entry(pte_to_swp_entry(ptent)); > +} > + > +static int guard_poison_pte_entry(pte_t *pte, unsigned long addr, > + unsigned long next, struct mm_walk *walk) > +{ > + pte_t ptent = ptep_get(pte); > + > + /* > + * If not a guard marker, simply abort the operation. We return a value > + * > 0 indicating a non-error abort. > + */ > + return !is_guard_pte_marker(ptent); > +} > + > +static const struct mm_walk_ops guard_poison_walk_ops = { > + .install_pte = guard_poison_install_pte, > + .pte_entry = guard_poison_pte_entry, > + /* We might need to install an anon_vma. */ > + .walk_lock = PGWALK_WRLOCK, > +}; > + > +static long madvise_guard_poison(struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, unsigned long end) > +{ > + long err; > + bool retried = false; > + > + *prev = vma; > + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) > + return -EINVAL; > + > + /* > + * Optimistically try to install the guard poison pages first. If any > + * non-guard pages are encountered, give up and zap the range before > + * trying again. > + */ > + while (true) { > + unsigned long num_installed = 0; > + > + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ > + err = walk_page_range_mm(vma->vm_mm, start, end, > + &guard_poison_walk_ops, > + &num_installed); > + /* > + * If we install poison markers, then the range is no longer > + * empty from a page table perspective and therefore it's > + * appropriate to have an anon_vma. > + * > + * This ensures that on fork, we copy page tables correctly. > + */ > + if (err >= 0 && num_installed > 0) { > + int err_anon = anon_vma_prepare(vma); > + > + if (err_anon) > + err = err_anon; > + } > + > + if (err <= 0) > + return err; > + > + if (!retried) > + /* > + * OK some of the range have non-guard pages mapped, zap > + * them. This leaves existing guard pages in place. > + */ > + zap_page_range_single(vma, start, end - start, NULL); > + else > + /* > + * If we reach here, then there is a racing fault that > + * has populated the PTE after we zapped. Give up and > + * let the user know to try again. > + */ > + return -EAGAIN; > + > + retried = true; > + } > +} > + > +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr, > + unsigned long next, struct mm_walk *walk) > +{ > + pte_t ptent = ptep_get(pte); > + > + if (is_guard_pte_marker(ptent)) { > + /* Simply clear the PTE marker. */ > + pte_clear_not_present_full(walk->mm, addr, pte, true); > + update_mmu_cache(walk->vma, addr, pte); > + } > + > + return 0; > +} > + > +static const struct mm_walk_ops guard_unpoison_walk_ops = { > + .pte_entry = guard_unpoison_pte_entry, > + .walk_lock = PGWALK_RDLOCK, > +}; > + > +static long madvise_guard_unpoison(struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, unsigned long end) > +{ > + *prev = vma; > + /* > + * We're ok with unpoisoning mlock()'d ranges, as this is a > + * non-destructive action. > + */ > + if (!is_valid_guard_vma(vma, /* allow_locked = */true)) > + return -EINVAL; > + > + return walk_page_range(vma->vm_mm, start, end, > + &guard_unpoison_walk_ops, NULL); > +} > + > /* > * Apply an madvise behavior to a region of a vma. madvise_update_vma > * will handle splitting a vm area into separate areas, each area with its own > @@ -1098,6 +1250,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > break; > case MADV_COLLAPSE: > return madvise_collapse(vma, prev, start, end); > + case MADV_GUARD_POISON: > + return madvise_guard_poison(vma, prev, start, end); > + case MADV_GUARD_UNPOISON: > + return madvise_guard_unpoison(vma, prev, start, end); > } > > anon_name = anon_vma_name(vma); > @@ -1197,6 +1353,8 @@ madvise_behavior_valid(int behavior) > case MADV_DODUMP: > case MADV_WIPEONFORK: > case MADV_KEEPONFORK: > + case MADV_GUARD_POISON: > + case MADV_GUARD_UNPOISON: > #ifdef CONFIG_MEMORY_FAILURE > case MADV_SOFT_OFFLINE: > case MADV_HWPOISON: > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 0c5d6d06107d..d0e3ebfadef8 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -236,7 +236,8 @@ static long change_pte_range(struct mmu_gather *tlb, > } else if (is_pte_marker_entry(entry)) { > /* > * Ignore error swap entries unconditionally, > - * because any access should sigbus anyway. > + * because any access should sigbus/sigsegv > + * anyway. > */ > if (is_poisoned_swp_entry(entry)) > continue; > diff --git a/mm/mseal.c b/mm/mseal.c > index ece977bd21e1..21bf5534bcf5 100644 > --- a/mm/mseal.c > +++ b/mm/mseal.c > @@ -30,6 +30,7 @@ static bool is_madv_discard(int behavior) > case MADV_REMOVE: > case MADV_DONTFORK: > case MADV_WIPEONFORK: > + case MADV_GUARD_POISON: Can you please describe the rationale to add this to the existing mseal's semantic ? I didn't not find any description from the cover letter or this patch's description, hence asking. Thanks -Jeff > return true; > } > > -- > 2.46.2 > >
On Fri, Oct 04, 2024 at 11:17:13AM -0700, Jeff Xu wrote: > Hi Lorenzo, > > Please add me to this series, I 'm interested in everything related to > mseal :-), thanks. Hi Jeff, more than happy to cc you on this going forward :) The only change to mseal is a trivial change because the poison operation discards, wasn't intentional, but apologies, I should have cc'd you regardless! Will do so on any such interaction with mseal moving forward. > > I also added Kees into the cc, since mseal is a security feature. Sure no problem happy to keep Kees cc-d too (Kees - ping me if you'd prefer not :>), however a note on this - guard pages _themselves_ are emphatically NOT a security feature, and make no guarantees on this front, but rather are a convenience/effiency thing. Obviously however I am adding madvise() functionality here, and all such functionality must take into account whether or not they are discard operations as to ensure mseal semantics are obeyed - see below for my argument as to why I feel the poison operation falls under this. > > > On Fri, Sep 27, 2024 at 5:52 AM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > > > Implement a new lightweight guard page feature, that is regions of userland > > virtual memory that, when accessed, cause a fatal signal to arise. > > > > Currently users must establish PROT_NONE ranges to achieve this. > > > > However this is very costly memory-wise - we need a VMA for each and every > > one of these regions AND they become unmergeable with surrounding VMAs. > > > > In addition repeated mmap() calls require repeated kernel context switches > > and contention of the mmap lock to install these ranges, potentially also > > having to unmap memory if installed over existing ranges. > > > > The lightweight guard approach eliminates the VMA cost altogether - rather > > than establishing a PROT_NONE VMA, it operates at the level of page table > > entries - poisoning PTEs such that accesses to them cause a fault followed > > by a SIGSGEV signal being raised. > > > > This is achieved through the PTE marker mechanism, which a previous commit > > in this series extended to permit this to be done, installed via the > > generic page walking logic, also extended by a prior commit for this > > purpose. > > > > These poison ranges are established with MADV_GUARD_POISON, and if the > > range in which they are installed contain any existing mappings, they will > > be zapped, i.e. free the range and unmap memory (thus mimicking the > > behaviour of MADV_DONTNEED in this respect). > > > > Any existing poison entries will be left untouched. There is no nesting of > > poisoned pages. > > > > Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather > > unexpected behaviour, but are cleared on process teardown or unmapping of > > memory ranges. > > > > Ranges can have the poison property removed by MADV_GUARD_UNPOISON - > > 'remedying' the poisoning. The ranges over which this is applied, should > > they contain non-poison entries, will be untouched, only poison entries > > will be cleared. > > > > We permit this operation on anonymous memory only, and only VMAs which are > > non-special, non-huge and not mlock()'d (if we permitted this we'd have to > > drop locked pages which would be rather counterintuitive). > > > > The poisoning of the range must be performed under mmap write lock as we > > have to install an anon_vma to ensure correct behaviour on fork. > > > > Suggested-by: Vlastimil Babka <vbabka@suze.cz> > > Suggested-by: Jann Horn <jannh@google.com> > > Suggested-by: David Hildenbrand <david@redhat.com> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > --- > > arch/alpha/include/uapi/asm/mman.h | 3 + > > arch/mips/include/uapi/asm/mman.h | 3 + > > arch/parisc/include/uapi/asm/mman.h | 3 + > > arch/xtensa/include/uapi/asm/mman.h | 3 + > > include/uapi/asm-generic/mman-common.h | 3 + > > mm/madvise.c | 158 +++++++++++++++++++++++++ > > mm/mprotect.c | 3 +- > > mm/mseal.c | 1 + > > 8 files changed, 176 insertions(+), 1 deletion(-) > > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > > index 763929e814e9..71e13f27742d 100644 > > --- a/arch/alpha/include/uapi/asm/mman.h > > +++ b/arch/alpha/include/uapi/asm/mman.h > > @@ -78,6 +78,9 @@ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > > + > > /* compatibility flags */ > > #define MAP_FILE 0 > > > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > > index 9c48d9a21aa0..1a2222322f77 100644 > > --- a/arch/mips/include/uapi/asm/mman.h > > +++ b/arch/mips/include/uapi/asm/mman.h > > @@ -105,6 +105,9 @@ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > > + > > /* compatibility flags */ > > #define MAP_FILE 0 > > > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > > index 68c44f99bc93..380905522397 100644 > > --- a/arch/parisc/include/uapi/asm/mman.h > > +++ b/arch/parisc/include/uapi/asm/mman.h > > @@ -75,6 +75,9 @@ > > #define MADV_HWPOISON 100 /* poison a page for testing */ > > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > > > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > > + > > /* compatibility flags */ > > #define MAP_FILE 0 > > > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > > index 1ff0c858544f..e8d5affceb28 100644 > > --- a/arch/xtensa/include/uapi/asm/mman.h > > +++ b/arch/xtensa/include/uapi/asm/mman.h > > @@ -113,6 +113,9 @@ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > > + > > /* compatibility flags */ > > #define MAP_FILE 0 > > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > > index 6ce1f1ceb432..5dfd3d442de4 100644 > > --- a/include/uapi/asm-generic/mman-common.h > > +++ b/include/uapi/asm-generic/mman-common.h > > @@ -79,6 +79,9 @@ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > > + > > /* compatibility flags */ > > #define MAP_FILE 0 > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index e871a72a6c32..7216e10723ae 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > > case MADV_POPULATE_READ: > > case MADV_POPULATE_WRITE: > > case MADV_COLLAPSE: > > + case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */ > > return 0; > > default: > > /* be safe, default to 1. list exceptions explicitly */ > > @@ -1017,6 +1018,157 @@ static long madvise_remove(struct vm_area_struct *vma, > > return error; > > } > > > > +static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) > > +{ > > + vm_flags_t disallowed = VM_SPECIAL | VM_HUGETLB; > > + > > + /* > > + * A user could lock after poisoning but that's fine, as they'd not be > > + * able to fault in. The issue arises when we try to zap existing locked > > + * VMAs. We don't want to do that. > > + */ > > + if (!allow_locked) > > + disallowed |= VM_LOCKED; > > + > > + if (!vma_is_anonymous(vma)) > > + return false; > > + > > + if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE) > > + return false; > > + > > + return true; > > +} > > + > > +static int guard_poison_install_pte(unsigned long addr, unsigned long next, > > + pte_t *ptep, struct mm_walk *walk) > > +{ > > + unsigned long *num_installed = (unsigned long *)walk->private; > > + > > + (*num_installed)++; > > + /* Simply install a PTE marker, this causes segfault on access. */ > > + *ptep = make_pte_marker(PTE_MARKER_GUARD); > > + > > + return 0; > > +} > > + > > +static bool is_guard_pte_marker(pte_t ptent) > > +{ > > + return is_pte_marker(ptent) && > > + is_guard_swp_entry(pte_to_swp_entry(ptent)); > > +} > > + > > +static int guard_poison_pte_entry(pte_t *pte, unsigned long addr, > > + unsigned long next, struct mm_walk *walk) > > +{ > > + pte_t ptent = ptep_get(pte); > > + > > + /* > > + * If not a guard marker, simply abort the operation. We return a value > > + * > 0 indicating a non-error abort. > > + */ > > + return !is_guard_pte_marker(ptent); > > +} > > + > > +static const struct mm_walk_ops guard_poison_walk_ops = { > > + .install_pte = guard_poison_install_pte, > > + .pte_entry = guard_poison_pte_entry, > > + /* We might need to install an anon_vma. */ > > + .walk_lock = PGWALK_WRLOCK, > > +}; > > + > > +static long madvise_guard_poison(struct vm_area_struct *vma, > > + struct vm_area_struct **prev, > > + unsigned long start, unsigned long end) > > +{ > > + long err; > > + bool retried = false; > > + > > + *prev = vma; > > + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) > > + return -EINVAL; > > + > > + /* > > + * Optimistically try to install the guard poison pages first. If any > > + * non-guard pages are encountered, give up and zap the range before > > + * trying again. > > + */ > > + while (true) { > > + unsigned long num_installed = 0; > > + > > + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ > > + err = walk_page_range_mm(vma->vm_mm, start, end, > > + &guard_poison_walk_ops, > > + &num_installed); > > + /* > > + * If we install poison markers, then the range is no longer > > + * empty from a page table perspective and therefore it's > > + * appropriate to have an anon_vma. > > + * > > + * This ensures that on fork, we copy page tables correctly. > > + */ > > + if (err >= 0 && num_installed > 0) { > > + int err_anon = anon_vma_prepare(vma); > > + > > + if (err_anon) > > + err = err_anon; > > + } > > + > > + if (err <= 0) > > + return err; > > + > > + if (!retried) > > + /* > > + * OK some of the range have non-guard pages mapped, zap > > + * them. This leaves existing guard pages in place. > > + */ > > + zap_page_range_single(vma, start, end - start, NULL); > > + else > > + /* > > + * If we reach here, then there is a racing fault that > > + * has populated the PTE after we zapped. Give up and > > + * let the user know to try again. > > + */ > > + return -EAGAIN; > > + > > + retried = true; > > + } > > +} > > + > > +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr, > > + unsigned long next, struct mm_walk *walk) > > +{ > > + pte_t ptent = ptep_get(pte); > > + > > + if (is_guard_pte_marker(ptent)) { > > + /* Simply clear the PTE marker. */ > > + pte_clear_not_present_full(walk->mm, addr, pte, true); > > + update_mmu_cache(walk->vma, addr, pte); > > + } > > + > > + return 0; > > +} > > + > > +static const struct mm_walk_ops guard_unpoison_walk_ops = { > > + .pte_entry = guard_unpoison_pte_entry, > > + .walk_lock = PGWALK_RDLOCK, > > +}; > > + > > +static long madvise_guard_unpoison(struct vm_area_struct *vma, > > + struct vm_area_struct **prev, > > + unsigned long start, unsigned long end) > > +{ > > + *prev = vma; > > + /* > > + * We're ok with unpoisoning mlock()'d ranges, as this is a > > + * non-destructive action. > > + */ > > + if (!is_valid_guard_vma(vma, /* allow_locked = */true)) > > + return -EINVAL; > > + > > + return walk_page_range(vma->vm_mm, start, end, > > + &guard_unpoison_walk_ops, NULL); > > +} > > + > > /* > > * Apply an madvise behavior to a region of a vma. madvise_update_vma > > * will handle splitting a vm area into separate areas, each area with its own > > @@ -1098,6 +1250,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > > break; > > case MADV_COLLAPSE: > > return madvise_collapse(vma, prev, start, end); > > + case MADV_GUARD_POISON: > > + return madvise_guard_poison(vma, prev, start, end); > > + case MADV_GUARD_UNPOISON: > > + return madvise_guard_unpoison(vma, prev, start, end); > > } > > > > anon_name = anon_vma_name(vma); > > @@ -1197,6 +1353,8 @@ madvise_behavior_valid(int behavior) > > case MADV_DODUMP: > > case MADV_WIPEONFORK: > > case MADV_KEEPONFORK: > > + case MADV_GUARD_POISON: > > + case MADV_GUARD_UNPOISON: > > #ifdef CONFIG_MEMORY_FAILURE > > case MADV_SOFT_OFFLINE: > > case MADV_HWPOISON: > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index 0c5d6d06107d..d0e3ebfadef8 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -236,7 +236,8 @@ static long change_pte_range(struct mmu_gather *tlb, > > } else if (is_pte_marker_entry(entry)) { > > /* > > * Ignore error swap entries unconditionally, > > - * because any access should sigbus anyway. > > + * because any access should sigbus/sigsegv > > + * anyway. > > */ > > if (is_poisoned_swp_entry(entry)) > > continue; > > diff --git a/mm/mseal.c b/mm/mseal.c > > index ece977bd21e1..21bf5534bcf5 100644 > > --- a/mm/mseal.c > > +++ b/mm/mseal.c > > @@ -30,6 +30,7 @@ static bool is_madv_discard(int behavior) > > case MADV_REMOVE: > > case MADV_DONTFORK: > > case MADV_WIPEONFORK: > > + case MADV_GUARD_POISON: > > Can you please describe the rationale to add this to the existing > mseal's semantic ? > > I didn't not find any description from the cover letter or this > patch's description, hence asking. Sure, this is because when you guard-poison ranges that have existing mappings, it zaps them, which performs basically the exact same operation as MADV_DONTNEED, and obviously discards any underlying data in doing so. As a result, I felt it was correct to add this operation to the list of discard operations from the perspective of mseal. > > Thanks > -Jeff > > > return true; > > } > > > > -- > > 2.46.2 > > > >
Hi Lorenzo On Fri, Oct 4, 2024 at 11:26 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Fri, Oct 04, 2024 at 11:17:13AM -0700, Jeff Xu wrote: > > Hi Lorenzo, > > > > Please add me to this series, I 'm interested in everything related to > > mseal :-), thanks. > > Hi Jeff, more than happy to cc you on this going forward :) > > The only change to mseal is a trivial change because the poison operation > discards, wasn't intentional, but apologies, I should have cc'd you > regardless! Will do so on any such interaction with mseal moving forward. > No problems :-). I do sometimes scan the emails to search for mseal keyword and that is how I find this patch series. > > > > I also added Kees into the cc, since mseal is a security feature. > > Sure no problem happy to keep Kees cc-d too (Kees - ping me if you'd prefer > not :>), however a note on this - guard pages _themselves_ are emphatically > NOT a security feature, and make no guarantees on this front, but rather > are a convenience/effiency thing. > It is a nice feature nevertheless. I imagine the guide page can detect cases such as trying to go-over the main stack ? > Obviously however I am adding madvise() functionality here, and all such > functionality must take into account whether or not they are discard > operations as to ensure mseal semantics are obeyed - see below for my > argument as to why I feel the poison operation falls under this. > > > > > > > On Fri, Sep 27, 2024 at 5:52 AM Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > Implement a new lightweight guard page feature, that is regions of userland > > > virtual memory that, when accessed, cause a fatal signal to arise. > > > > > > Currently users must establish PROT_NONE ranges to achieve this. > > > > > > However this is very costly memory-wise - we need a VMA for each and every > > > one of these regions AND they become unmergeable with surrounding VMAs. > > > > > > In addition repeated mmap() calls require repeated kernel context switches > > > and contention of the mmap lock to install these ranges, potentially also > > > having to unmap memory if installed over existing ranges. > > > > > > The lightweight guard approach eliminates the VMA cost altogether - rather > > > than establishing a PROT_NONE VMA, it operates at the level of page table > > > entries - poisoning PTEs such that accesses to them cause a fault followed > > > by a SIGSGEV signal being raised. > > > > > > This is achieved through the PTE marker mechanism, which a previous commit > > > in this series extended to permit this to be done, installed via the > > > generic page walking logic, also extended by a prior commit for this > > > purpose. > > > > > > These poison ranges are established with MADV_GUARD_POISON, and if the > > > range in which they are installed contain any existing mappings, they will > > > be zapped, i.e. free the range and unmap memory (thus mimicking the > > > behaviour of MADV_DONTNEED in this respect). > > > > > > Any existing poison entries will be left untouched. There is no nesting of > > > poisoned pages. > > > > > > Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather > > > unexpected behaviour, but are cleared on process teardown or unmapping of > > > memory ranges. > > > > > > Ranges can have the poison property removed by MADV_GUARD_UNPOISON - > > > 'remedying' the poisoning. The ranges over which this is applied, should > > > they contain non-poison entries, will be untouched, only poison entries > > > will be cleared. > > > > > > We permit this operation on anonymous memory only, and only VMAs which are > > > non-special, non-huge and not mlock()'d (if we permitted this we'd have to > > > drop locked pages which would be rather counterintuitive). > > > > > > The poisoning of the range must be performed under mmap write lock as we > > > have to install an anon_vma to ensure correct behaviour on fork. > > > > > > Suggested-by: Vlastimil Babka <vbabka@suze.cz> > > > Suggested-by: Jann Horn <jannh@google.com> > > > Suggested-by: David Hildenbrand <david@redhat.com> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > --- > > > arch/alpha/include/uapi/asm/mman.h | 3 + > > > arch/mips/include/uapi/asm/mman.h | 3 + > > > arch/parisc/include/uapi/asm/mman.h | 3 + > > > arch/xtensa/include/uapi/asm/mman.h | 3 + > > > include/uapi/asm-generic/mman-common.h | 3 + > > > mm/madvise.c | 158 +++++++++++++++++++++++++ > > > mm/mprotect.c | 3 +- > > > mm/mseal.c | 1 + > > > 8 files changed, 176 insertions(+), 1 deletion(-) > > > > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > > > index 763929e814e9..71e13f27742d 100644 > > > --- a/arch/alpha/include/uapi/asm/mman.h > > > +++ b/arch/alpha/include/uapi/asm/mman.h > > > @@ -78,6 +78,9 @@ > > > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > > > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > > > + > > > /* compatibility flags */ > > > #define MAP_FILE 0 > > > > > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > > > index 9c48d9a21aa0..1a2222322f77 100644 > > > --- a/arch/mips/include/uapi/asm/mman.h > > > +++ b/arch/mips/include/uapi/asm/mman.h > > > @@ -105,6 +105,9 @@ > > > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > > > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > > > + > > > /* compatibility flags */ > > > #define MAP_FILE 0 > > > > > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > > > index 68c44f99bc93..380905522397 100644 > > > --- a/arch/parisc/include/uapi/asm/mman.h > > > +++ b/arch/parisc/include/uapi/asm/mman.h > > > @@ -75,6 +75,9 @@ > > > #define MADV_HWPOISON 100 /* poison a page for testing */ > > > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > > > > > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > > > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > > > + > > > /* compatibility flags */ > > > #define MAP_FILE 0 > > > > > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > > > index 1ff0c858544f..e8d5affceb28 100644 > > > --- a/arch/xtensa/include/uapi/asm/mman.h > > > +++ b/arch/xtensa/include/uapi/asm/mman.h > > > @@ -113,6 +113,9 @@ > > > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > > > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > > > + > > > /* compatibility flags */ > > > #define MAP_FILE 0 > > > > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > > > index 6ce1f1ceb432..5dfd3d442de4 100644 > > > --- a/include/uapi/asm-generic/mman-common.h > > > +++ b/include/uapi/asm-generic/mman-common.h > > > @@ -79,6 +79,9 @@ > > > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > > > > +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ > > > +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ > > > + > > > /* compatibility flags */ > > > #define MAP_FILE 0 > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index e871a72a6c32..7216e10723ae 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > > > case MADV_POPULATE_READ: > > > case MADV_POPULATE_WRITE: > > > case MADV_COLLAPSE: > > > + case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */ > > > return 0; > > > default: > > > /* be safe, default to 1. list exceptions explicitly */ > > > @@ -1017,6 +1018,157 @@ static long madvise_remove(struct vm_area_struct *vma, > > > return error; > > > } > > > > > > +static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) > > > +{ > > > + vm_flags_t disallowed = VM_SPECIAL | VM_HUGETLB; > > > + > > > + /* > > > + * A user could lock after poisoning but that's fine, as they'd not be > > > + * able to fault in. The issue arises when we try to zap existing locked > > > + * VMAs. We don't want to do that. > > > + */ > > > + if (!allow_locked) > > > + disallowed |= VM_LOCKED; > > > + > > > + if (!vma_is_anonymous(vma)) > > > + return false; > > > + > > > + if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE) > > > + return false; > > > + > > > + return true; > > > +} > > > + > > > +static int guard_poison_install_pte(unsigned long addr, unsigned long next, > > > + pte_t *ptep, struct mm_walk *walk) > > > +{ > > > + unsigned long *num_installed = (unsigned long *)walk->private; > > > + > > > + (*num_installed)++; > > > + /* Simply install a PTE marker, this causes segfault on access. */ > > > + *ptep = make_pte_marker(PTE_MARKER_GUARD); > > > + > > > + return 0; > > > +} > > > + > > > +static bool is_guard_pte_marker(pte_t ptent) > > > +{ > > > + return is_pte_marker(ptent) && > > > + is_guard_swp_entry(pte_to_swp_entry(ptent)); > > > +} > > > + > > > +static int guard_poison_pte_entry(pte_t *pte, unsigned long addr, > > > + unsigned long next, struct mm_walk *walk) > > > +{ > > > + pte_t ptent = ptep_get(pte); > > > + > > > + /* > > > + * If not a guard marker, simply abort the operation. We return a value > > > + * > 0 indicating a non-error abort. > > > + */ > > > + return !is_guard_pte_marker(ptent); > > > +} > > > + > > > +static const struct mm_walk_ops guard_poison_walk_ops = { > > > + .install_pte = guard_poison_install_pte, > > > + .pte_entry = guard_poison_pte_entry, > > > + /* We might need to install an anon_vma. */ > > > + .walk_lock = PGWALK_WRLOCK, > > > +}; > > > + > > > +static long madvise_guard_poison(struct vm_area_struct *vma, > > > + struct vm_area_struct **prev, > > > + unsigned long start, unsigned long end) > > > +{ > > > + long err; > > > + bool retried = false; > > > + > > > + *prev = vma; > > > + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) > > > + return -EINVAL; > > > + > > > + /* > > > + * Optimistically try to install the guard poison pages first. If any > > > + * non-guard pages are encountered, give up and zap the range before > > > + * trying again. > > > + */ > > > + while (true) { > > > + unsigned long num_installed = 0; > > > + > > > + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ > > > + err = walk_page_range_mm(vma->vm_mm, start, end, > > > + &guard_poison_walk_ops, > > > + &num_installed); > > > + /* > > > + * If we install poison markers, then the range is no longer > > > + * empty from a page table perspective and therefore it's > > > + * appropriate to have an anon_vma. > > > + * > > > + * This ensures that on fork, we copy page tables correctly. > > > + */ > > > + if (err >= 0 && num_installed > 0) { > > > + int err_anon = anon_vma_prepare(vma); > > > + > > > + if (err_anon) > > > + err = err_anon; > > > + } > > > + > > > + if (err <= 0) > > > + return err; > > > + > > > + if (!retried) > > > + /* > > > + * OK some of the range have non-guard pages mapped, zap > > > + * them. This leaves existing guard pages in place. > > > + */ > > > + zap_page_range_single(vma, start, end - start, NULL); > > > + else > > > + /* > > > + * If we reach here, then there is a racing fault that > > > + * has populated the PTE after we zapped. Give up and > > > + * let the user know to try again. > > > + */ > > > + return -EAGAIN; > > > + > > > + retried = true; > > > + } > > > +} > > > + > > > +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr, > > > + unsigned long next, struct mm_walk *walk) > > > +{ > > > + pte_t ptent = ptep_get(pte); > > > + > > > + if (is_guard_pte_marker(ptent)) { > > > + /* Simply clear the PTE marker. */ > > > + pte_clear_not_present_full(walk->mm, addr, pte, true); > > > + update_mmu_cache(walk->vma, addr, pte); > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +static const struct mm_walk_ops guard_unpoison_walk_ops = { > > > + .pte_entry = guard_unpoison_pte_entry, > > > + .walk_lock = PGWALK_RDLOCK, > > > +}; > > > + > > > +static long madvise_guard_unpoison(struct vm_area_struct *vma, > > > + struct vm_area_struct **prev, > > > + unsigned long start, unsigned long end) > > > +{ > > > + *prev = vma; > > > + /* > > > + * We're ok with unpoisoning mlock()'d ranges, as this is a > > > + * non-destructive action. > > > + */ > > > + if (!is_valid_guard_vma(vma, /* allow_locked = */true)) > > > + return -EINVAL; > > > + > > > + return walk_page_range(vma->vm_mm, start, end, > > > + &guard_unpoison_walk_ops, NULL); > > > +} > > > + > > > /* > > > * Apply an madvise behavior to a region of a vma. madvise_update_vma > > > * will handle splitting a vm area into separate areas, each area with its own > > > @@ -1098,6 +1250,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > > > break; > > > case MADV_COLLAPSE: > > > return madvise_collapse(vma, prev, start, end); > > > + case MADV_GUARD_POISON: > > > + return madvise_guard_poison(vma, prev, start, end); > > > + case MADV_GUARD_UNPOISON: > > > + return madvise_guard_unpoison(vma, prev, start, end); > > > } > > > > > > anon_name = anon_vma_name(vma); > > > @@ -1197,6 +1353,8 @@ madvise_behavior_valid(int behavior) > > > case MADV_DODUMP: > > > case MADV_WIPEONFORK: > > > case MADV_KEEPONFORK: > > > + case MADV_GUARD_POISON: > > > + case MADV_GUARD_UNPOISON: > > > #ifdef CONFIG_MEMORY_FAILURE > > > case MADV_SOFT_OFFLINE: > > > case MADV_HWPOISON: > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > > index 0c5d6d06107d..d0e3ebfadef8 100644 > > > --- a/mm/mprotect.c > > > +++ b/mm/mprotect.c > > > @@ -236,7 +236,8 @@ static long change_pte_range(struct mmu_gather *tlb, > > > } else if (is_pte_marker_entry(entry)) { > > > /* > > > * Ignore error swap entries unconditionally, > > > - * because any access should sigbus anyway. > > > + * because any access should sigbus/sigsegv > > > + * anyway. > > > */ > > > if (is_poisoned_swp_entry(entry)) > > > continue; > > > diff --git a/mm/mseal.c b/mm/mseal.c > > > index ece977bd21e1..21bf5534bcf5 100644 > > > --- a/mm/mseal.c > > > +++ b/mm/mseal.c > > > @@ -30,6 +30,7 @@ static bool is_madv_discard(int behavior) > > > case MADV_REMOVE: > > > case MADV_DONTFORK: > > > case MADV_WIPEONFORK: > > > + case MADV_GUARD_POISON: > > > > Can you please describe the rationale to add this to the existing > > mseal's semantic ? > > > > I didn't not find any description from the cover letter or this > > patch's description, hence asking. > > Sure, this is because when you guard-poison ranges that have existing > mappings, it zaps them, which performs basically the exact same operation > as MADV_DONTNEED, and obviously discards any underlying data in doing so. > > As a result, I felt it was correct to add this operation to the list of > discard operations from the perspective of mseal. > That makes sense. Thanks for thinking about memory sealing when adding new features. If possible, please add the reasoning here in the commit description in the next version, for future reference. As far as I am concerned, the mseal.c changes LGTM. Thanks -Jeff > > > > Thanks > > -Jeff > > > > > return true; > > > } > > > > > > -- > > > 2.46.2 > > > > > >
On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > Implement a new lightweight guard page feature, that is regions of userland > virtual memory that, when accessed, cause a fatal signal to arise. [...] > --- > arch/alpha/include/uapi/asm/mman.h | 3 + > arch/mips/include/uapi/asm/mman.h | 3 + > arch/parisc/include/uapi/asm/mman.h | 3 + > arch/xtensa/include/uapi/asm/mman.h | 3 + > include/uapi/asm-generic/mman-common.h | 3 + I kinda wonder if we could start moving the parts of those headers that are the same for all architectures to include/uapi/linux/mman.h instead... but that's maybe out of scope for this series. [...] > diff --git a/mm/madvise.c b/mm/madvise.c > index e871a72a6c32..7216e10723ae 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > case MADV_POPULATE_READ: > case MADV_POPULATE_WRITE: > case MADV_COLLAPSE: > + case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */ What does poisoning need a write lock for? anon_vma_prepare() doesn't need it (it only needs mmap_lock held for reading), zap_page_range_single() doesn't need it, and pagewalk also doesn't need it as long as the range being walked is covered by a VMA, which it is... I see you set PGWALK_WRLOCK in guard_poison_walk_ops with a comment saying "We might need to install an anon_vma" - is that referring to an older version of the patch where the anon_vma_prepare() call was inside the pagewalk callback or something like that? Either way, anon_vma_prepare() doesn't need write locks (it can't, it has to work from the page fault handling path). > return 0; > default: > /* be safe, default to 1. list exceptions explicitly */ [...] > +static long madvise_guard_poison(struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, unsigned long end) > +{ > + long err; > + bool retried = false; > + > + *prev = vma; > + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) > + return -EINVAL; > + > + /* > + * Optimistically try to install the guard poison pages first. If any > + * non-guard pages are encountered, give up and zap the range before > + * trying again. > + */ > + while (true) { > + unsigned long num_installed = 0; > + > + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ > + err = walk_page_range_mm(vma->vm_mm, start, end, > + &guard_poison_walk_ops, > + &num_installed); > + /* > + * If we install poison markers, then the range is no longer > + * empty from a page table perspective and therefore it's > + * appropriate to have an anon_vma. > + * > + * This ensures that on fork, we copy page tables correctly. > + */ > + if (err >= 0 && num_installed > 0) { > + int err_anon = anon_vma_prepare(vma); I'd move this up, to before we create poison PTEs. There's no harm in attaching an anon_vma to the VMA even if the rest of the operation fails; and I think it would be weird to have error paths that don't attach an anon_vma even though they . > + if (err_anon) > + err = err_anon; > + } > + > + if (err <= 0) > + return err; > + > + if (!retried) > + /* > + * OK some of the range have non-guard pages mapped, zap > + * them. This leaves existing guard pages in place. > + */ > + zap_page_range_single(vma, start, end - start, NULL); > + else > + /* > + * If we reach here, then there is a racing fault that > + * has populated the PTE after we zapped. Give up and > + * let the user know to try again. > + */ > + return -EAGAIN; Hmm, yeah, it would be nice if we could avoid telling userspace to loop on -EAGAIN but I guess we don't have any particularly good options here? Well, we could bail out with -EINTR if a (fatal?) signal is pending and otherwise keep looping... if we'd tell userspace "try again on -EAGAIN", we might as well do that in the kernel... (Personally I would put curly braces around these branches because they occupy multiple lines, though the coding style doesn't explicitly say that, so I guess maybe it's a matter of personal preference... adding curly braces here would match what is done, for example, in relocate_vma_down().) > + retried = true; > + } > +} > + > +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr, > + unsigned long next, struct mm_walk *walk) > +{ > + pte_t ptent = ptep_get(pte); > + > + if (is_guard_pte_marker(ptent)) { > + /* Simply clear the PTE marker. */ > + pte_clear_not_present_full(walk->mm, addr, pte, true); I think that last parameter probably should be "false"? The sparc code calls it "fullmm", which is a term the MM code uses when talking about operations that remove all mappings in the entire mm_struct because the process has died, which allows using some faster special-case version of TLB shootdown or something along those lines. > + update_mmu_cache(walk->vma, addr, pte); > + } > + > + return 0; > +} > + > +static const struct mm_walk_ops guard_unpoison_walk_ops = { > + .pte_entry = guard_unpoison_pte_entry, > + .walk_lock = PGWALK_RDLOCK, > +}; It is a _little_ weird that unpoisoning creates page tables when they don't already exist, which will also prevent creating THP entries on fault in such areas afterwards... but I guess it doesn't really matter given that poisoning has that effect, too, and you probably usually won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned before... so I guess this is not an actionable comment.
On Fri, Oct 11, 2024 at 11:12 AM Jann Horn <jannh@google.com> wrote: > > On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > Implement a new lightweight guard page feature, that is regions of userland > > virtual memory that, when accessed, cause a fatal signal to arise. > [...] > > --- > > arch/alpha/include/uapi/asm/mman.h | 3 + > > arch/mips/include/uapi/asm/mman.h | 3 + > > arch/parisc/include/uapi/asm/mman.h | 3 + > > arch/xtensa/include/uapi/asm/mman.h | 3 + > > include/uapi/asm-generic/mman-common.h | 3 + > > I kinda wonder if we could start moving the parts of those headers > that are the same for all architectures to include/uapi/linux/mman.h > instead... but that's maybe out of scope for this series. > > [...] > > diff --git a/mm/madvise.c b/mm/madvise.c > > index e871a72a6c32..7216e10723ae 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > > case MADV_POPULATE_READ: > > case MADV_POPULATE_WRITE: > > case MADV_COLLAPSE: > > + case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */ > > What does poisoning need a write lock for? anon_vma_prepare() doesn't > need it (it only needs mmap_lock held for reading), > zap_page_range_single() doesn't need it, and pagewalk also doesn't > need it as long as the range being walked is covered by a VMA, which > it is... > > I see you set PGWALK_WRLOCK in guard_poison_walk_ops with a comment > saying "We might need to install an anon_vma" - is that referring to > an older version of the patch where the anon_vma_prepare() call was > inside the pagewalk callback or something like that? Either way, > anon_vma_prepare() doesn't need write locks (it can't, it has to work > from the page fault handling path). I was wondering about that too and I can't find any reason for write-locking the mm for this operation. PGWALK_WRLOCK should also be changed to PGWALK_RDLOCK as we are not modifying the VMA. BTW, I'm testing your patchset on Android and so far it is stable! > > > return 0; > > default: > > /* be safe, default to 1. list exceptions explicitly */ > [...] > > +static long madvise_guard_poison(struct vm_area_struct *vma, > > + struct vm_area_struct **prev, > > + unsigned long start, unsigned long end) > > +{ > > + long err; > > + bool retried = false; > > + > > + *prev = vma; > > + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) > > + return -EINVAL; > > + > > + /* > > + * Optimistically try to install the guard poison pages first. If any > > + * non-guard pages are encountered, give up and zap the range before > > + * trying again. > > + */ > > + while (true) { > > + unsigned long num_installed = 0; > > + > > + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ > > + err = walk_page_range_mm(vma->vm_mm, start, end, > > + &guard_poison_walk_ops, > > + &num_installed); > > + /* > > + * If we install poison markers, then the range is no longer > > + * empty from a page table perspective and therefore it's > > + * appropriate to have an anon_vma. > > + * > > + * This ensures that on fork, we copy page tables correctly. > > + */ > > + if (err >= 0 && num_installed > 0) { > > + int err_anon = anon_vma_prepare(vma); > > I'd move this up, to before we create poison PTEs. There's no harm in > attaching an anon_vma to the VMA even if the rest of the operation > fails; and I think it would be weird to have error paths that don't > attach an anon_vma even though they . > > > + if (err_anon) > > + err = err_anon; > > + } > > + > > + if (err <= 0) > > + return err; > > + > > + if (!retried) > > + /* > > + * OK some of the range have non-guard pages mapped, zap > > + * them. This leaves existing guard pages in place. > > + */ > > + zap_page_range_single(vma, start, end - start, NULL); > > + else > > + /* > > + * If we reach here, then there is a racing fault that > > + * has populated the PTE after we zapped. Give up and > > + * let the user know to try again. > > + */ > > + return -EAGAIN; > > Hmm, yeah, it would be nice if we could avoid telling userspace to > loop on -EAGAIN but I guess we don't have any particularly good > options here? Well, we could bail out with -EINTR if a (fatal?) signal > is pending and otherwise keep looping... if we'd tell userspace "try > again on -EAGAIN", we might as well do that in the kernel... > > (Personally I would put curly braces around these branches because > they occupy multiple lines, though the coding style doesn't explicitly > say that, so I guess maybe it's a matter of personal preference... > adding curly braces here would match what is done, for example, in > relocate_vma_down().) > > > + retried = true; > > + } > > +} > > + > > +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr, > > + unsigned long next, struct mm_walk *walk) > > +{ > > + pte_t ptent = ptep_get(pte); > > + > > + if (is_guard_pte_marker(ptent)) { > > + /* Simply clear the PTE marker. */ > > + pte_clear_not_present_full(walk->mm, addr, pte, true); > > I think that last parameter probably should be "false"? The sparc code > calls it "fullmm", which is a term the MM code uses when talking about > operations that remove all mappings in the entire mm_struct because > the process has died, which allows using some faster special-case > version of TLB shootdown or something along those lines. > > > + update_mmu_cache(walk->vma, addr, pte); > > + } > > + > > + return 0; > > +} > > + > > +static const struct mm_walk_ops guard_unpoison_walk_ops = { > > + .pte_entry = guard_unpoison_pte_entry, > > + .walk_lock = PGWALK_RDLOCK, > > +}; > > It is a _little_ weird that unpoisoning creates page tables when they > don't already exist, which will also prevent creating THP entries on > fault in such areas afterwards... but I guess it doesn't really matter > given that poisoning has that effect, too, and you probably usually > won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned > before... so I guess this is not an actionable comment.
On Fri, Oct 11, 2024 at 08:11:36PM +0200, Jann Horn wrote: > On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > Implement a new lightweight guard page feature, that is regions of userland > > virtual memory that, when accessed, cause a fatal signal to arise. > [...] > > --- > > arch/alpha/include/uapi/asm/mman.h | 3 + > > arch/mips/include/uapi/asm/mman.h | 3 + > > arch/parisc/include/uapi/asm/mman.h | 3 + > > arch/xtensa/include/uapi/asm/mman.h | 3 + > > include/uapi/asm-generic/mman-common.h | 3 + > > I kinda wonder if we could start moving the parts of those headers > that are the same for all architectures to include/uapi/linux/mman.h > instead... but that's maybe out of scope for this series. Arnd already had a look at this in a recent series. I had the same feeling doing this... > > [...] > > diff --git a/mm/madvise.c b/mm/madvise.c > > index e871a72a6c32..7216e10723ae 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > > case MADV_POPULATE_READ: > > case MADV_POPULATE_WRITE: > > case MADV_COLLAPSE: > > + case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */ > > What does poisoning need a write lock for? anon_vma_prepare() doesn't > need it (it only needs mmap_lock held for reading), > zap_page_range_single() doesn't need it, and pagewalk also doesn't > need it as long as the range being walked is covered by a VMA, which > it is... > > I see you set PGWALK_WRLOCK in guard_poison_walk_ops with a comment > saying "We might need to install an anon_vma" - is that referring to > an older version of the patch where the anon_vma_prepare() call was > inside the pagewalk callback or something like that? Either way, > anon_vma_prepare() doesn't need write locks (it can't, it has to work > from the page fault handling path). OK this was a misunderstanding. Actually there have been more than one, at first I thought a write lock would protect us against racing faults (nope, due to RCU vma locking now :) and then I had assumed literally changing a vma field _surely_ must require a write lock, also it appears no as __anon_vma_prepare(), amusingly, uses the mm->page_table_lock to protect against accesses to vma->anon_vma. And yes you're right it is triggered on the fault path so has to work that way. TL;DR will change to read lock. > > > return 0; > > default: > > /* be safe, default to 1. list exceptions explicitly */ > [...] > > +static long madvise_guard_poison(struct vm_area_struct *vma, > > + struct vm_area_struct **prev, > > + unsigned long start, unsigned long end) > > +{ > > + long err; > > + bool retried = false; > > + > > + *prev = vma; > > + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) > > + return -EINVAL; > > + > > + /* > > + * Optimistically try to install the guard poison pages first. If any > > + * non-guard pages are encountered, give up and zap the range before > > + * trying again. > > + */ > > + while (true) { > > + unsigned long num_installed = 0; > > + > > + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ > > + err = walk_page_range_mm(vma->vm_mm, start, end, > > + &guard_poison_walk_ops, > > + &num_installed); > > + /* > > + * If we install poison markers, then the range is no longer > > + * empty from a page table perspective and therefore it's > > + * appropriate to have an anon_vma. > > + * > > + * This ensures that on fork, we copy page tables correctly. > > + */ > > + if (err >= 0 && num_installed > 0) { > > + int err_anon = anon_vma_prepare(vma); > > I'd move this up, to before we create poison PTEs. There's no harm in > attaching an anon_vma to the VMA even if the rest of the operation > fails; and I think it would be weird to have error paths that don't > attach an anon_vma even though they . I think you didn't finish this sentence :) I disagree, we might have absolutely no need to do it, and I'd rather only do so _if_ we have to. It feels like the logical spot to do it and, while the cases where it wouldn't happen are ones where pages are already poisoned (the vma->anon_vma == NULL test will fail so basically a no-op) or error on page walk. > > > + if (err_anon) > > + err = err_anon; > > + } > > + > > + if (err <= 0) > > + return err; > > + > > + if (!retried) > > + /* > > + * OK some of the range have non-guard pages mapped, zap > > + * them. This leaves existing guard pages in place. > > + */ > > + zap_page_range_single(vma, start, end - start, NULL); > > + else > > + /* > > + * If we reach here, then there is a racing fault that > > + * has populated the PTE after we zapped. Give up and > > + * let the user know to try again. > > + */ > > + return -EAGAIN; > > Hmm, yeah, it would be nice if we could avoid telling userspace to > loop on -EAGAIN but I guess we don't have any particularly good > options here? Well, we could bail out with -EINTR if a (fatal?) signal > is pending and otherwise keep looping... if we'd tell userspace "try > again on -EAGAIN", we might as well do that in the kernel... The problem is you could conceivably go on for quite some time, while holding and contending a HIGHLY contended lock (mm->mmap_lock) so I'd really rather let userspace take care of it. You could avoid this by having the walker be a _replace_ operation, that is - if we encounter an existing mapping, replace in-place with a poison marker rather than install marker/zap. However doing that would involve either completely abstracting such logic from scratch (a significant task in itself) to avoid duplication which be hugely off-topic for the patch set or worse, duplicating a whole bunch of page walking logic once again. By being optimistic and simply having the user having to handle looping which seems reasonable (again, it's weird if you're installing poison markers and another thread could be racing you) we avoid all of that. > > (Personally I would put curly braces around these branches because > they occupy multiple lines, though the coding style doesn't explicitly > say that, so I guess maybe it's a matter of personal preference... > adding curly braces here would match what is done, for example, in > relocate_vma_down().) Hey I wrote that too! ;) Sure I can change that. > > > + retried = true; > > + } > > +} > > + > > +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr, > > + unsigned long next, struct mm_walk *walk) > > +{ > > + pte_t ptent = ptep_get(pte); > > + > > + if (is_guard_pte_marker(ptent)) { > > + /* Simply clear the PTE marker. */ > > + pte_clear_not_present_full(walk->mm, addr, pte, true); > > I think that last parameter probably should be "false"? The sparc code > calls it "fullmm", which is a term the MM code uses when talking about > operations that remove all mappings in the entire mm_struct because > the process has died, which allows using some faster special-case > version of TLB shootdown or something along those lines. Yeah I think you're right. Will change. > > > + update_mmu_cache(walk->vma, addr, pte); > > + } > > + > > + return 0; > > +} > > + > > +static const struct mm_walk_ops guard_unpoison_walk_ops = { > > + .pte_entry = guard_unpoison_pte_entry, > > + .walk_lock = PGWALK_RDLOCK, > > +}; > > It is a _little_ weird that unpoisoning creates page tables when they > don't already exist, which will also prevent creating THP entries on > fault in such areas afterwards... but I guess it doesn't really matter > given that poisoning has that effect, too, and you probably usually > won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned > before... so I guess this is not an actionable comment. It doesn't? There's no .install_pte so if an entries are non-present we ignore. HOWEVER, we do split THP. I don't think there's any way around it unless we extended the page walker to handle this more gracefully (pmd level being able to hint that we shouldn't do that or something), but that's really out of scope here. The idea is that a caller can lazily call MADV_GUARD_UNPOISON on a range knowing things stay as they were, I guess we can add to the manpage a note that this will split THP?
On Fri, Oct 11, 2024 at 01:55:42PM -0700, Suren Baghdasaryan wrote: > On Fri, Oct 11, 2024 at 11:12 AM Jann Horn <jannh@google.com> wrote: > > > > On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > Implement a new lightweight guard page feature, that is regions of userland > > > virtual memory that, when accessed, cause a fatal signal to arise. > > [...] > > > --- > > > arch/alpha/include/uapi/asm/mman.h | 3 + > > > arch/mips/include/uapi/asm/mman.h | 3 + > > > arch/parisc/include/uapi/asm/mman.h | 3 + > > > arch/xtensa/include/uapi/asm/mman.h | 3 + > > > include/uapi/asm-generic/mman-common.h | 3 + > > > > I kinda wonder if we could start moving the parts of those headers > > that are the same for all architectures to include/uapi/linux/mman.h > > instead... but that's maybe out of scope for this series. > > > > [...] > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index e871a72a6c32..7216e10723ae 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > > > case MADV_POPULATE_READ: > > > case MADV_POPULATE_WRITE: > > > case MADV_COLLAPSE: > > > + case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */ > > > > What does poisoning need a write lock for? anon_vma_prepare() doesn't > > need it (it only needs mmap_lock held for reading), > > zap_page_range_single() doesn't need it, and pagewalk also doesn't > > need it as long as the range being walked is covered by a VMA, which > > it is... > > > > I see you set PGWALK_WRLOCK in guard_poison_walk_ops with a comment > > saying "We might need to install an anon_vma" - is that referring to > > an older version of the patch where the anon_vma_prepare() call was > > inside the pagewalk callback or something like that? Either way, > > anon_vma_prepare() doesn't need write locks (it can't, it has to work > > from the page fault handling path). > > I was wondering about that too and I can't find any reason for > write-locking the mm for this operation. PGWALK_WRLOCK should also be > changed to PGWALK_RDLOCK as we are not modifying the VMA. Indeed, as I said to Jann you're right and I was in error to use this, will change! > > BTW, I'm testing your patchset on Android and so far it is stable! Thanks! As there is no significant conceptual pushback to this series, I will un-RFC and post a version with fixes for the issues Jann raised, as well as a fix for some xtensa et al. issues with header includes. > > > > > > return 0; > > > default: > > > /* be safe, default to 1. list exceptions explicitly */ > > [...] > > > +static long madvise_guard_poison(struct vm_area_struct *vma, > > > + struct vm_area_struct **prev, > > > + unsigned long start, unsigned long end) > > > +{ > > > + long err; > > > + bool retried = false; > > > + > > > + *prev = vma; > > > + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) > > > + return -EINVAL; > > > + > > > + /* > > > + * Optimistically try to install the guard poison pages first. If any > > > + * non-guard pages are encountered, give up and zap the range before > > > + * trying again. > > > + */ > > > + while (true) { > > > + unsigned long num_installed = 0; > > > + > > > + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ > > > + err = walk_page_range_mm(vma->vm_mm, start, end, > > > + &guard_poison_walk_ops, > > > + &num_installed); > > > + /* > > > + * If we install poison markers, then the range is no longer > > > + * empty from a page table perspective and therefore it's > > > + * appropriate to have an anon_vma. > > > + * > > > + * This ensures that on fork, we copy page tables correctly. > > > + */ > > > + if (err >= 0 && num_installed > 0) { > > > + int err_anon = anon_vma_prepare(vma); > > > > I'd move this up, to before we create poison PTEs. There's no harm in > > attaching an anon_vma to the VMA even if the rest of the operation > > fails; and I think it would be weird to have error paths that don't > > attach an anon_vma even though they . > > > > > + if (err_anon) > > > + err = err_anon; > > > + } > > > + > > > + if (err <= 0) > > > + return err; > > > + > > > + if (!retried) > > > + /* > > > + * OK some of the range have non-guard pages mapped, zap > > > + * them. This leaves existing guard pages in place. > > > + */ > > > + zap_page_range_single(vma, start, end - start, NULL); > > > + else > > > + /* > > > + * If we reach here, then there is a racing fault that > > > + * has populated the PTE after we zapped. Give up and > > > + * let the user know to try again. > > > + */ > > > + return -EAGAIN; > > > > Hmm, yeah, it would be nice if we could avoid telling userspace to > > loop on -EAGAIN but I guess we don't have any particularly good > > options here? Well, we could bail out with -EINTR if a (fatal?) signal > > is pending and otherwise keep looping... if we'd tell userspace "try > > again on -EAGAIN", we might as well do that in the kernel... > > > > (Personally I would put curly braces around these branches because > > they occupy multiple lines, though the coding style doesn't explicitly > > say that, so I guess maybe it's a matter of personal preference... > > adding curly braces here would match what is done, for example, in > > relocate_vma_down().) > > > > > + retried = true; > > > + } > > > +} > > > + > > > +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr, > > > + unsigned long next, struct mm_walk *walk) > > > +{ > > > + pte_t ptent = ptep_get(pte); > > > + > > > + if (is_guard_pte_marker(ptent)) { > > > + /* Simply clear the PTE marker. */ > > > + pte_clear_not_present_full(walk->mm, addr, pte, true); > > > > I think that last parameter probably should be "false"? The sparc code > > calls it "fullmm", which is a term the MM code uses when talking about > > operations that remove all mappings in the entire mm_struct because > > the process has died, which allows using some faster special-case > > version of TLB shootdown or something along those lines. > > > > > + update_mmu_cache(walk->vma, addr, pte); > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +static const struct mm_walk_ops guard_unpoison_walk_ops = { > > > + .pte_entry = guard_unpoison_pte_entry, > > > + .walk_lock = PGWALK_RDLOCK, > > > +}; > > > > It is a _little_ weird that unpoisoning creates page tables when they > > don't already exist, which will also prevent creating THP entries on > > fault in such areas afterwards... but I guess it doesn't really matter > > given that poisoning has that effect, too, and you probably usually > > won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned > > before... so I guess this is not an actionable comment.
On Mon, Oct 14, 2024 at 1:09 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > On Fri, Oct 11, 2024 at 08:11:36PM +0200, Jann Horn wrote: > > On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > > return 0; > > > default: > > > /* be safe, default to 1. list exceptions explicitly */ > > [...] > > > +static long madvise_guard_poison(struct vm_area_struct *vma, > > > + struct vm_area_struct **prev, > > > + unsigned long start, unsigned long end) > > > +{ > > > + long err; > > > + bool retried = false; > > > + > > > + *prev = vma; > > > + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) > > > + return -EINVAL; > > > + > > > + /* > > > + * Optimistically try to install the guard poison pages first. If any > > > + * non-guard pages are encountered, give up and zap the range before > > > + * trying again. > > > + */ > > > + while (true) { > > > + unsigned long num_installed = 0; > > > + > > > + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ > > > + err = walk_page_range_mm(vma->vm_mm, start, end, > > > + &guard_poison_walk_ops, > > > + &num_installed); > > > + /* > > > + * If we install poison markers, then the range is no longer > > > + * empty from a page table perspective and therefore it's > > > + * appropriate to have an anon_vma. > > > + * > > > + * This ensures that on fork, we copy page tables correctly. > > > + */ > > > + if (err >= 0 && num_installed > 0) { > > > + int err_anon = anon_vma_prepare(vma); > > > > I'd move this up, to before we create poison PTEs. There's no harm in > > attaching an anon_vma to the VMA even if the rest of the operation > > fails; and I think it would be weird to have error paths that don't > > attach an anon_vma even though they . > > I think you didn't finish this sentence :) Oops... > I disagree, we might have absolutely no need to do it, and I'd rather only > do so _if_ we have to. But there's no downside to erroring out after having installed an anon_vma, right? > It feels like the logical spot to do it and, while the cases where it > wouldn't happen are ones where pages are already poisoned (the > vma->anon_vma == NULL test will fail so basically a no-op) or error on page > walk. My understanding is that some of the MM code basically assumes that a VMA without an anon_vma and without userfault-WP can't contain any state that needs to be preserved; or something along those lines. As you pointed out, fork() is one such case (which maybe doesn't matter so much because it can't race with this operation). khugepaged also relies on this assumption in retract_page_tables(), though that function is not used on anonymous VMAs. If MADVISE_GUARD is extended to cover file VMAs in the future, then I think we could race with retract_page_tables() in a functionally relevant way even when MADVISE_GUARD succeeds: If khugepaged preempts us between the page walk and installing the anon_vma, retract_page_tables() could observe that we don't have an anon_vma yet and throw away a page table in which we just installed guard PTEs. Though I guess really that's not the main reason why I'm saying this; my main reason is that almost any other path that has to ensure an anon_vma is present does that part first (usually because the ordering matters and this way around is more or less the only possible ordering). So even if there are some specific reasons why you can do the ordering the other way around here, it kinda stands out to me as being weird... > > > + if (err_anon) > > > + err = err_anon; > > > + } > > > + > > > + if (err <= 0) > > > + return err; > > > + > > > + if (!retried) > > > + /* > > > + * OK some of the range have non-guard pages mapped, zap > > > + * them. This leaves existing guard pages in place. > > > + */ > > > + zap_page_range_single(vma, start, end - start, NULL); > > > + else > > > + /* > > > + * If we reach here, then there is a racing fault that > > > + * has populated the PTE after we zapped. Give up and > > > + * let the user know to try again. > > > + */ > > > + return -EAGAIN; > > > > Hmm, yeah, it would be nice if we could avoid telling userspace to > > loop on -EAGAIN but I guess we don't have any particularly good > > options here? Well, we could bail out with -EINTR if a (fatal?) signal > > is pending and otherwise keep looping... if we'd tell userspace "try > > again on -EAGAIN", we might as well do that in the kernel... > > The problem is you could conceivably go on for quite some time, while > holding and contending a HIGHLY contended lock (mm->mmap_lock) so I'd > really rather let userspace take care of it. Hmm... so if the retry was handled in-kernel, you'd basically ideally have the retry happen all the way up in do_madvise(), where the mmap lock can be dropped and re-taken? > You could avoid this by having the walker be a _replace_ operation, that is > - if we encounter an existing mapping, replace in-place with a poison > marker rather than install marker/zap. > > However doing that would involve either completely abstracting such logic > from scratch (a significant task in itself) to avoid duplication which be > hugely off-topic for the patch set or worse, duplicating a whole bunch of > page walking logic once again. Mmh, yeah, you'd have to extract the locked part of zap_pte_range() and add your own copy of all the stuff that happens higher up for setting up TLB flushes and such... I see how that would be a massive pain and error-prone. > By being optimistic and simply having the user having to handle looping > which seems reasonable (again, it's weird if you're installing poison > markers and another thread could be racing you) we avoid all of that. I guess one case in which that could happen legitimately is if you race a MADV_POISON on the area 0x1ff000-0x200100 (first page is populated, second page is not, pmd entry corresponding to 0x200000 is clear) with a page fault at 0x200200? So you could have a scenario like: 1. MADV_POISON starts walk_page_range() 2. MADV_POISON sees non-zero, non-poison PTE at 0x1ff000, stops the walk 3. MADV_POISON does zap_page_range_single() 4. pagefault at 0x200200 happens and populates with a hugepage 5. MADV_POISON enters walk_page_range() 6. MADV_POISON splits the THP 7. MADV_POISON sees a populated PTE > > > + update_mmu_cache(walk->vma, addr, pte); > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +static const struct mm_walk_ops guard_unpoison_walk_ops = { > > > + .pte_entry = guard_unpoison_pte_entry, > > > + .walk_lock = PGWALK_RDLOCK, > > > +}; > > > > It is a _little_ weird that unpoisoning creates page tables when they > > don't already exist, which will also prevent creating THP entries on > > fault in such areas afterwards... but I guess it doesn't really matter > > given that poisoning has that effect, too, and you probably usually > > won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned > > before... so I guess this is not an actionable comment. > > It doesn't? There's no .install_pte so if an entries are non-present we > ignore. Ah, right, of course. Nevermind. > HOWEVER, we do split THP. I don't think there's any way around it unless we > extended the page walker to handle this more gracefully (pmd level being > able to hint that we shouldn't do that or something), but that's really out > of scope here. I think the `walk->action == ACTION_CONTINUE` check in walk_pmd_range() would let you do that, see wp_clean_pmd_entry() for an example. But yeah I guess that might just be unnecessary complexity. > The idea is that a caller can lazily call MADV_GUARD_UNPOISON on a range > knowing things stay as they were, I guess we can add to the manpage a note > that this will split THP? Yeah, might make sense...
On Mon, Oct 14, 2024 at 05:56:50PM +0200, Jann Horn wrote: > On Mon, Oct 14, 2024 at 1:09 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > On Fri, Oct 11, 2024 at 08:11:36PM +0200, Jann Horn wrote: > > > On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > > > return 0; > > > > default: > > > > /* be safe, default to 1. list exceptions explicitly */ > > > [...] > > > > +static long madvise_guard_poison(struct vm_area_struct *vma, > > > > + struct vm_area_struct **prev, > > > > + unsigned long start, unsigned long end) > > > > +{ > > > > + long err; > > > > + bool retried = false; > > > > + > > > > + *prev = vma; > > > > + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) > > > > + return -EINVAL; > > > > + > > > > + /* > > > > + * Optimistically try to install the guard poison pages first. If any > > > > + * non-guard pages are encountered, give up and zap the range before > > > > + * trying again. > > > > + */ > > > > + while (true) { > > > > + unsigned long num_installed = 0; > > > > + > > > > + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ > > > > + err = walk_page_range_mm(vma->vm_mm, start, end, > > > > + &guard_poison_walk_ops, > > > > + &num_installed); > > > > + /* > > > > + * If we install poison markers, then the range is no longer > > > > + * empty from a page table perspective and therefore it's > > > > + * appropriate to have an anon_vma. > > > > + * > > > > + * This ensures that on fork, we copy page tables correctly. > > > > + */ > > > > + if (err >= 0 && num_installed > 0) { > > > > + int err_anon = anon_vma_prepare(vma); > > > > > > I'd move this up, to before we create poison PTEs. There's no harm in > > > attaching an anon_vma to the VMA even if the rest of the operation > > > fails; and I think it would be weird to have error paths that don't > > > attach an anon_vma even though they . > > > > I think you didn't finish this sentence :) > > Oops... > > > I disagree, we might have absolutely no need to do it, and I'd rather only > > do so _if_ we have to. > > But there's no downside to erroring out after having installed an > anon_vma, right? We then use a resource we don't have to. I think it's more logical to only take that action when we know we need to. > > > It feels like the logical spot to do it and, while the cases where it > > wouldn't happen are ones where pages are already poisoned (the > > vma->anon_vma == NULL test will fail so basically a no-op) or error on page > > walk. > > My understanding is that some of the MM code basically assumes that a > VMA without an anon_vma and without userfault-WP can't contain any > state that needs to be preserved; or something along those lines. As > you pointed out, fork() is one such case (which maybe doesn't matter > so much because it can't race with this operation). > > khugepaged also relies on this assumption in retract_page_tables(), > though that function is not used on anonymous VMAs. If MADVISE_GUARD > is extended to cover file VMAs in the future, then I think we could > race with retract_page_tables() in a functionally relevant way even > when MADVISE_GUARD succeeds: If khugepaged preempts us between the > page walk and installing the anon_vma, retract_page_tables() could > observe that we don't have an anon_vma yet and throw away a page table > in which we just installed guard PTEs. Well for one retract_page_tables() seems to require the VMA to be file-backed :) So we can disregard this at this stage. We enter into a slightly strange scenario with file-backed as to how we manifest memory poisoning, because a file will have backing in the page cache or an anon page for shmem and it seems that khugepage() operates at this level and simply remaps at the higher order. But we then introduce a way the _mapping_ can be different and we have to correctly handle that. So I think actually you'd see this break there too? Interesting that we special-case uffd-wp, which similarly uses PTE markers and this is commented in retract_page_tables(): /* * When a vma is registered with uffd-wp, we cannot recycle * the page table because there may be pte markers installed. * Other vmas can still have the same file mapped hugely, but * skip this one: it will always be mapped in small page size * for uffd-wp registered ranges. */ if (userfaultfd_wp(vma)) continue; So this is something (one of many) I will note down to think about when we come on to file-backed guard pages. > > Though I guess really that's not the main reason why I'm saying this; > my main reason is that almost any other path that has to ensure an > anon_vma is present does that part first (usually because the ordering > matters and this way around is more or less the only possible > ordering). So even if there are some specific reasons why you can do > the ordering the other way around here, it kinda stands out to me as > being weird... I mean, fair enough, on the basis of convention and to avoid future issues with this I'll move it. > > > > > + if (err_anon) > > > > + err = err_anon; > > > > + } > > > > + > > > > + if (err <= 0) > > > > + return err; > > > > + > > > > + if (!retried) > > > > + /* > > > > + * OK some of the range have non-guard pages mapped, zap > > > > + * them. This leaves existing guard pages in place. > > > > + */ > > > > + zap_page_range_single(vma, start, end - start, NULL); > > > > + else > > > > + /* > > > > + * If we reach here, then there is a racing fault that > > > > + * has populated the PTE after we zapped. Give up and > > > > + * let the user know to try again. > > > > + */ > > > > + return -EAGAIN; > > > > > > Hmm, yeah, it would be nice if we could avoid telling userspace to > > > loop on -EAGAIN but I guess we don't have any particularly good > > > options here? Well, we could bail out with -EINTR if a (fatal?) signal > > > is pending and otherwise keep looping... if we'd tell userspace "try > > > again on -EAGAIN", we might as well do that in the kernel... > > > > The problem is you could conceivably go on for quite some time, while > > holding and contending a HIGHLY contended lock (mm->mmap_lock) so I'd > > really rather let userspace take care of it. > > Hmm... so if the retry was handled in-kernel, you'd basically ideally > have the retry happen all the way up in do_madvise(), where the mmap > lock can be dropped and re-taken? Yeah perhaps, but that gets (really) horrible. > > > You could avoid this by having the walker be a _replace_ operation, that is > > - if we encounter an existing mapping, replace in-place with a poison > > marker rather than install marker/zap. > > > > However doing that would involve either completely abstracting such logic > > from scratch (a significant task in itself) to avoid duplication which be > > hugely off-topic for the patch set or worse, duplicating a whole bunch of > > page walking logic once again. > > Mmh, yeah, you'd have to extract the locked part of zap_pte_range() > and add your own copy of all the stuff that happens higher up for > setting up TLB flushes and such... I see how that would be a massive > pain and error-prone. Yep, I'd really, really like to avoid doing that, the solution we have now is neat and avoids such duplication. > > > By being optimistic and simply having the user having to handle looping > > which seems reasonable (again, it's weird if you're installing poison > > markers and another thread could be racing you) we avoid all of that. > > I guess one case in which that could happen legitimately is if you > race a MADV_POISON on the area 0x1ff000-0x200100 (first page is > populated, second page is not, pmd entry corresponding to 0x200000 is > clear) with a page fault at 0x200200? So you could have a scenario > like: > > 1. MADV_POISON starts walk_page_range() > 2. MADV_POISON sees non-zero, non-poison PTE at 0x1ff000, stops the walk > 3. MADV_POISON does zap_page_range_single() > 4. pagefault at 0x200200 happens and populates with a hugepage > 5. MADV_POISON enters walk_page_range() > 6. MADV_POISON splits the THP > 7. MADV_POISON sees a populated PTE You really shouldn't be seeing page faults in the range you are setting up poison markers for _at all_ :) it's something you'd do ahead of time. But of course it's possible some scenario could arise like that, that's what the EAGAIN is for. I just really don't want to get into a realm of trying to prove absolutely under all circumstances that we can't go on forever in a loop like that. If you drop the lock for contention then you up the risk of that, it just feels dangerous. A userland program can however live with a 'if EAGAIN try again' situation. An alternative approach to this might be to try to take the VMA lock, but given the fraught situation with locking elsewhere I wonder if we should. Also, you have to be realy unlucky with timing for this to happen, even in the scenario you mention (where you'd have to be unlucky with alignment too), unless you're _heavily_ page faulting in the range, either way a userland loop checking EAGAIN doesn't seem unreasonable. > > > > > + update_mmu_cache(walk->vma, addr, pte); > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +static const struct mm_walk_ops guard_unpoison_walk_ops = { > > > > + .pte_entry = guard_unpoison_pte_entry, > > > > + .walk_lock = PGWALK_RDLOCK, > > > > +}; > > > > > > It is a _little_ weird that unpoisoning creates page tables when they > > > don't already exist, which will also prevent creating THP entries on > > > fault in such areas afterwards... but I guess it doesn't really matter > > > given that poisoning has that effect, too, and you probably usually > > > won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned > > > before... so I guess this is not an actionable comment. > > > > It doesn't? There's no .install_pte so if an entries are non-present we > > ignore. > > Ah, right, of course. Nevermind. > > > HOWEVER, we do split THP. I don't think there's any way around it unless we > > extended the page walker to handle this more gracefully (pmd level being > > able to hint that we shouldn't do that or something), but that's really out > > of scope here. > > I think the `walk->action == ACTION_CONTINUE` check in > walk_pmd_range() would let you do that, see wp_clean_pmd_entry() for > an example. But yeah I guess that might just be unnecessary > complexity. Ah yeah... cool, actually think I will add that then, I hadn't noticed you could update that _in a callback_, as I first thought it was something you could set ahead of time then noticed the walker code resets it and... yeah cool. > > > The idea is that a caller can lazily call MADV_GUARD_UNPOISON on a range > > knowing things stay as they were, I guess we can add to the manpage a note > > that this will split THP? > > Yeah, might make sense... No need then :)
On Mon, Oct 14, 2024 at 7:02 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > On Mon, Oct 14, 2024 at 05:56:50PM +0200, Jann Horn wrote: > > On Mon, Oct 14, 2024 at 1:09 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > > On Fri, Oct 11, 2024 at 08:11:36PM +0200, Jann Horn wrote: > > > > On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > > By being optimistic and simply having the user having to handle looping > > > which seems reasonable (again, it's weird if you're installing poison > > > markers and another thread could be racing you) we avoid all of that. > > > > I guess one case in which that could happen legitimately is if you > > race a MADV_POISON on the area 0x1ff000-0x200100 (first page is > > populated, second page is not, pmd entry corresponding to 0x200000 is > > clear) with a page fault at 0x200200? So you could have a scenario > > like: > > > > 1. MADV_POISON starts walk_page_range() > > 2. MADV_POISON sees non-zero, non-poison PTE at 0x1ff000, stops the walk > > 3. MADV_POISON does zap_page_range_single() > > 4. pagefault at 0x200200 happens and populates with a hugepage > > 5. MADV_POISON enters walk_page_range() > > 6. MADV_POISON splits the THP > > 7. MADV_POISON sees a populated PTE > > You really shouldn't be seeing page faults in the range you are setting up > poison markers for _at all_ :) it's something you'd do ahead of time. But that's not what happens in my example - the address where the fault happens (0x200200) *is not* in the address range that MADV_POISON is called on (0x1ff000-0x200100). The fault and the MADV_POISON are in different 4KiB pages. What causes the conflict is that the fault and the MADV_POISON overlap the same *2MiB region* (both are in the region 0x200000-0x400000), and so THP stuff can effectively cause "page faults in the range you are setting up poison markers for". > But of course it's possible some scenario could arise like that, that's > what the EAGAIN is for. > > I just really don't want to get into a realm of trying to prove absolutely > under all circumstances that we can't go on forever in a loop like that. We can have a bailout on signal_pending() or something like that, and a cond_resched(). Then as far as I know, it won't really make a difference in behavior whether the loop is in the kernel or in userspace code that's following what the manpage tells it to do - either way, the program will loop until it either finishes its work or is interrupted by a signal, and either way it can get preempted. (Well, except under PREEMPT_NONE, but that is basically asking for long scheduling delays.) And we do have other codepaths that have to loop endlessly if they keep racing with page table updates the wrong way, though I guess those loops are not going to always scan over a large address range over and over again... Maybe something like this would be good enough, and mirror what you'd otherwise tell userspace to do? @@ -1598,6 +1598,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh return madvise_inject_error(behavior, start, start + len_in); #endif +retry: write = madvise_need_mmap_write(behavior); if (write) { if (mmap_write_lock_killable(mm)) @@ -1627,6 +1628,12 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh else mmap_read_unlock(mm); + if (error == <<<some special value>>>) { + if (!signal_pending(current)) + goto retry; + error = -ERESTARTNOINTR; + } + return error; } Buuut, heh, actually, I just realized: You could even omit this and simply replace -EINTR with -ERESTARTNOINTR in your code as the error value, and then the kernel would automatically go back into the syscall for you after going through signal handing and such, without userspace noticing. https://lore.kernel.org/all/20121206220955.GZ4939@ZenIV.linux.org.uk/ has some explanation on how this works. Basically it tells the architecture's syscall entry code to move the userspace instruction pointer back to the syscall instruction, so as soon as execution returns to userspace, the first userspace instruction that executes will immediately re-do the syscall. That might be the easiest way, even if it is maybe a *little* bit of an API abuse to use this thing without having a pending signal... > If you drop the lock for contention then you up the risk of that, it just > feels dangerous. > > A userland program can however live with a 'if EAGAIN try again' situation. > > An alternative approach to this might be to try to take the VMA lock, but > given the fraught situation with locking elsewhere I wonder if we should. > > Also, you have to be realy unlucky with timing for this to happen, even in > the scenario you mention (where you'd have to be unlucky with alignment > too), unless you're _heavily_ page faulting in the range, either way a > userland loop checking EAGAIN doesn't seem unreasonable. Yes, we could do -EINTR and document that for userspace, and as long as everyone using this properly reads the documentation, it will be fine. Though I imagine that from the userspace programmer perspective that's a weird API design - as in, if this error code always means I have to try again, why can't the kernel do that internally. It's kind of leaking an implementation detail into the UAPI.
On Mon, Oct 14, 2024 at 08:14:26PM +0200, Jann Horn wrote: > On Mon, Oct 14, 2024 at 7:02 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > On Mon, Oct 14, 2024 at 05:56:50PM +0200, Jann Horn wrote: > > > On Mon, Oct 14, 2024 at 1:09 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > > > On Fri, Oct 11, 2024 at 08:11:36PM +0200, Jann Horn wrote: > > > > > On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > > > By being optimistic and simply having the user having to handle looping > > > > which seems reasonable (again, it's weird if you're installing poison > > > > markers and another thread could be racing you) we avoid all of that. > > > > > > I guess one case in which that could happen legitimately is if you > > > race a MADV_POISON on the area 0x1ff000-0x200100 (first page is > > > populated, second page is not, pmd entry corresponding to 0x200000 is > > > clear) with a page fault at 0x200200? So you could have a scenario > > > like: > > > > > > 1. MADV_POISON starts walk_page_range() > > > 2. MADV_POISON sees non-zero, non-poison PTE at 0x1ff000, stops the walk > > > 3. MADV_POISON does zap_page_range_single() > > > 4. pagefault at 0x200200 happens and populates with a hugepage > > > 5. MADV_POISON enters walk_page_range() > > > 6. MADV_POISON splits the THP > > > 7. MADV_POISON sees a populated PTE > > > > You really shouldn't be seeing page faults in the range you are setting up > > poison markers for _at all_ :) it's something you'd do ahead of time. > > But that's not what happens in my example - the address where the > fault happens (0x200200) *is not* in the address range that > MADV_POISON is called on (0x1ff000-0x200100). The fault and the > MADV_POISON are in different 4KiB pages. What causes the conflict is > that the fault and the MADV_POISON overlap the same *2MiB region* > (both are in the region 0x200000-0x400000), and so THP stuff can > effectively cause "page faults in the range you are setting up poison > markers for". Right sorry maybe I wasn't clear in what I said - there should not be faults in the 'vicinity' of the poison pages, that is the range which you potentially intend to protect with the poison markers. HOWEVER this is problematic clearly for something like a userspace allocator where you might be allocating small ranges that might fit within a huge page. At the same time, I think you'd have to get pretty unlucky - you'd need to have faulted in enough for a huge page to be collapsed by THP, immediately adjacent to where you are installing this poison range, which spans multiple adjacent pages for some reason over the 2 MiB boundary (I assume you mean 0x201000 not 0x200100 btw :P). Anyway I think this is moot as I am warming to the idea of us just looping to be honest. There's a limit to how much can be faulted in (i.e. everything), and we hold the lock. The user is _choosing_ to call this function and if there happens to be enormously huge amounts of faulting going on then so be it. > > > But of course it's possible some scenario could arise like that, that's > > what the EAGAIN is for. > > > > I just really don't want to get into a realm of trying to prove absolutely > > under all circumstances that we can't go on forever in a loop like that. > > We can have a bailout on signal_pending() or something like that, and > a cond_resched(). Then as far as I know, it won't really make a > difference in behavior whether the loop is in the kernel or in > userspace code that's following what the manpage tells it to do - > either way, the program will loop until it either finishes its work or > is interrupted by a signal, and either way it can get preempted. > (Well, except under PREEMPT_NONE, but that is basically asking for > long scheduling delays.) > > And we do have other codepaths that have to loop endlessly if they > keep racing with page table updates the wrong way, though I guess > those loops are not going to always scan over a large address range > over and over again... > > Maybe something like this would be good enough, and mirror what you'd > otherwise tell userspace to do? > > > @@ -1598,6 +1598,7 @@ int do_madvise(struct mm_struct *mm, unsigned > long start, size_t len_in, int beh > return madvise_inject_error(behavior, start, start + len_in); > #endif > > +retry: > write = madvise_need_mmap_write(behavior); > if (write) { > if (mmap_write_lock_killable(mm)) > @@ -1627,6 +1628,12 @@ int do_madvise(struct mm_struct *mm, unsigned > long start, size_t len_in, int beh > else > mmap_read_unlock(mm); > > + if (error == <<<some special value>>>) { > + if (!signal_pending(current)) > + goto retry; > + error = -ERESTARTNOINTR; > + } > + > return error; > } > > Buuut, heh, actually, I just realized: You could even omit this and > simply replace -EINTR with -ERESTARTNOINTR in your code as the error Interesting that that exists had no idea :) I think I'd rather avoid it as it looks so specific and a lot more like asking for trouble than simply looping. > value, and then the kernel would automatically go back into the > syscall for you after going through signal handing and such, without > userspace noticing. > https://lore.kernel.org/all/20121206220955.GZ4939@ZenIV.linux.org.uk/ > has some explanation on how this works. Basically it tells the > architecture's syscall entry code to move the userspace instruction > pointer back to the syscall instruction, so as soon as execution > returns to userspace, the first userspace instruction that executes > will immediately re-do the syscall. That might be the easiest way, > even if it is maybe a *little* bit of an API abuse to use this thing > without having a pending signal... > > > > If you drop the lock for contention then you up the risk of that, it just > > feels dangerous. > > > > A userland program can however live with a 'if EAGAIN try again' situation. > > > > An alternative approach to this might be to try to take the VMA lock, but > > given the fraught situation with locking elsewhere I wonder if we should. > > > > Also, you have to be realy unlucky with timing for this to happen, even in > > the scenario you mention (where you'd have to be unlucky with alignment > > too), unless you're _heavily_ page faulting in the range, either way a > > userland loop checking EAGAIN doesn't seem unreasonable. > > Yes, we could do -EINTR and document that for userspace, and as long > as everyone using this properly reads the documentation, it will be > fine. Though I imagine that from the userspace programmer perspective > that's a weird API design - as in, if this error code always means I > have to try again, why can't the kernel do that internally. It's kind > of leaking an implementation detail into the UAPI. Overall I am warming to us just looping. I mean it's hard to argue against this being 'surprising' behaviour - the user expects to be able to install poison markers and for that to just be applied regardless of faulting. And it's not exactly a huge amount more effort for us to simply loop, I just wanted to avoid it to avoid having to think about whether there's cases that could result in an eternal loop... We'd need a write lock to VMA lock the VMAs and prevent racing faults (or in the case of non-VMA lock kernels to simplly prevent them) which adds to contention issues arguably a lot more than simply looping under the read lock. Let me think about this but I think I will go ahead and try to add something simple that loops + checks for pending fatal signals for the next iteration of this series.
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..71e13f27742d 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -78,6 +78,9 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 9c48d9a21aa0..1a2222322f77 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -105,6 +105,9 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 68c44f99bc93..380905522397 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -75,6 +75,9 @@ #define MADV_HWPOISON 100 /* poison a page for testing */ #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 1ff0c858544f..e8d5affceb28 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -113,6 +113,9 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..5dfd3d442de4 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -79,6 +79,9 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_GUARD_POISON 102 /* fatal signal on access to range */ +#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/madvise.c b/mm/madvise.c index e871a72a6c32..7216e10723ae 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE: + case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */ return 0; default: /* be safe, default to 1. list exceptions explicitly */ @@ -1017,6 +1018,157 @@ static long madvise_remove(struct vm_area_struct *vma, return error; } +static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) +{ + vm_flags_t disallowed = VM_SPECIAL | VM_HUGETLB; + + /* + * A user could lock after poisoning but that's fine, as they'd not be + * able to fault in. The issue arises when we try to zap existing locked + * VMAs. We don't want to do that. + */ + if (!allow_locked) + disallowed |= VM_LOCKED; + + if (!vma_is_anonymous(vma)) + return false; + + if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE) + return false; + + return true; +} + +static int guard_poison_install_pte(unsigned long addr, unsigned long next, + pte_t *ptep, struct mm_walk *walk) +{ + unsigned long *num_installed = (unsigned long *)walk->private; + + (*num_installed)++; + /* Simply install a PTE marker, this causes segfault on access. */ + *ptep = make_pte_marker(PTE_MARKER_GUARD); + + return 0; +} + +static bool is_guard_pte_marker(pte_t ptent) +{ + return is_pte_marker(ptent) && + is_guard_swp_entry(pte_to_swp_entry(ptent)); +} + +static int guard_poison_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pte_t ptent = ptep_get(pte); + + /* + * If not a guard marker, simply abort the operation. We return a value + * > 0 indicating a non-error abort. + */ + return !is_guard_pte_marker(ptent); +} + +static const struct mm_walk_ops guard_poison_walk_ops = { + .install_pte = guard_poison_install_pte, + .pte_entry = guard_poison_pte_entry, + /* We might need to install an anon_vma. */ + .walk_lock = PGWALK_WRLOCK, +}; + +static long madvise_guard_poison(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + long err; + bool retried = false; + + *prev = vma; + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) + return -EINVAL; + + /* + * Optimistically try to install the guard poison pages first. If any + * non-guard pages are encountered, give up and zap the range before + * trying again. + */ + while (true) { + unsigned long num_installed = 0; + + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ + err = walk_page_range_mm(vma->vm_mm, start, end, + &guard_poison_walk_ops, + &num_installed); + /* + * If we install poison markers, then the range is no longer + * empty from a page table perspective and therefore it's + * appropriate to have an anon_vma. + * + * This ensures that on fork, we copy page tables correctly. + */ + if (err >= 0 && num_installed > 0) { + int err_anon = anon_vma_prepare(vma); + + if (err_anon) + err = err_anon; + } + + if (err <= 0) + return err; + + if (!retried) + /* + * OK some of the range have non-guard pages mapped, zap + * them. This leaves existing guard pages in place. + */ + zap_page_range_single(vma, start, end - start, NULL); + else + /* + * If we reach here, then there is a racing fault that + * has populated the PTE after we zapped. Give up and + * let the user know to try again. + */ + return -EAGAIN; + + retried = true; + } +} + +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pte_t ptent = ptep_get(pte); + + if (is_guard_pte_marker(ptent)) { + /* Simply clear the PTE marker. */ + pte_clear_not_present_full(walk->mm, addr, pte, true); + update_mmu_cache(walk->vma, addr, pte); + } + + return 0; +} + +static const struct mm_walk_ops guard_unpoison_walk_ops = { + .pte_entry = guard_unpoison_pte_entry, + .walk_lock = PGWALK_RDLOCK, +}; + +static long madvise_guard_unpoison(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + *prev = vma; + /* + * We're ok with unpoisoning mlock()'d ranges, as this is a + * non-destructive action. + */ + if (!is_valid_guard_vma(vma, /* allow_locked = */true)) + return -EINVAL; + + return walk_page_range(vma->vm_mm, start, end, + &guard_unpoison_walk_ops, NULL); +} + /* * Apply an madvise behavior to a region of a vma. madvise_update_vma * will handle splitting a vm area into separate areas, each area with its own @@ -1098,6 +1250,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, break; case MADV_COLLAPSE: return madvise_collapse(vma, prev, start, end); + case MADV_GUARD_POISON: + return madvise_guard_poison(vma, prev, start, end); + case MADV_GUARD_UNPOISON: + return madvise_guard_unpoison(vma, prev, start, end); } anon_name = anon_vma_name(vma); @@ -1197,6 +1353,8 @@ madvise_behavior_valid(int behavior) case MADV_DODUMP: case MADV_WIPEONFORK: case MADV_KEEPONFORK: + case MADV_GUARD_POISON: + case MADV_GUARD_UNPOISON: #ifdef CONFIG_MEMORY_FAILURE case MADV_SOFT_OFFLINE: case MADV_HWPOISON: diff --git a/mm/mprotect.c b/mm/mprotect.c index 0c5d6d06107d..d0e3ebfadef8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -236,7 +236,8 @@ static long change_pte_range(struct mmu_gather *tlb, } else if (is_pte_marker_entry(entry)) { /* * Ignore error swap entries unconditionally, - * because any access should sigbus anyway. + * because any access should sigbus/sigsegv + * anyway. */ if (is_poisoned_swp_entry(entry)) continue; diff --git a/mm/mseal.c b/mm/mseal.c index ece977bd21e1..21bf5534bcf5 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -30,6 +30,7 @@ static bool is_madv_discard(int behavior) case MADV_REMOVE: case MADV_DONTFORK: case MADV_WIPEONFORK: + case MADV_GUARD_POISON: return true; }
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise. Currently users must establish PROT_NONE ranges to achieve this. However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs. In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges. The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - poisoning PTEs such that accesses to them cause a fault followed by a SIGSGEV signal being raised. This is achieved through the PTE marker mechanism, which a previous commit in this series extended to permit this to be done, installed via the generic page walking logic, also extended by a prior commit for this purpose. These poison ranges are established with MADV_GUARD_POISON, and if the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect). Any existing poison entries will be left untouched. There is no nesting of poisoned pages. Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather unexpected behaviour, but are cleared on process teardown or unmapping of memory ranges. Ranges can have the poison property removed by MADV_GUARD_UNPOISON - 'remedying' the poisoning. The ranges over which this is applied, should they contain non-poison entries, will be untouched, only poison entries will be cleared. We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive). The poisoning of the range must be performed under mmap write lock as we have to install an anon_vma to ensure correct behaviour on fork. Suggested-by: Vlastimil Babka <vbabka@suze.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/uapi/asm-generic/mman-common.h | 3 + mm/madvise.c | 158 +++++++++++++++++++++++++ mm/mprotect.c | 3 +- mm/mseal.c | 1 + 8 files changed, 176 insertions(+), 1 deletion(-)