Message ID | 1433942810-7852-2-git-send-email-emunson@akamai.com (mailing list archive) |
---|---|
State | Awaiting Upstream |
Headers | show |
[Sorry for the late reply - I meant to answer in the previous threads but something always preempted me from that] On Wed 10-06-15 09:26:48, Eric B Munson wrote: > The cost of faulting in all memory to be locked can be very high when > working with large mappings. If only portions of the mapping will be > used this can incur a high penalty for locking. > > For the example of a large file, this is the usage pattern for a large > statical language model (probably applies to other statical or graphical > models as well). For the security example, any application transacting > in data that cannot be swapped out (credit card data, medical records, > etc). Such a use case makes some sense to me but I am not sure the way you implement it is the right one. This is another mlock related flag for mmap with a different semantic. You do not want to prefault but e.g. is the readahead or fault around acceptable? I do not see anything in your patch to handle those... Wouldn't it be much more reasonable and straightforward to have MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would explicitly disallow any form of pre-faulting? It would be usable for other usecases than with MAP_LOCKED combination. > This patch introduces the ability to request that pages are not > pre-faulted, but are placed on the unevictable LRU when they are finally > faulted in. > > To keep accounting checks out of the page fault path, users are billed > for the entire mapping lock as if MAP_LOCKED was used. > > Signed-off-by: Eric B Munson <emunson@akamai.com> > Cc: Michal Hocko <mhocko@suse.cz> > Cc: linux-alpha@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mips@linux-mips.org > Cc: linux-parisc@vger.kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: sparclinux@vger.kernel.org > Cc: linux-xtensa@linux-xtensa.org > Cc: linux-mm@kvack.org > Cc: linux-arch@vger.kernel.org > Cc: linux-api@vger.kernel.org > --- > arch/alpha/include/uapi/asm/mman.h | 1 + > arch/mips/include/uapi/asm/mman.h | 1 + > arch/parisc/include/uapi/asm/mman.h | 1 + > arch/powerpc/include/uapi/asm/mman.h | 1 + > arch/sparc/include/uapi/asm/mman.h | 1 + > arch/tile/include/uapi/asm/mman.h | 1 + > arch/xtensa/include/uapi/asm/mman.h | 1 + > include/linux/mm.h | 1 + > include/linux/mman.h | 3 ++- > include/uapi/asm-generic/mman.h | 1 + > mm/mmap.c | 4 ++-- > mm/swap.c | 3 ++- > 12 files changed, 15 insertions(+), 4 deletions(-) > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > index 0086b47..15e96e1 100644 > --- a/arch/alpha/include/uapi/asm/mman.h > +++ b/arch/alpha/include/uapi/asm/mman.h > @@ -30,6 +30,7 @@ > #define MAP_NONBLOCK 0x40000 /* do not block on IO */ > #define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */ > #define MAP_HUGETLB 0x100000 /* create a huge page mapping */ > +#define MAP_LOCKONFAULT 0x200000 /* Lock pages after they are faulted in, do not prefault */ > > #define MS_ASYNC 1 /* sync memory asynchronously */ > #define MS_SYNC 2 /* synchronous memory sync */ > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > index cfcb876..47846a5 100644 > --- a/arch/mips/include/uapi/asm/mman.h > +++ b/arch/mips/include/uapi/asm/mman.h > @@ -48,6 +48,7 @@ > #define MAP_NONBLOCK 0x20000 /* do not block on IO */ > #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ > #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ > +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */ > > /* > * Flags for msync > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > index 294d251..1514cd7 100644 > --- a/arch/parisc/include/uapi/asm/mman.h > +++ b/arch/parisc/include/uapi/asm/mman.h > @@ -24,6 +24,7 @@ > #define MAP_NONBLOCK 0x20000 /* do not block on IO */ > #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ > #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ > +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */ > > #define MS_SYNC 1 /* synchronous memory sync */ > #define MS_ASYNC 2 /* sync memory asynchronously */ > diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h > index 6ea26df..fce74fe 100644 > --- a/arch/powerpc/include/uapi/asm/mman.h > +++ b/arch/powerpc/include/uapi/asm/mman.h > @@ -27,5 +27,6 @@ > #define MAP_NONBLOCK 0x10000 /* do not block on IO */ > #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ > #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ > +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */ > > #endif /* _UAPI_ASM_POWERPC_MMAN_H */ > diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h > index 0b14df3..12425d8 100644 > --- a/arch/sparc/include/uapi/asm/mman.h > +++ b/arch/sparc/include/uapi/asm/mman.h > @@ -22,6 +22,7 @@ > #define MAP_NONBLOCK 0x10000 /* do not block on IO */ > #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ > #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ > +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */ > > > #endif /* _UAPI__SPARC_MMAN_H__ */ > diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h > index 81b8fc3..ec04eaf 100644 > --- a/arch/tile/include/uapi/asm/mman.h > +++ b/arch/tile/include/uapi/asm/mman.h > @@ -29,6 +29,7 @@ > #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ > #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */ > #define MAP_HUGETLB 0x4000 /* create a huge page mapping */ > +#define MAP_LOCKONFAULT 0x8000 /* Lock pages after they are faulted in, do not prefault */ > > > /* > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > index 201aec0..42d43cc 100644 > --- a/arch/xtensa/include/uapi/asm/mman.h > +++ b/arch/xtensa/include/uapi/asm/mman.h > @@ -55,6 +55,7 @@ > #define MAP_NONBLOCK 0x20000 /* do not block on IO */ > #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ > #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ > +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */ > #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED > # define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be > * uninitialized */ > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 0755b9f..3e31457 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -126,6 +126,7 @@ extern unsigned int kobjsize(const void *objp); > #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ > #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */ > > +#define VM_LOCKONFAULT 0x00001000 /* Lock the pages covered when they are faulted in */ > #define VM_LOCKED 0x00002000 > #define VM_IO 0x00004000 /* Memory mapped I/O or similar */ > > diff --git a/include/linux/mman.h b/include/linux/mman.h > index 16373c8..437264b 100644 > --- a/include/linux/mman.h > +++ b/include/linux/mman.h > @@ -86,7 +86,8 @@ calc_vm_flag_bits(unsigned long flags) > { > return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | > _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | > - _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ); > + _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | > + _calc_vm_trans(flags, MAP_LOCKONFAULT,VM_LOCKONFAULT); > } > > unsigned long vm_commit_limit(void); > diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h > index e9fe6fd..fc4e586 100644 > --- a/include/uapi/asm-generic/mman.h > +++ b/include/uapi/asm-generic/mman.h > @@ -12,6 +12,7 @@ > #define MAP_NONBLOCK 0x10000 /* do not block on IO */ > #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ > #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ > +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */ > > /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */ > > diff --git a/mm/mmap.c b/mm/mmap.c > index bb50cac..ba1a6bf 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1233,7 +1233,7 @@ static inline int mlock_future_check(struct mm_struct *mm, > unsigned long locked, lock_limit; > > /* mlock MCL_FUTURE? */ > - if (flags & VM_LOCKED) { > + if (flags & (VM_LOCKED | VM_LOCKONFAULT)) { > locked = len >> PAGE_SHIFT; > locked += mm->locked_vm; > lock_limit = rlimit(RLIMIT_MEMLOCK); > @@ -1301,7 +1301,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, > vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) | > mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; > > - if (flags & MAP_LOCKED) > + if (flags & (MAP_LOCKED | MAP_LOCKONFAULT)) > if (!can_do_mlock()) > return -EPERM; > > diff --git a/mm/swap.c b/mm/swap.c > index a7251a8..07c905e 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -711,7 +711,8 @@ void lru_cache_add_active_or_unevictable(struct page *page, > { > VM_BUG_ON_PAGE(PageLRU(page), page); > > - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) { > + if (likely((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == 0) || > + (vma->vm_flags & VM_SPECIAL)) { > SetPageActive(page); > lru_cache_add(page); > return; > -- > 1.9.1 >
On Thu, 18 Jun 2015, Michal Hocko wrote: > [Sorry for the late reply - I meant to answer in the previous threads > but something always preempted me from that] > > On Wed 10-06-15 09:26:48, Eric B Munson wrote: > > The cost of faulting in all memory to be locked can be very high when > > working with large mappings. If only portions of the mapping will be > > used this can incur a high penalty for locking. > > > > For the example of a large file, this is the usage pattern for a large > > statical language model (probably applies to other statical or graphical > > models as well). For the security example, any application transacting > > in data that cannot be swapped out (credit card data, medical records, > > etc). > > Such a use case makes some sense to me but I am not sure the way you > implement it is the right one. This is another mlock related flag for > mmap with a different semantic. You do not want to prefault but e.g. is > the readahead or fault around acceptable? I do not see anything in your > patch to handle those... We haven't bumped into readahead or fault around causing performance problems for us. If they cause problems for users when LOCKONFAULT is in use then we can address them. > > Wouldn't it be much more reasonable and straightforward to have > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would > explicitly disallow any form of pre-faulting? It would be usable for > other usecases than with MAP_LOCKED combination. I don't see a clear case for it being more reasonable, it is one possible way to solve the problem. But I think it leaves us in an even more akward state WRT VMA flags. As you noted in your fix for the mmap() man page, one can get into a state where a VMA is VM_LOCKED, but not present. Having VM_LOCKONFAULT states that this was intentional, if we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no longer set VM_LOCKONFAULT (unless we want to start mapping it to the presence of two MAP_ flags). This can make detecting the MAP_LOCKED + populate failure state harder. If this is the preferred path for mmap(), I am fine with that. However, I would like to see the new system calls that Andrew mentioned (and that I am testing patches for) go in as well. That way we give users the ability to request VM_LOCKONFAULT for memory allocated using something other than mmap. > > > This patch introduces the ability to request that pages are not > > pre-faulted, but are placed on the unevictable LRU when they are finally > > faulted in. > > > > To keep accounting checks out of the page fault path, users are billed > > for the entire mapping lock as if MAP_LOCKED was used. > > > > Signed-off-by: Eric B Munson <emunson@akamai.com> > > Cc: Michal Hocko <mhocko@suse.cz> > > Cc: linux-alpha@vger.kernel.org > > Cc: linux-kernel@vger.kernel.org > > Cc: linux-mips@linux-mips.org > > Cc: linux-parisc@vger.kernel.org > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: sparclinux@vger.kernel.org > > Cc: linux-xtensa@linux-xtensa.org > > Cc: linux-mm@kvack.org > > Cc: linux-arch@vger.kernel.org > > Cc: linux-api@vger.kernel.org > > --- > > arch/alpha/include/uapi/asm/mman.h | 1 + > > arch/mips/include/uapi/asm/mman.h | 1 + > > arch/parisc/include/uapi/asm/mman.h | 1 + > > arch/powerpc/include/uapi/asm/mman.h | 1 + > > arch/sparc/include/uapi/asm/mman.h | 1 + > > arch/tile/include/uapi/asm/mman.h | 1 + > > arch/xtensa/include/uapi/asm/mman.h | 1 + > > include/linux/mm.h | 1 + > > include/linux/mman.h | 3 ++- > > include/uapi/asm-generic/mman.h | 1 + > > mm/mmap.c | 4 ++-- > > mm/swap.c | 3 ++- > > 12 files changed, 15 insertions(+), 4 deletions(-) > > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > > index 0086b47..15e96e1 100644 > > --- a/arch/alpha/include/uapi/asm/mman.h > > +++ b/arch/alpha/include/uapi/asm/mman.h > > @@ -30,6 +30,7 @@ > > #define MAP_NONBLOCK 0x40000 /* do not block on IO */ > > #define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */ > > #define MAP_HUGETLB 0x100000 /* create a huge page mapping */ > > +#define MAP_LOCKONFAULT 0x200000 /* Lock pages after they are faulted in, do not prefault */ > > > > #define MS_ASYNC 1 /* sync memory asynchronously */ > > #define MS_SYNC 2 /* synchronous memory sync */ > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > > index cfcb876..47846a5 100644 > > --- a/arch/mips/include/uapi/asm/mman.h > > +++ b/arch/mips/include/uapi/asm/mman.h > > @@ -48,6 +48,7 @@ > > #define MAP_NONBLOCK 0x20000 /* do not block on IO */ > > #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ > > #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ > > +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */ > > > > /* > > * Flags for msync > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > > index 294d251..1514cd7 100644 > > --- a/arch/parisc/include/uapi/asm/mman.h > > +++ b/arch/parisc/include/uapi/asm/mman.h > > @@ -24,6 +24,7 @@ > > #define MAP_NONBLOCK 0x20000 /* do not block on IO */ > > #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ > > #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ > > +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */ > > > > #define MS_SYNC 1 /* synchronous memory sync */ > > #define MS_ASYNC 2 /* sync memory asynchronously */ > > diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h > > index 6ea26df..fce74fe 100644 > > --- a/arch/powerpc/include/uapi/asm/mman.h > > +++ b/arch/powerpc/include/uapi/asm/mman.h > > @@ -27,5 +27,6 @@ > > #define MAP_NONBLOCK 0x10000 /* do not block on IO */ > > #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ > > #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ > > +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */ > > > > #endif /* _UAPI_ASM_POWERPC_MMAN_H */ > > diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h > > index 0b14df3..12425d8 100644 > > --- a/arch/sparc/include/uapi/asm/mman.h > > +++ b/arch/sparc/include/uapi/asm/mman.h > > @@ -22,6 +22,7 @@ > > #define MAP_NONBLOCK 0x10000 /* do not block on IO */ > > #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ > > #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ > > +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */ > > > > > > #endif /* _UAPI__SPARC_MMAN_H__ */ > > diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h > > index 81b8fc3..ec04eaf 100644 > > --- a/arch/tile/include/uapi/asm/mman.h > > +++ b/arch/tile/include/uapi/asm/mman.h > > @@ -29,6 +29,7 @@ > > #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ > > #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */ > > #define MAP_HUGETLB 0x4000 /* create a huge page mapping */ > > +#define MAP_LOCKONFAULT 0x8000 /* Lock pages after they are faulted in, do not prefault */ > > > > > > /* > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > > index 201aec0..42d43cc 100644 > > --- a/arch/xtensa/include/uapi/asm/mman.h > > +++ b/arch/xtensa/include/uapi/asm/mman.h > > @@ -55,6 +55,7 @@ > > #define MAP_NONBLOCK 0x20000 /* do not block on IO */ > > #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ > > #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ > > +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */ > > #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED > > # define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be > > * uninitialized */ > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 0755b9f..3e31457 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -126,6 +126,7 @@ extern unsigned int kobjsize(const void *objp); > > #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ > > #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */ > > > > +#define VM_LOCKONFAULT 0x00001000 /* Lock the pages covered when they are faulted in */ > > #define VM_LOCKED 0x00002000 > > #define VM_IO 0x00004000 /* Memory mapped I/O or similar */ > > > > diff --git a/include/linux/mman.h b/include/linux/mman.h > > index 16373c8..437264b 100644 > > --- a/include/linux/mman.h > > +++ b/include/linux/mman.h > > @@ -86,7 +86,8 @@ calc_vm_flag_bits(unsigned long flags) > > { > > return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | > > _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | > > - _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ); > > + _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | > > + _calc_vm_trans(flags, MAP_LOCKONFAULT,VM_LOCKONFAULT); > > } > > > > unsigned long vm_commit_limit(void); > > diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h > > index e9fe6fd..fc4e586 100644 > > --- a/include/uapi/asm-generic/mman.h > > +++ b/include/uapi/asm-generic/mman.h > > @@ -12,6 +12,7 @@ > > #define MAP_NONBLOCK 0x10000 /* do not block on IO */ > > #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ > > #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ > > +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */ > > > > /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */ > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index bb50cac..ba1a6bf 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -1233,7 +1233,7 @@ static inline int mlock_future_check(struct mm_struct *mm, > > unsigned long locked, lock_limit; > > > > /* mlock MCL_FUTURE? */ > > - if (flags & VM_LOCKED) { > > + if (flags & (VM_LOCKED | VM_LOCKONFAULT)) { > > locked = len >> PAGE_SHIFT; > > locked += mm->locked_vm; > > lock_limit = rlimit(RLIMIT_MEMLOCK); > > @@ -1301,7 +1301,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, > > vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) | > > mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; > > > > - if (flags & MAP_LOCKED) > > + if (flags & (MAP_LOCKED | MAP_LOCKONFAULT)) > > if (!can_do_mlock()) > > return -EPERM; > > > > diff --git a/mm/swap.c b/mm/swap.c > > index a7251a8..07c905e 100644 > > --- a/mm/swap.c > > +++ b/mm/swap.c > > @@ -711,7 +711,8 @@ void lru_cache_add_active_or_unevictable(struct page *page, > > { > > VM_BUG_ON_PAGE(PageLRU(page), page); > > > > - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) { > > + if (likely((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == 0) || > > + (vma->vm_flags & VM_SPECIAL)) { > > SetPageActive(page); > > lru_cache_add(page); > > return; > > -- > > 1.9.1 > > > > -- > Michal Hocko > SUSE Labs
On Thu 18-06-15 16:30:48, Eric B Munson wrote: > On Thu, 18 Jun 2015, Michal Hocko wrote: [...] > > Wouldn't it be much more reasonable and straightforward to have > > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would > > explicitly disallow any form of pre-faulting? It would be usable for > > other usecases than with MAP_LOCKED combination. > > I don't see a clear case for it being more reasonable, it is one > possible way to solve the problem. MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault around is all or nothing feature. Either all mappings (which support this) fault around or none. There is no way to tell the kernel that this particular mapping shouldn't fault around. I haven't seen such a request yet but we have seen requests to have a way to opt out from a global policy in the past (e.g. per-process opt out from THP). So I can imagine somebody will come with a request to opt out from any speculative operations on the mapped area in the future. > But I think it leaves us in an even > more akward state WRT VMA flags. As you noted in your fix for the > mmap() man page, one can get into a state where a VMA is VM_LOCKED, but > not present. Having VM_LOCKONFAULT states that this was intentional, if > we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no > longer set VM_LOCKONFAULT (unless we want to start mapping it to the > presence of two MAP_ flags). This can make detecting the MAP_LOCKED + > populate failure state harder. I am not sure I understand your point here. Could you be more specific how would you check for that and what for? From my understanding MAP_LOCKONFAULT is essentially MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike single MAP_LOCKED unfortunately). I would love to also have MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really skeptical considering how my previous attempt to make MAP_POPULATE reasonable went. > If this is the preferred path for mmap(), I am fine with that. > However, > I would like to see the new system calls that Andrew mentioned (and that > I am testing patches for) go in as well. mlock with flags sounds like a good step but I am not sure it will make sense in the future. POSIX has screwed that and I am not sure how many applications would use it. This ship has sailed long time ago. > That way we give users the > ability to request VM_LOCKONFAULT for memory allocated using something > other than mmap. mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even without changing mlock syscall. > > > This patch introduces the ability to request that pages are not > > > pre-faulted, but are placed on the unevictable LRU when they are finally > > > faulted in. > > > > > > To keep accounting checks out of the page fault path, users are billed > > > for the entire mapping lock as if MAP_LOCKED was used. > > > > > > Signed-off-by: Eric B Munson <emunson@akamai.com> > > > Cc: Michal Hocko <mhocko@suse.cz> > > > Cc: linux-alpha@vger.kernel.org > > > Cc: linux-kernel@vger.kernel.org > > > Cc: linux-mips@linux-mips.org > > > Cc: linux-parisc@vger.kernel.org > > > Cc: linuxppc-dev@lists.ozlabs.org > > > Cc: sparclinux@vger.kernel.org > > > Cc: linux-xtensa@linux-xtensa.org > > > Cc: linux-mm@kvack.org > > > Cc: linux-arch@vger.kernel.org > > > Cc: linux-api@vger.kernel.org > > > --- [...]
On Fri, 19 Jun 2015, Michal Hocko wrote: > On Thu 18-06-15 16:30:48, Eric B Munson wrote: > > On Thu, 18 Jun 2015, Michal Hocko wrote: > [...] > > > Wouldn't it be much more reasonable and straightforward to have > > > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would > > > explicitly disallow any form of pre-faulting? It would be usable for > > > other usecases than with MAP_LOCKED combination. > > > > I don't see a clear case for it being more reasonable, it is one > > possible way to solve the problem. > > MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault > around is all or nothing feature. Either all mappings (which support > this) fault around or none. There is no way to tell the kernel that > this particular mapping shouldn't fault around. I haven't seen such a > request yet but we have seen requests to have a way to opt out from > a global policy in the past (e.g. per-process opt out from THP). So > I can imagine somebody will come with a request to opt out from any > speculative operations on the mapped area in the future. > > > But I think it leaves us in an even > > more akward state WRT VMA flags. As you noted in your fix for the > > mmap() man page, one can get into a state where a VMA is VM_LOCKED, but > > not present. Having VM_LOCKONFAULT states that this was intentional, if > > we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no > > longer set VM_LOCKONFAULT (unless we want to start mapping it to the > > presence of two MAP_ flags). This can make detecting the MAP_LOCKED + > > populate failure state harder. > > I am not sure I understand your point here. Could you be more specific > how would you check for that and what for? My thought on detecting was that someone might want to know if they had a VMA that was VM_LOCKED but had not been made present becuase of a failure in mmap. We don't have a way today, but adding VM_LOCKONFAULT is at least explicit about what is happening which would make detecting the VM_LOCKED but not present state easier. This assumes that MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like it would have to. > > From my understanding MAP_LOCKONFAULT is essentially > MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike > single MAP_LOCKED unfortunately). I would love to also have > MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really > skeptical considering how my previous attempt to make MAP_POPULATE > reasonable went. Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the new MAP_LOCKONFAULT flag (or both)? If you prefer that MAP_LOCKED | MAP_FAULTPOPULATE means that VM_LOCKONFAULT is set, I am fine with that instead of introducing MAP_LOCKONFAULT. I went with the new flag because to date, we have a one to one mapping of MAP_* to VM_* flags. > > > If this is the preferred path for mmap(), I am fine with that. > > > However, > > I would like to see the new system calls that Andrew mentioned (and that > > I am testing patches for) go in as well. > > mlock with flags sounds like a good step but I am not sure it will make > sense in the future. POSIX has screwed that and I am not sure how many > applications would use it. This ship has sailed long time ago. I don't know either, but the code is the question, right? I know that we have at least one team that wants it here. > > > That way we give users the > > ability to request VM_LOCKONFAULT for memory allocated using something > > other than mmap. > > mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even > without changing mlock syscall. That is true as long as MAP_FAULTPOPULATE set a flag in the VMA(s). It doesn't cover the actual case I was asking about, which is how do I get lock on fault on malloc'd memory? > > > > > This patch introduces the ability to request that pages are not > > > > pre-faulted, but are placed on the unevictable LRU when they are finally > > > > faulted in. > > > > > > > > To keep accounting checks out of the page fault path, users are billed > > > > for the entire mapping lock as if MAP_LOCKED was used. > > > > > > > > Signed-off-by: Eric B Munson <emunson@akamai.com> > > > > Cc: Michal Hocko <mhocko@suse.cz> > > > > Cc: linux-alpha@vger.kernel.org > > > > Cc: linux-kernel@vger.kernel.org > > > > Cc: linux-mips@linux-mips.org > > > > Cc: linux-parisc@vger.kernel.org > > > > Cc: linuxppc-dev@lists.ozlabs.org > > > > Cc: sparclinux@vger.kernel.org > > > > Cc: linux-xtensa@linux-xtensa.org > > > > Cc: linux-mm@kvack.org > > > > Cc: linux-arch@vger.kernel.org > > > > Cc: linux-api@vger.kernel.org > > > > --- > [...] > -- > Michal Hocko > SUSE Labs
On Fri 19-06-15 12:43:33, Eric B Munson wrote: > On Fri, 19 Jun 2015, Michal Hocko wrote: > > > On Thu 18-06-15 16:30:48, Eric B Munson wrote: > > > On Thu, 18 Jun 2015, Michal Hocko wrote: > > [...] > > > > Wouldn't it be much more reasonable and straightforward to have > > > > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would > > > > explicitly disallow any form of pre-faulting? It would be usable for > > > > other usecases than with MAP_LOCKED combination. > > > > > > I don't see a clear case for it being more reasonable, it is one > > > possible way to solve the problem. > > > > MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault > > around is all or nothing feature. Either all mappings (which support > > this) fault around or none. There is no way to tell the kernel that > > this particular mapping shouldn't fault around. I haven't seen such a > > request yet but we have seen requests to have a way to opt out from > > a global policy in the past (e.g. per-process opt out from THP). So > > I can imagine somebody will come with a request to opt out from any > > speculative operations on the mapped area in the future. > > > > > But I think it leaves us in an even > > > more akward state WRT VMA flags. As you noted in your fix for the > > > mmap() man page, one can get into a state where a VMA is VM_LOCKED, but > > > not present. Having VM_LOCKONFAULT states that this was intentional, if > > > we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no > > > longer set VM_LOCKONFAULT (unless we want to start mapping it to the > > > presence of two MAP_ flags). This can make detecting the MAP_LOCKED + > > > populate failure state harder. > > > > I am not sure I understand your point here. Could you be more specific > > how would you check for that and what for? > > My thought on detecting was that someone might want to know if they had > a VMA that was VM_LOCKED but had not been made present becuase of a > failure in mmap. We don't have a way today, but adding VM_LOCKONFAULT > is at least explicit about what is happening which would make detecting > the VM_LOCKED but not present state easier. One could use /proc/<pid>/pagemap to query the residency. > This assumes that > MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like > it would have to. Yes, it would have to have a VM flag for the vma. > > From my understanding MAP_LOCKONFAULT is essentially > > MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike > > single MAP_LOCKED unfortunately). I would love to also have > > MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really > > skeptical considering how my previous attempt to make MAP_POPULATE > > reasonable went. > > Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the > new MAP_LOCKONFAULT flag (or both)? I thought the MAP_FAULTPOPULATE (or any other better name) would directly translate into VM_FAULTPOPULATE and wouldn't be tight to the locked semantic. We already have VM_LOCKED for that. The direct effect of the flag would be to prevent from population other than the direct page fault - including any speculative actions like fault around or read-ahead. > If you prefer that MAP_LOCKED | > MAP_FAULTPOPULATE means that VM_LOCKONFAULT is set, I am fine with that > instead of introducing MAP_LOCKONFAULT. I went with the new flag > because to date, we have a one to one mapping of MAP_* to VM_* flags. > > > > > > If this is the preferred path for mmap(), I am fine with that. > > > > > However, > > > I would like to see the new system calls that Andrew mentioned (and that > > > I am testing patches for) go in as well. > > > > mlock with flags sounds like a good step but I am not sure it will make > > sense in the future. POSIX has screwed that and I am not sure how many > > applications would use it. This ship has sailed long time ago. > > I don't know either, but the code is the question, right? I know that > we have at least one team that wants it here. > > > > > > That way we give users the > > > ability to request VM_LOCKONFAULT for memory allocated using something > > > other than mmap. > > > > mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even > > without changing mlock syscall. > > That is true as long as MAP_FAULTPOPULATE set a flag in the VMA(s). It > doesn't cover the actual case I was asking about, which is how do I get > lock on fault on malloc'd memory? OK I see your point now. We would indeed need a flag argument for mlock.
On Mon, 22 Jun 2015, Michal Hocko wrote: > On Fri 19-06-15 12:43:33, Eric B Munson wrote: > > On Fri, 19 Jun 2015, Michal Hocko wrote: > > > > > On Thu 18-06-15 16:30:48, Eric B Munson wrote: > > > > On Thu, 18 Jun 2015, Michal Hocko wrote: > > > [...] > > > > > Wouldn't it be much more reasonable and straightforward to have > > > > > MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would > > > > > explicitly disallow any form of pre-faulting? It would be usable for > > > > > other usecases than with MAP_LOCKED combination. > > > > > > > > I don't see a clear case for it being more reasonable, it is one > > > > possible way to solve the problem. > > > > > > MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault > > > around is all or nothing feature. Either all mappings (which support > > > this) fault around or none. There is no way to tell the kernel that > > > this particular mapping shouldn't fault around. I haven't seen such a > > > request yet but we have seen requests to have a way to opt out from > > > a global policy in the past (e.g. per-process opt out from THP). So > > > I can imagine somebody will come with a request to opt out from any > > > speculative operations on the mapped area in the future. > > > > > > > But I think it leaves us in an even > > > > more akward state WRT VMA flags. As you noted in your fix for the > > > > mmap() man page, one can get into a state where a VMA is VM_LOCKED, but > > > > not present. Having VM_LOCKONFAULT states that this was intentional, if > > > > we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no > > > > longer set VM_LOCKONFAULT (unless we want to start mapping it to the > > > > presence of two MAP_ flags). This can make detecting the MAP_LOCKED + > > > > populate failure state harder. > > > > > > I am not sure I understand your point here. Could you be more specific > > > how would you check for that and what for? > > > > My thought on detecting was that someone might want to know if they had > > a VMA that was VM_LOCKED but had not been made present becuase of a > > failure in mmap. We don't have a way today, but adding VM_LOCKONFAULT > > is at least explicit about what is happening which would make detecting > > the VM_LOCKED but not present state easier. > > One could use /proc/<pid>/pagemap to query the residency. > > > This assumes that > > MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like > > it would have to. > > Yes, it would have to have a VM flag for the vma. > > > > From my understanding MAP_LOCKONFAULT is essentially > > > MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike > > > single MAP_LOCKED unfortunately). I would love to also have > > > MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really > > > skeptical considering how my previous attempt to make MAP_POPULATE > > > reasonable went. > > > > Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the > > new MAP_LOCKONFAULT flag (or both)? > > I thought the MAP_FAULTPOPULATE (or any other better name) would > directly translate into VM_FAULTPOPULATE and wouldn't be tight to the > locked semantic. We already have VM_LOCKED for that. The direct effect > of the flag would be to prevent from population other than the direct > page fault - including any speculative actions like fault around or > read-ahead. I like the ability to control other speculative population, but I am not sure about overloading it with the VM_LOCKONFAULT case. Here is my concern. If we are using VM_FAULTPOPULATE | VM_LOCKED to denote LOCKONFAULT, how can we tell the difference between someone that wants to avoid read-ahead and wants to use mlock()? This might lead to some interesting states with mlock() and munlock() that take flags. For instance, using VM_LOCKONFAULT mlock(MLOCK_ONFAULT) followed by munlock(MLOCK_LOCKED) leaves the VMAs in the same state with VM_LOCKONFAULT set. If we use VM_FAULTPOPULATE, the same pair of calls would clear VM_LOCKED, but leave VM_FAULTPOPULATE. It may not matter in the end, but I am concerned about the subtleties here. > > > If you prefer that MAP_LOCKED | > > MAP_FAULTPOPULATE means that VM_LOCKONFAULT is set, I am fine with that > > instead of introducing MAP_LOCKONFAULT. I went with the new flag > > because to date, we have a one to one mapping of MAP_* to VM_* flags. > > > > > > > > > If this is the preferred path for mmap(), I am fine with that. > > > > > > > However, > > > > I would like to see the new system calls that Andrew mentioned (and that > > > > I am testing patches for) go in as well. > > > > > > mlock with flags sounds like a good step but I am not sure it will make > > > sense in the future. POSIX has screwed that and I am not sure how many > > > applications would use it. This ship has sailed long time ago. > > > > I don't know either, but the code is the question, right? I know that > > we have at least one team that wants it here. > > > > > > > > > That way we give users the > > > > ability to request VM_LOCKONFAULT for memory allocated using something > > > > other than mmap. > > > > > > mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even > > > without changing mlock syscall. > > > > That is true as long as MAP_FAULTPOPULATE set a flag in the VMA(s). It > > doesn't cover the actual case I was asking about, which is how do I get > > lock on fault on malloc'd memory? > > OK I see your point now. We would indeed need a flag argument for mlock. > -- > Michal Hocko > SUSE Labs
On 06/22/2015 04:18 PM, Eric B Munson wrote: > On Mon, 22 Jun 2015, Michal Hocko wrote: > >> On Fri 19-06-15 12:43:33, Eric B Munson wrote: >>> On Fri, 19 Jun 2015, Michal Hocko wrote: >>> >>>> On Thu 18-06-15 16:30:48, Eric B Munson wrote: >>>>> On Thu, 18 Jun 2015, Michal Hocko wrote: >>>> [...] >>>>>> Wouldn't it be much more reasonable and straightforward to have >>>>>> MAP_FAULTPOPULATE as a counterpart for MAP_POPULATE which would >>>>>> explicitly disallow any form of pre-faulting? It would be usable for >>>>>> other usecases than with MAP_LOCKED combination. >>>>> >>>>> I don't see a clear case for it being more reasonable, it is one >>>>> possible way to solve the problem. >>>> >>>> MAP_FAULTPOPULATE would be usable for other cases as well. E.g. fault >>>> around is all or nothing feature. Either all mappings (which support >>>> this) fault around or none. There is no way to tell the kernel that >>>> this particular mapping shouldn't fault around. I haven't seen such a >>>> request yet but we have seen requests to have a way to opt out from >>>> a global policy in the past (e.g. per-process opt out from THP). So >>>> I can imagine somebody will come with a request to opt out from any >>>> speculative operations on the mapped area in the future. That sounds like something where new madvise() flag would make more sense than a new mmap flag, and conflating it with locking behavior would lead to all kinds of weird corner cases as Eric mentioned. >>>> >>>>> But I think it leaves us in an even >>>>> more akward state WRT VMA flags. As you noted in your fix for the >>>>> mmap() man page, one can get into a state where a VMA is VM_LOCKED, but >>>>> not present. Having VM_LOCKONFAULT states that this was intentional, if >>>>> we go to using MAP_FAULTPOPULATE instead of MAP_LOCKONFAULT, we no >>>>> longer set VM_LOCKONFAULT (unless we want to start mapping it to the >>>>> presence of two MAP_ flags). This can make detecting the MAP_LOCKED + >>>>> populate failure state harder. >>>> >>>> I am not sure I understand your point here. Could you be more specific >>>> how would you check for that and what for? >>> >>> My thought on detecting was that someone might want to know if they had >>> a VMA that was VM_LOCKED but had not been made present becuase of a >>> failure in mmap. We don't have a way today, but adding VM_LOCKONFAULT >>> is at least explicit about what is happening which would make detecting >>> the VM_LOCKED but not present state easier. >> >> One could use /proc/<pid>/pagemap to query the residency. I think that's all too much complex scenario for a little gain. If someone knows that mmap(MAP_LOCKED|MAP_POPULATE) is not perfect, he should either mlock() separately from mmap(), or fault the range manually with a for loop. Why try to detect if the corner case was hit? >> >>> This assumes that >>> MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like >>> it would have to. >> >> Yes, it would have to have a VM flag for the vma. So with your approach, VM_LOCKED flag is enough, right? The new MAP_ / MLOCK_ flags just cause setting VM_LOCKED to not fault the whole vma, but otherwise nothing changes. If that's true, I think it's better than a new vma flag. >> >>>> From my understanding MAP_LOCKONFAULT is essentially >>>> MAP_FAULTPOPULATE|MAP_LOCKED with a quite obvious semantic (unlike >>>> single MAP_LOCKED unfortunately). I would love to also have >>>> MAP_LOCKED|MAP_POPULATE (aka full mlock semantic) but I am really >>>> skeptical considering how my previous attempt to make MAP_POPULATE >>>> reasonable went. >>> >>> Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the >>> new MAP_LOCKONFAULT flag (or both)? >> >> I thought the MAP_FAULTPOPULATE (or any other better name) would >> directly translate into VM_FAULTPOPULATE and wouldn't be tight to the >> locked semantic. We already have VM_LOCKED for that. The direct effect >> of the flag would be to prevent from population other than the direct >> page fault - including any speculative actions like fault around or >> read-ahead. > > I like the ability to control other speculative population, but I am not > sure about overloading it with the VM_LOCKONFAULT case. Here is my > concern. If we are using VM_FAULTPOPULATE | VM_LOCKED to denote > LOCKONFAULT, how can we tell the difference between someone that wants > to avoid read-ahead and wants to use mlock()? This might lead to some > interesting states with mlock() and munlock() that take flags. For > instance, using VM_LOCKONFAULT mlock(MLOCK_ONFAULT) followed by > munlock(MLOCK_LOCKED) leaves the VMAs in the same state with > VM_LOCKONFAULT set. If we use VM_FAULTPOPULATE, the same pair of calls > would clear VM_LOCKED, but leave VM_FAULTPOPULATE. It may not matter in > the end, but I am concerned about the subtleties here. Right. >> >>> If you prefer that MAP_LOCKED | >>> MAP_FAULTPOPULATE means that VM_LOCKONFAULT is set, I am fine with that >>> instead of introducing MAP_LOCKONFAULT. I went with the new flag >>> because to date, we have a one to one mapping of MAP_* to VM_* flags. >>> >>>> >>>>> If this is the preferred path for mmap(), I am fine with that. >>>> >>>>> However, >>>>> I would like to see the new system calls that Andrew mentioned (and that >>>>> I am testing patches for) go in as well. >>>> >>>> mlock with flags sounds like a good step but I am not sure it will make >>>> sense in the future. POSIX has screwed that and I am not sure how many >>>> applications would use it. This ship has sailed long time ago. >>> >>> I don't know either, but the code is the question, right? I know that >>> we have at least one team that wants it here. >>> >>>> >>>>> That way we give users the >>>>> ability to request VM_LOCKONFAULT for memory allocated using something >>>>> other than mmap. >>>> >>>> mmap(MAP_FAULTPOPULATE); mlock() would have the same semantic even >>>> without changing mlock syscall. >>> >>> That is true as long as MAP_FAULTPOPULATE set a flag in the VMA(s). It >>> doesn't cover the actual case I was asking about, which is how do I get >>> lock on fault on malloc'd memory? >> >> OK I see your point now. We would indeed need a flag argument for mlock. >> -- >> Michal Hocko >> SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 22-06-15 10:18:06, Eric B Munson wrote: > On Mon, 22 Jun 2015, Michal Hocko wrote: > > > On Fri 19-06-15 12:43:33, Eric B Munson wrote: [...] > > > Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the > > > new MAP_LOCKONFAULT flag (or both)? > > > > I thought the MAP_FAULTPOPULATE (or any other better name) would > > directly translate into VM_FAULTPOPULATE and wouldn't be tight to the > > locked semantic. We already have VM_LOCKED for that. The direct effect > > of the flag would be to prevent from population other than the direct > > page fault - including any speculative actions like fault around or > > read-ahead. > > I like the ability to control other speculative population, but I am not > sure about overloading it with the VM_LOCKONFAULT case. Here is my > concern. If we are using VM_FAULTPOPULATE | VM_LOCKED to denote > LOCKONFAULT, how can we tell the difference between someone that wants > to avoid read-ahead and wants to use mlock()? Not sure I understand. Something like? addr = mmap(VM_FAULTPOPULATE) # To prevent speculative mappings into the vma [...] mlock(addr, len) # Now I want the full mlock semantic and the later to have the full mlock semantic and populate the given area regardless of VM_FAULTPOPULATE being set on the vma? This would be an interesting question because mlock man page clearly states the semantic and that is to _always_ populate or fail. So I originally thought that it would obey VM_FAULTPOPULATE but this needs a more thinking. > This might lead to some > interesting states with mlock() and munlock() that take flags. For > instance, using VM_LOCKONFAULT mlock(MLOCK_ONFAULT) followed by > munlock(MLOCK_LOCKED) leaves the VMAs in the same state with > VM_LOCKONFAULT set. This is really confusing. Let me try to rephrase that. So you have mlock(addr, len, MLOCK_ONFAULT) munlock(addr, len, MLOCK_LOCKED) IIUC you would expect the vma still being MLOCK_ONFAULT, right? Isn't that behavior strange and unexpected? First of all, munlock has traditionally dropped the lock on the address range (e.g. what should happen if you did plain old munlock(addr, len)). But even without that. You are trying to unlock something that hasn't been locked the same way. So I would expect -EINVAL at least, if the two modes should be really represented by different flags. Or did you mean the both types of lock like: mlock(addr, len, MLOCK_ONFAULT) | mmap(MAP_LOCKONFAULT) mlock(addr, len, MLOCK_LOCKED) munlock(addr, len, MLOCK_LOCKED) and that should keep MLOCK_ONFAULT? This sounds even more weird to me because that means that the vma in question would be locked by two different mechanisms. MLOCK_LOCKED with the "always populate" semantic would rule out MLOCK_ONFAULT so what would be the meaning of the other flag then? Also what should regular munlock(addr, len) without flags unlock? Both? > If we use VM_FAULTPOPULATE, the same pair of calls > would clear VM_LOCKED, but leave VM_FAULTPOPULATE. It may not matter in > the end, but I am concerned about the subtleties here. This sounds like the proper behavior to me. munlock should simply always drop VM_LOCKED and the VM_FAULTPOPULATE can live its separate life. Btw. could you be more specific about semantic of m{un}lock(addr, len, flags) you want to propose? The more I think about that the more I am unclear about it, especially munlock behavior and possible flags.
On Tue 23-06-15 14:45:17, Vlastimil Babka wrote: > On 06/22/2015 04:18 PM, Eric B Munson wrote: > >On Mon, 22 Jun 2015, Michal Hocko wrote: > > > >>On Fri 19-06-15 12:43:33, Eric B Munson wrote: [...] > >>>My thought on detecting was that someone might want to know if they had > >>>a VMA that was VM_LOCKED but had not been made present becuase of a > >>>failure in mmap. We don't have a way today, but adding VM_LOCKONFAULT > >>>is at least explicit about what is happening which would make detecting > >>>the VM_LOCKED but not present state easier. > >> > >>One could use /proc/<pid>/pagemap to query the residency. > > I think that's all too much complex scenario for a little gain. If someone > knows that mmap(MAP_LOCKED|MAP_POPULATE) is not perfect, he should either > mlock() separately from mmap(), or fault the range manually with a for loop. > Why try to detect if the corner case was hit? No idea. I have just offered a way to do that. I do not think it is anyhow useful but who knows... I do agree that the mlock should be used for the full mlock semantic. > >>>This assumes that > >>>MAP_FAULTPOPULATE does not translate to a VMA flag, but it sounds like > >>>it would have to. > >> > >>Yes, it would have to have a VM flag for the vma. > > So with your approach, VM_LOCKED flag is enough, right? The new MAP_ / > MLOCK_ flags just cause setting VM_LOCKED to not fault the whole vma, but > otherwise nothing changes. VM_FAULTPOPULATE would have to be sticky to prevent from other speculative poppulation of the mapping. I mean, is it OK to have a new mlock semantic (on fault) which might still populate&lock memory which hasn't been faulted directly? Who knows what kind of speculative things we will do in the future and then find out that the semantic of lock-on-fault is not usable anymore. [...]
On Wed, 24 Jun 2015, Michal Hocko wrote: > On Mon 22-06-15 10:18:06, Eric B Munson wrote: > > On Mon, 22 Jun 2015, Michal Hocko wrote: > > > > > On Fri 19-06-15 12:43:33, Eric B Munson wrote: > [...] > > > > Are you objecting to the addition of the VMA flag VM_LOCKONFAULT, or the > > > > new MAP_LOCKONFAULT flag (or both)? > > > > > > I thought the MAP_FAULTPOPULATE (or any other better name) would > > > directly translate into VM_FAULTPOPULATE and wouldn't be tight to the > > > locked semantic. We already have VM_LOCKED for that. The direct effect > > > of the flag would be to prevent from population other than the direct > > > page fault - including any speculative actions like fault around or > > > read-ahead. > > > > I like the ability to control other speculative population, but I am not > > sure about overloading it with the VM_LOCKONFAULT case. Here is my > > concern. If we are using VM_FAULTPOPULATE | VM_LOCKED to denote > > LOCKONFAULT, how can we tell the difference between someone that wants > > to avoid read-ahead and wants to use mlock()? > > Not sure I understand. Something like? > addr = mmap(VM_FAULTPOPULATE) # To prevent speculative mappings into the vma > [...] > mlock(addr, len) # Now I want the full mlock semantic So this leaves us without the LOCKONFAULT semantics? That is not at all what I am looking for. What I want is a way to express 3 possible states of a VMA WRT locking, locked (populated and all pages on the unevictable LRU), lock on fault (populated by page fault, pages that are present are on the unevictable LRU, newly faulted pages are added to same), and not locked. > > and the later to have the full mlock semantic and populate the given > area regardless of VM_FAULTPOPULATE being set on the vma? This would > be an interesting question because mlock man page clearly states the > semantic and that is to _always_ populate or fail. So I originally > thought that it would obey VM_FAULTPOPULATE but this needs a more > thinking. > > > This might lead to some > > interesting states with mlock() and munlock() that take flags. For > > instance, using VM_LOCKONFAULT mlock(MLOCK_ONFAULT) followed by > > munlock(MLOCK_LOCKED) leaves the VMAs in the same state with > > VM_LOCKONFAULT set. > > This is really confusing. Let me try to rephrase that. So you have > mlock(addr, len, MLOCK_ONFAULT) > munlock(addr, len, MLOCK_LOCKED) > > IIUC you would expect the vma still being MLOCK_ONFAULT, right? Isn't > that behavior strange and unexpected? First of all, munlock has > traditionally dropped the lock on the address range (e.g. what should > happen if you did plain old munlock(addr, len)). But even without > that. You are trying to unlock something that hasn't been locked the > same way. So I would expect -EINVAL at least, if the two modes should be > really represented by different flags. I would expect it to remain MLOCK_LOCKONFAULT because the user requested munlock(addr, len, MLOCK_LOCKED). It is not currently an error to unlock memory that is not locked. We do this because we do not require the user track what areas are locked. It is acceptable to have a mostly locked area with holes unlocked with a single call to munlock that spans the entire area. The same semantics should hold for munlock with flags. If I have an area with MLOCK_LOCKED and MLOCK_ONFAULT interleaved, it should be acceptable to clear the MLOCK_ONFAULT flag from those areas with a single munlock call that spans the area. On top of continuing with munlock semantics, the implementation would need the ability to rollback an munlock call if it failed after altering VMAs. If we have the same interleaved area as before and we go to return -EINVAL the first time we hit an area that was MLOCK_LOCKED, how do we restore the state of the VMAs we have already processed, and possibly merged/split? > > Or did you mean the both types of lock like: > mlock(addr, len, MLOCK_ONFAULT) | mmap(MAP_LOCKONFAULT) > mlock(addr, len, MLOCK_LOCKED) > munlock(addr, len, MLOCK_LOCKED) > > and that should keep MLOCK_ONFAULT? > This sounds even more weird to me because that means that the vma in > question would be locked by two different mechanisms. MLOCK_LOCKED with > the "always populate" semantic would rule out MLOCK_ONFAULT so what > would be the meaning of the other flag then? Also what should regular > munlock(addr, len) without flags unlock? Both? This is indeed confusing and not what I was trying to illustrate, but since you bring it up. mlockall() currently clears all flags and then sets the new flags with each subsequent call. mlock2 would use that same behavior, if LOCKED was specified for a ONFAULT region, that region would become LOCKED and vice versa. I have the new system call set ready, I am waiting to post for rc1 so I can run the benchmarks again on a base more stable than the middle of a merge window. We should wait to hash out implementations until the code is up rather than talk past eachother here. > > > If we use VM_FAULTPOPULATE, the same pair of calls > > would clear VM_LOCKED, but leave VM_FAULTPOPULATE. It may not matter in > > the end, but I am concerned about the subtleties here. > > This sounds like the proper behavior to me. munlock should simply always > drop VM_LOCKED and the VM_FAULTPOPULATE can live its separate life. > > Btw. could you be more specific about semantic of m{un}lock(addr, len, flags) > you want to propose? The more I think about that the more I am unclear > about it, especially munlock behavior and possible flags. > -- > Michal Hocko > SUSE Labs
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 0086b47..15e96e1 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -30,6 +30,7 @@ #define MAP_NONBLOCK 0x40000 /* do not block on IO */ #define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x100000 /* create a huge page mapping */ +#define MAP_LOCKONFAULT 0x200000 /* Lock pages after they are faulted in, do not prefault */ #define MS_ASYNC 1 /* sync memory asynchronously */ #define MS_SYNC 2 /* synchronous memory sync */ diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index cfcb876..47846a5 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -48,6 +48,7 @@ #define MAP_NONBLOCK 0x20000 /* do not block on IO */ #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */ /* * Flags for msync diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 294d251..1514cd7 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -24,6 +24,7 @@ #define MAP_NONBLOCK 0x20000 /* do not block on IO */ #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */ #define MS_SYNC 1 /* synchronous memory sync */ #define MS_ASYNC 2 /* sync memory asynchronously */ diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h index 6ea26df..fce74fe 100644 --- a/arch/powerpc/include/uapi/asm/mman.h +++ b/arch/powerpc/include/uapi/asm/mman.h @@ -27,5 +27,6 @@ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */ #endif /* _UAPI_ASM_POWERPC_MMAN_H */ diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h index 0b14df3..12425d8 100644 --- a/arch/sparc/include/uapi/asm/mman.h +++ b/arch/sparc/include/uapi/asm/mman.h @@ -22,6 +22,7 @@ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */ #endif /* _UAPI__SPARC_MMAN_H__ */ diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h index 81b8fc3..ec04eaf 100644 --- a/arch/tile/include/uapi/asm/mman.h +++ b/arch/tile/include/uapi/asm/mman.h @@ -29,6 +29,7 @@ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */ #define MAP_HUGETLB 0x4000 /* create a huge page mapping */ +#define MAP_LOCKONFAULT 0x8000 /* Lock pages after they are faulted in, do not prefault */ /* diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 201aec0..42d43cc 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -55,6 +55,7 @@ #define MAP_NONBLOCK 0x20000 /* do not block on IO */ #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ +#define MAP_LOCKONFAULT 0x100000 /* Lock pages after they are faulted in, do not prefault */ #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED # define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 0755b9f..3e31457 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -126,6 +126,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */ +#define VM_LOCKONFAULT 0x00001000 /* Lock the pages covered when they are faulted in */ #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000 /* Memory mapped I/O or similar */ diff --git a/include/linux/mman.h b/include/linux/mman.h index 16373c8..437264b 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -86,7 +86,8 @@ calc_vm_flag_bits(unsigned long flags) { return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | - _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ); + _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | + _calc_vm_trans(flags, MAP_LOCKONFAULT,VM_LOCKONFAULT); } unsigned long vm_commit_limit(void); diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h index e9fe6fd..fc4e586 100644 --- a/include/uapi/asm-generic/mman.h +++ b/include/uapi/asm-generic/mman.h @@ -12,6 +12,7 @@ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ +#define MAP_LOCKONFAULT 0x80000 /* Lock pages after they are faulted in, do not prefault */ /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */ diff --git a/mm/mmap.c b/mm/mmap.c index bb50cac..ba1a6bf 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1233,7 +1233,7 @@ static inline int mlock_future_check(struct mm_struct *mm, unsigned long locked, lock_limit; /* mlock MCL_FUTURE? */ - if (flags & VM_LOCKED) { + if (flags & (VM_LOCKED | VM_LOCKONFAULT)) { locked = len >> PAGE_SHIFT; locked += mm->locked_vm; lock_limit = rlimit(RLIMIT_MEMLOCK); @@ -1301,7 +1301,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; - if (flags & MAP_LOCKED) + if (flags & (MAP_LOCKED | MAP_LOCKONFAULT)) if (!can_do_mlock()) return -EPERM; diff --git a/mm/swap.c b/mm/swap.c index a7251a8..07c905e 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -711,7 +711,8 @@ void lru_cache_add_active_or_unevictable(struct page *page, { VM_BUG_ON_PAGE(PageLRU(page), page); - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) { + if (likely((vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)) == 0) || + (vma->vm_flags & VM_SPECIAL)) { SetPageActive(page); lru_cache_add(page); return;
The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking. For the example of a large file, this is the usage pattern for a large statical language model (probably applies to other statical or graphical models as well). For the security example, any application transacting in data that cannot be swapped out (credit card data, medical records, etc). This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. To keep accounting checks out of the page fault path, users are billed for the entire mapping lock as if MAP_LOCKED was used. Signed-off-by: Eric B Munson <emunson@akamai.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: linux-alpha@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-mm@kvack.org Cc: linux-arch@vger.kernel.org Cc: linux-api@vger.kernel.org --- arch/alpha/include/uapi/asm/mman.h | 1 + arch/mips/include/uapi/asm/mman.h | 1 + arch/parisc/include/uapi/asm/mman.h | 1 + arch/powerpc/include/uapi/asm/mman.h | 1 + arch/sparc/include/uapi/asm/mman.h | 1 + arch/tile/include/uapi/asm/mman.h | 1 + arch/xtensa/include/uapi/asm/mman.h | 1 + include/linux/mm.h | 1 + include/linux/mman.h | 3 ++- include/uapi/asm-generic/mman.h | 1 + mm/mmap.c | 4 ++-- mm/swap.c | 3 ++- 12 files changed, 15 insertions(+), 4 deletions(-)