diff mbox

mm/madvise: allow MADV_DONTNEED to free memory that is MLOCK_ONFAULT

Message ID 1528484212-7199-1-git-send-email-jbaron@akamai.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jason Baron June 8, 2018, 6:56 p.m. UTC
In order to free memory that is marked MLOCK_ONFAULT, the memory region
needs to be first unlocked, before calling MADV_DONTNEED. And if the region
is to be reused as MLOCK_ONFAULT, we require another call to mlock2() with
the MLOCK_ONFAULT flag.

Let's simplify freeing memory that is set MLOCK_ONFAULT, by allowing
MADV_DONTNEED to work directly for memory that is set MLOCK_ONFAULT. The
locked memory limits, tracked by mm->locked_vm do not need to be adjusted
in this case, since they were charged to the entire region when
MLOCK_ONFAULT was initially set.

Further, I don't think allowing MADV_FREE for MLOCK_ONFAULT regions makes
sense, since the point of MLOCK_ONFAULT is for userspace to know when pages
are locked in memory and thus to know when page faults will occur.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/internal.h | 18 ++++++++++++++++++
 mm/madvise.c  |  4 ++--
 mm/oom_kill.c |  2 +-
 3 files changed, 21 insertions(+), 3 deletions(-)

Comments

Andrew Morton June 8, 2018, 7:57 p.m. UTC | #1
On Fri,  8 Jun 2018 14:56:52 -0400 Jason Baron <jbaron@akamai.com> wrote:

> In order to free memory that is marked MLOCK_ONFAULT, the memory region
> needs to be first unlocked, before calling MADV_DONTNEED. And if the region
> is to be reused as MLOCK_ONFAULT, we require another call to mlock2() with
> the MLOCK_ONFAULT flag.
> 
> Let's simplify freeing memory that is set MLOCK_ONFAULT, by allowing
> MADV_DONTNEED to work directly for memory that is set MLOCK_ONFAULT. The
> locked memory limits, tracked by mm->locked_vm do not need to be adjusted
> in this case, since they were charged to the entire region when
> MLOCK_ONFAULT was initially set.

Seems useful.

Is a manpage update planned?

Various updates to tools/testing/selftests/vm/* seem appropriate.

> Further, I don't think allowing MADV_FREE for MLOCK_ONFAULT regions makes
> sense, since the point of MLOCK_ONFAULT is for userspace to know when pages
> are locked in memory and thus to know when page faults will occur.

This sounds non-backward-compatible?
Jason Baron June 8, 2018, 8:55 p.m. UTC | #2
On 06/08/2018 03:57 PM, Andrew Morton wrote:
> On Fri,  8 Jun 2018 14:56:52 -0400 Jason Baron <jbaron@akamai.com> wrote:
> 
>> In order to free memory that is marked MLOCK_ONFAULT, the memory region
>> needs to be first unlocked, before calling MADV_DONTNEED. And if the region
>> is to be reused as MLOCK_ONFAULT, we require another call to mlock2() with
>> the MLOCK_ONFAULT flag.
>>
>> Let's simplify freeing memory that is set MLOCK_ONFAULT, by allowing
>> MADV_DONTNEED to work directly for memory that is set MLOCK_ONFAULT. The
>> locked memory limits, tracked by mm->locked_vm do not need to be adjusted
>> in this case, since they were charged to the entire region when
>> MLOCK_ONFAULT was initially set.
> 
> Seems useful.
> 
> Is a manpage update planned?
> 

Yes, I will add a manpage update. I sort of wanted to see first if
people thought this patch was a reasonable thing to do.

> Various updates to tools/testing/selftests/vm/* seem appropriate.
> 

Indeed, I started updating tootls/testing/selftests/vm/mlock2-tests.c
with this new interface, but then I realized that that test is failing
before I made any changes. So I will go back and sort that out, and add
additional testing for this new interface.

>> Further, I don't think allowing MADV_FREE for MLOCK_ONFAULT regions makes
>> sense, since the point of MLOCK_ONFAULT is for userspace to know when pages
>> are locked in memory and thus to know when page faults will occur.
> 
> This sounds non-backward-compatible?
> 

I was making the point of why I think allowing 'MADV_DONTNEED' for
MLOCK_ONFAULT regions makes sense, while allowing 'MADV_FREE' for
MLOCK_ONFAULT regions really does not.

Thanks,

-Jason
kernel test robot June 9, 2018, 11:51 a.m. UTC | #3
Hi Jason,

I love your patch! Perhaps something to improve:

[auto build test WARNING on mmotm/master]
[also build test WARNING on v4.17 next-20180608]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jason-Baron/mm-madvise-allow-MADV_DONTNEED-to-free-memory-that-is-MLOCK_ONFAULT/20180609-185549
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: alpha-allmodconfig (attached as .config)
compiler: alpha-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=alpha 

All warnings (new ones prefixed by >>):

   In file included from mm//internal.h:18:0,
                    from mm//swap.c:39:
   include/uapi/asm-generic/mman-common.h:22:0: warning: "MAP_FIXED" redefined
    #define MAP_FIXED 0x10  /* Interpret addr exactly */
    
   In file included from include/uapi/linux/mman.h:5:0,
                    from include/linux/mman.h:9,
                    from mm//swap.c:20:
   arch/alpha/include/uapi/asm/mman.h:17:0: note: this is the location of the previous definition
    #define MAP_FIXED 0x100  /* Interpret addr exactly */
    
   In file included from mm//internal.h:18:0,
                    from mm//swap.c:39:
   include/uapi/asm-generic/mman-common.h:23:0: warning: "MAP_ANONYMOUS" redefined
    #define MAP_ANONYMOUS 0x20  /* don't use a file */
    
   In file included from include/uapi/linux/mman.h:5:0,
                    from include/linux/mman.h:9,
                    from mm//swap.c:20:
   arch/alpha/include/uapi/asm/mman.h:18:0: note: this is the location of the previous definition
    #define MAP_ANONYMOUS 0x10  /* don't use a file */
    
   In file included from mm//internal.h:18:0,
                    from mm//swap.c:39:
   include/uapi/asm-generic/mman-common.h:27:0: warning: "MAP_UNINITIALIZED" redefined
    # define MAP_UNINITIALIZED 0x0  /* Don't support this flag */
    
   In file included from mm//swap.c:20:0:
   include/linux/mman.h:25:0: note: this is the location of the previous definition
    #define MAP_UNINITIALIZED 0
    
   In file included from mm//internal.h:18:0,
                    from mm//swap.c:39:
>> include/uapi/asm-generic/mman-common.h:31:0: warning: "MAP_FIXED_NOREPLACE" redefined
    #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
    
   In file included from include/uapi/linux/mman.h:5:0,
                    from include/linux/mman.h:9,
                    from mm//swap.c:20:
   arch/alpha/include/uapi/asm/mman.h:35:0: note: this is the location of the previous definition
    #define MAP_FIXED_NOREPLACE 0x200000/* MAP_FIXED which doesn't unmap underlying mapping */
    
   In file included from mm//internal.h:18:0,
                    from mm//swap.c:39:
   include/uapi/asm-generic/mman-common.h:39:0: warning: "MS_INVALIDATE" redefined
    #define MS_INVALIDATE 2  /* invalidate the caches */
    
   In file included from include/uapi/linux/mman.h:5:0,
                    from include/linux/mman.h:9,
                    from mm//swap.c:20:
   arch/alpha/include/uapi/asm/mman.h:39:0: note: this is the location of the previous definition
    #define MS_INVALIDATE 4  /* invalidate the caches */
    
   In file included from mm//internal.h:18:0,
                    from mm//swap.c:39:
   include/uapi/asm-generic/mman-common.h:40:0: warning: "MS_SYNC" redefined
    #define MS_SYNC  4  /* synchronous memory sync */
    
   In file included from include/uapi/linux/mman.h:5:0,
                    from include/linux/mman.h:9,
                    from mm//swap.c:20:
   arch/alpha/include/uapi/asm/mman.h:38:0: note: this is the location of the previous definition
    #define MS_SYNC  2  /* synchronous memory sync */
    
   In file included from mm//internal.h:18:0,
                    from mm//swap.c:39:
>> include/uapi/asm-generic/mman-common.h:46:0: warning: "MADV_DONTNEED" redefined
    #define MADV_DONTNEED 4  /* don't need these pages */
    
   In file included from include/uapi/linux/mman.h:5:0,
                    from include/linux/mman.h:9,
                    from mm//swap.c:20:
   arch/alpha/include/uapi/asm/mman.h:52:0: note: this is the location of the previous definition
    #define MADV_DONTNEED 6  /* don't need these pages */
    

vim +/MAP_FIXED_NOREPLACE +31 include/uapi/asm-generic/mman-common.h

5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  17  
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  18  #define MAP_SHARED	0x01		/* Share changes */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  19  #define MAP_PRIVATE	0x02		/* Changes are private */
1c972597 include/uapi/asm-generic/mman-common.h Dan Williams       2017-11-01  20  #define MAP_SHARED_VALIDATE 0x03	/* share + validate extension flags */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  21  #define MAP_TYPE	0x0f		/* Mask for type of mapping */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  22  #define MAP_FIXED	0x10		/* Interpret addr exactly */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  23  #define MAP_ANONYMOUS	0x20		/* don't use a file */
ea637639 include/asm-generic/mman-common.h      Jie Zhang          2009-12-14  24  #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
ea637639 include/asm-generic/mman-common.h      Jie Zhang          2009-12-14  25  # define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be uninitialized */
ea637639 include/asm-generic/mman-common.h      Jie Zhang          2009-12-14  26  #else
ea637639 include/asm-generic/mman-common.h      Jie Zhang          2009-12-14 @27  # define MAP_UNINITIALIZED 0x0		/* Don't support this flag */
ea637639 include/asm-generic/mman-common.h      Jie Zhang          2009-12-14  28  #endif
4ed28639 include/uapi/asm-generic/mman-common.h Michal Hocko       2018-04-10  29  
4ed28639 include/uapi/asm-generic/mman-common.h Michal Hocko       2018-04-10  30  /* 0x0100 - 0x80000 flags are defined in asm-generic/mman.h */
4ed28639 include/uapi/asm-generic/mman-common.h Michal Hocko       2018-04-10 @31  #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  32  
b0f205c2 include/uapi/asm-generic/mman-common.h Eric B Munson      2015-11-05  33  /*
b0f205c2 include/uapi/asm-generic/mman-common.h Eric B Munson      2015-11-05  34   * Flags for mlock
b0f205c2 include/uapi/asm-generic/mman-common.h Eric B Munson      2015-11-05  35   */
b0f205c2 include/uapi/asm-generic/mman-common.h Eric B Munson      2015-11-05  36  #define MLOCK_ONFAULT	0x01		/* Lock pages in range after they are faulted in, do not prefault */
b0f205c2 include/uapi/asm-generic/mman-common.h Eric B Munson      2015-11-05  37  
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  38  #define MS_ASYNC	1		/* sync memory asynchronously */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  39  #define MS_INVALIDATE	2		/* invalidate the caches */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  40  #define MS_SYNC		4		/* synchronous memory sync */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  41  
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  42  #define MADV_NORMAL	0		/* no further special treatment */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  43  #define MADV_RANDOM	1		/* expect random page references */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  44  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  45  #define MADV_WILLNEED	3		/* will need these pages */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15 @46  #define MADV_DONTNEED	4		/* don't need these pages */
5f6164f3 include/asm-generic/mman.h             Michael S. Tsirkin 2006-02-15  47  

:::::: The code at line 31 was first introduced by commit
:::::: 4ed28639519c7bad5f518e70b3284c6e0763e650 fs, elf: drop MAP_FIXED usage from elf_map

:::::: TO: Michal Hocko <mhocko@suse.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Michal Hocko June 11, 2018, 7:20 a.m. UTC | #4
[CCing linux-api - please make sure to CC this mailing list anytime you
 are touching user visible apis]

On Fri 08-06-18 14:56:52, Jason Baron wrote:
> In order to free memory that is marked MLOCK_ONFAULT, the memory region
> needs to be first unlocked, before calling MADV_DONTNEED. And if the region
> is to be reused as MLOCK_ONFAULT, we require another call to mlock2() with
> the MLOCK_ONFAULT flag.
> 
> Let's simplify freeing memory that is set MLOCK_ONFAULT, by allowing
> MADV_DONTNEED to work directly for memory that is set MLOCK_ONFAULT.

I do not understand the point here. How is MLOCK_ONFAULT any different
from the regular mlock here? If you want to free mlocked memory then
fine but the behavior should be consistent. MLOCK_ONFAULT is just a way
to say that we do not want to pre-populate the mlocked area and do that
lazily on the page fault time. madvise should make any difference here.

That being said we do not allow MADV_DONTNEED on VM_LOCKED since ever. I
do not really see why but this would be a user visible change. Can we do
that? What was the original motivation for exclusion?

[keeping the rest of email for linux-api]

> The
> locked memory limits, tracked by mm->locked_vm do not need to be adjusted
> in this case, since they were charged to the entire region when
> MLOCK_ONFAULT was initially set.
> 
> Further, I don't think allowing MADV_FREE for MLOCK_ONFAULT regions makes
> sense, since the point of MLOCK_ONFAULT is for userspace to know when pages
> are locked in memory and thus to know when page faults will occur.
> 
> Signed-off-by: Jason Baron <jbaron@akamai.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/internal.h | 18 ++++++++++++++++++
>  mm/madvise.c  |  4 ++--
>  mm/oom_kill.c |  2 +-
>  3 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 9e3654d..16c0041 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -15,6 +15,7 @@
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
>  #include <linux/tracepoint-defs.h>
> +#include <uapi/asm-generic/mman-common.h>
>  
>  /*
>   * The set of flags that only affect watermark checking and reclaim
> @@ -45,9 +46,26 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  
>  static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
>  {
> +	return !(((vma->vm_flags & (VM_LOCKED|VM_LOCKONFAULT)) == VM_LOCKED) ||
> +		 (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)));
> +}
> +
> +static inline bool can_madv_free_vma(struct vm_area_struct *vma)
> +{
>  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
>  }
>  
> +static inline bool can_madv_dontneed_or_free_vma(struct vm_area_struct *vma,
> +						 int behavior)
> +{
> +	if (behavior == MADV_DONTNEED)
> +		return can_madv_dontneed_vma(vma);
> +	else if (behavior == MADV_FREE)
> +		return can_madv_free_vma(vma);
> +	else
> +		return 0;
> +}
> +
>  void unmap_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end,
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 4d3c922..61ff306 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -517,7 +517,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  				  int behavior)
>  {
>  	*prev = vma;
> -	if (!can_madv_dontneed_vma(vma))
> +	if (!can_madv_dontneed_or_free_vma(vma, behavior))
>  		return -EINVAL;
>  
>  	if (!userfaultfd_remove(vma, start, end)) {
> @@ -539,7 +539,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  			 */
>  			return -ENOMEM;
>  		}
> -		if (!can_madv_dontneed_vma(vma))
> +		if (!can_madv_dontneed_or_free_vma(vma, behavior))
>  			return -EINVAL;
>  		if (end > vma->vm_end) {
>  			/*
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8ba6cb8..9817d15 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -492,7 +492,7 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>  
>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> -		if (!can_madv_dontneed_vma(vma))
> +		if (!can_madv_free_vma(vma))
>  			continue;
>  
>  		/*
> -- 
> 2.7.4
>
Jason Baron June 11, 2018, 2:51 p.m. UTC | #5
On 06/11/2018 03:20 AM, Michal Hocko wrote:
> [CCing linux-api - please make sure to CC this mailing list anytime you
>  are touching user visible apis]
> 
> On Fri 08-06-18 14:56:52, Jason Baron wrote:
>> In order to free memory that is marked MLOCK_ONFAULT, the memory region
>> needs to be first unlocked, before calling MADV_DONTNEED. And if the region
>> is to be reused as MLOCK_ONFAULT, we require another call to mlock2() with
>> the MLOCK_ONFAULT flag.
>>
>> Let's simplify freeing memory that is set MLOCK_ONFAULT, by allowing
>> MADV_DONTNEED to work directly for memory that is set MLOCK_ONFAULT.
> 
> I do not understand the point here. How is MLOCK_ONFAULT any different
> from the regular mlock here? If you want to free mlocked memory then
> fine but the behavior should be consistent. MLOCK_ONFAULT is just a way
> to say that we do not want to pre-populate the mlocked area and do that
> lazily on the page fault time. madvise should make any difference here.
>

The difference for me is after the page has been freed, MLOCK_ONFAULT
will re-populate the range if its accessed again. Whereas with regular
mlock I don't think it will because its normally done at mlock() or
mmap() time. In any case, the state of a region being locked with
regular mlock and pages not present does not currently exist, whereas it
does for MLOCK_ONFAULT, so it seems more natural to do it only for
MLOCK_ONFAULT. Finally, the use-case we had for this, didn't need
regular mlock().

> That being said we do not allow MADV_DONTNEED on VM_LOCKED since ever. I
> do not really see why but this would be a user visible change. Can we do
> that? What was the original motivation for exclusion?
> 

I'm not sure precisely for regular mlock. But for MLOCK_ONFAULT I did
ask the original author, Eric Munson (added to the 'cc) about allowing
MADV_DONTNEED, and iirc, he thought it made sense for MLOCK_ONFAULT.

Thanks,

-Jason


> [keeping the rest of email for linux-api]
> 
>> The
>> locked memory limits, tracked by mm->locked_vm do not need to be adjusted
>> in this case, since they were charged to the entire region when
>> MLOCK_ONFAULT was initially set.
>>
>> Further, I don't think allowing MADV_FREE for MLOCK_ONFAULT regions makes
>> sense, since the point of MLOCK_ONFAULT is for userspace to know when pages
>> are locked in memory and thus to know when page faults will occur.
>>
>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> ---
>>  mm/internal.h | 18 ++++++++++++++++++
>>  mm/madvise.c  |  4 ++--
>>  mm/oom_kill.c |  2 +-
>>  3 files changed, 21 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 9e3654d..16c0041 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -15,6 +15,7 @@
>>  #include <linux/mm.h>
>>  #include <linux/pagemap.h>
>>  #include <linux/tracepoint-defs.h>
>> +#include <uapi/asm-generic/mman-common.h>
>>  
>>  /*
>>   * The set of flags that only affect watermark checking and reclaim
>> @@ -45,9 +46,26 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>>  
>>  static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
>>  {
>> +	return !(((vma->vm_flags & (VM_LOCKED|VM_LOCKONFAULT)) == VM_LOCKED) ||
>> +		 (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)));
>> +}
>> +
>> +static inline bool can_madv_free_vma(struct vm_area_struct *vma)
>> +{
>>  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
>>  }
>>  
>> +static inline bool can_madv_dontneed_or_free_vma(struct vm_area_struct *vma,
>> +						 int behavior)
>> +{
>> +	if (behavior == MADV_DONTNEED)
>> +		return can_madv_dontneed_vma(vma);
>> +	else if (behavior == MADV_FREE)
>> +		return can_madv_free_vma(vma);
>> +	else
>> +		return 0;
>> +}
>> +
>>  void unmap_page_range(struct mmu_gather *tlb,
>>  			     struct vm_area_struct *vma,
>>  			     unsigned long addr, unsigned long end,
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 4d3c922..61ff306 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -517,7 +517,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>>  				  int behavior)
>>  {
>>  	*prev = vma;
>> -	if (!can_madv_dontneed_vma(vma))
>> +	if (!can_madv_dontneed_or_free_vma(vma, behavior))
>>  		return -EINVAL;
>>  
>>  	if (!userfaultfd_remove(vma, start, end)) {
>> @@ -539,7 +539,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>>  			 */
>>  			return -ENOMEM;
>>  		}
>> -		if (!can_madv_dontneed_vma(vma))
>> +		if (!can_madv_dontneed_or_free_vma(vma, behavior))
>>  			return -EINVAL;
>>  		if (end > vma->vm_end) {
>>  			/*
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 8ba6cb8..9817d15 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -492,7 +492,7 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>>  	set_bit(MMF_UNSTABLE, &mm->flags);
>>  
>>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
>> -		if (!can_madv_dontneed_vma(vma))
>> +		if (!can_madv_free_vma(vma))
>>  			continue;
>>  
>>  		/*
>> -- 
>> 2.7.4
>>
>
Michal Hocko June 11, 2018, 3:03 p.m. UTC | #6
On Mon 11-06-18 10:51:44, Jason Baron wrote:
> On 06/11/2018 03:20 AM, Michal Hocko wrote:
> > [CCing linux-api - please make sure to CC this mailing list anytime you
> >  are touching user visible apis]
> > 
> > On Fri 08-06-18 14:56:52, Jason Baron wrote:
> >> In order to free memory that is marked MLOCK_ONFAULT, the memory region
> >> needs to be first unlocked, before calling MADV_DONTNEED. And if the region
> >> is to be reused as MLOCK_ONFAULT, we require another call to mlock2() with
> >> the MLOCK_ONFAULT flag.
> >>
> >> Let's simplify freeing memory that is set MLOCK_ONFAULT, by allowing
> >> MADV_DONTNEED to work directly for memory that is set MLOCK_ONFAULT.
> > 
> > I do not understand the point here. How is MLOCK_ONFAULT any different
> > from the regular mlock here? If you want to free mlocked memory then
> > fine but the behavior should be consistent. MLOCK_ONFAULT is just a way
> > to say that we do not want to pre-populate the mlocked area and do that
> > lazily on the page fault time. madvise should make any difference here.
> >
> 
> The difference for me is after the page has been freed, MLOCK_ONFAULT
> will re-populate the range if its accessed again. Whereas with regular
> mlock I don't think it will because its normally done at mlock() or
> mmap() time.

The vma would still be locked so we would effectively turn it into
ONFAULT IIRC.

> In any case, the state of a region being locked with
> regular mlock and pages not present does not currently exist, whereas it
> does for MLOCK_ONFAULT, so it seems more natural to do it only for
> MLOCK_ONFAULT. Finally, the use-case we had for this, didn't need
> regular mlock().

So can we start discussing whether we want to allow MADV_DONTNEED on
mlocked areas and what downsides it might have? Sure it would turn the
strong mlock guarantee to have the whole vma resident but is this
acceptable for something that is an explicit request from the owner of
the memory?
Jason Baron June 11, 2018, 4:23 p.m. UTC | #7
On 06/11/2018 11:03 AM, Michal Hocko wrote:
> On Mon 11-06-18 10:51:44, Jason Baron wrote:
>> On 06/11/2018 03:20 AM, Michal Hocko wrote:
>>> [CCing linux-api - please make sure to CC this mailing list anytime you
>>>  are touching user visible apis]
>>>
>>> On Fri 08-06-18 14:56:52, Jason Baron wrote:
>>>> In order to free memory that is marked MLOCK_ONFAULT, the memory region
>>>> needs to be first unlocked, before calling MADV_DONTNEED. And if the region
>>>> is to be reused as MLOCK_ONFAULT, we require another call to mlock2() with
>>>> the MLOCK_ONFAULT flag.
>>>>
>>>> Let's simplify freeing memory that is set MLOCK_ONFAULT, by allowing
>>>> MADV_DONTNEED to work directly for memory that is set MLOCK_ONFAULT.
>>>
>>> I do not understand the point here. How is MLOCK_ONFAULT any different
>>> from the regular mlock here? If you want to free mlocked memory then
>>> fine but the behavior should be consistent. MLOCK_ONFAULT is just a way
>>> to say that we do not want to pre-populate the mlocked area and do that
>>> lazily on the page fault time. madvise should make any difference here.
>>>
>>
>> The difference for me is after the page has been freed, MLOCK_ONFAULT
>> will re-populate the range if its accessed again. Whereas with regular
>> mlock I don't think it will because its normally done at mlock() or
>> mmap() time.
> 
> The vma would still be locked so we would effectively turn it into
> ONFAULT IIRC.
>

Indeed. I just tried allowing MADV_DONTNEED against regular mlock() and
in my brief testing it seemed to work as expected against both anonymous
and file back pages. I am certainly not against allowing it for regular
mlock() as well, if you think that makes it more consistent.


>> In any case, the state of a region being locked with
>> regular mlock and pages not present does not currently exist, whereas it
>> does for MLOCK_ONFAULT, so it seems more natural to do it only for
>> MLOCK_ONFAULT. Finally, the use-case we had for this, didn't need
>> regular mlock().
> 
> So can we start discussing whether we want to allow MADV_DONTNEED on
> mlocked areas and what downsides it might have? Sure it would turn the
> strong mlock guarantee to have the whole vma resident but is this
> acceptable for something that is an explicit request from the owner of
> the memory?
> 

If its being explicity requested by the owner it makes sense to me. I
guess there could be a concern about this breaking some userspace that
relied on MADV_DONTNEED not freeing locked memory?

Thanks,

-Jason
Michal Hocko June 12, 2018, 7:46 a.m. UTC | #8
On Mon 11-06-18 12:23:58, Jason Baron wrote:
> On 06/11/2018 11:03 AM, Michal Hocko wrote:
> > So can we start discussing whether we want to allow MADV_DONTNEED on
> > mlocked areas and what downsides it might have? Sure it would turn the
> > strong mlock guarantee to have the whole vma resident but is this
> > acceptable for something that is an explicit request from the owner of
> > the memory?
> > 
> 
> If its being explicity requested by the owner it makes sense to me. I
> guess there could be a concern about this breaking some userspace that
> relied on MADV_DONTNEED not freeing locked memory?

Yes, this is always the fear when changing user visible behavior.  I can
imagine that a userspace allocator calling MADV_DONTNEED on free could
break. The same would apply to MLOCK_ONFAULT/MCL_ONFAULT though. We
have the new flag much shorter so the probability is smaller but the
problem is very same. So I _think_ we should treat both the same because
semantically they are indistinguishable from the MADV_DONTNEED POV. Both
remove faulted and mlocked pages. Mlock, once applied, should guarantee
no later major fault and MADV_DONTNEED breaks that obviously.

So the more I think about it the more I am worried about this but I am
more and more convinced that making ONFAULT special is just a wrong way
around this.
Jason Baron June 12, 2018, 2:11 p.m. UTC | #9
On 06/12/2018 03:46 AM, Michal Hocko wrote:
> On Mon 11-06-18 12:23:58, Jason Baron wrote:
>> On 06/11/2018 11:03 AM, Michal Hocko wrote:
>>> So can we start discussing whether we want to allow MADV_DONTNEED on
>>> mlocked areas and what downsides it might have? Sure it would turn the
>>> strong mlock guarantee to have the whole vma resident but is this
>>> acceptable for something that is an explicit request from the owner of
>>> the memory?
>>>
>>
>> If its being explicity requested by the owner it makes sense to me. I
>> guess there could be a concern about this breaking some userspace that
>> relied on MADV_DONTNEED not freeing locked memory?
> 
> Yes, this is always the fear when changing user visible behavior.  I can
> imagine that a userspace allocator calling MADV_DONTNEED on free could
> break. The same would apply to MLOCK_ONFAULT/MCL_ONFAULT though. We
> have the new flag much shorter so the probability is smaller but the
> problem is very same. So I _think_ we should treat both the same because
> semantically they are indistinguishable from the MADV_DONTNEED POV. Both
> remove faulted and mlocked pages. Mlock, once applied, should guarantee
> no later major fault and MADV_DONTNEED breaks that obviously.
> 
> So the more I think about it the more I am worried about this but I am
> more and more convinced that making ONFAULT special is just a wrong way
> around this.
> 

Ok, I share the concern that there is a chance that userspace is relying
on MADV_DONTNEED not free'ing locked memory. In that case, what if we
introduce a MADV_DONTNEED_FORCE, which does everything that
MADV_DONTNEED currently does but in addition will also free mlock areas.
That way there is no concern about breaking something.

Thanks,

-Jason
Vlastimil Babka June 13, 2018, 6:32 a.m. UTC | #10
On 06/12/2018 04:11 PM, Jason Baron wrote:
> 
> 
> On 06/12/2018 03:46 AM, Michal Hocko wrote:
>> On Mon 11-06-18 12:23:58, Jason Baron wrote:
>>> On 06/11/2018 11:03 AM, Michal Hocko wrote:
>>>> So can we start discussing whether we want to allow MADV_DONTNEED on
>>>> mlocked areas and what downsides it might have? Sure it would turn the
>>>> strong mlock guarantee to have the whole vma resident but is this
>>>> acceptable for something that is an explicit request from the owner of
>>>> the memory?
>>>>
>>>
>>> If its being explicity requested by the owner it makes sense to me. I
>>> guess there could be a concern about this breaking some userspace that
>>> relied on MADV_DONTNEED not freeing locked memory?
>>
>> Yes, this is always the fear when changing user visible behavior.  I can
>> imagine that a userspace allocator calling MADV_DONTNEED on free could
>> break. The same would apply to MLOCK_ONFAULT/MCL_ONFAULT though. We
>> have the new flag much shorter so the probability is smaller but the
>> problem is very same. So I _think_ we should treat both the same because
>> semantically they are indistinguishable from the MADV_DONTNEED POV. Both
>> remove faulted and mlocked pages. Mlock, once applied, should guarantee
>> no later major fault and MADV_DONTNEED breaks that obviously.

I think more concerning than guaranteeing no later major fault is
possible data loss, e.g. replacing data with zero-filled pages.

The madvise manpage is also quite specific about not allowing
MADV_DONTNEED and MADV_FREE for locked pages.

So I don't think we should risk changing that for all mlocked pages.
Maybe we can risk MCL_ONFAULT, since it's relatively new and has few users?

>> So the more I think about it the more I am worried about this but I am
>> more and more convinced that making ONFAULT special is just a wrong way
>> around this.
>>
> 
> Ok, I share the concern that there is a chance that userspace is relying
> on MADV_DONTNEED not free'ing locked memory. In that case, what if we
> introduce a MADV_DONTNEED_FORCE, which does everything that
> MADV_DONTNEED currently does but in addition will also free mlock areas.
> That way there is no concern about breaking something.

A new niche case flag? Sad :(

BTW I didn't get why we should allow this for MADV_DONTNEED but not
MADV_FREE. Can you expand on that?

> Thanks,
> 
> -Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Michal Hocko June 13, 2018, 7:15 a.m. UTC | #11
On Wed 13-06-18 08:32:19, Vlastimil Babka wrote:
> On 06/12/2018 04:11 PM, Jason Baron wrote:
> > 
> > 
> > On 06/12/2018 03:46 AM, Michal Hocko wrote:
> >> On Mon 11-06-18 12:23:58, Jason Baron wrote:
> >>> On 06/11/2018 11:03 AM, Michal Hocko wrote:
> >>>> So can we start discussing whether we want to allow MADV_DONTNEED on
> >>>> mlocked areas and what downsides it might have? Sure it would turn the
> >>>> strong mlock guarantee to have the whole vma resident but is this
> >>>> acceptable for something that is an explicit request from the owner of
> >>>> the memory?
> >>>>
> >>>
> >>> If its being explicity requested by the owner it makes sense to me. I
> >>> guess there could be a concern about this breaking some userspace that
> >>> relied on MADV_DONTNEED not freeing locked memory?
> >>
> >> Yes, this is always the fear when changing user visible behavior.  I can
> >> imagine that a userspace allocator calling MADV_DONTNEED on free could
> >> break. The same would apply to MLOCK_ONFAULT/MCL_ONFAULT though. We
> >> have the new flag much shorter so the probability is smaller but the
> >> problem is very same. So I _think_ we should treat both the same because
> >> semantically they are indistinguishable from the MADV_DONTNEED POV. Both
> >> remove faulted and mlocked pages. Mlock, once applied, should guarantee
> >> no later major fault and MADV_DONTNEED breaks that obviously.
> 
> I think more concerning than guaranteeing no later major fault is
> possible data loss, e.g. replacing data with zero-filled pages.

But MADV_DONTNEED is an explicit call for data loss. Or do I miss your
point?

> The madvise manpage is also quite specific about not allowing
> MADV_DONTNEED and MADV_FREE for locked pages.

Yeah, but that seems to describe the state of the art rather than
explain why.

> So I don't think we should risk changing that for all mlocked pages.
> Maybe we can risk MCL_ONFAULT, since it's relatively new and has few users?

That is what Jason wanted but I argued that the two are the same from
MADV_DONTNEED point of view. I do not see how treating them differently
would be less confusing or error prone. It's new so we can make it
behave differently is certainly not an argument.

> >> So the more I think about it the more I am worried about this but I am
> >> more and more convinced that making ONFAULT special is just a wrong way
> >> around this.
> >>
> > 
> > Ok, I share the concern that there is a chance that userspace is relying
> > on MADV_DONTNEED not free'ing locked memory. In that case, what if we
> > introduce a MADV_DONTNEED_FORCE, which does everything that
> > MADV_DONTNEED currently does but in addition will also free mlock areas.
> > That way there is no concern about breaking something.
> 
> A new niche case flag? Sad :(
> 
> BTW I didn't get why we should allow this for MADV_DONTNEED but not
> MADV_FREE. Can you expand on that?

Well, I wanted to bring this up as well. I guess this would require some
more hacks to handle the reclaim path correctly because we do rely on
VM_LOCK at many places for the lazy mlock pages culling.
Vlastimil Babka June 13, 2018, 7:51 a.m. UTC | #12
On 06/13/2018 09:15 AM, Michal Hocko wrote:
> On Wed 13-06-18 08:32:19, Vlastimil Babka wrote:
>> On 06/12/2018 04:11 PM, Jason Baron wrote:
>>>
>>>
>>> On 06/12/2018 03:46 AM, Michal Hocko wrote:
>>>> On Mon 11-06-18 12:23:58, Jason Baron wrote:
>>>>> On 06/11/2018 11:03 AM, Michal Hocko wrote:
>>>>>> So can we start discussing whether we want to allow MADV_DONTNEED on
>>>>>> mlocked areas and what downsides it might have? Sure it would turn the
>>>>>> strong mlock guarantee to have the whole vma resident but is this
>>>>>> acceptable for something that is an explicit request from the owner of
>>>>>> the memory?
>>>>>>
>>>>>
>>>>> If its being explicity requested by the owner it makes sense to me. I
>>>>> guess there could be a concern about this breaking some userspace that
>>>>> relied on MADV_DONTNEED not freeing locked memory?
>>>>
>>>> Yes, this is always the fear when changing user visible behavior.  I can
>>>> imagine that a userspace allocator calling MADV_DONTNEED on free could
>>>> break. The same would apply to MLOCK_ONFAULT/MCL_ONFAULT though. We
>>>> have the new flag much shorter so the probability is smaller but the
>>>> problem is very same. So I _think_ we should treat both the same because
>>>> semantically they are indistinguishable from the MADV_DONTNEED POV. Both
>>>> remove faulted and mlocked pages. Mlock, once applied, should guarantee
>>>> no later major fault and MADV_DONTNEED breaks that obviously.
>>
>> I think more concerning than guaranteeing no later major fault is
>> possible data loss, e.g. replacing data with zero-filled pages.
> 
> But MADV_DONTNEED is an explicit call for data loss. Or do I miss your
> point?

My point is that if somebody is relying on MADV_DONTNEED not affecting
mlocked pages, the consequences will be unexpected data loss, not just
extra page faults.

>> The madvise manpage is also quite specific about not allowing
>> MADV_DONTNEED and MADV_FREE for locked pages.
> 
> Yeah, but that seems to describe the state of the art rather than
> explain why.

Right, but as it's explicitly described there, it makes it more likely
that somebody is relying on it.

>> So I don't think we should risk changing that for all mlocked pages.
>> Maybe we can risk MCL_ONFAULT, since it's relatively new and has few users?
> 
> That is what Jason wanted but I argued that the two are the same from
> MADV_DONTNEED point of view. I do not see how treating them differently
> would be less confusing or error prone. It's new so we can make it
> behave differently is certainly not an argument.

Right. Might be either this inconsistency, or a new flag.
Michal Hocko June 13, 2018, 8:37 a.m. UTC | #13
On Wed 13-06-18 09:51:23, Vlastimil Babka wrote:
> On 06/13/2018 09:15 AM, Michal Hocko wrote:
> > On Wed 13-06-18 08:32:19, Vlastimil Babka wrote:
[...]
> >> I think more concerning than guaranteeing no later major fault is
> >> possible data loss, e.g. replacing data with zero-filled pages.
> > 
> > But MADV_DONTNEED is an explicit call for data loss. Or do I miss your
> > point?
> 
> My point is that if somebody is relying on MADV_DONTNEED not affecting
> mlocked pages, the consequences will be unexpected data loss, not just
> extra page faults.

OK, I see your point now. I would consider this an application bug
though. Calling MADV_DONTNEED and wondering that the content is gone is,
ehm, questionable at best. Why would anybody do that in the first place?

Anyway, I think that we cannot change the behavior because of mlockall
semantic as mentioned earlier.
Michal Hocko June 13, 2018, 9:13 a.m. UTC | #14
On Tue 12-06-18 10:11:33, Jason Baron wrote:
[...]
> Ok, I share the concern that there is a chance that userspace is relying
> on MADV_DONTNEED not free'ing locked memory. In that case, what if we
> introduce a MADV_DONTNEED_FORCE, which does everything that
> MADV_DONTNEED currently does but in addition will also free mlock areas.

What about other types of vmas that are not allowed to MADV_DONTNEED?
_FORCE suggests that the user of the API know what he is doing so why
shouldn't we allow unmapping hugetlb pages or special PFNMAPS? Or do we
want to add MADV_DONTNEED_FORCE_FOR_REAL when somebody comes with
another usecase?

I agree with Vlastimil that adding new madvise mode for niche case
sounds like a bad idea so we should better be sure that a new flag has
a reasonable semantic. Just allow mlocked pages is more of a tweak than
a proper semantic. So making it force for real requires to analyze what
that would mean for other vmas which are excluded now.
Jason Baron June 15, 2018, 7:28 p.m. UTC | #15
On 06/13/2018 05:13 AM, Michal Hocko wrote:
> On Tue 12-06-18 10:11:33, Jason Baron wrote:
> [...]
>> Ok, I share the concern that there is a chance that userspace is relying
>> on MADV_DONTNEED not free'ing locked memory. In that case, what if we
>> introduce a MADV_DONTNEED_FORCE, which does everything that
>> MADV_DONTNEED currently does but in addition will also free mlock areas.
> 
> What about other types of vmas that are not allowed to MADV_DONTNEED?
> _FORCE suggests that the user of the API know what he is doing so why
> shouldn't we allow unmapping hugetlb pages or special PFNMAPS? Or do we
> want to add MADV_DONTNEED_FORCE_FOR_REAL when somebody comes with
> another usecase?
> 
> I agree with Vlastimil that adding new madvise mode for niche case
> sounds like a bad idea so we should better be sure that a new flag has
> a reasonable semantic. Just allow mlocked pages is more of a tweak than
> a proper semantic. So making it force for real requires to analyze what
> that would mean for other vmas which are excluded now.
> 

If its a new flag, I agree it makes sense to look at hugetlb and
pfnmaps. pfnmaps might be more work, since it may require callbacks to
do driver specific actions...

I was able to do something very close to the original requirement of
free'ing mlock'd pages, via using a tmpfs mmap that is mlock'd. And then
using fallocate(FALLOC_FL_PUNCH_HOLE) to free the locked pages. I think
the tmpfs is sufficient for my needs, I wonder if there is any other
interest in this feature?

Thanks,

-Jason
Jason Baron June 15, 2018, 7:36 p.m. UTC | #16
On 06/13/2018 03:15 AM, Michal Hocko wrote:
> On Wed 13-06-18 08:32:19, Vlastimil Babka wrote:
>> On 06/12/2018 04:11 PM, Jason Baron wrote:
>>>
>>>
>>> On 06/12/2018 03:46 AM, Michal Hocko wrote:
>>>> On Mon 11-06-18 12:23:58, Jason Baron wrote:
>>>>> On 06/11/2018 11:03 AM, Michal Hocko wrote:
>>>>>> So can we start discussing whether we want to allow MADV_DONTNEED on
>>>>>> mlocked areas and what downsides it might have? Sure it would turn the
>>>>>> strong mlock guarantee to have the whole vma resident but is this
>>>>>> acceptable for something that is an explicit request from the owner of
>>>>>> the memory?
>>>>>>
>>>>>
>>>>> If its being explicity requested by the owner it makes sense to me. I
>>>>> guess there could be a concern about this breaking some userspace that
>>>>> relied on MADV_DONTNEED not freeing locked memory?
>>>>
>>>> Yes, this is always the fear when changing user visible behavior.  I can
>>>> imagine that a userspace allocator calling MADV_DONTNEED on free could
>>>> break. The same would apply to MLOCK_ONFAULT/MCL_ONFAULT though. We
>>>> have the new flag much shorter so the probability is smaller but the
>>>> problem is very same. So I _think_ we should treat both the same because
>>>> semantically they are indistinguishable from the MADV_DONTNEED POV. Both
>>>> remove faulted and mlocked pages. Mlock, once applied, should guarantee
>>>> no later major fault and MADV_DONTNEED breaks that obviously.
>>
>> I think more concerning than guaranteeing no later major fault is
>> possible data loss, e.g. replacing data with zero-filled pages.
> 
> But MADV_DONTNEED is an explicit call for data loss. Or do I miss your
> point?
> 
>> The madvise manpage is also quite specific about not allowing
>> MADV_DONTNEED and MADV_FREE for locked pages.
> 
> Yeah, but that seems to describe the state of the art rather than
> explain why.
> 
>> So I don't think we should risk changing that for all mlocked pages.
>> Maybe we can risk MCL_ONFAULT, since it's relatively new and has few users?
> 
> That is what Jason wanted but I argued that the two are the same from
> MADV_DONTNEED point of view. I do not see how treating them differently
> would be less confusing or error prone. It's new so we can make it
> behave differently is certainly not an argument.
> 
>>>> So the more I think about it the more I am worried about this but I am
>>>> more and more convinced that making ONFAULT special is just a wrong way
>>>> around this.
>>>>
>>>
>>> Ok, I share the concern that there is a chance that userspace is relying
>>> on MADV_DONTNEED not free'ing locked memory. In that case, what if we
>>> introduce a MADV_DONTNEED_FORCE, which does everything that
>>> MADV_DONTNEED currently does but in addition will also free mlock areas.
>>> That way there is no concern about breaking something.
>>
>> A new niche case flag? Sad :(
>>
>> BTW I didn't get why we should allow this for MADV_DONTNEED but not
>> MADV_FREE. Can you expand on that?
> 
> Well, I wanted to bring this up as well. I guess this would require some
> more hacks to handle the reclaim path correctly because we do rely on
> VM_LOCK at many places for the lazy mlock pages culling.
> 

The point of not allowing MADV_FREE on mlock'd pages for me was that
with mlock and even MLOCK_ON_FAULT, one can always can always determine
if a page is present or not (and thus avoid the major fault). Allowing
MADV_FREE on lock'd pages breaks that assumption.

Thanks,

-Jason
Michal Hocko June 20, 2018, 11 a.m. UTC | #17
On Fri 15-06-18 15:36:07, Jason Baron wrote:
> 
> 
> On 06/13/2018 03:15 AM, Michal Hocko wrote:
> > On Wed 13-06-18 08:32:19, Vlastimil Babka wrote:
[...]
> >> BTW I didn't get why we should allow this for MADV_DONTNEED but not
> >> MADV_FREE. Can you expand on that?
> > 
> > Well, I wanted to bring this up as well. I guess this would require some
> > more hacks to handle the reclaim path correctly because we do rely on
> > VM_LOCK at many places for the lazy mlock pages culling.
> > 
> 
> The point of not allowing MADV_FREE on mlock'd pages for me was that
> with mlock and even MLOCK_ON_FAULT, one can always can always determine
> if a page is present or not (and thus avoid the major fault). Allowing
> MADV_FREE on lock'd pages breaks that assumption.

But once you have called MADV_FREE you cannot assume anything about the
content until you touch the memory again. So you can safely assume a
major fault for the worst case. Btw. why knowing whether you major fault
is important in the first place? What is an application going to do
about that information?
Jason Baron June 28, 2018, 8:20 p.m. UTC | #18
On 06/20/2018 07:00 AM, Michal Hocko wrote:
> On Fri 15-06-18 15:36:07, Jason Baron wrote:
>>
>>
>> On 06/13/2018 03:15 AM, Michal Hocko wrote:
>>> On Wed 13-06-18 08:32:19, Vlastimil Babka wrote:
> [...]
>>>> BTW I didn't get why we should allow this for MADV_DONTNEED but not
>>>> MADV_FREE. Can you expand on that?
>>>
>>> Well, I wanted to bring this up as well. I guess this would require some
>>> more hacks to handle the reclaim path correctly because we do rely on
>>> VM_LOCK at many places for the lazy mlock pages culling.
>>>
>>
>> The point of not allowing MADV_FREE on mlock'd pages for me was that
>> with mlock and even MLOCK_ON_FAULT, one can always can always determine
>> if a page is present or not (and thus avoid the major fault). Allowing
>> MADV_FREE on lock'd pages breaks that assumption.
> 
> But once you have called MADV_FREE you cannot assume anything about the
> content until you touch the memory again. So you can safely assume a
> major fault for the worst case. Btw. why knowing whether you major fault
> is important in the first place? What is an application going to do
> about that information?
> 

Fair enough, I think that means you end up with a MADV_FREE_FORCE to
support that case? As I said I worked around this by using tmpfs and
fallocate(FALLOC_FL_PUNCH_HOLE). However, I still think there is a
use-case for doing this for anonymous memory, to avoid the unlock() calls.

The use-case I had in mind was simply an application that has a fast
path for when it knows that the requested item is locked in memory.

Thanks,

-Jason
diff mbox

Patch

diff --git a/mm/internal.h b/mm/internal.h
index 9e3654d..16c0041 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -15,6 +15,7 @@ 
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/tracepoint-defs.h>
+#include <uapi/asm-generic/mman-common.h>
 
 /*
  * The set of flags that only affect watermark checking and reclaim
@@ -45,9 +46,26 @@  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 
 static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
 {
+	return !(((vma->vm_flags & (VM_LOCKED|VM_LOCKONFAULT)) == VM_LOCKED) ||
+		 (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)));
+}
+
+static inline bool can_madv_free_vma(struct vm_area_struct *vma)
+{
 	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
 }
 
+static inline bool can_madv_dontneed_or_free_vma(struct vm_area_struct *vma,
+						 int behavior)
+{
+	if (behavior == MADV_DONTNEED)
+		return can_madv_dontneed_vma(vma);
+	else if (behavior == MADV_FREE)
+		return can_madv_free_vma(vma);
+	else
+		return 0;
+}
+
 void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
diff --git a/mm/madvise.c b/mm/madvise.c
index 4d3c922..61ff306 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -517,7 +517,7 @@  static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  int behavior)
 {
 	*prev = vma;
-	if (!can_madv_dontneed_vma(vma))
+	if (!can_madv_dontneed_or_free_vma(vma, behavior))
 		return -EINVAL;
 
 	if (!userfaultfd_remove(vma, start, end)) {
@@ -539,7 +539,7 @@  static long madvise_dontneed_free(struct vm_area_struct *vma,
 			 */
 			return -ENOMEM;
 		}
-		if (!can_madv_dontneed_vma(vma))
+		if (!can_madv_dontneed_or_free_vma(vma, behavior))
 			return -EINVAL;
 		if (end > vma->vm_end) {
 			/*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8ba6cb8..9817d15 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -492,7 +492,7 @@  void __oom_reap_task_mm(struct mm_struct *mm)
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
-		if (!can_madv_dontneed_vma(vma))
+		if (!can_madv_free_vma(vma))
 			continue;
 
 		/*