mm: disable `vm.max_map_count' sysctl limit

Message ID	23066.59196.909026.689706@gargle.gargle.HOWL (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Mikael Pettersson <mikpelinux@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <23066.59196.909026.689706@gargle.gargle.HOWL> Date: Sun, 26 Nov 2017 17:09:32 +0100 To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Subject: [PATCH] mm: disable `vm.max_map_count' sysctl limit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Mikael Pettersson Nov. 26, 2017, 4:09 p.m. UTC

The `vm.max_map_count' sysctl limit is IMO useless and confusing, so
this patch disables it.

- Old ELF had a limit of 64K segments, making core dumps from processes
  with more mappings than that problematic, but that was fixed in March
  2010 ("elf coredump: add extended numbering support").

- There are no internal data structures sized by this limit, making it
  entirely artificial.

- When viewed as a limit on memory consumption, it is ineffective since
  the number of mappings does not correspond directly to the amount of
  memory consumed, since each mapping is variable-length.

- Reaching the limit causes various memory management system calls to
  fail with ENOMEM, which is a lie.  Combined with the unpredictability
  of the number of mappings in a process, especially when non-trivial
  memory management or heavy file mapping is used, it can be difficult
  to reproduce these events and debug them.  It's also confusing to get
  ENOMEM when you know you have lots of free RAM.

This limit was apparently introduced in the 2.1.80 kernel (first as a
compile-time constant, later changed to a sysctl), but I haven't been
able to find any description for it in Git or the LKML archives, so I
don't know what the original motivation was.

I've kept the kernel tunable to not break the API towards user-space,
but it's a no-op now.  Also the distinction between split_vma() and
__split_vma() disappears, so they are merged.

Tested on x86_64 with Fedora 26 user-space.  Also built an ARM NOMMU
kernel to make sure NOMMU compiles and links cleanly.

Signed-off-by: Mikael Pettersson <mikpelinux@gmail.com>
---
 Documentation/sysctl/vm.txt           | 17 +-------------
 Documentation/vm/ksm.txt              |  4 ----
 Documentation/vm/remap_file_pages.txt |  4 ----
 fs/binfmt_elf.c                       |  4 ----
 include/linux/mm.h                    | 23 -------------------
 kernel/sysctl.c                       |  3 +++
 mm/madvise.c                          | 12 ++--------
 mm/mmap.c                             | 42 ++++++-----------------------------
 mm/mremap.c                           |  7 ------
 mm/nommu.c                            |  3 ---
 mm/util.c                             |  1 -
 11 files changed, 13 insertions(+), 107 deletions(-)

Michal Hocko Nov. 27, 2017, 10:12 a.m. UTC | #1

On Sun 26-11-17 17:09:32, Mikael Pettersson wrote:
> The `vm.max_map_count' sysctl limit is IMO useless and confusing, so
> this patch disables it.
> 
> - Old ELF had a limit of 64K segments, making core dumps from processes
>   with more mappings than that problematic, but that was fixed in March
>   2010 ("elf coredump: add extended numbering support").
> 
> - There are no internal data structures sized by this limit, making it
>   entirely artificial.

each mapping has its vma structure and that in turn can be tracked by
other data structures so this is not entirely true.

> - When viewed as a limit on memory consumption, it is ineffective since
>   the number of mappings does not correspond directly to the amount of
>   memory consumed, since each mapping is variable-length.
> 
> - Reaching the limit causes various memory management system calls to
>   fail with ENOMEM, which is a lie.  Combined with the unpredictability
>   of the number of mappings in a process, especially when non-trivial
>   memory management or heavy file mapping is used, it can be difficult
>   to reproduce these events and debug them.  It's also confusing to get
>   ENOMEM when you know you have lots of free RAM.
> 
> This limit was apparently introduced in the 2.1.80 kernel (first as a
> compile-time constant, later changed to a sysctl), but I haven't been
> able to find any description for it in Git or the LKML archives, so I
> don't know what the original motivation was.
> 
> I've kept the kernel tunable to not break the API towards user-space,
> but it's a no-op now.  Also the distinction between split_vma() and
> __split_vma() disappears, so they are merged.

Could you be more explicit about _why_ we need to remove this tunable?
I am not saying I disagree, the removal simplifies the code but I do not
really see any justification here.

> Tested on x86_64 with Fedora 26 user-space.  Also built an ARM NOMMU
> kernel to make sure NOMMU compiles and links cleanly.
> 
> Signed-off-by: Mikael Pettersson <mikpelinux@gmail.com>
> ---
>  Documentation/sysctl/vm.txt           | 17 +-------------
>  Documentation/vm/ksm.txt              |  4 ----
>  Documentation/vm/remap_file_pages.txt |  4 ----
>  fs/binfmt_elf.c                       |  4 ----
>  include/linux/mm.h                    | 23 -------------------
>  kernel/sysctl.c                       |  3 +++
>  mm/madvise.c                          | 12 ++--------
>  mm/mmap.c                             | 42 ++++++-----------------------------
>  mm/mremap.c                           |  7 ------
>  mm/nommu.c                            |  3 ---
>  mm/util.c                             |  1 -
>  11 files changed, 13 insertions(+), 107 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index b920423f88cb..0fcb511d07e6 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -35,7 +35,7 @@ Currently, these files are in /proc/sys/vm:
>  - laptop_mode
>  - legacy_va_layout
>  - lowmem_reserve_ratio
> -- max_map_count
> +- max_map_count (unused, kept for backwards compatibility)
>  - memory_failure_early_kill
>  - memory_failure_recovery
>  - min_free_kbytes
> @@ -400,21 +400,6 @@ The minimum value is 1 (1/1 -> 100%).
>  
>  ==============================================================
>  
> -max_map_count:
> -
> -This file contains the maximum number of memory map areas a process
> -may have. Memory map areas are used as a side-effect of calling
> -malloc, directly by mmap, mprotect, and madvise, and also when loading
> -shared libraries.
> -
> -While most applications need less than a thousand maps, certain
> -programs, particularly malloc debuggers, may consume lots of them,
> -e.g., up to one or two maps per allocation.
> -
> -The default value is 65536.
> -
> -=============================================================
> -
>  memory_failure_early_kill:
>  
>  Control how to kill processes when uncorrected memory error (typically
> diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
> index 6686bd267dc9..4a917f88cb11 100644
> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -38,10 +38,6 @@ the range for whenever the KSM daemon is started; even if the range
>  cannot contain any pages which KSM could actually merge; even if
>  MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
>  
> -If a region of memory must be split into at least one new MADV_MERGEABLE
> -or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
> -will exceed vm.max_map_count (see Documentation/sysctl/vm.txt).
> -
>  Like other madvise calls, they are intended for use on mapped areas of
>  the user address space: they will report ENOMEM if the specified range
>  includes unmapped gaps (though working on the intervening mapped areas),
> diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt
> index f609142f406a..85985a89f05d 100644
> --- a/Documentation/vm/remap_file_pages.txt
> +++ b/Documentation/vm/remap_file_pages.txt
> @@ -21,7 +21,3 @@ systems are widely available.
>  The syscall is deprecated and replaced it with an emulation now. The
>  emulation creates new VMAs instead of nonlinear mappings. It's going to
>  work slower for rare users of remap_file_pages() but ABI is preserved.
> -
> -One side effect of emulation (apart from performance) is that user can hit
> -vm.max_map_count limit more easily due to additional VMAs. See comment for
> -DEFAULT_MAX_MAP_COUNT for more details on the limit.
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 83732fef510d..8e870b6e4ad9 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -2227,10 +2227,6 @@ static int elf_core_dump(struct coredump_params *cprm)
>  	elf = kmalloc(sizeof(*elf), GFP_KERNEL);
>  	if (!elf)
>  		goto out;
> -	/*
> -	 * The number of segs are recored into ELF header as 16bit value.
> -	 * Please check DEFAULT_MAX_MAP_COUNT definition when you modify here.
> -	 */
>  	segs = current->mm->map_count;
>  	segs += elf_core_extra_phdrs();
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ee073146aaa7..cf545264eb8b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -104,27 +104,6 @@ extern int mmap_rnd_compat_bits __read_mostly;
>  #define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
>  #endif
>  
> -/*
> - * Default maximum number of active map areas, this limits the number of vmas
> - * per mm struct. Users can overwrite this number by sysctl but there is a
> - * problem.
> - *
> - * When a program's coredump is generated as ELF format, a section is created
> - * per a vma. In ELF, the number of sections is represented in unsigned short.
> - * This means the number of sections should be smaller than 65535 at coredump.
> - * Because the kernel adds some informative sections to a image of program at
> - * generating coredump, we need some margin. The number of extra sections is
> - * 1-3 now and depends on arch. We use "5" as safe margin, here.
> - *
> - * ELF extended numbering allows more than 65535 sections, so 16-bit bound is
> - * not a hard limit any more. Although some userspace tools can be surprised by
> - * that.
> - */
> -#define MAPCOUNT_ELF_CORE_MARGIN	(5)
> -#define DEFAULT_MAX_MAP_COUNT	(USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
> -
> -extern int sysctl_max_map_count;
> -
>  extern unsigned long sysctl_user_reserve_kbytes;
>  extern unsigned long sysctl_admin_reserve_kbytes;
>  
> @@ -2134,8 +2113,6 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *,
>  	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
>  	struct mempolicy *, struct vm_userfaultfd_ctx);
>  extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
> -extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
> -	unsigned long addr, int new_below);
>  extern int split_vma(struct mm_struct *, struct vm_area_struct *,
>  	unsigned long addr, int new_below);
>  extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 557d46728577..caced68ff0d0 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -110,6 +110,9 @@ extern int pid_max_min, pid_max_max;
>  extern int percpu_pagelist_fraction;
>  extern int latencytop_enabled;
>  extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
> +#ifdef CONFIG_MMU
> +static int sysctl_max_map_count = 65530; /* obsolete, kept for backwards compatibility */
> +#endif
>  #ifndef CONFIG_MMU
>  extern int sysctl_nr_trim_pages;
>  #endif
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 375cf32087e4..f63834f59ca7 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -147,11 +147,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
>  	*prev = vma;
>  
>  	if (start != vma->vm_start) {
> -		if (unlikely(mm->map_count >= sysctl_max_map_count)) {
> -			error = -ENOMEM;
> -			goto out;
> -		}
> -		error = __split_vma(mm, vma, start, 1);
> +		error = split_vma(mm, vma, start, 1);
>  		if (error) {
>  			/*
>  			 * madvise() returns EAGAIN if kernel resources, such as
> @@ -164,11 +160,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
>  	}
>  
>  	if (end != vma->vm_end) {
> -		if (unlikely(mm->map_count >= sysctl_max_map_count)) {
> -			error = -ENOMEM;
> -			goto out;
> -		}
> -		error = __split_vma(mm, vma, end, 0);
> +		error = split_vma(mm, vma, end, 0);
>  		if (error) {
>  			/*
>  			 * madvise() returns EAGAIN if kernel resources, such as
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 924839fac0e6..e821d9c4395d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1354,10 +1354,6 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  	if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
>  		return -EOVERFLOW;
>  
> -	/* Too many mappings? */
> -	if (mm->map_count > sysctl_max_map_count)
> -		return -ENOMEM;
> -
>  	/* Obtain the address to map to. we verify (or select) it and ensure
>  	 * that it represents a valid section of the address space.
>  	 */
> @@ -2546,11 +2542,11 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
>  }
>  
>  /*
> - * __split_vma() bypasses sysctl_max_map_count checking.  We use this where it
> - * has already been checked or doesn't make sense to fail.
> + * Split a vma into two pieces at address 'addr', a new vma is allocated
> + * either for the first part or the tail.
>   */
> -int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
> -		unsigned long addr, int new_below)
> +int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
> +	      unsigned long addr, int new_below)
>  {
>  	struct vm_area_struct *new;
>  	int err;
> @@ -2612,19 +2608,6 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
>  	return err;
>  }
>  
> -/*
> - * Split a vma into two pieces at address 'addr', a new vma is allocated
> - * either for the first part or the tail.
> - */
> -int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
> -	      unsigned long addr, int new_below)
> -{
> -	if (mm->map_count >= sysctl_max_map_count)
> -		return -ENOMEM;
> -
> -	return __split_vma(mm, vma, addr, new_below);
> -}
> -
>  /* Munmap is split into 2 main parts -- this part which finds
>   * what needs doing, and the areas themselves, which do the
>   * work.  This now handles partial unmappings.
> @@ -2665,15 +2648,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  	if (start > vma->vm_start) {
>  		int error;
>  
> -		/*
> -		 * Make sure that map_count on return from munmap() will
> -		 * not exceed its limit; but let map_count go just above
> -		 * its limit temporarily, to help free resources as expected.
> -		 */
> -		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
> -			return -ENOMEM;
> -
> -		error = __split_vma(mm, vma, start, 0);
> +		error = split_vma(mm, vma, start, 0);
>  		if (error)
>  			return error;
>  		prev = vma;
> @@ -2682,7 +2657,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  	/* Does it split the last one? */
>  	last = find_vma(mm, end);
>  	if (last && end > last->vm_start) {
> -		int error = __split_vma(mm, last, end, 1);
> +		int error = split_vma(mm, last, end, 1);
>  		if (error)
>  			return error;
>  	}
> @@ -2694,7 +2669,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  		 * will remain splitted, but userland will get a
>  		 * highly unexpected error anyway. This is no
>  		 * different than the case where the first of the two
> -		 * __split_vma fails, but we don't undo the first
> +		 * split_vma fails, but we don't undo the first
>  		 * split, despite we could. This is unlikely enough
>  		 * failure that it's not worth optimizing it for.
>  		 */
> @@ -2915,9 +2890,6 @@ static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long
>  	if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT))
>  		return -ENOMEM;
>  
> -	if (mm->map_count > sysctl_max_map_count)
> -		return -ENOMEM;
> -
>  	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
>  		return -ENOMEM;
>  
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 049470aa1e3e..5544dd3e6e10 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -278,13 +278,6 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  	bool need_rmap_locks;
>  
>  	/*
> -	 * We'd prefer to avoid failure later on in do_munmap:
> -	 * which may split one vma into three before unmapping.
> -	 */
> -	if (mm->map_count >= sysctl_max_map_count - 3)
> -		return -ENOMEM;
> -
> -	/*
>  	 * Advise KSM to break any KSM pages in the area to be moved:
>  	 * it would be confusing if they were to turn up at the new
>  	 * location, where they happen to coincide with different KSM
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 17c00d93de2e..0f6d37be4797 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1487,9 +1487,6 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (vma->vm_file)
>  		return -ENOMEM;
>  
> -	if (mm->map_count >= sysctl_max_map_count)
> -		return -ENOMEM;
> -
>  	region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
>  	if (!region)
>  		return -ENOMEM;
> diff --git a/mm/util.c b/mm/util.c
> index 34e57fae959d..7e757686f186 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -516,7 +516,6 @@ EXPORT_SYMBOL_GPL(__page_mapcount);
>  int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
>  int sysctl_overcommit_ratio __read_mostly = 50;
>  unsigned long sysctl_overcommit_kbytes __read_mostly;
> -int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
>  unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
>  unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
>  
> -- 
> 2.13.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

Matthew Wilcox Nov. 27, 2017, 4:22 p.m. UTC | #2

On Mon, Nov 27, 2017 at 11:12:32AM +0100, Michal Hocko wrote:
> On Sun 26-11-17 17:09:32, Mikael Pettersson wrote:
> > - Reaching the limit causes various memory management system calls to
> >   fail with ENOMEM, which is a lie.  Combined with the unpredictability
> >   of the number of mappings in a process, especially when non-trivial
> >   memory management or heavy file mapping is used, it can be difficult
> >   to reproduce these events and debug them.  It's also confusing to get
> >   ENOMEM when you know you have lots of free RAM.

[snip]

> Could you be more explicit about _why_ we need to remove this tunable?
> I am not saying I disagree, the removal simplifies the code but I do not
> really see any justification here.

I imagine he started seeing random syscalls failing with ENOMEM and
eventually tracked it down to this stupid limit we used to need.

Andi Kleen Nov. 27, 2017, 5:25 p.m. UTC | #3

Michal Hocko <mhocko@kernel.org> writes:
>
> Could you be more explicit about _why_ we need to remove this tunable?
> I am not saying I disagree, the removal simplifies the code but I do not
> really see any justification here.

It's an arbitrary scaling limit on the how many mappings the process
has. The more memory you have the bigger a problem it is. We've
ran into this problem too on larger systems.

The reason the limit was there originally because it allows a DoS
attack against the kernel by filling all unswappable memory up with VMAs.

The old limit was designed for much smaller systems than we have
today.

There needs to be some limit, but it should be on the number of memory
pinned by the VMAs, and needs to scale with the available memory,
so that large systems are not penalized.

Unfortunately just making it part of the existing mlock limit could
break some existing setups which max out the mlock limit with something
else. Maybe we need a new rlimit for this?

-Andi

Michal Hocko Nov. 27, 2017, 6:32 p.m. UTC | #4

On Mon 27-11-17 09:25:16, Andi Kleen wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> >
> > Could you be more explicit about _why_ we need to remove this tunable?
> > I am not saying I disagree, the removal simplifies the code but I do not
> > really see any justification here.
> 
> It's an arbitrary scaling limit on the how many mappings the process
> has. The more memory you have the bigger a problem it is. We've
> ran into this problem too on larger systems.

Why cannot you increase the limit?

> The reason the limit was there originally because it allows a DoS
> attack against the kernel by filling all unswappable memory up with VMAs.

We can reduce the effect by accounting vmas to memory cgroups.

Mikael Pettersson Nov. 27, 2017, 7:18 p.m. UTC | #5

On Mon, Nov 27, 2017 at 11:12 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > I've kept the kernel tunable to not break the API towards user-space,
> > but it's a no-op now.  Also the distinction between split_vma() and
> > __split_vma() disappears, so they are merged.
>
> Could you be more explicit about _why_ we need to remove this tunable?
> I am not saying I disagree, the removal simplifies the code but I do not
> really see any justification here.

In principle you don't "need" to, as those that know about it can bump it
to some insanely high value and get on with life.  Meanwhile those that don't
(and I was one of them until fairly recently, and I'm no newcomer to Unix or
Linux) get to scratch their heads and wonder why the kernel says ENOMEM
when one has loads of free RAM.

But what _is_ the justification for having this arbitrary limit?
There might have
been historical reasons, but at least ELF core dumps are no longer a problem.

/Mikael

Mikael Pettersson Nov. 27, 2017, 7:28 p.m. UTC | #6

On Mon, Nov 27, 2017 at 5:22 PM, Matthew Wilcox <willy@infradead.org> wrote:
>> Could you be more explicit about _why_ we need to remove this tunable?
>> I am not saying I disagree, the removal simplifies the code but I do not
>> really see any justification here.
>
> I imagine he started seeing random syscalls failing with ENOMEM and
> eventually tracked it down to this stupid limit we used to need.

Exactly, except the origin (mmap() failing) was hidden behind layers upon layers
of user-space memory management code (not ours), which just said "failed to
allocate N bytes" (with N about 0.001% of the free RAM).  And it
wasn't reproducible.

Mikael Pettersson Nov. 27, 2017, 7:36 p.m. UTC | #7

On Mon, Nov 27, 2017 at 6:25 PM, Andi Kleen <ak@linux.intel.com> wrote:
> It's an arbitrary scaling limit on the how many mappings the process
> has. The more memory you have the bigger a problem it is. We've
> ran into this problem too on larger systems.
>
> The reason the limit was there originally because it allows a DoS
> attack against the kernel by filling all unswappable memory up with VMAs.
>
> The old limit was designed for much smaller systems than we have
> today.
>
> There needs to be some limit, but it should be on the number of memory
> pinned by the VMAs, and needs to scale with the available memory,
> so that large systems are not penalized.

Fully agreed.  One problem with the current limit is that number of VMAs
is only weakly related to the amount of memory one has mapped, and is
also prone to grow due to memory fragmentation.  I've seen processes
differ by 3X number of VMAs, even though they ran the same code and
had similar memory sizes; they only differed on how long they had been
running and which servers they ran on (and how long those had been up).

> Unfortunately just making it part of the existing mlock limit could
> break some existing setups which max out the mlock limit with something
> else. Maybe we need a new rlimit for this?
>
> -Andi

Michal Hocko Nov. 27, 2017, 7:52 p.m. UTC | #8

On Mon 27-11-17 20:18:00, Mikael Pettersson wrote:
> On Mon, Nov 27, 2017 at 11:12 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > I've kept the kernel tunable to not break the API towards user-space,
> > > but it's a no-op now.  Also the distinction between split_vma() and
> > > __split_vma() disappears, so they are merged.
> >
> > Could you be more explicit about _why_ we need to remove this tunable?
> > I am not saying I disagree, the removal simplifies the code but I do not
> > really see any justification here.
> 
> In principle you don't "need" to, as those that know about it can bump it
> to some insanely high value and get on with life.  Meanwhile those that don't
> (and I was one of them until fairly recently, and I'm no newcomer to Unix or
> Linux) get to scratch their heads and wonder why the kernel says ENOMEM
> when one has loads of free RAM.

I agree that our error reporting is more than suboptimal in this regard.
These are all historical mistakes and we have much more of those. The
thing is that we have means to debug these issues (check
/proc/<pid>/maps e.g.).

> But what _is_ the justification for having this arbitrary limit?
> There might have been historical reasons, but at least ELF core dumps
> are no longer a problem.

Andi has already mentioned the the resource consumption. You can create
a lot of unreclaimable memory and there should be some cap. Whether our
default is good is questionable. Whether we can remove it altogether is
a different thing.

As I've said I am not a great fan of the limit but "I've just notice it
breaks on me" doesn't sound like a very good justification. You still
have an option to increase it. Considering we do not have too many
reports suggests that this is not such a big deal for most users.

Michal Hocko Nov. 27, 2017, 7:57 p.m. UTC | #9

On Mon 27-11-17 19:32:18, Michal Hocko wrote:
> On Mon 27-11-17 09:25:16, Andi Kleen wrote:
[...]
> > The reason the limit was there originally because it allows a DoS
> > attack against the kernel by filling all unswappable memory up with VMAs.
> 
> We can reduce the effect by accounting vmas to memory cgroups.

As it turned out we already do.
	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);

Andi Kleen Nov. 27, 2017, 8:21 p.m. UTC | #10

On Mon, Nov 27, 2017 at 08:57:32PM +0100, Michal Hocko wrote:
> On Mon 27-11-17 19:32:18, Michal Hocko wrote:
> > On Mon 27-11-17 09:25:16, Andi Kleen wrote:
> [...]
> > > The reason the limit was there originally because it allows a DoS
> > > attack against the kernel by filling all unswappable memory up with VMAs.
> > 
> > We can reduce the effect by accounting vmas to memory cgroups.
> 
> As it turned out we already do.
> 	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);

That only helps if you have memory cgroups enabled. It would be a regression
to break the accounting on all the systems that don't.

-Andi

Michal Hocko Nov. 27, 2017, 8:52 p.m. UTC | #11

On Mon 27-11-17 12:21:21, Andi Kleen wrote:
> On Mon, Nov 27, 2017 at 08:57:32PM +0100, Michal Hocko wrote:
> > On Mon 27-11-17 19:32:18, Michal Hocko wrote:
> > > On Mon 27-11-17 09:25:16, Andi Kleen wrote:
> > [...]
> > > > The reason the limit was there originally because it allows a DoS
> > > > attack against the kernel by filling all unswappable memory up with VMAs.
> > > 
> > > We can reduce the effect by accounting vmas to memory cgroups.
> > 
> > As it turned out we already do.
> > 	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
> 
> That only helps if you have memory cgroups enabled. It would be a regression
> to break the accounting on all the systems that don't.

I agree. And I didn't say we should remove the existing limit. I am just
saying that we can reduce existing problems by increasing the limit and
relying on memcg accounting where possible.

John Hubbard Nov. 27, 2017, 11:26 p.m. UTC | #12

On 11/27/2017 11:52 AM, Michal Hocko wrote:
> On Mon 27-11-17 20:18:00, Mikael Pettersson wrote:
>> On Mon, Nov 27, 2017 at 11:12 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>>> I've kept the kernel tunable to not break the API towards user-space,
>>>> but it's a no-op now.  Also the distinction between split_vma() and
>>>> __split_vma() disappears, so they are merged.
>>>
>>> Could you be more explicit about _why_ we need to remove this tunable?
>>> I am not saying I disagree, the removal simplifies the code but I do not
>>> really see any justification here.
>>
>> In principle you don't "need" to, as those that know about it can bump it
>> to some insanely high value and get on with life.  Meanwhile those that don't
>> (and I was one of them until fairly recently, and I'm no newcomer to Unix or
>> Linux) get to scratch their heads and wonder why the kernel says ENOMEM
>> when one has loads of free RAM.
> 
> I agree that our error reporting is more than suboptimal in this regard.
> These are all historical mistakes and we have much more of those. The
> thing is that we have means to debug these issues (check
> /proc/<pid>/maps e.g.).
> 
>> But what _is_ the justification for having this arbitrary limit?
>> There might have been historical reasons, but at least ELF core dumps
>> are no longer a problem.
> 
> Andi has already mentioned the the resource consumption. You can create
> a lot of unreclaimable memory and there should be some cap. Whether our
> default is good is questionable. Whether we can remove it altogether is
> a different thing.
> 
> As I've said I am not a great fan of the limit but "I've just notice it
> breaks on me" doesn't sound like a very good justification. You still
> have an option to increase it. Considering we do not have too many
> reports suggests that this is not such a big deal for most users.
> 

Let me add a belated report, then: we ran into this limit while implementing 
an early version of Unified Memory[1], back in 2013. The implementation
at the time depended on tracking that assumed "one allocation == one vma".
So, with only 64K vmas, we quickly ran out, and changed the design to work
around that. (And later, the design was *completely* changed to use a separate
tracking system altogether). 

The existing limit seems rather too low, at least from my perspective. Maybe
it would be better, if expressed as a function of RAM size?


[1] https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

    This is a way to automatically (via page faulting) migrate memory
    between CPUs and devices (GPUs, here). This is before HMM, of course.

thanks,
John Hubbard

Michal Hocko Nov. 28, 2017, 8:12 a.m. UTC | #13

On Mon 27-11-17 15:26:27, John Hubbard wrote:
[...]
> Let me add a belated report, then: we ran into this limit while implementing 
> an early version of Unified Memory[1], back in 2013. The implementation
> at the time depended on tracking that assumed "one allocation == one vma".

And you tried hard to make those VMAs really separate? E.g. with
prot_none gaps?

> So, with only 64K vmas, we quickly ran out, and changed the design to work
> around that. (And later, the design was *completely* changed to use a separate
> tracking system altogether). 
> 
> The existing limit seems rather too low, at least from my perspective. Maybe
> it would be better, if expressed as a function of RAM size?

Dunno. Whenever we tried to do RAM scaling it turned out a bad idea
after years when memory grown much more than the code author expected.
Just look how we scaled hash table sizes... But maybe you can come up
with something clever. In any case tuning this from the userspace is a
trivial thing to do and I am somehow skeptical that any early boot code
would trip over the limit.

John Hubbard Nov. 29, 2017, 5:14 a.m. UTC | #14

On 11/28/2017 12:12 AM, Michal Hocko wrote:
> On Mon 27-11-17 15:26:27, John Hubbard wrote:
> [...]
>> Let me add a belated report, then: we ran into this limit while implementing 
>> an early version of Unified Memory[1], back in 2013. The implementation
>> at the time depended on tracking that assumed "one allocation == one vma".
> 
> And you tried hard to make those VMAs really separate? E.g. with
> prot_none gaps?

We didn't do that, and in fact I'm probably failing to grasp the underlying 
design idea that you have in mind there...hints welcome...

What we did was to hook into the mmap callbacks in the kernel driver, after 
userspace mmap'd a region (via a custom allocator API). And we had an ioctl
in there, to connect up other allocation attributes that couldn't be passed
through via mmap. Again, this was for regions of memory that were to be
migrated between CPU and device (GPU).

> 
>> So, with only 64K vmas, we quickly ran out, and changed the design to work
>> around that. (And later, the design was *completely* changed to use a separate
>> tracking system altogether). exag
>>
>> The existing limit seems rather too low, at least from my perspective. Maybe
>> it would be better, if expressed as a function of RAM size?
> 
> Dunno. Whenever we tried to do RAM scaling it turned out a bad idea
> after years when memory grown much more than the code author expected.
> Just look how we scaled hash table sizes... But maybe you can come up
> with something clever. In any case tuning this from the userspace is a
> trivial thing to do and I am somehow skeptical that any early boot code
> would trip over the limit.
> 

I agree that this is not a limit that boot code is likely to hit. And maybe 
tuning from userspace really is the right approach here, considering that
there is a real cost to going too large. 

Just philosophically here, hard limits like this seem a little awkward if they 
are set once in, say, 1999 (gross exaggeration here, for effect) and then not
updated to stay with the times, right? In other words, one should not routinely 
need to tune most things. That's why I was wondering if something crude and silly
would work, such as just a ratio of RAM to vma count. (I'm more just trying to
understand the "rules" here, than to debate--I don't have a strong opinion 
on this.)

The fact that this apparently failed with hash tables is interesting, I'd
love to read more if you have any notes or links. I spotted a 2014 LWN article
( https://lwn.net/Articles/612100 ) about hash table resizing, and some commits
that fixed resizing bugs, such as

    12311959ecf8a ("rhashtable: fix shift by 64 when shrinking")

...was it just a storm of bugs that showed up?

thanks,
John Hubbard

Michal Hocko Nov. 29, 2017, 8:32 a.m. UTC | #15

On Tue 28-11-17 21:14:23, John Hubbard wrote:
> On 11/28/2017 12:12 AM, Michal Hocko wrote:
> > On Mon 27-11-17 15:26:27, John Hubbard wrote:
> > [...]
> >> Let me add a belated report, then: we ran into this limit while implementing 
> >> an early version of Unified Memory[1], back in 2013. The implementation
> >> at the time depended on tracking that assumed "one allocation == one vma".
> > 
> > And you tried hard to make those VMAs really separate? E.g. with
> > prot_none gaps?
> 
> We didn't do that, and in fact I'm probably failing to grasp the underlying 
> design idea that you have in mind there...hints welcome...

mmap code tries to merge vmas very aggressively so you have to try to
make too many vmas. One way to separate different vmas is to mprotect
holes to trap potential {over,under}flows.

> What we did was to hook into the mmap callbacks in the kernel driver, after 
> userspace mmap'd a region (via a custom allocator API). And we had an ioctl
> in there, to connect up other allocation attributes that couldn't be passed
> through via mmap. Again, this was for regions of memory that were to be
> migrated between CPU and device (GPU).

Or maybe your driver made the vma merging impossible by requesting
explicit ranges which are not adjacent.

> >> So, with only 64K vmas, we quickly ran out, and changed the design to work
> >> around that. (And later, the design was *completely* changed to use a separate
> >> tracking system altogether). exag
> >>
> >> The existing limit seems rather too low, at least from my perspective. Maybe
> >> it would be better, if expressed as a function of RAM size?
> > 
> > Dunno. Whenever we tried to do RAM scaling it turned out a bad idea
> > after years when memory grown much more than the code author expected.
> > Just look how we scaled hash table sizes... But maybe you can come up
> > with something clever. In any case tuning this from the userspace is a
> > trivial thing to do and I am somehow skeptical that any early boot code
> > would trip over the limit.
> > 
> 
> I agree that this is not a limit that boot code is likely to hit. And maybe 
> tuning from userspace really is the right approach here, considering that
> there is a real cost to going too large. 
> 
> Just philosophically here, hard limits like this seem a little awkward if they 
> are set once in, say, 1999 (gross exaggeration here, for effect) and then not
> updated to stay with the times, right? In other words, one should not routinely 
> need to tune most things. That's why I was wondering if something crude and silly
> would work, such as just a ratio of RAM to vma count. (I'm more just trying to
> understand the "rules" here, than to debate--I don't have a strong opinion 
> on this.)

Well, rlimits are in general not very usable. Yet I do not think we
should simply wipe them out.

> The fact that this apparently failed with hash tables is interesting, I'd
> love to read more if you have any notes or links. I spotted a 2014 LWN article
> ( https://lwn.net/Articles/612100 ) about hash table resizing, and some commits
> that fixed resizing bugs, such as
> 
>     12311959ecf8a ("rhashtable: fix shift by 64 when shrinking")
> 
> ...was it just a storm of bugs that showed up?

No, it was just that large (TB) machines allocated insanely large hash
tables for things which will never have any way to fill them up. See
9017217b6f45 ("mm: adaptive hash table scaling").

mm: disable `vm.max_map_count' sysctl limit

Commit Message

Comments

Patch