[v2,2/5] mm: move per-vma lock into vm_area_struct

Message ID	20241112194635.444146-3-surenb@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Tue, 12 Nov 2024 11:46:32 -0800 In-Reply-To: <20241112194635.444146-1-surenb@google.com> Mime-Version: 1.0 References: <20241112194635.444146-1-surenb@google.com> Message-ID: <20241112194635.444146-3-surenb@google.com> Subject: [PATCH v2 2/5] mm: move per-vma lock into vm_area_struct From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: willy@infradead.org, liam.howlett@oracle.com, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com, surenb@google.com Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	move per-vma lock into vm_area_struct \| expand [v2,0/5] move per-vma lock into vm_area_struct [v2,1/5] mm: introduce vma_start_read_locked{_nested} helpers [v2,2/5] mm: move per-vma lock into vm_area_struct [v2,3/5] mm: mark vma as detached until it's added into vma tree [v2,4/5] mm: make vma cache SLAB_TYPESAFE_BY_RCU [v2,5/5] docs/mm: document latest changes to vm_lock

Suren Baghdasaryan Nov. 12, 2024, 7:46 p.m. UTC

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
This patchset moves vm_lock back into vm_area_struct, aligning it at the
cacheline boundary and changing the cache to be cache-aligned as well.
This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
(vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs. This memory consumption growth can be
addressed later by optimizing the vm_lock.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h       | 28 +++++++++++++----------
 include/linux/mm_types.h |  6 +++--
 kernel/fork.c            | 49 ++++------------------------------------
 3 files changed, 25 insertions(+), 58 deletions(-)

Lorenzo Stoakes Nov. 13, 2024, 2:28 p.m. UTC | #1

On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].

I don't see a motivating reason as to why we want to do this? We increase
memory usage here which is not good, but later lock optimisation mitigates
it, but why wouldn't we just do the lock optimisations and use less memory
overall?

> This patchset moves vm_lock back into vm_area_struct, aligning it at the
> cacheline boundary and changing the cache to be cache-aligned as well.
> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> (vm_lock) bytes to 256 bytes:
>
>     slabinfo before:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vma_lock         ...     40  102    1 : ...
>      vm_area_struct   ...    160   51    2 : ...

Pedantry, but it might be worth mentioning how much this can vary by config.

For instance, on my machine:

vm_area_struct    125238 138820    184   44

>
>     slabinfo after moving vm_lock:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    256   32    2 : ...
>
> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> which is 5.5MB per 100000 VMAs. This memory consumption growth can be
> addressed later by optimizing the vm_lock.

Yes grabbing this back is of critical importance I'd say! :)

Functionally it looks ok to me but would like to see a stronger
justification in the commit msg! :)

>
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h       | 28 +++++++++++++----------
>  include/linux/mm_types.h |  6 +++--
>  kernel/fork.c            | 49 ++++------------------------------------
>  3 files changed, 25 insertions(+), 58 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 01ce619f3d17..a5eb0be3e351 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -684,6 +684,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
>  #endif /* CONFIG_NUMA_BALANCING */
>
>  #ifdef CONFIG_PER_VMA_LOCK
> +static inline void vma_lock_init(struct vm_area_struct *vma)
> +{
> +	init_rwsem(&vma->vm_lock.lock);
> +	vma->vm_lock_seq = UINT_MAX;
> +}
> +
>  /*
>   * Try to read-lock a vma. The function is allowed to occasionally yield false
>   * locked result to avoid performance overhead, in which case we fall back to
> @@ -701,7 +707,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
>  		return false;
>
> -	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> +	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
>  		return false;
>
>  	/*
> @@ -716,7 +722,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * This pairs with RELEASE semantics in vma_end_write_all().
>  	 */
>  	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> -		up_read(&vma->vm_lock->lock);
> +		up_read(&vma->vm_lock.lock);
>  		return false;
>  	}
>  	return true;
> @@ -729,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
>  {
>  	mmap_assert_locked(vma->vm_mm);
> -	down_read_nested(&vma->vm_lock->lock, subclass);
> +	down_read_nested(&vma->vm_lock.lock, subclass);
>  }
>
>  /*
> @@ -739,13 +745,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
>  static inline void vma_start_read_locked(struct vm_area_struct *vma)
>  {
>  	mmap_assert_locked(vma->vm_mm);
> -	down_read(&vma->vm_lock->lock);
> +	down_read(&vma->vm_lock.lock);
>  }
>
>  static inline void vma_end_read(struct vm_area_struct *vma)
>  {
>  	rcu_read_lock(); /* keeps vma alive till the end of up_read */
> -	up_read(&vma->vm_lock->lock);
> +	up_read(&vma->vm_lock.lock);
>  	rcu_read_unlock();
>  }
>
> @@ -774,7 +780,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
>  	if (__is_vma_write_locked(vma, &mm_lock_seq))
>  		return;
>
> -	down_write(&vma->vm_lock->lock);
> +	down_write(&vma->vm_lock.lock);
>  	/*
>  	 * We should use WRITE_ONCE() here because we can have concurrent reads
>  	 * from the early lockless pessimistic check in vma_start_read().
> @@ -782,7 +788,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
>  	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
>  	 */
>  	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> -	up_write(&vma->vm_lock->lock);
> +	up_write(&vma->vm_lock.lock);
>  }
>
>  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> @@ -794,7 +800,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
>
>  static inline void vma_assert_locked(struct vm_area_struct *vma)
>  {
> -	if (!rwsem_is_locked(&vma->vm_lock->lock))
> +	if (!rwsem_is_locked(&vma->vm_lock.lock))
>  		vma_assert_write_locked(vma);
>  }
>
> @@ -827,6 +833,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>
>  #else /* CONFIG_PER_VMA_LOCK */
>
> +static inline void vma_lock_init(struct vm_area_struct *vma) {}
>  static inline bool vma_start_read(struct vm_area_struct *vma)
>  		{ return false; }
>  static inline void vma_end_read(struct vm_area_struct *vma) {}
> @@ -861,10 +868,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
>
>  extern const struct vm_operations_struct vma_dummy_vm_ops;
>
> -/*
> - * WARNING: vma_init does not initialize vma->vm_lock.
> - * Use vm_area_alloc()/vm_area_free() if vma needs locking.
> - */
>  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  {
>  	memset(vma, 0, sizeof(*vma));
> @@ -873,6 +876,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
>  	vma_mark_detached(vma, false);
>  	vma_numab_state_init(vma);
> +	vma_lock_init(vma);
>  }
>
>  /* Use when VMA is not part of the VMA tree and needs no locking */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 80fef38d9d64..5c4bfdcfac72 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -716,8 +716,6 @@ struct vm_area_struct {
>  	 * slowpath.
>  	 */
>  	unsigned int vm_lock_seq;
> -	/* Unstable RCU readers are allowed to read this. */
> -	struct vma_lock *vm_lock;
>  #endif
>
>  	/*
> @@ -770,6 +768,10 @@ struct vm_area_struct {
>  	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* Unstable RCU readers are allowed to read this. */
> +	struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> +#endif
>  } __randomize_layout;
>
>  #ifdef CONFIG_NUMA
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0061cf2450ef..7823797e31d2 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
>
> -#ifdef CONFIG_PER_VMA_LOCK
> -
> -/* SLAB cache for vm_area_struct.lock */
> -static struct kmem_cache *vma_lock_cachep;
> -
> -static bool vma_lock_alloc(struct vm_area_struct *vma)
> -{
> -	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
> -	if (!vma->vm_lock)
> -		return false;
> -
> -	init_rwsem(&vma->vm_lock->lock);
> -	vma->vm_lock_seq = UINT_MAX;
> -
> -	return true;
> -}
> -
> -static inline void vma_lock_free(struct vm_area_struct *vma)
> -{
> -	kmem_cache_free(vma_lock_cachep, vma->vm_lock);
> -}
> -
> -#else /* CONFIG_PER_VMA_LOCK */
> -
> -static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
> -static inline void vma_lock_free(struct vm_area_struct *vma) {}
> -
> -#endif /* CONFIG_PER_VMA_LOCK */
> -
>  struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
> @@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>  		return NULL;
>
>  	vma_init(vma, mm);
> -	if (!vma_lock_alloc(vma)) {
> -		kmem_cache_free(vm_area_cachep, vma);
> -		return NULL;
> -	}
>
>  	return vma;
>  }
> @@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	 * will be reinitialized.
>  	 */
>  	data_race(memcpy(new, orig, sizeof(*new)));
> -	if (!vma_lock_alloc(new)) {
> -		kmem_cache_free(vm_area_cachep, new);
> -		return NULL;
> -	}
> +	vma_lock_init(new);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
>  	vma_numab_state_init(new);
>  	dup_anon_vma_name(orig, new);
> @@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
>  {
>  	vma_numab_state_free(vma);
>  	free_anon_vma_name(vma);
> -	vma_lock_free(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
>  }
>
> @@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
>  						  vm_rcu);
>
>  	/* The vma should not be locked while being destroyed. */
> -	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
> +	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
>  	__vm_area_free(vma);
>  }
>  #endif
> @@ -3168,11 +3131,9 @@ void __init proc_caches_init(void)
>  			sizeof(struct fs_struct), 0,
>  			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
>  			NULL);
> -
> -	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
> -#ifdef CONFIG_PER_VMA_LOCK
> -	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
> -#endif
> +	vm_area_cachep = KMEM_CACHE(vm_area_struct,
> +			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> +			SLAB_ACCOUNT);

Why the SLAB_NO_MERGE?

>  	mmap_init();
>  	nsproxy_cache_init();
>  }
> --
> 2.47.0.277.g8800431eea-goog
>

Vlastimil Babka Nov. 13, 2024, 2:45 p.m. UTC | #2

On 11/13/24 15:28, Lorenzo Stoakes wrote:
> On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
>> Back when per-vma locks were introduces, vm_lock was moved out of
>> vm_area_struct in [1] because of the performance regression caused by
>> false cacheline sharing. Recent investigation [2] revealed that the
>> regressions is limited to a rather old Broadwell microarchitecture and
>> even there it can be mitigated by disabling adjacent cacheline
>> prefetching, see [3].
> 
> I don't see a motivating reason as to why we want to do this? We increase
> memory usage here which is not good, but later lock optimisation mitigates

I'd say we don't normally split logically single structures into multiple
structures just because they might pack better in multiple slabs vs single
slab. Because that means more complicated management, extra pointer
dereferences etc. And if that split away part is a lock, it even complicates
things further. So the only motivation for doing that split was that it
appeared to perform better, but that was found to be misleading.

But sure it can be described better, and include the new
SLAB_TYPESAFE_BY_RCU conversion part as the motivation - that would be
likely impossible to do with a split away lock.

> it, but why wouldn't we just do the lock optimisations and use less memory
> overall?

If the lock is made much smaller then the packing benefit by split might
disappear, as is the case here.

>> This patchset moves vm_lock back into vm_area_struct, aligning it at the
>> cacheline boundary and changing the cache to be cache-aligned as well.
>> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
>> (vm_lock) bytes to 256 bytes:
>>
>>     slabinfo before:
>>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>>      vma_lock         ...     40  102    1 : ...
>>      vm_area_struct   ...    160   51    2 : ...
> 
> Pedantry, but it might be worth mentioning how much this can vary by config.
> 
> For instance, on my machine:
> 
> vm_area_struct    125238 138820    184   44
> 
>>
>>     slabinfo after moving vm_lock:
>>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>>      vm_area_struct   ...    256   32    2 : ...
>>
>> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
>> which is 5.5MB per 100000 VMAs. This memory consumption growth can be
>> addressed later by optimizing the vm_lock.
> 
> Yes grabbing this back is of critical importance I'd say! :)

I doubt it's that critical. We'll have to weight that against introducing
another non-standard locking primitive.

> Functionally it looks ok to me but would like to see a stronger
> justification in the commit msg! :)
> 
>>
>> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
>> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
>> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
>>
>> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>> ---
>>  include/linux/mm.h       | 28 +++++++++++++----------
>>  include/linux/mm_types.h |  6 +++--
>>  kernel/fork.c            | 49 ++++------------------------------------
>>  3 files changed, 25 insertions(+), 58 deletions(-)
>>

Mateusz Guzik Nov. 13, 2024, 2:53 p.m. UTC | #3

On Wed, Nov 13, 2024 at 02:28:16PM +0000, Lorenzo Stoakes wrote:
> On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> 
> I don't see a motivating reason as to why we want to do this? We increase
> memory usage here which is not good, but later lock optimisation mitigates
> it, but why wouldn't we just do the lock optimisations and use less memory
> overall?
> 

Where would you put the lock in that case though?

With the patchset it sticks with the affected vma, so no false-sharing
woes concerning other instances of the same struct.

If you make them separately allocated and packed, they false-share
between different vmas using them (in fact this is currently happening).
If you make sure to pad them, that's 64 bytes per obj, majority of which
is empty space.

Lorenzo Stoakes Nov. 13, 2024, 2:58 p.m. UTC | #4

On Wed, Nov 13, 2024 at 03:45:24PM +0100, Vlastimil Babka wrote:
> On 11/13/24 15:28, Lorenzo Stoakes wrote:
> > On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> >> Back when per-vma locks were introduces, vm_lock was moved out of
> >> vm_area_struct in [1] because of the performance regression caused by
> >> false cacheline sharing. Recent investigation [2] revealed that the
> >> regressions is limited to a rather old Broadwell microarchitecture and
> >> even there it can be mitigated by disabling adjacent cacheline
> >> prefetching, see [3].
> >
> > I don't see a motivating reason as to why we want to do this? We increase
> > memory usage here which is not good, but later lock optimisation mitigates
>
> I'd say we don't normally split logically single structures into multiple
> structures just because they might pack better in multiple slabs vs single
> slab. Because that means more complicated management, extra pointer
> dereferences etc. And if that split away part is a lock, it even complicates
> things further. So the only motivation for doing that split was that it
> appeared to perform better, but that was found to be misleading.
>
> But sure it can be described better, and include the new
> SLAB_TYPESAFE_BY_RCU conversion part as the motivation - that would be
> likely impossible to do with a split away lock.

Right, my point is that there is no justification given here at all, and we
should give one. I understand the provenance of why we split the lock, but
there has to be a motivating reason if everything is working fine right
now.

The SLAB_TYPESAFE_BY_RCU one seems to be the key one, but also something
along the lines of complicated management, concern about ordering of
allocating/freeing things, etc.

Just needs some extra explanation here.

>
> > it, but why wouldn't we just do the lock optimisations and use less memory
> > overall?
>
> If the lock is made much smaller then the packing benefit by split might
> disappear, as is the case here.
>

Yeah, but the lock would be smaller so we'd save space still right?


> >> This patchset moves vm_lock back into vm_area_struct, aligning it at the
> >> cacheline boundary and changing the cache to be cache-aligned as well.
> >> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> >> (vm_lock) bytes to 256 bytes:
> >>
> >>     slabinfo before:
> >>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >>      vma_lock         ...     40  102    1 : ...
> >>      vm_area_struct   ...    160   51    2 : ...
> >
> > Pedantry, but it might be worth mentioning how much this can vary by config.
> >
> > For instance, on my machine:
> >
> > vm_area_struct    125238 138820    184   44
> >
> >>
> >>     slabinfo after moving vm_lock:
> >>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >>      vm_area_struct   ...    256   32    2 : ...
> >>
> >> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> >> which is 5.5MB per 100000 VMAs. This memory consumption growth can be
> >> addressed later by optimizing the vm_lock.
> >
> > Yes grabbing this back is of critical importance I'd say! :)
>
> I doubt it's that critical. We'll have to weight that against introducing
> another non-standard locking primitive.

Avoiding unnecessary VMA overhead is important, and a strong part of the
motivation for the guard pages series for instance, so I don't think we
should be unconcerned about unnecessary extra memory usage.

I'm guessing from what you say you're not in favour of the subsequent
'non-standard' locking changes...

>
> > Functionally it looks ok to me but would like to see a stronger
> > justification in the commit msg! :)
> >
> >>
> >> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> >> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> >> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> >>
> >> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >> ---
> >>  include/linux/mm.h       | 28 +++++++++++++----------
> >>  include/linux/mm_types.h |  6 +++--
> >>  kernel/fork.c            | 49 ++++------------------------------------
> >>  3 files changed, 25 insertions(+), 58 deletions(-)
> >>

Lorenzo Stoakes Nov. 13, 2024, 2:59 p.m. UTC | #5

On Wed, Nov 13, 2024 at 03:53:54PM +0100, Mateusz Guzik wrote:
> On Wed, Nov 13, 2024 at 02:28:16PM +0000, Lorenzo Stoakes wrote:
> > On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> >
> > I don't see a motivating reason as to why we want to do this? We increase
> > memory usage here which is not good, but later lock optimisation mitigates
> > it, but why wouldn't we just do the lock optimisations and use less memory
> > overall?
> >
>
> Where would you put the lock in that case though?
>
> With the patchset it sticks with the affected vma, so no false-sharing
> woes concerning other instances of the same struct.
>
> If you make them separately allocated and packed, they false-share
> between different vmas using them (in fact this is currently happening).
> If you make sure to pad them, that's 64 bytes per obj, majority of which
> is empty space.

'I don't see a motivating reason' = I don't see it in the commit message.

I'm saying put motivating reasons, like the above, in the commit message.

Lorenzo Stoakes Nov. 13, 2024, 3:01 p.m. UTC | #6

On Wed, Nov 13, 2024 at 02:28:16PM +0000, Lorenzo Stoakes wrote:
> On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
>
> I don't see a motivating reason as to why we want to do this? We increase
> memory usage here which is not good, but later lock optimisation mitigates
> it, but why wouldn't we just do the lock optimisations and use less memory
> overall?

I worded this badly. To clarify:

I don't see a motivating reason _in the commit message_ as to why we want
to do this.

I am certain there are, in fact Mateusz and Vlastimil have provided them.

So my review is - let's just put these there :)

Vlastimil Babka Nov. 13, 2024, 3:09 p.m. UTC | #7

On 11/13/24 15:58, Lorenzo Stoakes wrote:
> On Wed, Nov 13, 2024 at 03:45:24PM +0100, Vlastimil Babka wrote:
>> On 11/13/24 15:28, Lorenzo Stoakes wrote:
>> > On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
>> >> Back when per-vma locks were introduces, vm_lock was moved out of
>> >> vm_area_struct in [1] because of the performance regression caused by
>> >> false cacheline sharing. Recent investigation [2] revealed that the
>> >> regressions is limited to a rather old Broadwell microarchitecture and
>> >> even there it can be mitigated by disabling adjacent cacheline
>> >> prefetching, see [3].
>> >
>> > I don't see a motivating reason as to why we want to do this? We increase
>> > memory usage here which is not good, but later lock optimisation mitigates
>>
>> I'd say we don't normally split logically single structures into multiple
>> structures just because they might pack better in multiple slabs vs single
>> slab. Because that means more complicated management, extra pointer
>> dereferences etc. And if that split away part is a lock, it even complicates
>> things further. So the only motivation for doing that split was that it
>> appeared to perform better, but that was found to be misleading.
>>
>> But sure it can be described better, and include the new
>> SLAB_TYPESAFE_BY_RCU conversion part as the motivation - that would be
>> likely impossible to do with a split away lock.
> 
> Right, my point is that there is no justification given here at all, and we
> should give one. I understand the provenance of why we split the lock, but
> there has to be a motivating reason if everything is working fine right
> now.
> 
> The SLAB_TYPESAFE_BY_RCU one seems to be the key one, but also something
> along the lines of complicated management, concern about ordering of
> allocating/freeing things, etc.
> 
> Just needs some extra explanation here.

Right.

>>
>> > it, but why wouldn't we just do the lock optimisations and use less memory
>> > overall?
>>
>> If the lock is made much smaller then the packing benefit by split might
>> disappear, as is the case here.
>>
> 
> Yeah, but the lock would be smaller so we'd save space still right?

If it becomes small enough that it makes the vma including the lock fit in
192 bytes, then we're no longer saving space by keeping it in a separate
cache. Yes it would make the motivation even more obvious by shrinking it
first and then saying "look, we can even save more space by not having it
separate anymore". But there's the downside of nonstandard locking so I
think the order of changes that first unsplit the lock and only then
consider the pros and cons of shrinking makes more sense.
>> >> This patchset moves vm_lock back into vm_area_struct, aligning it at the
>> >> cacheline boundary and changing the cache to be cache-aligned as well.
>> >> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
>> >> (vm_lock) bytes to 256 bytes:
>> >>
>> >>     slabinfo before:
>> >>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>> >>      vma_lock         ...     40  102    1 : ...
>> >>      vm_area_struct   ...    160   51    2 : ...
>> >
>> > Pedantry, but it might be worth mentioning how much this can vary by config.
>> >
>> > For instance, on my machine:
>> >
>> > vm_area_struct    125238 138820    184   44
>> >
>> >>
>> >>     slabinfo after moving vm_lock:
>> >>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>> >>      vm_area_struct   ...    256   32    2 : ...
>> >>
>> >> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
>> >> which is 5.5MB per 100000 VMAs. This memory consumption growth can be
>> >> addressed later by optimizing the vm_lock.
>> >
>> > Yes grabbing this back is of critical importance I'd say! :)
>>
>> I doubt it's that critical. We'll have to weight that against introducing
>> another non-standard locking primitive.
> 
> Avoiding unnecessary VMA overhead is important, and a strong part of the
> motivation for the guard pages series for instance, so I don't think we
> should be unconcerned about unnecessary extra memory usage.

Yeah guard pages eliminate whole VMAs from existence so they are great :)

> I'm guessing from what you say you're not in favour of the subsequent
> 'non-standard' locking changes...

True.

>>
>> > Functionally it looks ok to me but would like to see a stronger
>> > justification in the commit msg! :)
>> >
>> >>
>> >> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
>> >> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
>> >> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
>> >>
>> >> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>> >> ---
>> >>  include/linux/mm.h       | 28 +++++++++++++----------
>> >>  include/linux/mm_types.h |  6 +++--
>> >>  kernel/fork.c            | 49 ++++------------------------------------
>> >>  3 files changed, 25 insertions(+), 58 deletions(-)
>> >>

Suren Baghdasaryan Nov. 13, 2024, 3:42 p.m. UTC | #8

On Wed, Nov 13, 2024 at 6:28 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
>
> I don't see a motivating reason as to why we want to do this? We increase
> memory usage here which is not good, but later lock optimisation mitigates
> it, but why wouldn't we just do the lock optimisations and use less memory
> overall?
>
> > This patchset moves vm_lock back into vm_area_struct, aligning it at the
> > cacheline boundary and changing the cache to be cache-aligned as well.
> > This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> > (vm_lock) bytes to 256 bytes:
> >
> >     slabinfo before:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vma_lock         ...     40  102    1 : ...
> >      vm_area_struct   ...    160   51    2 : ...
>
> Pedantry, but it might be worth mentioning how much this can vary by config.
>
> For instance, on my machine:
>
> vm_area_struct    125238 138820    184   44

Ack.

>
> >
> >     slabinfo after moving vm_lock:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    256   32    2 : ...
> >
> > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > which is 5.5MB per 100000 VMAs. This memory consumption growth can be
> > addressed later by optimizing the vm_lock.
>
> Yes grabbing this back is of critical importance I'd say! :)
>
> Functionally it looks ok to me but would like to see a stronger
> justification in the commit msg! :)
>
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h       | 28 +++++++++++++----------
> >  include/linux/mm_types.h |  6 +++--
> >  kernel/fork.c            | 49 ++++------------------------------------
> >  3 files changed, 25 insertions(+), 58 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 01ce619f3d17..a5eb0be3e351 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -684,6 +684,12 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
> >  #endif /* CONFIG_NUMA_BALANCING */
> >
> >  #ifdef CONFIG_PER_VMA_LOCK
> > +static inline void vma_lock_init(struct vm_area_struct *vma)
> > +{
> > +     init_rwsem(&vma->vm_lock.lock);
> > +     vma->vm_lock_seq = UINT_MAX;
> > +}
> > +
> >  /*
> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> >   * locked result to avoid performance overhead, in which case we fall back to
> > @@ -701,7 +707,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >       if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> >               return false;
> >
> > -     if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > +     if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> >               return false;
> >
> >       /*
> > @@ -716,7 +722,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >        */
> >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > -             up_read(&vma->vm_lock->lock);
> > +             up_read(&vma->vm_lock.lock);
> >               return false;
> >       }
> >       return true;
> > @@ -729,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >  static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass)
> >  {
> >       mmap_assert_locked(vma->vm_mm);
> > -     down_read_nested(&vma->vm_lock->lock, subclass);
> > +     down_read_nested(&vma->vm_lock.lock, subclass);
> >  }
> >
> >  /*
> > @@ -739,13 +745,13 @@ static inline void vma_start_read_locked_nested(struct vm_area_struct *vma, int
> >  static inline void vma_start_read_locked(struct vm_area_struct *vma)
> >  {
> >       mmap_assert_locked(vma->vm_mm);
> > -     down_read(&vma->vm_lock->lock);
> > +     down_read(&vma->vm_lock.lock);
> >  }
> >
> >  static inline void vma_end_read(struct vm_area_struct *vma)
> >  {
> >       rcu_read_lock(); /* keeps vma alive till the end of up_read */
> > -     up_read(&vma->vm_lock->lock);
> > +     up_read(&vma->vm_lock.lock);
> >       rcu_read_unlock();
> >  }
> >
> > @@ -774,7 +780,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
> >       if (__is_vma_write_locked(vma, &mm_lock_seq))
> >               return;
> >
> > -     down_write(&vma->vm_lock->lock);
> > +     down_write(&vma->vm_lock.lock);
> >       /*
> >        * We should use WRITE_ONCE() here because we can have concurrent reads
> >        * from the early lockless pessimistic check in vma_start_read().
> > @@ -782,7 +788,7 @@ static inline void vma_start_write(struct vm_area_struct *vma)
> >        * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> >        */
> >       WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > -     up_write(&vma->vm_lock->lock);
> > +     up_write(&vma->vm_lock.lock);
> >  }
> >
> >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > @@ -794,7 +800,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> >
> >  static inline void vma_assert_locked(struct vm_area_struct *vma)
> >  {
> > -     if (!rwsem_is_locked(&vma->vm_lock->lock))
> > +     if (!rwsem_is_locked(&vma->vm_lock.lock))
> >               vma_assert_write_locked(vma);
> >  }
> >
> > @@ -827,6 +833,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >
> >  #else /* CONFIG_PER_VMA_LOCK */
> >
> > +static inline void vma_lock_init(struct vm_area_struct *vma) {}
> >  static inline bool vma_start_read(struct vm_area_struct *vma)
> >               { return false; }
> >  static inline void vma_end_read(struct vm_area_struct *vma) {}
> > @@ -861,10 +868,6 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
> >
> >  extern const struct vm_operations_struct vma_dummy_vm_ops;
> >
> > -/*
> > - * WARNING: vma_init does not initialize vma->vm_lock.
> > - * Use vm_area_alloc()/vm_area_free() if vma needs locking.
> > - */
> >  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >  {
> >       memset(vma, 0, sizeof(*vma));
> > @@ -873,6 +876,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >       INIT_LIST_HEAD(&vma->anon_vma_chain);
> >       vma_mark_detached(vma, false);
> >       vma_numab_state_init(vma);
> > +     vma_lock_init(vma);
> >  }
> >
> >  /* Use when VMA is not part of the VMA tree and needs no locking */
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 80fef38d9d64..5c4bfdcfac72 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -716,8 +716,6 @@ struct vm_area_struct {
> >        * slowpath.
> >        */
> >       unsigned int vm_lock_seq;
> > -     /* Unstable RCU readers are allowed to read this. */
> > -     struct vma_lock *vm_lock;
> >  #endif
> >
> >       /*
> > @@ -770,6 +768,10 @@ struct vm_area_struct {
> >       struct vma_numab_state *numab_state;    /* NUMA Balancing state */
> >  #endif
> >       struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     /* Unstable RCU readers are allowed to read this. */
> > +     struct vma_lock vm_lock ____cacheline_aligned_in_smp;
> > +#endif
> >  } __randomize_layout;
> >
> >  #ifdef CONFIG_NUMA
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 0061cf2450ef..7823797e31d2 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -436,35 +436,6 @@ static struct kmem_cache *vm_area_cachep;
> >  /* SLAB cache for mm_struct structures (tsk->mm) */
> >  static struct kmem_cache *mm_cachep;
> >
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -
> > -/* SLAB cache for vm_area_struct.lock */
> > -static struct kmem_cache *vma_lock_cachep;
> > -
> > -static bool vma_lock_alloc(struct vm_area_struct *vma)
> > -{
> > -     vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
> > -     if (!vma->vm_lock)
> > -             return false;
> > -
> > -     init_rwsem(&vma->vm_lock->lock);
> > -     vma->vm_lock_seq = UINT_MAX;
> > -
> > -     return true;
> > -}
> > -
> > -static inline void vma_lock_free(struct vm_area_struct *vma)
> > -{
> > -     kmem_cache_free(vma_lock_cachep, vma->vm_lock);
> > -}
> > -
> > -#else /* CONFIG_PER_VMA_LOCK */
> > -
> > -static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
> > -static inline void vma_lock_free(struct vm_area_struct *vma) {}
> > -
> > -#endif /* CONFIG_PER_VMA_LOCK */
> > -
> >  struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> >  {
> >       struct vm_area_struct *vma;
> > @@ -474,10 +445,6 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> >               return NULL;
> >
> >       vma_init(vma, mm);
> > -     if (!vma_lock_alloc(vma)) {
> > -             kmem_cache_free(vm_area_cachep, vma);
> > -             return NULL;
> > -     }
> >
> >       return vma;
> >  }
> > @@ -496,10 +463,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >        * will be reinitialized.
> >        */
> >       data_race(memcpy(new, orig, sizeof(*new)));
> > -     if (!vma_lock_alloc(new)) {
> > -             kmem_cache_free(vm_area_cachep, new);
> > -             return NULL;
> > -     }
> > +     vma_lock_init(new);
> >       INIT_LIST_HEAD(&new->anon_vma_chain);
> >       vma_numab_state_init(new);
> >       dup_anon_vma_name(orig, new);
> > @@ -511,7 +475,6 @@ void __vm_area_free(struct vm_area_struct *vma)
> >  {
> >       vma_numab_state_free(vma);
> >       free_anon_vma_name(vma);
> > -     vma_lock_free(vma);
> >       kmem_cache_free(vm_area_cachep, vma);
> >  }
> >
> > @@ -522,7 +485,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
> >                                                 vm_rcu);
> >
> >       /* The vma should not be locked while being destroyed. */
> > -     VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
> > +     VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
> >       __vm_area_free(vma);
> >  }
> >  #endif
> > @@ -3168,11 +3131,9 @@ void __init proc_caches_init(void)
> >                       sizeof(struct fs_struct), 0,
> >                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> >                       NULL);
> > -
> > -     vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -     vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
> > -#endif
> > +     vm_area_cachep = KMEM_CACHE(vm_area_struct,
> > +                     SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> > +                     SLAB_ACCOUNT);
>
> Why the SLAB_NO_MERGE?

Ah, I had it there for convenience to be able to see them separately
in /proc/slabinfo. With SLAB_HWCACHE_ALIGN I don't think it matters
for cacheline sharing...
Once we add SLAB_TYPESAFE_BY_RCU this flag won't matter anyway because
it will prevent slab merging.

>
> >       mmap_init();
> >       nsproxy_cache_init();
> >  }
> > --
> > 2.47.0.277.g8800431eea-goog
> >

Suren Baghdasaryan Nov. 13, 2024, 3:45 p.m. UTC | #9

On Wed, Nov 13, 2024 at 7:02 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Nov 13, 2024 at 02:28:16PM +0000, Lorenzo Stoakes wrote:
> > On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> >
> > I don't see a motivating reason as to why we want to do this? We increase
> > memory usage here which is not good, but later lock optimisation mitigates
> > it, but why wouldn't we just do the lock optimisations and use less memory
> > overall?
>
> I worded this badly. To clarify:
>
> I don't see a motivating reason _in the commit message_ as to why we want
> to do this.
>
> I am certain there are, in fact Mateusz and Vlastimil have provided them.
>
> So my review is - let's just put these there :)

Yeah, I had trouble wording all the reasons because in my head it was
simply "the right thing to do". Now with all your input my job has
become much easier :) Thanks folks!

[v2,2/5] mm: move per-vma lock into vm_area_struct

Commit Message

Comments

Patch