mbox series

[0/4] move per-vma lock into vm_area_struct

Message ID 20241111205506.3404479-1-surenb@google.com (mailing list archive)
Headers show
Series move per-vma lock into vm_area_struct | expand

Message

Suren Baghdasaryan Nov. 11, 2024, 8:55 p.m. UTC
Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
This patchset moves vm_lock back into vm_area_struct, aligning it at the
cacheline boundary and changing the cache to be cache-aligned as well.
This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
(vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs.
To minimize memory overhead, vm_lock implementation is changed from
using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
vm_area_struct members are moved into the last cacheline, resulting
in a less fragmented structure:

struct vm_area_struct {
	union {
		struct {
			long unsigned int vm_start;      /*     0     8 */
			long unsigned int vm_end;        /*     8     8 */
		};                                       /*     0    16 */
		struct callback_head vm_rcu ;            /*     0    16 */
	} __attribute__((__aligned__(8)));               /*     0    16 */
	struct mm_struct *         vm_mm;                /*    16     8 */
	pgprot_t                   vm_page_prot;         /*    24     8 */
	union {
		const vm_flags_t   vm_flags;             /*    32     8 */
		vm_flags_t         __vm_flags;           /*    32     8 */
	};                                               /*    32     8 */
	bool                       detached;             /*    40     1 */

	/* XXX 3 bytes hole, try to pack */

	unsigned int               vm_lock_seq;          /*    44     4 */
	struct list_head           anon_vma_chain;       /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct anon_vma *          anon_vma;             /*    64     8 */
	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
	long unsigned int          vm_pgoff;             /*    80     8 */
	struct file *              vm_file;              /*    88     8 */
	void *                     vm_private_data;      /*    96     8 */
	atomic_long_t              swap_readahead_info;  /*   104     8 */
	struct mempolicy *         vm_policy;            /*   112     8 */

	/* XXX 8 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */

	/* XXX 4 bytes hole, try to pack */

	struct {
		struct rb_node     rb (__aligned__(8));  /*   136    24 */
		long unsigned int  rb_subtree_last;      /*   160     8 */
	} __attribute__((__aligned__(8))) shared;        /*   136    32 */
	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */

	/* size: 192, cachelines: 3, members: 17 */
	/* sum members: 153, holes: 3, sum holes: 15 */
	/* padding: 24 */
	/* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
} __attribute__((__aligned__(64)));

Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
to the 50 pages in the baseline:

    slabinfo after vm_area_struct changes:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    192   42    2 : ...

Performance measurements using pft test on x86 do not show considerable
difference, on Pixel 6 running Android it results in 3-5% improvement in
faults per second.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

Suren Baghdasaryan (4):
  mm: introduce vma_start_read_locked{_nested} helpers
  mm: move per-vma lock into vm_area_struct
  mm: replace rw_semaphore with atomic_t in vma_lock
  mm: move lesser used vma_area_struct members into the last cacheline

 include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
 include/linux/mm_types.h  |  59 +++++++++-----
 include/linux/mmap_lock.h |   3 +
 kernel/fork.c             |  50 ++----------
 mm/init-mm.c              |   2 +
 mm/userfaultfd.c          |  14 ++--
 6 files changed, 205 insertions(+), 86 deletions(-)


base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f

Comments

Suren Baghdasaryan Nov. 11, 2024, 9:41 p.m. UTC | #1
On Mon, Nov 11, 2024 at 12:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> This patchset moves vm_lock back into vm_area_struct, aligning it at the
> cacheline boundary and changing the cache to be cache-aligned as well.
> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> (vm_lock) bytes to 256 bytes:
>
>     slabinfo before:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vma_lock         ...     40  102    1 : ...
>      vm_area_struct   ...    160   51    2 : ...
>
>     slabinfo after moving vm_lock:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    256   32    2 : ...
>
> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> which is 5.5MB per 100000 VMAs.
> To minimize memory overhead, vm_lock implementation is changed from
> using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> vm_area_struct members are moved into the last cacheline, resulting
> in a less fragmented structure:
>
> struct vm_area_struct {
>         union {
>                 struct {
>                         long unsigned int vm_start;      /*     0     8 */
>                         long unsigned int vm_end;        /*     8     8 */
>                 };                                       /*     0    16 */
>                 struct callback_head vm_rcu ;            /*     0    16 */
>         } __attribute__((__aligned__(8)));               /*     0    16 */
>         struct mm_struct *         vm_mm;                /*    16     8 */
>         pgprot_t                   vm_page_prot;         /*    24     8 */
>         union {
>                 const vm_flags_t   vm_flags;             /*    32     8 */
>                 vm_flags_t         __vm_flags;           /*    32     8 */
>         };                                               /*    32     8 */
>         bool                       detached;             /*    40     1 */
>
>         /* XXX 3 bytes hole, try to pack */
>
>         unsigned int               vm_lock_seq;          /*    44     4 */
>         struct list_head           anon_vma_chain;       /*    48    16 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         struct anon_vma *          anon_vma;             /*    64     8 */
>         const struct vm_operations_struct  * vm_ops;     /*    72     8 */
>         long unsigned int          vm_pgoff;             /*    80     8 */
>         struct file *              vm_file;              /*    88     8 */
>         void *                     vm_private_data;      /*    96     8 */
>         atomic_long_t              swap_readahead_info;  /*   104     8 */
>         struct mempolicy *         vm_policy;            /*   112     8 */
>
>         /* XXX 8 bytes hole, try to pack */
>
>         /* --- cacheline 2 boundary (128 bytes) --- */
>         struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
>
>         /* XXX 4 bytes hole, try to pack */
>
>         struct {
>                 struct rb_node     rb (__aligned__(8));  /*   136    24 */
>                 long unsigned int  rb_subtree_last;      /*   160     8 */
>         } __attribute__((__aligned__(8))) shared;        /*   136    32 */
>         struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */
>
>         /* size: 192, cachelines: 3, members: 17 */
>         /* sum members: 153, holes: 3, sum holes: 15 */
>         /* padding: 24 */
>         /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> } __attribute__((__aligned__(64)));
>
> Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> to the 50 pages in the baseline:
>
>     slabinfo after vm_area_struct changes:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    192   42    2 : ...
>
> Performance measurements using pft test on x86 do not show considerable
> difference, on Pixel 6 running Android it results in 3-5% improvement in
> faults per second.
>
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/

And of course I forgot to update Lorenzo's new locking documentation :/
Will add that in the next version.

>
> Suren Baghdasaryan (4):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: replace rw_semaphore with atomic_t in vma_lock
>   mm: move lesser used vma_area_struct members into the last cacheline
>
>  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
>  include/linux/mm_types.h  |  59 +++++++++-----
>  include/linux/mmap_lock.h |   3 +
>  kernel/fork.c             |  50 ++----------
>  mm/init-mm.c              |   2 +
>  mm/userfaultfd.c          |  14 ++--
>  6 files changed, 205 insertions(+), 86 deletions(-)
>
>
> base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> --
> 2.47.0.277.g8800431eea-goog
>
Davidlohr Bueso Nov. 11, 2024, 10:18 p.m. UTC | #2
On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:

>To minimize memory overhead, vm_lock implementation is changed from
>using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
>vm_area_struct members are moved into the last cacheline, resulting
>in a less fragmented structure:

I am not a fan of building a custom lock, replacing a standard one.
How much do we really care about this? rwsems are quite optimized
and are known to heavily affect mm performance altogether.

...

>Performance measurements using pft test on x86 do not show considerable
>difference, on Pixel 6 running Android it results in 3-5% improvement in
>faults per second.

pft is a very micro benchmark, these results do not justify this change, imo.

Thanks,
Davidlohr
Suren Baghdasaryan Nov. 11, 2024, 11:19 p.m. UTC | #3
On Mon, Nov 11, 2024 at 2:18 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:
>
> >To minimize memory overhead, vm_lock implementation is changed from
> >using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> >vm_area_struct members are moved into the last cacheline, resulting
> >in a less fragmented structure:
>
> I am not a fan of building a custom lock, replacing a standard one.

Understandable.

> How much do we really care about this?

In the Android world I got complaints after introducing per-vma locks.
More specifically, moving from 5.15 to 6.1 kernel, where we first
started using these locks, the memory usage increased by 10MB on
average. Not a huge deal but if we can trim it without too much
complexity, that would be definitely appreciated.

>  rwsems are quite optimized and are known to heavily affect mm performance altogether.

I know, that's why I spent an additional week profiling the new
implementation. I asked Oliver (CC'ed) to rerun the tests he used in
https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/ to
confirm no regressions.

>
> ...
>
> >Performance measurements using pft test on x86 do not show considerable
> >difference, on Pixel 6 running Android it results in 3-5% improvement in
> >faults per second.
>
> pft is a very micro benchmark, these results do not justify this change, imo.

I'm not really trying to claim performance gains here. I just want to
make sure there are no regressions.
Thanks,
Suren.

>
> Thanks,
> Davidlohr
Davidlohr Bueso Nov. 12, 2024, 12:03 a.m. UTC | #4
On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:

>I'm not really trying to claim performance gains here. I just want to
>make sure there are no regressions.

You might also fine tune the atomics with acquire/release standard locking
semantics, you will probably see better numbers in Android than what you
currently have in patch 3 with full barriers - and not particularly risky
as callers expect that behaviour already.

Thanks,
Davidlohr
Andrew Morton Nov. 12, 2024, 12:11 a.m. UTC | #5
On Mon, 11 Nov 2024 15:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote:

> On Mon, Nov 11, 2024 at 2:18 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> > On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:
> >
> > >To minimize memory overhead, vm_lock implementation is changed from
> > >using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > >vm_area_struct members are moved into the last cacheline, resulting
> > >in a less fragmented structure:
> >
> > I am not a fan of building a custom lock, replacing a standard one.
> 
> Understandable.

If we're going to invent a new lock type, I'm thinking we should do
that - make it a standaline thing, add full lockdep support, etc.

I wonder if we could remove the lock from the vma altogeher and use an
old-fashioned hashed lock.  An array of locks indexed by the vma
address.  It might work well enough, although sizing the array would be
difficult.
Suren Baghdasaryan Nov. 12, 2024, 12:43 a.m. UTC | #6
On Mon, Nov 11, 2024 at 4:03 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:
>
> >I'm not really trying to claim performance gains here. I just want to
> >make sure there are no regressions.
>
> You might also fine tune the atomics with acquire/release standard locking
> semantics, you will probably see better numbers in Android than what you
> currently have in patch 3 with full barriers - and not particularly risky
> as callers expect that behaviour already.

Ack. Will try that. Thanks!

>
> Thanks,
> Davidlohr
Suren Baghdasaryan Nov. 12, 2024, 12:48 a.m. UTC | #7
On Mon, Nov 11, 2024 at 4:11 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 11 Nov 2024 15:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > On Mon, Nov 11, 2024 at 2:18 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
> > >
> > > On Mon, 11 Nov 2024, Suren Baghdasaryan wrote:
> > >
> > > >To minimize memory overhead, vm_lock implementation is changed from
> > > >using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > > >vm_area_struct members are moved into the last cacheline, resulting
> > > >in a less fragmented structure:
> > >
> > > I am not a fan of building a custom lock, replacing a standard one.
> >
> > Understandable.
>
> If we're going to invent a new lock type, I'm thinking we should do
> that - make it a standaline thing, add full lockdep support, etc.

Yeah, that will make it easy to experiment and replace it with a
different lock type if needed.

>
> I wonder if we could remove the lock from the vma altogeher and use an
> old-fashioned hashed lock.  An array of locks indexed by the vma
> address.  It might work well enough, although sizing the array would be
> difficult.

Ok, sounds like I'll need to experiment a bit with different lock
implementations.
I'll post a new version without the last two patches, keeping
rw_semaphore for now.
Thanks!

>
Liam R. Howlett Nov. 12, 2024, 2:35 a.m. UTC | #8
* Suren Baghdasaryan <surenb@google.com> [241111 15:55]:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> This patchset moves vm_lock back into vm_area_struct, aligning it at the
> cacheline boundary and changing the cache to be cache-aligned as well.
> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> (vm_lock) bytes to 256 bytes:
> 
>     slabinfo before:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vma_lock         ...     40  102    1 : ...
>      vm_area_struct   ...    160   51    2 : ...
> 
>     slabinfo after moving vm_lock:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    256   32    2 : ...
> 
> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> which is 5.5MB per 100000 VMAs.
> To minimize memory overhead, vm_lock implementation is changed from
> using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> vm_area_struct members are moved into the last cacheline, resulting
> in a less fragmented structure:
> 
> struct vm_area_struct {
> 	union {
> 		struct {
> 			long unsigned int vm_start;      /*     0     8 */
> 			long unsigned int vm_end;        /*     8     8 */
> 		};                                       /*     0    16 */
> 		struct callback_head vm_rcu ;            /*     0    16 */
> 	} __attribute__((__aligned__(8)));               /*     0    16 */
> 	struct mm_struct *         vm_mm;                /*    16     8 */
> 	pgprot_t                   vm_page_prot;         /*    24     8 */
> 	union {
> 		const vm_flags_t   vm_flags;             /*    32     8 */
> 		vm_flags_t         __vm_flags;           /*    32     8 */
> 	};                                               /*    32     8 */
> 	bool                       detached;             /*    40     1 */
> 
> 	/* XXX 3 bytes hole, try to pack */
> 
> 	unsigned int               vm_lock_seq;          /*    44     4 */ 
> 	struct list_head           anon_vma_chain;       /*    48    16 */ 40 + 16
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	struct anon_vma *          anon_vma;             /*    64     8 */ 56 + 8
> 	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> 	long unsigned int          vm_pgoff;             /*    80     8 */
> 	struct file *              vm_file;              /*    88     8 */
> 	void *                     vm_private_data;      /*    96     8 */
> 	atomic_long_t              swap_readahead_info;  /*   104     8 */
> 	struct mempolicy *         vm_policy;            /*   112     8 */
> 
> 	/* XXX 8 bytes hole, try to pack */
> 
> 	/* --- cacheline 2 boundary (128 bytes) --- */
> 	struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	struct {
> 		struct rb_node     rb (__aligned__(8));  /*   136    24 */
> 		long unsigned int  rb_subtree_last;      /*   160     8 */
> 	} __attribute__((__aligned__(8))) shared;        /*   136    32 */
> 	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */

This is 8 bytes on my compile, I guess you have userfaultfd disabled?

Considering this will end up being 256 anyways these changes may not
matter, but we can pack this better.
1. move vm_lock to after anon_vma (ends up at 64B in the end)
2. move vm_lock_seq to after vm_lock
4. move struct to the new 112 offset (which is 8B aligned at 112)
3. move detached to the end of the structure

This should limit the holes and maintain the alignments.

The down side is detached is now in the last used cache line and away
from what would probably be used with it, but maybe it doesn't matter
considering prefetch.

But maybe you have other reasons for the placement?

> 
> 	/* size: 192, cachelines: 3, members: 17 */
> 	/* sum members: 153, holes: 3, sum holes: 15 */
> 	/* padding: 24 */
> 	/* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> } __attribute__((__aligned__(64)));
> 
> Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> to the 50 pages in the baseline:
> 
>     slabinfo after vm_area_struct changes:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    192   42    2 : ...
> 
> Performance measurements using pft test on x86 do not show considerable
> difference, on Pixel 6 running Android it results in 3-5% improvement in
> faults per second.
> 
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> 
> Suren Baghdasaryan (4):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: replace rw_semaphore with atomic_t in vma_lock
>   mm: move lesser used vma_area_struct members into the last cacheline
> 
>  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
>  include/linux/mm_types.h  |  59 +++++++++-----
>  include/linux/mmap_lock.h |   3 +
>  kernel/fork.c             |  50 ++----------
>  mm/init-mm.c              |   2 +
>  mm/userfaultfd.c          |  14 ++--
>  6 files changed, 205 insertions(+), 86 deletions(-)
> 
> 
> base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> -- 
> 2.47.0.277.g8800431eea-goog
>
Liam R. Howlett Nov. 12, 2024, 2:48 a.m. UTC | #9
* Suren Baghdasaryan <surenb@google.com> [241111 16:41]:
> On Mon, Nov 11, 2024 at 12:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > This patchset moves vm_lock back into vm_area_struct, aligning it at the
> > cacheline boundary and changing the cache to be cache-aligned as well.
> > This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> > (vm_lock) bytes to 256 bytes:
> >
> >     slabinfo before:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vma_lock         ...     40  102    1 : ...
> >      vm_area_struct   ...    160   51    2 : ...
> >
> >     slabinfo after moving vm_lock:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    256   32    2 : ...
> >
> > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > which is 5.5MB per 100000 VMAs.
> > To minimize memory overhead, vm_lock implementation is changed from
> > using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > vm_area_struct members are moved into the last cacheline, resulting
> > in a less fragmented structure:

Wait a second, this is taking 40B down to 8B, but the alignment of the
vma will surely absorb that 32B difference?  The struct sum is 153B
according to what you have below so we won't go over 192B.  What am I
missing?

> >
> > struct vm_area_struct {
> >         union {
> >                 struct {
> >                         long unsigned int vm_start;      /*     0     8 */
> >                         long unsigned int vm_end;        /*     8     8 */
> >                 };                                       /*     0    16 */
> >                 struct callback_head vm_rcu ;            /*     0    16 */
> >         } __attribute__((__aligned__(8)));               /*     0    16 */
> >         struct mm_struct *         vm_mm;                /*    16     8 */
> >         pgprot_t                   vm_page_prot;         /*    24     8 */
> >         union {
> >                 const vm_flags_t   vm_flags;             /*    32     8 */
> >                 vm_flags_t         __vm_flags;           /*    32     8 */
> >         };                                               /*    32     8 */
> >         bool                       detached;             /*    40     1 */
> >
> >         /* XXX 3 bytes hole, try to pack */
> >
> >         unsigned int               vm_lock_seq;          /*    44     4 */
> >         struct list_head           anon_vma_chain;       /*    48    16 */
> >         /* --- cacheline 1 boundary (64 bytes) --- */
> >         struct anon_vma *          anon_vma;             /*    64     8 */
> >         const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> >         long unsigned int          vm_pgoff;             /*    80     8 */
> >         struct file *              vm_file;              /*    88     8 */
> >         void *                     vm_private_data;      /*    96     8 */
> >         atomic_long_t              swap_readahead_info;  /*   104     8 */
> >         struct mempolicy *         vm_policy;            /*   112     8 */
> >
> >         /* XXX 8 bytes hole, try to pack */
> >
> >         /* --- cacheline 2 boundary (128 bytes) --- */
> >         struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
> >
> >         /* XXX 4 bytes hole, try to pack */
> >
> >         struct {
> >                 struct rb_node     rb (__aligned__(8));  /*   136    24 */
> >                 long unsigned int  rb_subtree_last;      /*   160     8 */
> >         } __attribute__((__aligned__(8))) shared;        /*   136    32 */
> >         struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */
> >
> >         /* size: 192, cachelines: 3, members: 17 */
> >         /* sum members: 153, holes: 3, sum holes: 15 */
> >         /* padding: 24 */
> >         /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> > } __attribute__((__aligned__(64)));
> >
> > Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> > to the 50 pages in the baseline:
> >
> >     slabinfo after vm_area_struct changes:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    192   42    2 : ...
> >
> > Performance measurements using pft test on x86 do not show considerable
> > difference, on Pixel 6 running Android it results in 3-5% improvement in
> > faults per second.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> 
> And of course I forgot to update Lorenzo's new locking documentation :/
> Will add that in the next version.
> 
> >
> > Suren Baghdasaryan (4):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: replace rw_semaphore with atomic_t in vma_lock
> >   mm: move lesser used vma_area_struct members into the last cacheline
> >
> >  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
> >  include/linux/mm_types.h  |  59 +++++++++-----
> >  include/linux/mmap_lock.h |   3 +
> >  kernel/fork.c             |  50 ++----------
> >  mm/init-mm.c              |   2 +
> >  mm/userfaultfd.c          |  14 ++--
> >  6 files changed, 205 insertions(+), 86 deletions(-)
> >
> >
> > base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> > --
> > 2.47.0.277.g8800431eea-goog
> >
Suren Baghdasaryan Nov. 12, 2024, 3:23 a.m. UTC | #10
On Mon, Nov 11, 2024 at 6:35 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [241111 15:55]:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > This patchset moves vm_lock back into vm_area_struct, aligning it at the
> > cacheline boundary and changing the cache to be cache-aligned as well.
> > This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> > (vm_lock) bytes to 256 bytes:
> >
> >     slabinfo before:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vma_lock         ...     40  102    1 : ...
> >      vm_area_struct   ...    160   51    2 : ...
> >
> >     slabinfo after moving vm_lock:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    256   32    2 : ...
> >
> > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > which is 5.5MB per 100000 VMAs.
> > To minimize memory overhead, vm_lock implementation is changed from
> > using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > vm_area_struct members are moved into the last cacheline, resulting
> > in a less fragmented structure:
> >
> > struct vm_area_struct {
> >       union {
> >               struct {
> >                       long unsigned int vm_start;      /*     0     8 */
> >                       long unsigned int vm_end;        /*     8     8 */
> >               };                                       /*     0    16 */
> >               struct callback_head vm_rcu ;            /*     0    16 */
> >       } __attribute__((__aligned__(8)));               /*     0    16 */
> >       struct mm_struct *         vm_mm;                /*    16     8 */
> >       pgprot_t                   vm_page_prot;         /*    24     8 */
> >       union {
> >               const vm_flags_t   vm_flags;             /*    32     8 */
> >               vm_flags_t         __vm_flags;           /*    32     8 */
> >       };                                               /*    32     8 */
> >       bool                       detached;             /*    40     1 */
> >
> >       /* XXX 3 bytes hole, try to pack */
> >
> >       unsigned int               vm_lock_seq;          /*    44     4 */
> >       struct list_head           anon_vma_chain;       /*    48    16 */ 40 + 16
> >       /* --- cacheline 1 boundary (64 bytes) --- */
> >       struct anon_vma *          anon_vma;             /*    64     8 */ 56 + 8
> >       const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> >       long unsigned int          vm_pgoff;             /*    80     8 */
> >       struct file *              vm_file;              /*    88     8 */
> >       void *                     vm_private_data;      /*    96     8 */
> >       atomic_long_t              swap_readahead_info;  /*   104     8 */
> >       struct mempolicy *         vm_policy;            /*   112     8 */
> >
> >       /* XXX 8 bytes hole, try to pack */
> >
> >       /* --- cacheline 2 boundary (128 bytes) --- */
> >       struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
> >
> >       /* XXX 4 bytes hole, try to pack */
> >
> >       struct {
> >               struct rb_node     rb (__aligned__(8));  /*   136    24 */
> >               long unsigned int  rb_subtree_last;      /*   160     8 */
> >       } __attribute__((__aligned__(8))) shared;        /*   136    32 */
> >       struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */
>
> This is 8 bytes on my compile, I guess you have userfaultfd disabled?

Yeah, I show here results with the defconfig. After I move things
around at the end we will have space for to keep everything under 3
cachelines.

>
> Considering this will end up being 256 anyways these changes may not
> matter, but we can pack this better.
> 1. move vm_lock to after anon_vma (ends up at 64B in the end)
> 2. move vm_lock_seq to after vm_lock

Nope, for performance reasons it's important to keep vm_lock_seq in
the first cacheline. It's used very extensively when read-locking the
VMA.

> 4. move struct to the new 112 offset (which is 8B aligned at 112)
> 3. move detached to the end of the structure

detached also should stay in the first cacheline, otherwise we will
get performance regression. I spent a week experimenting with what we
can and can't move :)

>
> This should limit the holes and maintain the alignments.
>
> The down side is detached is now in the last used cache line and away
> from what would probably be used with it, but maybe it doesn't matter
> considering prefetch.

It very much does matter :)

>
> But maybe you have other reasons for the placement?
>
> >
> >       /* size: 192, cachelines: 3, members: 17 */
> >       /* sum members: 153, holes: 3, sum holes: 15 */
> >       /* padding: 24 */
> >       /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> > } __attribute__((__aligned__(64)));
> >
> > Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> > to the 50 pages in the baseline:
> >
> >     slabinfo after vm_area_struct changes:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    192   42    2 : ...
> >
> > Performance measurements using pft test on x86 do not show considerable
> > difference, on Pixel 6 running Android it results in 3-5% improvement in
> > faults per second.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> >
> > Suren Baghdasaryan (4):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: replace rw_semaphore with atomic_t in vma_lock
> >   mm: move lesser used vma_area_struct members into the last cacheline
> >
> >  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
> >  include/linux/mm_types.h  |  59 +++++++++-----
> >  include/linux/mmap_lock.h |   3 +
> >  kernel/fork.c             |  50 ++----------
> >  mm/init-mm.c              |   2 +
> >  mm/userfaultfd.c          |  14 ++--
> >  6 files changed, 205 insertions(+), 86 deletions(-)
> >
> >
> > base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> > --
> > 2.47.0.277.g8800431eea-goog
> >
Suren Baghdasaryan Nov. 12, 2024, 3:27 a.m. UTC | #11
On Mon, Nov 11, 2024 at 6:48 PM 'Liam R. Howlett' via kernel-team
<kernel-team@android.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [241111 16:41]:
> > On Mon, Nov 11, 2024 at 12:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > This patchset moves vm_lock back into vm_area_struct, aligning it at the
> > > cacheline boundary and changing the cache to be cache-aligned as well.
> > > This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> > > (vm_lock) bytes to 256 bytes:
> > >
> > >     slabinfo before:
> > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > >      vma_lock         ...     40  102    1 : ...
> > >      vm_area_struct   ...    160   51    2 : ...
> > >
> > >     slabinfo after moving vm_lock:
> > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > >      vm_area_struct   ...    256   32    2 : ...
> > >
> > > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > > which is 5.5MB per 100000 VMAs.
> > > To minimize memory overhead, vm_lock implementation is changed from
> > > using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > > vm_area_struct members are moved into the last cacheline, resulting
> > > in a less fragmented structure:
>
> Wait a second, this is taking 40B down to 8B, but the alignment of the
> vma will surely absorb that 32B difference?  The struct sum is 153B
> according to what you have below so we won't go over 192B.  What am I
> missing?

Take a look at the last patch in the series called "[PATCH 4/4] mm:
move lesser used vma_area_struct members into the last cacheline". I
move some struct members from the earlier cachelines into cacheline #4
where the vm_lock is staying.
>
> > >
> > > struct vm_area_struct {
> > >         union {
> > >                 struct {
> > >                         long unsigned int vm_start;      /*     0     8 */
> > >                         long unsigned int vm_end;        /*     8     8 */
> > >                 };                                       /*     0    16 */
> > >                 struct callback_head vm_rcu ;            /*     0    16 */
> > >         } __attribute__((__aligned__(8)));               /*     0    16 */
> > >         struct mm_struct *         vm_mm;                /*    16     8 */
> > >         pgprot_t                   vm_page_prot;         /*    24     8 */
> > >         union {
> > >                 const vm_flags_t   vm_flags;             /*    32     8 */
> > >                 vm_flags_t         __vm_flags;           /*    32     8 */
> > >         };                                               /*    32     8 */
> > >         bool                       detached;             /*    40     1 */
> > >
> > >         /* XXX 3 bytes hole, try to pack */
> > >
> > >         unsigned int               vm_lock_seq;          /*    44     4 */
> > >         struct list_head           anon_vma_chain;       /*    48    16 */
> > >         /* --- cacheline 1 boundary (64 bytes) --- */
> > >         struct anon_vma *          anon_vma;             /*    64     8 */
> > >         const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> > >         long unsigned int          vm_pgoff;             /*    80     8 */
> > >         struct file *              vm_file;              /*    88     8 */
> > >         void *                     vm_private_data;      /*    96     8 */
> > >         atomic_long_t              swap_readahead_info;  /*   104     8 */
> > >         struct mempolicy *         vm_policy;            /*   112     8 */
> > >
> > >         /* XXX 8 bytes hole, try to pack */
> > >
> > >         /* --- cacheline 2 boundary (128 bytes) --- */
> > >         struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
> > >
> > >         /* XXX 4 bytes hole, try to pack */
> > >
> > >         struct {
> > >                 struct rb_node     rb (__aligned__(8));  /*   136    24 */
> > >                 long unsigned int  rb_subtree_last;      /*   160     8 */
> > >         } __attribute__((__aligned__(8))) shared;        /*   136    32 */
> > >         struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */
> > >
> > >         /* size: 192, cachelines: 3, members: 17 */
> > >         /* sum members: 153, holes: 3, sum holes: 15 */
> > >         /* padding: 24 */
> > >         /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> > > } __attribute__((__aligned__(64)));
> > >
> > > Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> > > to the 50 pages in the baseline:
> > >
> > >     slabinfo after vm_area_struct changes:
> > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > >      vm_area_struct   ...    192   42    2 : ...
> > >
> > > Performance measurements using pft test on x86 do not show considerable
> > > difference, on Pixel 6 running Android it results in 3-5% improvement in
> > > faults per second.
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> >
> > And of course I forgot to update Lorenzo's new locking documentation :/
> > Will add that in the next version.
> >
> > >
> > > Suren Baghdasaryan (4):
> > >   mm: introduce vma_start_read_locked{_nested} helpers
> > >   mm: move per-vma lock into vm_area_struct
> > >   mm: replace rw_semaphore with atomic_t in vma_lock
> > >   mm: move lesser used vma_area_struct members into the last cacheline
> > >
> > >  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
> > >  include/linux/mm_types.h  |  59 +++++++++-----
> > >  include/linux/mmap_lock.h |   3 +
> > >  kernel/fork.c             |  50 ++----------
> > >  mm/init-mm.c              |   2 +
> > >  mm/userfaultfd.c          |  14 ++--
> > >  6 files changed, 205 insertions(+), 86 deletions(-)
> > >
> > >
> > > base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> > > --
> > > 2.47.0.277.g8800431eea-goog
> > >
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Lorenzo Stoakes Nov. 12, 2024, 9:39 a.m. UTC | #12
On Mon, Nov 11, 2024 at 12:55:02PM -0800, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].

Sorry to nag + add workload, but could you also add changes to the
Documentation/mm/process_addrs.rst file to reflect this as part of the
series? It'd be really cool to have the change as part of your series, so
when it lands we update the documentation with it.

This change fundamentally changes internals that are documented there so it
makes sense to :)

Thanks!
Suren Baghdasaryan Nov. 12, 2024, 3:38 p.m. UTC | #13
On Tue, Nov 12, 2024 at 1:39 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Nov 11, 2024 at 12:55:02PM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
>
> Sorry to nag + add workload, but could you also add changes to the
> Documentation/mm/process_addrs.rst file to reflect this as part of the
> series? It'd be really cool to have the change as part of your series, so
> when it lands we update the documentation with it.
>
> This change fundamentally changes internals that are documented there so it
> makes sense to :)

Yep, I already prepared the documentation patch to be included in v2.
Sorry for missing it the first time around and thanks for the
reminder.

>
> Thanks!