diff mbox series

[v3] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

Message ID 20230610030730.110074-1-yu.ma@intel.com (mailing list archive)
State New
Headers show
Series [v3] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing | expand

Commit Message

Yu Ma June 10, 2023, 3:07 a.m. UTC
When running UnixBench/Execl throughput case, false sharing is observed
due to frequent read on base_addr and write on free_bytes, chunk_md.

UnixBench/Execl represents a class of workload where bash scripts
are spawned frequently to do some short jobs. It will do system call on
execl frequently, and execl will call mm_init to initialize mm_struct
of the process. mm_init will call __percpu_counter_init for
percpu_counters initialization. Then pcpu_alloc is called to read
the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
it will call pcpu_alloc_area  to allocate memory from a specified chunk.
This function will update "free_bytes" and "chunk_md" to record the
rest free bytes and other meta data for this chunk. Correspondingly,
pcpu_free_area will also update these 2 members when free memory.
Call trace from perf is as below:
+   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
+   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
-   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
   - 53.54% 0x654278696e552f34
        main
        __execve
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        __x64_sys_execve
        do_execveat_common.isra.47
        alloc_bprm
        mm_init
        __percpu_counter_init
        pcpu_alloc
      - __mutex_lock.isra.17

In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
to let ‘base_addr’ locate in a new cacheline.

With this change, on Intel Sapphire Rapids 112C/224T platform,
based on v6.4-rc4, the 160 parallel score improves by 24%.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 mm/percpu-internal.h | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Andrew Morton June 12, 2023, 9:43 p.m. UTC | #1
On Fri,  9 Jun 2023 23:07:30 -0400 Yu Ma <yu.ma@intel.com> wrote:

> When running UnixBench/Execl throughput case, false sharing is observed
> due to frequent read on base_addr and write on free_bytes, chunk_md.
> 
> UnixBench/Execl represents a class of workload where bash scripts
> are spawned frequently to do some short jobs. It will do system call on
> execl frequently, and execl will call mm_init to initialize mm_struct
> of the process. mm_init will call __percpu_counter_init for
> percpu_counters initialization. Then pcpu_alloc is called to read
> the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
> it will call pcpu_alloc_area  to allocate memory from a specified chunk.
> This function will update "free_bytes" and "chunk_md" to record the
> rest free bytes and other meta data for this chunk. Correspondingly,
> pcpu_free_area will also update these 2 members when free memory.
> Call trace from perf is as below:
> +   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
> +   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
> -   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
>    - 53.54% 0x654278696e552f34
>         main
>         __execve
>         entry_SYSCALL_64_after_hwframe
>         do_syscall_64
>         __x64_sys_execve
>         do_execveat_common.isra.47
>         alloc_bprm
>         mm_init
>         __percpu_counter_init
>         pcpu_alloc
>       - __mutex_lock.isra.17
> 
> In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
> last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
> to let ‘base_addr’ locate in a new cacheline.
> 
> With this change, on Intel Sapphire Rapids 112C/224T platform,
> based on v6.4-rc4, the 160 parallel score improves by 24%.

Well that's nice.

>
> ...
>
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -41,10 +41,17 @@ struct pcpu_chunk {
>  	struct list_head	list;		/* linked to pcpu_slot lists */
>  	int			free_bytes;	/* free bytes in the chunk */
>  	struct pcpu_block_md	chunk_md;
> -	void			*base_addr;	/* base address of this chunk */
> +	unsigned long		*bound_map;	/* boundary map */
> +	
> +	/* 
> +	 * base_addr is the base address of this chunk.
> +	 * To reduce false sharing, current layout is optimized to make sure
> +	 * base_addr locate in the different cacheline with free_bytes and
> +	 * chunk_md.
> +	 */
> +	void			*base_addr ____cacheline_aligned_in_smp;
>  
>  	unsigned long		*alloc_map;	/* allocation map */
> -	unsigned long		*bound_map;	/* boundary map */
>  	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
>  
>  	void			*data;		/* chunk data */

This will of course consume more memory.  Do we have a feel for the
worst-case impact of this?
Dennis Zhou June 12, 2023, 9:55 p.m. UTC | #2
Hi Andrew,

On Mon, Jun 12, 2023 at 02:43:31PM -0700, Andrew Morton wrote:
> On Fri,  9 Jun 2023 23:07:30 -0400 Yu Ma <yu.ma@intel.com> wrote:
> 
> > When running UnixBench/Execl throughput case, false sharing is observed
> > due to frequent read on base_addr and write on free_bytes, chunk_md.
> > 
> > UnixBench/Execl represents a class of workload where bash scripts
> > are spawned frequently to do some short jobs. It will do system call on
> > execl frequently, and execl will call mm_init to initialize mm_struct
> > of the process. mm_init will call __percpu_counter_init for
> > percpu_counters initialization. Then pcpu_alloc is called to read
> > the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
> > it will call pcpu_alloc_area  to allocate memory from a specified chunk.
> > This function will update "free_bytes" and "chunk_md" to record the
> > rest free bytes and other meta data for this chunk. Correspondingly,
> > pcpu_free_area will also update these 2 members when free memory.
> > Call trace from perf is as below:
> > +   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
> > +   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
> > -   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
> >    - 53.54% 0x654278696e552f34
> >         main
> >         __execve
> >         entry_SYSCALL_64_after_hwframe
> >         do_syscall_64
> >         __x64_sys_execve
> >         do_execveat_common.isra.47
> >         alloc_bprm
> >         mm_init
> >         __percpu_counter_init
> >         pcpu_alloc
> >       - __mutex_lock.isra.17
> > 
> > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
> > last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
> > to let ‘base_addr’ locate in a new cacheline.
> > 
> > With this change, on Intel Sapphire Rapids 112C/224T platform,
> > based on v6.4-rc4, the 160 parallel score improves by 24%.
> 
> Well that's nice.
> 
> >
> > ...
> >
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -41,10 +41,17 @@ struct pcpu_chunk {
> >  	struct list_head	list;		/* linked to pcpu_slot lists */
> >  	int			free_bytes;	/* free bytes in the chunk */
> >  	struct pcpu_block_md	chunk_md;
> > -	void			*base_addr;	/* base address of this chunk */
> > +	unsigned long		*bound_map;	/* boundary map */
> > +	
> > +	/* 
> > +	 * base_addr is the base address of this chunk.
> > +	 * To reduce false sharing, current layout is optimized to make sure
> > +	 * base_addr locate in the different cacheline with free_bytes and
> > +	 * chunk_md.
> > +	 */
> > +	void			*base_addr ____cacheline_aligned_in_smp;
> >  
> >  	unsigned long		*alloc_map;	/* allocation map */
> > -	unsigned long		*bound_map;	/* boundary map */
> >  	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
> >  
> >  	void			*data;		/* chunk data */
> 
> This will of course consume more memory.  Do we have a feel for the
> worst-case impact of this?
> 

The pcpu_chunk struct is a backing data structure per chunk, so the
additional memory should not be dramatic. A chunk covers ballpark
between 64kb and 512kb memory depending on some config and boot time
stuff, so I believe the additional memory used here is nominal at best.

Working the #s on my desktop:
Percpu:            58624 kB
28 cores -> ~2.1MB of percpu memory.
At say ~128KB per chunk -> 33 chunks, generously 40 chunks.
Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB
of overhead?

I believe we can do a little better to avoid eating that full padding,
so likely less than that.

Acked-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis
Yu Ma June 13, 2023, 5:41 p.m. UTC | #3
> Hi Andrew,
> 
> On Mon, Jun 12, 2023 at 02:43:31PM -0700, Andrew Morton wrote:
> > On Fri,  9 Jun 2023 23:07:30 -0400 Yu Ma <yu.ma@intel.com> wrote:
> >
> > > When running UnixBench/Execl throughput case, false sharing is
> > > observed due to frequent read on base_addr and write on free_bytes,
> chunk_md.
> > >
> > > UnixBench/Execl represents a class of workload where bash scripts
> > > are spawned frequently to do some short jobs. It will do system call
> > > on execl frequently, and execl will call mm_init to initialize
> > > mm_struct of the process. mm_init will call __percpu_counter_init
> > > for percpu_counters initialization. Then pcpu_alloc is called to
> > > read the base_addr of pcpu_chunk for memory allocation. Inside
> > > pcpu_alloc, it will call pcpu_alloc_area  to allocate memory from a
> specified chunk.
> > > This function will update "free_bytes" and "chunk_md" to record the
> > > rest free bytes and other meta data for this chunk. Correspondingly,
> > > pcpu_free_area will also update these 2 members when free memory.
> > > Call trace from perf is as below:
> > > +   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
> > > +   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
> > > -   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
> > >    - 53.54% 0x654278696e552f34
> > >         main
> > >         __execve
> > >         entry_SYSCALL_64_after_hwframe
> > >         do_syscall_64
> > >         __x64_sys_execve
> > >         do_execveat_common.isra.47
> > >         alloc_bprm
> > >         mm_init
> > >         __percpu_counter_init
> > >         pcpu_alloc
> > >       - __mutex_lock.isra.17
> > >
> > > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the last 8
> > > bytes. This patch moves ‘bound_map’ up to ‘base_addr’, to let
> > > ‘base_addr’ locate in a new cacheline.
> > >
> > > With this change, on Intel Sapphire Rapids 112C/224T platform, based
> > > on v6.4-rc4, the 160 parallel score improves by 24%.
> >
> > Well that's nice.
> >
> > >
> > > ...
> > >
> > > --- a/mm/percpu-internal.h
> > > +++ b/mm/percpu-internal.h
> > > @@ -41,10 +41,17 @@ struct pcpu_chunk {
> > >  	struct list_head	list;		/* linked to pcpu_slot lists */
> > >  	int			free_bytes;	/* free bytes in the chunk */
> > >  	struct pcpu_block_md	chunk_md;
> > > -	void			*base_addr;	/* base address of this chunk
> */
> > > +	unsigned long		*bound_map;	/* boundary map */
> > > +
> > > +	/*
> > > +	 * base_addr is the base address of this chunk.
> > > +	 * To reduce false sharing, current layout is optimized to make sure
> > > +	 * base_addr locate in the different cacheline with free_bytes and
> > > +	 * chunk_md.
> > > +	 */
> > > +	void			*base_addr ____cacheline_aligned_in_smp;
> > >
> > >  	unsigned long		*alloc_map;	/* allocation map */
> > > -	unsigned long		*bound_map;	/* boundary map */
> > >  	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
> > >
> > >  	void			*data;		/* chunk data */
> >
> > This will of course consume more memory.  Do we have a feel for the
> > worst-case impact of this?
> >
> 
> The pcpu_chunk struct is a backing data structure per chunk, so the
> additional memory should not be dramatic. A chunk covers ballpark between
> 64kb and 512kb memory depending on some config and boot time stuff, so I
> believe the additional memory used here is nominal at best.
> 
> Working the #s on my desktop:
> Percpu:            58624 kB
> 28 cores -> ~2.1MB of percpu memory.
> At say ~128KB per chunk -> 33 chunks, generously 40 chunks.
> Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB of
> overhead?
> 
> I believe we can do a little better to avoid eating that full padding, so likely
> less than that.
> 
> Acked-by: Dennis Zhou <dennis@kernel.org>
>

Thanks Andrew and Dennis for agreement on the patch. 
The layout of this structure (printed by pahole) before and after this patch is as below for reference. 
The default size is 136 Bytes with 3 cachelines. With patch v3, it is 192 Bytes with 56 extra padding.
For "____cacheline_aligned_in_smp ", initially it was not added with the same concern on memory, 
as it can obtain the same performance gain with reshuffle on base_addr only. Thanks to Dennis with
expertise on the overall usage, it is added to be more clear and bring convenience for future changes.

--default v6.4-rc4--
struct pcpu_chunk {
        struct list_head           list;                 /*     0    16 */
        int                        free_bytes;           /*    16     4 */
        struct pcpu_block_md       chunk_md;             /*    20    32 */

        /* XXX 4 bytes hole, try to pack */

        void *                     base_addr;            /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        long unsigned int *        alloc_map;            /*    64     8 */
        long unsigned int *        bound_map;            /*    72     8 */
        struct pcpu_block_md *     md_blocks;            /*    80     8 */
        void *                     data;                 /*    88     8 */
        bool                       immutable;            /*    96     1 */
        bool                       isolated;             /*    97     1 */

        /* XXX 2 bytes hole, try to pack */

        int                        start_offset;         /*   100     4 */
        int                        end_offset;           /*   104     4 */

        /* XXX 4 bytes hole, try to pack */

        struct obj_cgroup * *      obj_cgroups;          /*   112     8 */
        int                        nr_pages;             /*   120     4 */
        int                        nr_populated;         /*   124     4 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        int                        nr_empty_pop_pages;   /*   128     4 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int          populated[];          /*   136     0 */

        /* size: 136, cachelines: 3, members: 17 */
        /* sum members: 122, holes: 4, sum holes: 14 */
        /* last cacheline: 8 bytes */
};

--with patch v3--
struct pcpu_chunk {
        struct list_head           list;                 /*     0    16 */
        int                        free_bytes;           /*    16     4 */
        struct pcpu_block_md       chunk_md;             /*    20    32 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int *        bound_map;            /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        void *                     base_addr;            /*    64     8 */
        long unsigned int *        alloc_map;            /*    72     8 */
        struct pcpu_block_md *     md_blocks;            /*    80     8 */
        void *                     data;                 /*    88     8 */
        bool                       immutable;            /*    96     1 */
        bool                       isolated;             /*    97     1 */

        /* XXX 2 bytes hole, try to pack */

        int                        start_offset;         /*   100     4 */
        int                        end_offset;           /*   104     4 */

        /* XXX 4 bytes hole, try to pack */

        struct obj_cgroup * *      obj_cgroups;          /*   112     8 */
        int                        nr_pages;             /*   120     4 */
        int                        nr_populated;         /*   124     4 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        int                        nr_empty_pop_pages;   /*   128     4 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int          populated[];          /*   136     0 */

        /* size: 192, cachelines: 3, members: 17 */
        /* sum members: 122, holes: 4, sum holes: 14 */
        /* padding: 56 */
};


Regards
Yu

> Thanks,
> Dennis
diff mbox series

Patch

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index f9847c131998..7f108b25bb93 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -41,10 +41,17 @@  struct pcpu_chunk {
 	struct list_head	list;		/* linked to pcpu_slot lists */
 	int			free_bytes;	/* free bytes in the chunk */
 	struct pcpu_block_md	chunk_md;
-	void			*base_addr;	/* base address of this chunk */
+	unsigned long		*bound_map;	/* boundary map */
+	
+	/* 
+	 * base_addr is the base address of this chunk.
+	 * To reduce false sharing, current layout is optimized to make sure
+	 * base_addr locate in the different cacheline with free_bytes and
+	 * chunk_md.
+	 */
+	void			*base_addr ____cacheline_aligned_in_smp;
 
 	unsigned long		*alloc_map;	/* allocation map */
-	unsigned long		*bound_map;	/* boundary map */
 	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
 
 	void			*data;		/* chunk data */