diff mbox series

[1/2] mm: page_counter: relayout structure to reduce false sharing

Message ID 1609252514-27795-1-git-send-email-feng.tang@intel.com (mailing list archive)
State New, archived
Headers show
Series [1/2] mm: page_counter: relayout structure to reduce false sharing | expand

Commit Message

Feng Tang Dec. 29, 2020, 2:35 p.m. UTC
When checking a memory cgroup related performance regression [1],
from the perf c2c profiling data, we found high false sharing for
accessing 'usage' and 'parent'.

On 64 bit system, the 'usage' and 'parent' are close to each other,
and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
is usally written, while 'parent' is usually read as the cgroup's
hierarchical counting nature.

So move the 'parent' to the end of the structure to make sure they
are in different cache lines.

Following are some performance data with the patch, against
v5.11-rc1, on several generations of Xeon platforms. Most of the
results are improvements, with only one malloc case on one platform
shows a -4.0% regression. Each category below has several subcases
run on different platform, and only the worst and best scores are
listed:

fio:				 +1.8% ~  +8.3%
will-it-scale/malloc1:		 -4.0% ~  +8.9%
will-it-scale/page_fault1:	 no change
will-it-scale/page_fault2:	 +2.4% ~  +20.2%

[1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/page_counter.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Roman Gushchin Dec. 29, 2020, 4:56 p.m. UTC | #1
On Tue, Dec 29, 2020 at 10:35:13PM +0800, Feng Tang wrote:
> When checking a memory cgroup related performance regression [1],
> from the perf c2c profiling data, we found high false sharing for
> accessing 'usage' and 'parent'.
> 
> On 64 bit system, the 'usage' and 'parent' are close to each other,
> and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> is usally written, while 'parent' is usually read as the cgroup's
> hierarchical counting nature.
> 
> So move the 'parent' to the end of the structure to make sure they
> are in different cache lines.
> 
> Following are some performance data with the patch, against
> v5.11-rc1, on several generations of Xeon platforms. Most of the
> results are improvements, with only one malloc case on one platform
> shows a -4.0% regression. Each category below has several subcases
> run on different platform, and only the worst and best scores are
> listed:
> 
> fio:				 +1.8% ~  +8.3%
> will-it-scale/malloc1:		 -4.0% ~  +8.9%
> will-it-scale/page_fault1:	 no change
> will-it-scale/page_fault2:	 +2.4% ~  +20.2%
> 
> [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/page_counter.h | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 85bd413..6795913 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -12,7 +12,6 @@ struct page_counter {
>  	unsigned long low;
>  	unsigned long high;
>  	unsigned long max;
> -	struct page_counter *parent;
>  
>  	/* effective memory.min and memory.min usage tracking */
>  	unsigned long emin;
> @@ -27,6 +26,14 @@ struct page_counter {
>  	/* legacy */
>  	unsigned long watermark;
>  	unsigned long failcnt;
> +
> +	/*
> +	 * 'parent' is placed here to be far from 'usage' to reduce
> +	 * cache false sharing, as 'usage' is written mostly while
> +	 * parent is frequently read for cgroup's hierarchical
> +	 * counting nature.
> +	 */
> +	struct page_counter *parent;
>  };

LGTM!

Reviewed-by: Roman Gushchin <guro@fb.com>

I wonder if we have the same problem with min/low/high/max?
Maybe try to group all mostly-read-only fields (min, low, high,
max and parent) and separate them with some padding?

Thank you!
Feng Tang Dec. 30, 2020, 2:19 p.m. UTC | #2
On Tue, Dec 29, 2020 at 08:56:42AM -0800, Roman Gushchin wrote:
> On Tue, Dec 29, 2020 at 10:35:13PM +0800, Feng Tang wrote:
> > When checking a memory cgroup related performance regression [1],
> > from the perf c2c profiling data, we found high false sharing for
> > accessing 'usage' and 'parent'.
> > 
> > On 64 bit system, the 'usage' and 'parent' are close to each other,
> > and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> > is usally written, while 'parent' is usually read as the cgroup's
> > hierarchical counting nature.
> > 
> > So move the 'parent' to the end of the structure to make sure they
> > are in different cache lines.
> > 
> > Following are some performance data with the patch, against
> > v5.11-rc1, on several generations of Xeon platforms. Most of the
> > results are improvements, with only one malloc case on one platform
> > shows a -4.0% regression. Each category below has several subcases
> > run on different platform, and only the worst and best scores are
> > listed:
> > 
> > fio:				 +1.8% ~  +8.3%
> > will-it-scale/malloc1:		 -4.0% ~  +8.9%
> > will-it-scale/page_fault1:	 no change
> > will-it-scale/page_fault2:	 +2.4% ~  +20.2%
> > 
> > [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  include/linux/page_counter.h | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index 85bd413..6795913 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -12,7 +12,6 @@ struct page_counter {
> >  	unsigned long low;
> >  	unsigned long high;
> >  	unsigned long max;
> > -	struct page_counter *parent;
> >  
> >  	/* effective memory.min and memory.min usage tracking */
> >  	unsigned long emin;
> > @@ -27,6 +26,14 @@ struct page_counter {
> >  	/* legacy */
> >  	unsigned long watermark;
> >  	unsigned long failcnt;
> > +
> > +	/*
> > +	 * 'parent' is placed here to be far from 'usage' to reduce
> > +	 * cache false sharing, as 'usage' is written mostly while
> > +	 * parent is frequently read for cgroup's hierarchical
> > +	 * counting nature.
> > +	 */
> > +	struct page_counter *parent;
> >  };
> 
> LGTM!
> 
> Reviewed-by: Roman Gushchin <guro@fb.com>

Thanks for the review!

> I wonder if we have the same problem with min/low/high/max?
> Maybe try to group all mostly-read-only fields (min, low, high,
> max and parent) and separate them with some padding?

Yep, we thought about it too. From current perf c2c profiling
data, I haven't noticed obvious hot spots of false sharing for
min/low/high/max (which are read mostly).

For padding, we had some proposal before, current page_counter
for 64 bits platform is 112 bytes, padding to 2 cacheline
will only cost 16 bytes more. If this is fine, I can send another
patch or folder it to this one.

Thanks,
Feng

> Thank you!
Michal Hocko Jan. 4, 2021, 1:03 p.m. UTC | #3
On Tue 29-12-20 22:35:13, Feng Tang wrote:
> When checking a memory cgroup related performance regression [1],
> from the perf c2c profiling data, we found high false sharing for
> accessing 'usage' and 'parent'.
> 
> On 64 bit system, the 'usage' and 'parent' are close to each other,
> and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> is usally written, while 'parent' is usually read as the cgroup's
> hierarchical counting nature.
> 
> So move the 'parent' to the end of the structure to make sure they
> are in different cache lines.

Yes, parent is write-once field so having it away from other heavy RW
fields makes sense to me.
 
> Following are some performance data with the patch, against
> v5.11-rc1, on several generations of Xeon platforms. Most of the
> results are improvements, with only one malloc case on one platform
> shows a -4.0% regression. Each category below has several subcases
> run on different platform, and only the worst and best scores are
> listed:
> 
> fio:				 +1.8% ~  +8.3%
> will-it-scale/malloc1:		 -4.0% ~  +8.9%
> will-it-scale/page_fault1:	 no change
> will-it-scale/page_fault2:	 +2.4% ~  +20.2%

What is the second number? Std?

> [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/page_counter.h | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 85bd413..6795913 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -12,7 +12,6 @@ struct page_counter {
>  	unsigned long low;
>  	unsigned long high;
>  	unsigned long max;
> -	struct page_counter *parent;
>  
>  	/* effective memory.min and memory.min usage tracking */
>  	unsigned long emin;
> @@ -27,6 +26,14 @@ struct page_counter {
>  	/* legacy */
>  	unsigned long watermark;
>  	unsigned long failcnt;
> +
> +	/*
> +	 * 'parent' is placed here to be far from 'usage' to reduce
> +	 * cache false sharing, as 'usage' is written mostly while
> +	 * parent is frequently read for cgroup's hierarchical
> +	 * counting nature.
> +	 */
> +	struct page_counter *parent;
>  };
>  
>  #if BITS_PER_LONG == 32
> -- 
> 2.7.4
>
Feng Tang Jan. 4, 2021, 1:34 p.m. UTC | #4
Hi Michal,

On Mon, Jan 04, 2021 at 02:03:57PM +0100, Michal Hocko wrote:
> On Tue 29-12-20 22:35:13, Feng Tang wrote:
> > When checking a memory cgroup related performance regression [1],
> > from the perf c2c profiling data, we found high false sharing for
> > accessing 'usage' and 'parent'.
> > 
> > On 64 bit system, the 'usage' and 'parent' are close to each other,
> > and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> > is usally written, while 'parent' is usually read as the cgroup's
> > hierarchical counting nature.
> > 
> > So move the 'parent' to the end of the structure to make sure they
> > are in different cache lines.
> 
> Yes, parent is write-once field so having it away from other heavy RW
> fields makes sense to me.
>  
> > Following are some performance data with the patch, against
> > v5.11-rc1, on several generations of Xeon platforms. Most of the
> > results are improvements, with only one malloc case on one platform
> > shows a -4.0% regression. Each category below has several subcases
> > run on different platform, and only the worst and best scores are
> > listed:
> > 
> > fio:				 +1.8% ~  +8.3%
> > will-it-scale/malloc1:		 -4.0% ~  +8.9%
> > will-it-scale/page_fault1:	 no change
> > will-it-scale/page_fault2:	 +2.4% ~  +20.2%
> 
> What is the second number? Std?

For each case like 'page_fault2', I run several subcases on different
generations of Xeon, and only listed the lowest (first number) and
highest (second number) scores.

There are 5 runs and the result are: +3.6%, +2.4%, +10.4%, +20.2%,
+4.7%, and +2.4% and +20.2% are listed.

Thanks,
Feng
Michal Hocko Jan. 4, 2021, 2:11 p.m. UTC | #5
On Mon 04-01-21 21:34:45, Feng Tang wrote:
> Hi Michal,
> 
> On Mon, Jan 04, 2021 at 02:03:57PM +0100, Michal Hocko wrote:
> > On Tue 29-12-20 22:35:13, Feng Tang wrote:
> > > When checking a memory cgroup related performance regression [1],
> > > from the perf c2c profiling data, we found high false sharing for
> > > accessing 'usage' and 'parent'.
> > > 
> > > On 64 bit system, the 'usage' and 'parent' are close to each other,
> > > and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> > > is usally written, while 'parent' is usually read as the cgroup's
> > > hierarchical counting nature.
> > > 
> > > So move the 'parent' to the end of the structure to make sure they
> > > are in different cache lines.
> > 
> > Yes, parent is write-once field so having it away from other heavy RW
> > fields makes sense to me.
> >  
> > > Following are some performance data with the patch, against
> > > v5.11-rc1, on several generations of Xeon platforms. Most of the
> > > results are improvements, with only one malloc case on one platform
> > > shows a -4.0% regression. Each category below has several subcases
> > > run on different platform, and only the worst and best scores are
> > > listed:
> > > 
> > > fio:				 +1.8% ~  +8.3%
> > > will-it-scale/malloc1:		 -4.0% ~  +8.9%
> > > will-it-scale/page_fault1:	 no change
> > > will-it-scale/page_fault2:	 +2.4% ~  +20.2%
> > 
> > What is the second number? Std?
> 
> For each case like 'page_fault2', I run several subcases on different
> generations of Xeon, and only listed the lowest (first number) and
> highest (second number) scores.
> 
> There are 5 runs and the result are: +3.6%, +2.4%, +10.4%, +20.2%,
> +4.7%, and +2.4% and +20.2% are listed.

This should be really explained in the changelog and ideally mention the
model as well. Seeing a std would be appreciated as well.
Feng Tang Jan. 4, 2021, 2:44 p.m. UTC | #6
On Mon, Jan 04, 2021 at 03:11:40PM +0100, Michal Hocko wrote:
> On Mon 04-01-21 21:34:45, Feng Tang wrote:
> > Hi Michal,
> > 
> > On Mon, Jan 04, 2021 at 02:03:57PM +0100, Michal Hocko wrote:
> > > On Tue 29-12-20 22:35:13, Feng Tang wrote:
> > > > When checking a memory cgroup related performance regression [1],
> > > > from the perf c2c profiling data, we found high false sharing for
> > > > accessing 'usage' and 'parent'.
> > > > 
> > > > On 64 bit system, the 'usage' and 'parent' are close to each other,
> > > > and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> > > > is usally written, while 'parent' is usually read as the cgroup's
> > > > hierarchical counting nature.
> > > > 
> > > > So move the 'parent' to the end of the structure to make sure they
> > > > are in different cache lines.
> > > 
> > > Yes, parent is write-once field so having it away from other heavy RW
> > > fields makes sense to me.
> > >  
> > > > Following are some performance data with the patch, against
> > > > v5.11-rc1, on several generations of Xeon platforms. Most of the
> > > > results are improvements, with only one malloc case on one platform
> > > > shows a -4.0% regression. Each category below has several subcases
> > > > run on different platform, and only the worst and best scores are
> > > > listed:
> > > > 
> > > > fio:				 +1.8% ~  +8.3%
> > > > will-it-scale/malloc1:		 -4.0% ~  +8.9%
> > > > will-it-scale/page_fault1:	 no change
> > > > will-it-scale/page_fault2:	 +2.4% ~  +20.2%
> > > 
> > > What is the second number? Std?
> > 
> > For each case like 'page_fault2', I run several subcases on different
> > generations of Xeon, and only listed the lowest (first number) and
> > highest (second number) scores.
> > 
> > There are 5 runs and the result are: +3.6%, +2.4%, +10.4%, +20.2%,
> > +4.7%, and +2.4% and +20.2% are listed.
> 
> This should be really explained in the changelog and ideally mention the
> model as well. Seeing a std would be appreciated as well.

I guess I haven't made it clear (due to my poor English :))

The five scores are for different parameters on different HW:

Cascadelake (100%)    77844    +3.6%    80667   will-it-scale.per_process_ops
Cascadelake  (50%)   182475    +2.4%   186866   will-it-scale.per_process_ops
Haswell     (100%)    84870   +10.4%    93671   will-it-scale.per_process_ops
Haswell      (50%)   197684   +20.2%   237585   will-it-scale.per_process_ops
Newer Xeon   (50%)   268569    +4.7%   281320   will-it-scale.per_process_ops

+2.4% is the lowest improvement, while +20.2% is the highest. 

100% means the number of forked test processes eqauls to CPU number,
while 50% is the half. Each line has been runed several times, whose score
are consistent without big deviations.

Thanks,
Feng
Michal Hocko Jan. 4, 2021, 3:34 p.m. UTC | #7
On Mon 04-01-21 22:44:02, Feng Tang wrote:
> On Mon, Jan 04, 2021 at 03:11:40PM +0100, Michal Hocko wrote:
> > On Mon 04-01-21 21:34:45, Feng Tang wrote:
> > > Hi Michal,
> > > 
> > > On Mon, Jan 04, 2021 at 02:03:57PM +0100, Michal Hocko wrote:
> > > > On Tue 29-12-20 22:35:13, Feng Tang wrote:
> > > > > When checking a memory cgroup related performance regression [1],
> > > > > from the perf c2c profiling data, we found high false sharing for
> > > > > accessing 'usage' and 'parent'.
> > > > > 
> > > > > On 64 bit system, the 'usage' and 'parent' are close to each other,
> > > > > and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> > > > > is usally written, while 'parent' is usually read as the cgroup's
> > > > > hierarchical counting nature.
> > > > > 
> > > > > So move the 'parent' to the end of the structure to make sure they
> > > > > are in different cache lines.
> > > > 
> > > > Yes, parent is write-once field so having it away from other heavy RW
> > > > fields makes sense to me.
> > > >  
> > > > > Following are some performance data with the patch, against
> > > > > v5.11-rc1, on several generations of Xeon platforms. Most of the
> > > > > results are improvements, with only one malloc case on one platform
> > > > > shows a -4.0% regression. Each category below has several subcases
> > > > > run on different platform, and only the worst and best scores are
> > > > > listed:
> > > > > 
> > > > > fio:				 +1.8% ~  +8.3%
> > > > > will-it-scale/malloc1:		 -4.0% ~  +8.9%
> > > > > will-it-scale/page_fault1:	 no change
> > > > > will-it-scale/page_fault2:	 +2.4% ~  +20.2%
> > > > 
> > > > What is the second number? Std?
> > > 
> > > For each case like 'page_fault2', I run several subcases on different
> > > generations of Xeon, and only listed the lowest (first number) and
> > > highest (second number) scores.
> > > 
> > > There are 5 runs and the result are: +3.6%, +2.4%, +10.4%, +20.2%,
> > > +4.7%, and +2.4% and +20.2% are listed.
> > 
> > This should be really explained in the changelog and ideally mention the
> > model as well. Seeing a std would be appreciated as well.
> 
> I guess I haven't made it clear (due to my poor English :))
> 
> The five scores are for different parameters on different HW:
> 
> Cascadelake (100%)    77844    +3.6%    80667   will-it-scale.per_process_ops
> Cascadelake  (50%)   182475    +2.4%   186866   will-it-scale.per_process_ops
> Haswell     (100%)    84870   +10.4%    93671   will-it-scale.per_process_ops
> Haswell      (50%)   197684   +20.2%   237585   will-it-scale.per_process_ops
> Newer Xeon   (50%)   268569    +4.7%   281320   will-it-scale.per_process_ops
> 
> +2.4% is the lowest improvement, while +20.2% is the highest. 

Please make sure to document these results in the changelog.

> 100% means the number of forked test processes eqauls to CPU number,
> while 50% is the half. Each line has been runed several times, whose score
> are consistent without big deviations.

It is still a good practice to mention the number of runs and std.
diff mbox series

Patch

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 85bd413..6795913 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -12,7 +12,6 @@  struct page_counter {
 	unsigned long low;
 	unsigned long high;
 	unsigned long max;
-	struct page_counter *parent;
 
 	/* effective memory.min and memory.min usage tracking */
 	unsigned long emin;
@@ -27,6 +26,14 @@  struct page_counter {
 	/* legacy */
 	unsigned long watermark;
 	unsigned long failcnt;
+
+	/*
+	 * 'parent' is placed here to be far from 'usage' to reduce
+	 * cache false sharing, as 'usage' is written mostly while
+	 * parent is frequently read for cgroup's hierarchical
+	 * counting nature.
+	 */
+	struct page_counter *parent;
 };
 
 #if BITS_PER_LONG == 32