Message ID | 1609252514-27795-1-git-send-email-feng.tang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] mm: page_counter: relayout structure to reduce false sharing | expand |
On Tue, Dec 29, 2020 at 10:35:13PM +0800, Feng Tang wrote: > When checking a memory cgroup related performance regression [1], > from the perf c2c profiling data, we found high false sharing for > accessing 'usage' and 'parent'. > > On 64 bit system, the 'usage' and 'parent' are close to each other, > and easy to be in one cacheline (for cacheline size == 64+ B). 'usage' > is usally written, while 'parent' is usually read as the cgroup's > hierarchical counting nature. > > So move the 'parent' to the end of the structure to make sure they > are in different cache lines. > > Following are some performance data with the patch, against > v5.11-rc1, on several generations of Xeon platforms. Most of the > results are improvements, with only one malloc case on one platform > shows a -4.0% regression. Each category below has several subcases > run on different platform, and only the worst and best scores are > listed: > > fio: +1.8% ~ +8.3% > will-it-scale/malloc1: -4.0% ~ +8.9% > will-it-scale/page_fault1: no change > will-it-scale/page_fault2: +2.4% ~ +20.2% > > [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/ > Signed-off-by: Feng Tang <feng.tang@intel.com> > Cc: Roman Gushchin <guro@fb.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > --- > include/linux/page_counter.h | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 85bd413..6795913 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -12,7 +12,6 @@ struct page_counter { > unsigned long low; > unsigned long high; > unsigned long max; > - struct page_counter *parent; > > /* effective memory.min and memory.min usage tracking */ > unsigned long emin; > @@ -27,6 +26,14 @@ struct page_counter { > /* legacy */ > unsigned long watermark; > unsigned long failcnt; > + > + /* > + * 'parent' is placed here to be far from 'usage' to reduce > + * cache false sharing, as 'usage' is written mostly while > + * parent is frequently read for cgroup's hierarchical > + * counting nature. > + */ > + struct page_counter *parent; > }; LGTM! Reviewed-by: Roman Gushchin <guro@fb.com> I wonder if we have the same problem with min/low/high/max? Maybe try to group all mostly-read-only fields (min, low, high, max and parent) and separate them with some padding? Thank you!
On Tue, Dec 29, 2020 at 08:56:42AM -0800, Roman Gushchin wrote: > On Tue, Dec 29, 2020 at 10:35:13PM +0800, Feng Tang wrote: > > When checking a memory cgroup related performance regression [1], > > from the perf c2c profiling data, we found high false sharing for > > accessing 'usage' and 'parent'. > > > > On 64 bit system, the 'usage' and 'parent' are close to each other, > > and easy to be in one cacheline (for cacheline size == 64+ B). 'usage' > > is usally written, while 'parent' is usually read as the cgroup's > > hierarchical counting nature. > > > > So move the 'parent' to the end of the structure to make sure they > > are in different cache lines. > > > > Following are some performance data with the patch, against > > v5.11-rc1, on several generations of Xeon platforms. Most of the > > results are improvements, with only one malloc case on one platform > > shows a -4.0% regression. Each category below has several subcases > > run on different platform, and only the worst and best scores are > > listed: > > > > fio: +1.8% ~ +8.3% > > will-it-scale/malloc1: -4.0% ~ +8.9% > > will-it-scale/page_fault1: no change > > will-it-scale/page_fault2: +2.4% ~ +20.2% > > > > [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/ > > Signed-off-by: Feng Tang <feng.tang@intel.com> > > Cc: Roman Gushchin <guro@fb.com> > > Cc: Johannes Weiner <hannes@cmpxchg.org> > > --- > > include/linux/page_counter.h | 9 ++++++++- > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > > index 85bd413..6795913 100644 > > --- a/include/linux/page_counter.h > > +++ b/include/linux/page_counter.h > > @@ -12,7 +12,6 @@ struct page_counter { > > unsigned long low; > > unsigned long high; > > unsigned long max; > > - struct page_counter *parent; > > > > /* effective memory.min and memory.min usage tracking */ > > unsigned long emin; > > @@ -27,6 +26,14 @@ struct page_counter { > > /* legacy */ > > unsigned long watermark; > > unsigned long failcnt; > > + > > + /* > > + * 'parent' is placed here to be far from 'usage' to reduce > > + * cache false sharing, as 'usage' is written mostly while > > + * parent is frequently read for cgroup's hierarchical > > + * counting nature. > > + */ > > + struct page_counter *parent; > > }; > > LGTM! > > Reviewed-by: Roman Gushchin <guro@fb.com> Thanks for the review! > I wonder if we have the same problem with min/low/high/max? > Maybe try to group all mostly-read-only fields (min, low, high, > max and parent) and separate them with some padding? Yep, we thought about it too. From current perf c2c profiling data, I haven't noticed obvious hot spots of false sharing for min/low/high/max (which are read mostly). For padding, we had some proposal before, current page_counter for 64 bits platform is 112 bytes, padding to 2 cacheline will only cost 16 bytes more. If this is fine, I can send another patch or folder it to this one. Thanks, Feng > Thank you!
On Tue 29-12-20 22:35:13, Feng Tang wrote: > When checking a memory cgroup related performance regression [1], > from the perf c2c profiling data, we found high false sharing for > accessing 'usage' and 'parent'. > > On 64 bit system, the 'usage' and 'parent' are close to each other, > and easy to be in one cacheline (for cacheline size == 64+ B). 'usage' > is usally written, while 'parent' is usually read as the cgroup's > hierarchical counting nature. > > So move the 'parent' to the end of the structure to make sure they > are in different cache lines. Yes, parent is write-once field so having it away from other heavy RW fields makes sense to me. > Following are some performance data with the patch, against > v5.11-rc1, on several generations of Xeon platforms. Most of the > results are improvements, with only one malloc case on one platform > shows a -4.0% regression. Each category below has several subcases > run on different platform, and only the worst and best scores are > listed: > > fio: +1.8% ~ +8.3% > will-it-scale/malloc1: -4.0% ~ +8.9% > will-it-scale/page_fault1: no change > will-it-scale/page_fault2: +2.4% ~ +20.2% What is the second number? Std? > [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/ > Signed-off-by: Feng Tang <feng.tang@intel.com> > Cc: Roman Gushchin <guro@fb.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > --- > include/linux/page_counter.h | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 85bd413..6795913 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -12,7 +12,6 @@ struct page_counter { > unsigned long low; > unsigned long high; > unsigned long max; > - struct page_counter *parent; > > /* effective memory.min and memory.min usage tracking */ > unsigned long emin; > @@ -27,6 +26,14 @@ struct page_counter { > /* legacy */ > unsigned long watermark; > unsigned long failcnt; > + > + /* > + * 'parent' is placed here to be far from 'usage' to reduce > + * cache false sharing, as 'usage' is written mostly while > + * parent is frequently read for cgroup's hierarchical > + * counting nature. > + */ > + struct page_counter *parent; > }; > > #if BITS_PER_LONG == 32 > -- > 2.7.4 >
Hi Michal, On Mon, Jan 04, 2021 at 02:03:57PM +0100, Michal Hocko wrote: > On Tue 29-12-20 22:35:13, Feng Tang wrote: > > When checking a memory cgroup related performance regression [1], > > from the perf c2c profiling data, we found high false sharing for > > accessing 'usage' and 'parent'. > > > > On 64 bit system, the 'usage' and 'parent' are close to each other, > > and easy to be in one cacheline (for cacheline size == 64+ B). 'usage' > > is usally written, while 'parent' is usually read as the cgroup's > > hierarchical counting nature. > > > > So move the 'parent' to the end of the structure to make sure they > > are in different cache lines. > > Yes, parent is write-once field so having it away from other heavy RW > fields makes sense to me. > > > Following are some performance data with the patch, against > > v5.11-rc1, on several generations of Xeon platforms. Most of the > > results are improvements, with only one malloc case on one platform > > shows a -4.0% regression. Each category below has several subcases > > run on different platform, and only the worst and best scores are > > listed: > > > > fio: +1.8% ~ +8.3% > > will-it-scale/malloc1: -4.0% ~ +8.9% > > will-it-scale/page_fault1: no change > > will-it-scale/page_fault2: +2.4% ~ +20.2% > > What is the second number? Std? For each case like 'page_fault2', I run several subcases on different generations of Xeon, and only listed the lowest (first number) and highest (second number) scores. There are 5 runs and the result are: +3.6%, +2.4%, +10.4%, +20.2%, +4.7%, and +2.4% and +20.2% are listed. Thanks, Feng
On Mon 04-01-21 21:34:45, Feng Tang wrote: > Hi Michal, > > On Mon, Jan 04, 2021 at 02:03:57PM +0100, Michal Hocko wrote: > > On Tue 29-12-20 22:35:13, Feng Tang wrote: > > > When checking a memory cgroup related performance regression [1], > > > from the perf c2c profiling data, we found high false sharing for > > > accessing 'usage' and 'parent'. > > > > > > On 64 bit system, the 'usage' and 'parent' are close to each other, > > > and easy to be in one cacheline (for cacheline size == 64+ B). 'usage' > > > is usally written, while 'parent' is usually read as the cgroup's > > > hierarchical counting nature. > > > > > > So move the 'parent' to the end of the structure to make sure they > > > are in different cache lines. > > > > Yes, parent is write-once field so having it away from other heavy RW > > fields makes sense to me. > > > > > Following are some performance data with the patch, against > > > v5.11-rc1, on several generations of Xeon platforms. Most of the > > > results are improvements, with only one malloc case on one platform > > > shows a -4.0% regression. Each category below has several subcases > > > run on different platform, and only the worst and best scores are > > > listed: > > > > > > fio: +1.8% ~ +8.3% > > > will-it-scale/malloc1: -4.0% ~ +8.9% > > > will-it-scale/page_fault1: no change > > > will-it-scale/page_fault2: +2.4% ~ +20.2% > > > > What is the second number? Std? > > For each case like 'page_fault2', I run several subcases on different > generations of Xeon, and only listed the lowest (first number) and > highest (second number) scores. > > There are 5 runs and the result are: +3.6%, +2.4%, +10.4%, +20.2%, > +4.7%, and +2.4% and +20.2% are listed. This should be really explained in the changelog and ideally mention the model as well. Seeing a std would be appreciated as well.
On Mon, Jan 04, 2021 at 03:11:40PM +0100, Michal Hocko wrote: > On Mon 04-01-21 21:34:45, Feng Tang wrote: > > Hi Michal, > > > > On Mon, Jan 04, 2021 at 02:03:57PM +0100, Michal Hocko wrote: > > > On Tue 29-12-20 22:35:13, Feng Tang wrote: > > > > When checking a memory cgroup related performance regression [1], > > > > from the perf c2c profiling data, we found high false sharing for > > > > accessing 'usage' and 'parent'. > > > > > > > > On 64 bit system, the 'usage' and 'parent' are close to each other, > > > > and easy to be in one cacheline (for cacheline size == 64+ B). 'usage' > > > > is usally written, while 'parent' is usually read as the cgroup's > > > > hierarchical counting nature. > > > > > > > > So move the 'parent' to the end of the structure to make sure they > > > > are in different cache lines. > > > > > > Yes, parent is write-once field so having it away from other heavy RW > > > fields makes sense to me. > > > > > > > Following are some performance data with the patch, against > > > > v5.11-rc1, on several generations of Xeon platforms. Most of the > > > > results are improvements, with only one malloc case on one platform > > > > shows a -4.0% regression. Each category below has several subcases > > > > run on different platform, and only the worst and best scores are > > > > listed: > > > > > > > > fio: +1.8% ~ +8.3% > > > > will-it-scale/malloc1: -4.0% ~ +8.9% > > > > will-it-scale/page_fault1: no change > > > > will-it-scale/page_fault2: +2.4% ~ +20.2% > > > > > > What is the second number? Std? > > > > For each case like 'page_fault2', I run several subcases on different > > generations of Xeon, and only listed the lowest (first number) and > > highest (second number) scores. > > > > There are 5 runs and the result are: +3.6%, +2.4%, +10.4%, +20.2%, > > +4.7%, and +2.4% and +20.2% are listed. > > This should be really explained in the changelog and ideally mention the > model as well. Seeing a std would be appreciated as well. I guess I haven't made it clear (due to my poor English :)) The five scores are for different parameters on different HW: Cascadelake (100%) 77844 +3.6% 80667 will-it-scale.per_process_ops Cascadelake (50%) 182475 +2.4% 186866 will-it-scale.per_process_ops Haswell (100%) 84870 +10.4% 93671 will-it-scale.per_process_ops Haswell (50%) 197684 +20.2% 237585 will-it-scale.per_process_ops Newer Xeon (50%) 268569 +4.7% 281320 will-it-scale.per_process_ops +2.4% is the lowest improvement, while +20.2% is the highest. 100% means the number of forked test processes eqauls to CPU number, while 50% is the half. Each line has been runed several times, whose score are consistent without big deviations. Thanks, Feng
On Mon 04-01-21 22:44:02, Feng Tang wrote: > On Mon, Jan 04, 2021 at 03:11:40PM +0100, Michal Hocko wrote: > > On Mon 04-01-21 21:34:45, Feng Tang wrote: > > > Hi Michal, > > > > > > On Mon, Jan 04, 2021 at 02:03:57PM +0100, Michal Hocko wrote: > > > > On Tue 29-12-20 22:35:13, Feng Tang wrote: > > > > > When checking a memory cgroup related performance regression [1], > > > > > from the perf c2c profiling data, we found high false sharing for > > > > > accessing 'usage' and 'parent'. > > > > > > > > > > On 64 bit system, the 'usage' and 'parent' are close to each other, > > > > > and easy to be in one cacheline (for cacheline size == 64+ B). 'usage' > > > > > is usally written, while 'parent' is usually read as the cgroup's > > > > > hierarchical counting nature. > > > > > > > > > > So move the 'parent' to the end of the structure to make sure they > > > > > are in different cache lines. > > > > > > > > Yes, parent is write-once field so having it away from other heavy RW > > > > fields makes sense to me. > > > > > > > > > Following are some performance data with the patch, against > > > > > v5.11-rc1, on several generations of Xeon platforms. Most of the > > > > > results are improvements, with only one malloc case on one platform > > > > > shows a -4.0% regression. Each category below has several subcases > > > > > run on different platform, and only the worst and best scores are > > > > > listed: > > > > > > > > > > fio: +1.8% ~ +8.3% > > > > > will-it-scale/malloc1: -4.0% ~ +8.9% > > > > > will-it-scale/page_fault1: no change > > > > > will-it-scale/page_fault2: +2.4% ~ +20.2% > > > > > > > > What is the second number? Std? > > > > > > For each case like 'page_fault2', I run several subcases on different > > > generations of Xeon, and only listed the lowest (first number) and > > > highest (second number) scores. > > > > > > There are 5 runs and the result are: +3.6%, +2.4%, +10.4%, +20.2%, > > > +4.7%, and +2.4% and +20.2% are listed. > > > > This should be really explained in the changelog and ideally mention the > > model as well. Seeing a std would be appreciated as well. > > I guess I haven't made it clear (due to my poor English :)) > > The five scores are for different parameters on different HW: > > Cascadelake (100%) 77844 +3.6% 80667 will-it-scale.per_process_ops > Cascadelake (50%) 182475 +2.4% 186866 will-it-scale.per_process_ops > Haswell (100%) 84870 +10.4% 93671 will-it-scale.per_process_ops > Haswell (50%) 197684 +20.2% 237585 will-it-scale.per_process_ops > Newer Xeon (50%) 268569 +4.7% 281320 will-it-scale.per_process_ops > > +2.4% is the lowest improvement, while +20.2% is the highest. Please make sure to document these results in the changelog. > 100% means the number of forked test processes eqauls to CPU number, > while 50% is the half. Each line has been runed several times, whose score > are consistent without big deviations. It is still a good practice to mention the number of runs and std.
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 85bd413..6795913 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -12,7 +12,6 @@ struct page_counter { unsigned long low; unsigned long high; unsigned long max; - struct page_counter *parent; /* effective memory.min and memory.min usage tracking */ unsigned long emin; @@ -27,6 +26,14 @@ struct page_counter { /* legacy */ unsigned long watermark; unsigned long failcnt; + + /* + * 'parent' is placed here to be far from 'usage' to reduce + * cache false sharing, as 'usage' is written mostly while + * parent is frequently read for cgroup's hierarchical + * counting nature. + */ + struct page_counter *parent; }; #if BITS_PER_LONG == 32
When checking a memory cgroup related performance regression [1], from the perf c2c profiling data, we found high false sharing for accessing 'usage' and 'parent'. On 64 bit system, the 'usage' and 'parent' are close to each other, and easy to be in one cacheline (for cacheline size == 64+ B). 'usage' is usally written, while 'parent' is usually read as the cgroup's hierarchical counting nature. So move the 'parent' to the end of the structure to make sure they are in different cache lines. Following are some performance data with the patch, against v5.11-rc1, on several generations of Xeon platforms. Most of the results are improvements, with only one malloc case on one platform shows a -4.0% regression. Each category below has several subcases run on different platform, and only the worst and best scores are listed: fio: +1.8% ~ +8.3% will-it-scale/malloc1: -4.0% ~ +8.9% will-it-scale/page_fault1: no change will-it-scale/page_fault2: +2.4% ~ +20.2% [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/ Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/page_counter.h | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)