diff mbox series

[v4,1/4] mm/pageblock: mitigation cmpxchg false sharing in pageblock flags

Message ID 1599116482-7410-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series [v4,1/4] mm/pageblock: mitigation cmpxchg false sharing in pageblock flags | expand

Commit Message

Alex Shi Sept. 3, 2020, 7:01 a.m. UTC
pageblock_flags is used as long, since every pageblock_flags is just 4
bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags,
that flag setting has to sync in cmpxchg with 7 or 15 other pageblock
flags. It would cause long waiting for sync.

If we could change the pageblock_flags variable as char, we could use
char size cmpxchg, which just sync up with 2 pageblock flags. it could
relief the false sharing in cmpxchg.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mmzone.h          |  6 +++---
 include/linux/pageblock-flags.h |  2 +-
 mm/page_alloc.c                 | 38 +++++++++++++++++++-------------------
 3 files changed, 23 insertions(+), 23 deletions(-)

Comments

Mel Gorman Sept. 3, 2020, 7:24 a.m. UTC | #1
On Thu, Sep 03, 2020 at 03:01:20PM +0800, Alex Shi wrote:
> pageblock_flags is used as long, since every pageblock_flags is just 4
> bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags,
> that flag setting has to sync in cmpxchg with 7 or 15 other pageblock
> flags. It would cause long waiting for sync.
> 
> If we could change the pageblock_flags variable as char, we could use
> char size cmpxchg, which just sync up with 2 pageblock flags. it could
> relief the false sharing in cmpxchg.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>

Page block types were not known to change at high frequency that would
cause a measurable performance drop. If anything, the performance hit
from pageblocks is the lookup paths which is a lot more frequent.

What was the workload you were running that altered pageblocks at a high
enough frequency for collisions to occur when updating adjacent
pageblocks?
David Hildenbrand Sept. 3, 2020, 8:19 a.m. UTC | #2
On 03.09.20 09:01, Alex Shi wrote:
> pageblock_flags is used as long, since every pageblock_flags is just 4
> bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags,
> that flag setting has to sync in cmpxchg with 7 or 15 other pageblock
> flags. It would cause long waiting for sync.
> 
> If we could change the pageblock_flags variable as char, we could use
> char size cmpxchg, which just sync up with 2 pageblock flags. it could
> relief the false sharing in cmpxchg.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Could you please

1. Send a cover letter and indicate the changees between versions. I
cannot find any in my mailbox or on -mm - is there any? (also, is there
a patch 4 ?)

2. Report proper performance numbers as requested, especially, over
multiple runs. This should go into patch 1/2. Are they buried somewhere?

3. Clarify what patch 3 does: Do we actually waste 8*sizeof(long) where
we only need 4 bits?

Also, breaking stuff in patch 1 and fixing it in patch 3 is not
acceptable. This breaks git bisect. Skimming over the patches I think
this is the case.

I am not convinced yet that we need and want this. As Alex mentioned, we
touch pageblock flags while holding the zone lock already in most cases
-  and as Mel mentiones, updates should be rare.
Alex Shi Sept. 3, 2020, 8:32 a.m. UTC | #3
在 2020/9/3 下午3:24, Mel Gorman 写道:
> On Thu, Sep 03, 2020 at 03:01:20PM +0800, Alex Shi wrote:
>> pageblock_flags is used as long, since every pageblock_flags is just 4
>> bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags,
>> that flag setting has to sync in cmpxchg with 7 or 15 other pageblock
>> flags. It would cause long waiting for sync.
>>
>> If we could change the pageblock_flags variable as char, we could use
>> char size cmpxchg, which just sync up with 2 pageblock flags. it could
>> relief the false sharing in cmpxchg.
>>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> 
> Page block types were not known to change at high frequency that would
> cause a measurable performance drop. If anything, the performance hit
> from pageblocks is the lookup paths which is a lot more frequent.

Yes, it is not hot path. But it's still a meaningful points to reduce cmpxchg
level false sharing which isn't right on logical.


> 
> What was the workload you were running that altered pageblocks at a high
> enough frequency for collisions to occur when updating adjacent
> pageblocks?
> 

I have run thpscale with 'always' defrag setting of THP. The Amean stddev is much
larger than a very little average run time reducing.

But the left patch 4 could show the cmpxchg retry reduce from thousands to hundreds
or less.

Subject: [PATCH v4 4/4] add cmpxchg tracing

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
---
 include/trace/events/pageblock.h | 30 ++++++++++++++++++++++++++++++
 mm/page_alloc.c                  |  4 ++++
 2 files changed, 34 insertions(+)
 create mode 100644 include/trace/events/pageblock.h

diff --git a/include/trace/events/pageblock.h b/include/trace/events/pageblock.h
new file mode 100644
index 000000000000..003c2d716f82
--- /dev/null
+++ b/include/trace/events/pageblock.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pageblock
+
+#if !defined(_TRACE_PAGEBLOCK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGEBLOCK_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(hit_cmpxchg,
+
+	TP_PROTO(char byte),
+
+	TP_ARGS(byte),
+
+	TP_STRUCT__entry(
+		__field(char, byte)
+	),
+
+	TP_fast_assign(
+		__entry->byte = byte;
+	),
+
+	TP_printk("%d", __entry->byte)
+);
+
+#endif /* _TRACE_PAGE_ISOLATION_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8b65d83d8be6..a6d7159295bc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -509,6 +509,9 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned
  * @pfn: The target page frame number
  * @mask: mask of bits that the caller is interested in
  */
+#define CREATE_TRACE_POINTS
+#include <trace/events/pageblock.h>
+
 void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
 					unsigned long pfn,
 					unsigned long mask)
@@ -532,6 +535,7 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
 		if (byte == old_byte)
 			break;
 		byte = old_byte;
+		trace_hit_cmpxchg(byte);
 	}
 }

--
1.8.3.1
Alex Shi Sept. 3, 2020, 8:40 a.m. UTC | #4
在 2020/9/3 下午4:32, Alex Shi 写道:
>>
> I have run thpscale with 'always' defrag setting of THP. The Amean stddev is much
> larger than a very little average run time reducing.
> 
> But the left patch 4 could show the cmpxchg retry reduce from thousands to hundreds
> or less.
> 
> Subject: [PATCH v4 4/4] add cmpxchg tracing


It's a typical result with the patchset:

 Performance counter stats for './run-mmtests.sh -c configs/config-workload-thpscale pageblock-c':

             9,564      compaction:mm_compaction_isolate_migratepages
             6,430      compaction:mm_compaction_isolate_freepages
             5,287      compaction:mm_compaction_migratepages
            45,299      compaction:mm_compaction_begin
            45,299      compaction:mm_compaction_end
            30,557      compaction:mm_compaction_try_to_compact_pages
            95,540      compaction:mm_compaction_finished
           149,379      compaction:mm_compaction_suitable
                 0      compaction:mm_compaction_deferred
                 0      compaction:mm_compaction_defer_compaction
             3,949      compaction:mm_compaction_defer_reset
                 0      compaction:mm_compaction_kcompactd_sleep
                 0      compaction:mm_compaction_wakeup_kcompactd
                 0      compaction:mm_compaction_kcompactd_wake
                68      pageblock:hit_cmpxchg

     113.570974583 seconds time elapsed

      14.664451000 seconds user
      96.847116000 seconds sys

It's 5.9-rc2 base kernel result:

 Performance counter stats for './run-mmtests.sh -c configs/config-workload-thpscale rc2-e':

            15,920      compaction:mm_compaction_isolate_migratepages
            20,523      compaction:mm_compaction_isolate_freepages
             9,752      compaction:mm_compaction_migratepages
            27,773      compaction:mm_compaction_begin
            27,773      compaction:mm_compaction_end
            16,391      compaction:mm_compaction_try_to_compact_pages
            62,809      compaction:mm_compaction_finished
            69,821      compaction:mm_compaction_suitable
                 0      compaction:mm_compaction_deferred
                 0      compaction:mm_compaction_defer_compaction
             7,875      compaction:mm_compaction_defer_reset
                 0      compaction:mm_compaction_kcompactd_sleep
                 0      compaction:mm_compaction_wakeup_kcompactd
                 0      compaction:mm_compaction_kcompactd_wake
             1,208      pageblock:hit_cmpxchg

     116.440414591 seconds time elapsed

      15.326913000 seconds user
     103.752758000 seconds sys
Vlastimil Babka Sept. 3, 2020, 9 a.m. UTC | #5
On 9/3/20 10:40 AM, Alex Shi wrote:
> 
> 
> 在 2020/9/3 下午4:32, Alex Shi 写道:
>>>
>> I have run thpscale with 'always' defrag setting of THP. The Amean stddev is much
>> larger than a very little average run time reducing.
>> 
>> But the left patch 4 could show the cmpxchg retry reduce from thousands to hundreds
>> or less.
>> 
>> Subject: [PATCH v4 4/4] add cmpxchg tracing
> 
> 
> It's a typical result with the patchset:
> 
>  Performance counter stats for './run-mmtests.sh -c configs/config-workload-thpscale pageblock-c':
> 
>              9,564      compaction:mm_compaction_isolate_migratepages
>              6,430      compaction:mm_compaction_isolate_freepages
>              5,287      compaction:mm_compaction_migratepages
>             45,299      compaction:mm_compaction_begin
>             45,299      compaction:mm_compaction_end
>             30,557      compaction:mm_compaction_try_to_compact_pages
>             95,540      compaction:mm_compaction_finished
>            149,379      compaction:mm_compaction_suitable
>                  0      compaction:mm_compaction_deferred
>                  0      compaction:mm_compaction_defer_compaction
>              3,949      compaction:mm_compaction_defer_reset
>                  0      compaction:mm_compaction_kcompactd_sleep
>                  0      compaction:mm_compaction_wakeup_kcompactd
>                  0      compaction:mm_compaction_kcompactd_wake
>                 68      pageblock:hit_cmpxchg
> 
>      113.570974583 seconds time elapsed
> 
>       14.664451000 seconds user
>       96.847116000 seconds sys
> 
> It's 5.9-rc2 base kernel result:
> 
>  Performance counter stats for './run-mmtests.sh -c configs/config-workload-thpscale rc2-e':
> 
>             15,920      compaction:mm_compaction_isolate_migratepages
>             20,523      compaction:mm_compaction_isolate_freepages
>              9,752      compaction:mm_compaction_migratepages
>             27,773      compaction:mm_compaction_begin
>             27,773      compaction:mm_compaction_end
>             16,391      compaction:mm_compaction_try_to_compact_pages
>             62,809      compaction:mm_compaction_finished
>             69,821      compaction:mm_compaction_suitable
>                  0      compaction:mm_compaction_deferred
>                  0      compaction:mm_compaction_defer_compaction
>              7,875      compaction:mm_compaction_defer_reset
>                  0      compaction:mm_compaction_kcompactd_sleep
>                  0      compaction:mm_compaction_wakeup_kcompactd
>                  0      compaction:mm_compaction_kcompactd_wake
>              1,208      pageblock:hit_cmpxchg
> 
>      116.440414591 seconds time elapsed
> 
>       15.326913000 seconds user
>      103.752758000 seconds sys

The runs wildly differ in many of other stats, so I'm not sure they are really
comparable. I guess you could show the fraction of hit_cmpxchg to all cmpxchg.
But there's also danger of tracepoints widening the race window.

In the end what matters is how these 1208 retries contribute to runtime. I doubt
they could be really visible in a 100+ seconds run though.
Alex Shi Sept. 3, 2020, 9:14 a.m. UTC | #6
在 2020/9/3 下午4:19, David Hildenbrand 写道:
> On 03.09.20 09:01, Alex Shi wrote:
>> pageblock_flags is used as long, since every pageblock_flags is just 4
>> bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags,
>> that flag setting has to sync in cmpxchg with 7 or 15 other pageblock
>> flags. It would cause long waiting for sync.
>>
>> If we could change the pageblock_flags variable as char, we could use
>> char size cmpxchg, which just sync up with 2 pageblock flags. it could
>> relief the false sharing in cmpxchg.
>>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
> 
> Could you please
> 
> 1. Send a cover letter and indicate the changees between versions. I
> cannot find any in my mailbox or on -mm - is there any? (also, is there
> a patch 4 ?)

Hi David,

Thanks for comments!
I thought a short patchset don't need a cover. Here it's:

cmpxchg is a lockless way to update data by check and compare the target
data after updated and retry if target data is changed during that action.

So we should just compare the exact size of target data. If the data is only
byte, but cmpxchg compare a long word, that leads to a cmpxchg level flase
sharing, cause more try which lock memory more times. That's a clearly
logical error and should be fixed.

v1, the initial version
v2, fix the armv6 cmpxchgb missing issue.
v3, fix more arch cmpxchg missing issue on armv6, sh2, sparc32, xtensa
v4, redo cmpxchgb fix by NO_CMPXCHG_BYTE into arch/Kconfig 

> 
> 2. Report proper performance numbers as requested, especially, over
> multiple runs. This should go into patch 1/2. Are they buried somewhere?

There are some result sent on
https://lkml.org/lkml/2020/8/30/95

> 
> 3. Clarify what patch 3 does: Do we actually waste 8*sizeof(long) where
> we only need 4 bits?

No, no waste, current kernel pack the 4 bits into long, that cause cmpxchg
compare 8 pageblock flags which 7 of them are not needed.

> 
> Also, breaking stuff in patch 1 and fixing it in patch 3 is not
> acceptable. This breaks git bisect. Skimming over the patches I think
> this is the case.

Got it, thanks!
> 
> I am not convinced yet that we need and want this. As Alex mentioned, we
> touch pageblock flags while holding the zone lock already in most cases
> -  and as Mel mentiones, updates should be rare.
> 

yes, not too much, but there are still a few. and cmpxchg retry will lock memory
 which I believe the less the better.

Thanks
Alex
Mel Gorman Sept. 3, 2020, 9:31 a.m. UTC | #7
On Thu, Sep 03, 2020 at 04:32:54PM +0800, Alex Shi wrote:
> 
> 
> ??? 2020/9/3 ??????3:24, Mel Gorman ??????:
> > On Thu, Sep 03, 2020 at 03:01:20PM +0800, Alex Shi wrote:
> >> pageblock_flags is used as long, since every pageblock_flags is just 4
> >> bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags,
> >> that flag setting has to sync in cmpxchg with 7 or 15 other pageblock
> >> flags. It would cause long waiting for sync.
> >>
> >> If we could change the pageblock_flags variable as char, we could use
> >> char size cmpxchg, which just sync up with 2 pageblock flags. it could
> >> relief the false sharing in cmpxchg.
> >>
> >> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> > 
> > Page block types were not known to change at high frequency that would
> > cause a measurable performance drop. If anything, the performance hit
> > from pageblocks is the lookup paths which is a lot more frequent.
> 
> Yes, it is not hot path. But it's still a meaningful points to reduce cmpxchg
> level false sharing which isn't right on logical.
> 

Except there is no guarantee that false sharing was reduced. cmpxchg is
still used except using the byte as the comparison for the old value
and in some cases, that width will still be 32-bit for the exchange.
It would be up to each architecture to see if that can be translated to a
better instruction but it may not even matter.  As the instruction will
be prefixed with the lock instruction, the bus will be locked and bus
locking is probably on the cache line boundary so there is a collision
anyway while the atomic update takes place.

End result -- reducing false sharing in this case is not guaranteed to help
performance and may not be detectable when it's a low frequency operation
but the code will behave differently depending on the architecture and
CPU family.

Your justification path measured the number of times a cmpxchg was retried
but it did not track how many pageblock updates there were or how many
potentially collided. As the workload is uncontrolled with respect to
pageblock updates, you do not know if the difference in retries is due to
a real reduction in collisions or a difference in the number of pageblock
updates that potentially collide. Either way, because the frequency of
the operation was so low relative too your target load, any difference
in performance would be indistinguishable from noise.

I don't think it's worth the churn in this case for an impact that will
be very difficult to detect and variable across architectures and CPU
families.
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8379432f4f2f..be676e659fb7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -437,7 +437,7 @@  struct zone {
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
 	 * In SPARSEMEM, this map is stored in struct mem_section
 	 */
-	unsigned long		*pageblock_flags;
+	unsigned char		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
@@ -1159,7 +1159,7 @@  struct mem_section_usage {
 	DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION);
 #endif
 	/* See declaration of similar field in struct zone */
-	unsigned long pageblock_flags[0];
+	unsigned char	pageblock_flags[0];
 };
 
 void subsection_map_init(unsigned long pfn, unsigned long nr_pages);
@@ -1212,7 +1212,7 @@  struct mem_section {
 extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
 #endif
 
-static inline unsigned long *section_to_usemap(struct mem_section *ms)
+static inline unsigned char *section_to_usemap(struct mem_section *ms)
 {
 	return ms->usage->pageblock_flags;
 }
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index fff52ad370c1..d189441568eb 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -54,7 +54,7 @@  enum pageblock_bits {
 /* Forward declaration */
 struct page;
 
-unsigned long get_pfnblock_flags_mask(struct page *page,
+unsigned char get_pfnblock_flags_mask(struct page *page,
 				unsigned long pfn,
 				unsigned long mask);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fab5e97dc9ca..81e96d4d9c42 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -445,7 +445,7 @@  static inline bool defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
 #endif
 
 /* Return a pointer to the bitmap storing bits affecting a block of pages */
-static inline unsigned long *get_pageblock_bitmap(struct page *page,
+static inline unsigned char *get_pageblock_bitmap(struct page *page,
 							unsigned long pfn)
 {
 #ifdef CONFIG_SPARSEMEM
@@ -474,24 +474,24 @@  static inline int pfn_to_bitidx(struct page *page, unsigned long pfn)
  * Return: pageblock_bits flags
  */
 static __always_inline
-unsigned long __get_pfnblock_flags_mask(struct page *page,
+unsigned char __get_pfnblock_flags_mask(struct page *page,
 					unsigned long pfn,
 					unsigned long mask)
 {
-	unsigned long *bitmap;
-	unsigned long bitidx, word_bitidx;
-	unsigned long word;
+	unsigned char *bitmap;
+	unsigned long bitidx, byte_bitidx;
+	unsigned char byte;
 
 	bitmap = get_pageblock_bitmap(page, pfn);
 	bitidx = pfn_to_bitidx(page, pfn);
-	word_bitidx = bitidx / BITS_PER_LONG;
-	bitidx &= (BITS_PER_LONG-1);
+	byte_bitidx = bitidx / BITS_PER_BYTE;
+	bitidx &= (BITS_PER_BYTE-1);
 
-	word = bitmap[word_bitidx];
-	return (word >> bitidx) & mask;
+	byte = bitmap[byte_bitidx];
+	return (byte >> bitidx) & mask;
 }
 
-unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn,
+unsigned char get_pfnblock_flags_mask(struct page *page, unsigned long pfn,
 					unsigned long mask)
 {
 	return __get_pfnblock_flags_mask(page, pfn, mask);
@@ -513,29 +513,29 @@  void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
 					unsigned long pfn,
 					unsigned long mask)
 {
-	unsigned long *bitmap;
-	unsigned long bitidx, word_bitidx;
-	unsigned long old_word, word;
+	unsigned char *bitmap;
+	unsigned long bitidx, byte_bitidx;
+	unsigned char old_byte, byte;
 
 	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 	BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits));
 
 	bitmap = get_pageblock_bitmap(page, pfn);
 	bitidx = pfn_to_bitidx(page, pfn);
-	word_bitidx = bitidx / BITS_PER_LONG;
-	bitidx &= (BITS_PER_LONG-1);
+	byte_bitidx = bitidx / BITS_PER_BYTE;
+	bitidx &= (BITS_PER_BYTE-1);
 
 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
 
 	mask <<= bitidx;
 	flags <<= bitidx;
 
-	word = READ_ONCE(bitmap[word_bitidx]);
+	byte = READ_ONCE(bitmap[byte_bitidx]);
 	for (;;) {
-		old_word = cmpxchg(&bitmap[word_bitidx], word, (word & ~mask) | flags);
-		if (word == old_word)
+		old_byte = cmpxchg(&bitmap[byte_bitidx], byte, (byte & ~mask) | flags);
+		if (byte == old_byte)
 			break;
-		word = old_word;
+		byte = old_byte;
 	}
 }