diff mbox series

[v2,2/2] mm/pageblock: remove false sharing in pageblock_flags

Message ID 1597816075-61091-2-git-send-email-alex.shi@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series [v2,1/2] mm/pageblock: mitigation cmpxchg false sharing in pageblock flags | expand

Commit Message

Alex Shi Aug. 19, 2020, 5:47 a.m. UTC
Current pageblock_flags is only 4 bits, so it has to share a char size
in cmpxchg when get set, the false sharing cause perf drop.

If we incrase the bits up to 8, false sharing would gone in cmpxchg. and
the only cost is half char per pageblock, which is half char per 128MB
on x86, 4 chars in 1 GB.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/pageblock-flags.h | 2 +-
 mm/page_alloc.c                 | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Anshuman Khandual Aug. 19, 2020, 7:57 a.m. UTC | #1
On 08/19/2020 11:17 AM, Alex Shi wrote:
> Current pageblock_flags is only 4 bits, so it has to share a char size
> in cmpxchg when get set, the false sharing cause perf drop.
> 
> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and
> the only cost is half char per pageblock, which is half char per 128MB
> on x86, 4 chars in 1 GB.

Agreed that increase in memory utilization is negligible here but does
this really improve performance ?
Alex Shi Aug. 19, 2020, 8:09 a.m. UTC | #2
在 2020/8/19 下午3:57, Anshuman Khandual 写道:
> 
> 
> On 08/19/2020 11:17 AM, Alex Shi wrote:
>> Current pageblock_flags is only 4 bits, so it has to share a char size
>> in cmpxchg when get set, the false sharing cause perf drop.
>>
>> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and
>> the only cost is half char per pageblock, which is half char per 128MB
>> on x86, 4 chars in 1 GB.
> 
> Agreed that increase in memory utilization is negligible here but does
> this really improve performance ?
> 

It's no doubt in theory. and it would had a bad impact according to 
commit e380bebe4771548  mm, compaction: keep migration source private to a single 

but I do have some problem in running thpscale/mmtest. I'd like to see if anyone
could give a try.

BTW, I naturally hate the false sharing even it's in theory. Anyone who doesn't? :)

Thanks
Alex
David Hildenbrand Aug. 19, 2020, 8:13 a.m. UTC | #3
On 19.08.20 10:09, Alex Shi wrote:
> 
> 
> 在 2020/8/19 下午3:57, Anshuman Khandual 写道:
>>
>>
>> On 08/19/2020 11:17 AM, Alex Shi wrote:
>>> Current pageblock_flags is only 4 bits, so it has to share a char size
>>> in cmpxchg when get set, the false sharing cause perf drop.
>>>
>>> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and
>>> the only cost is half char per pageblock, which is half char per 128MB
>>> on x86, 4 chars in 1 GB.
>>
>> Agreed that increase in memory utilization is negligible here but does
>> this really improve performance ?
>>
> 
> It's no doubt in theory. and it would had a bad impact according to 
> commit e380bebe4771548  mm, compaction: keep migration source private to a single 
> 
> but I do have some problem in running thpscale/mmtest. I'd like to see if anyone
> could give a try.
> 
> BTW, I naturally hate the false sharing even it's in theory. Anyone who doesn't? :)

I hate wasting memory, even if it's just a little bit, anyone who
doesn't? ;)
Alex Shi Aug. 19, 2020, 8:36 a.m. UTC | #4
在 2020/8/19 下午4:13, David Hildenbrand 写道:
> On 19.08.20 10:09, Alex Shi wrote:
>>
>>
>> 在 2020/8/19 下午3:57, Anshuman Khandual 写道:
>>>
>>>
>>> On 08/19/2020 11:17 AM, Alex Shi wrote:
>>>> Current pageblock_flags is only 4 bits, so it has to share a char size
>>>> in cmpxchg when get set, the false sharing cause perf drop.
>>>>
>>>> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and
>>>> the only cost is half char per pageblock, which is half char per 128MB
>>>> on x86, 4 chars in 1 GB.
>>>
>>> Agreed that increase in memory utilization is negligible here but does
>>> this really improve performance ?
>>>
>>
>> It's no doubt in theory. and it would had a bad impact according to 
>> commit e380bebe4771548  mm, compaction: keep migration source private to a single 

The above commit is a good disproof, the false sharing caused performance issue,
and need using a lock protect cmpxchg, the instruction was designed for lockless
using. The false sharing, it was a trouble, and would trouble again sometime.

Thanks
Alex

>>
>> but I do have some problem in running thpscale/mmtest. I'd like to see if anyone
>> could give a try.
>>
>> BTW, I naturally hate the false sharing even it's in theory. Anyone who doesn't? :)
> 
> I hate wasting memory, even if it's just a little bit, anyone who
> doesn't? ;)
> 
>
Alexander Duyck Aug. 19, 2020, 4:50 p.m. UTC | #5
On Wed, Aug 19, 2020 at 1:11 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/19 下午3:57, Anshuman Khandual 写道:
> >
> >
> > On 08/19/2020 11:17 AM, Alex Shi wrote:
> >> Current pageblock_flags is only 4 bits, so it has to share a char size
> >> in cmpxchg when get set, the false sharing cause perf drop.
> >>
> >> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and
> >> the only cost is half char per pageblock, which is half char per 128MB
> >> on x86, 4 chars in 1 GB.
> >
> > Agreed that increase in memory utilization is negligible here but does
> > this really improve performance ?
> >
>
> It's no doubt in theory. and it would had a bad impact according to
> commit e380bebe4771548  mm, compaction: keep migration source private to a single
>
> but I do have some problem in running thpscale/mmtest. I'd like to see if anyone
> could give a try.
>
> BTW, I naturally hate the false sharing even it's in theory. Anyone who doesn't? :)

You keep bringing up false sharing but you don't fix the false sharing
by doing this. You are still allowing the flags for multiple
pageblocks per cacheline so you still have false sharing even after
this.

What I believe you are attempting to address is the fact that multiple
pageblocks share a single long value and that long is being used with
a cmpxchg so you end up with multiple threads potentially all banging
on the same value and watching it change. However the field currently
consists of only 4 bits, 3 of them for migratetype and 1 for the skip
bit. In the case of the 3 bit portion a cmpxchg makes sense and is
usually protected by the zone lock so you would only have one thread
accessing it in most cases with the possible exception of a section
that spans multiple zones.

For the case such as the skip bit and MIGRATE_UNMOVABLE (0x0) where we
would be clearing or setting the entire mask maybe it would make more
sense to simply use an atomic_or or atomic_and depending on if you are
setting or clearing the flag? It would allow you to avoid the spinning
or having to read the word before performing the operation since you
would just be directly applying an AND or OR via a mask value.
Alex Shi Aug. 30, 2020, 10 a.m. UTC | #6
在 2020/8/20 上午12:50, Alexander Duyck 写道:
> On Wed, Aug 19, 2020 at 1:11 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2020/8/19 下午3:57, Anshuman Khandual 写道:
>>>
>>>
>>> On 08/19/2020 11:17 AM, Alex Shi wrote:
>>>> Current pageblock_flags is only 4 bits, so it has to share a char size
>>>> in cmpxchg when get set, the false sharing cause perf drop.
>>>>
>>>> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and
>>>> the only cost is half char per pageblock, which is half char per 128MB
>>>> on x86, 4 chars in 1 GB.
>>>
>>> Agreed that increase in memory utilization is negligible here but does
>>> this really improve performance ?
>>>
>>
>> It's no doubt in theory. and it would had a bad impact according to
>> commit e380bebe4771548  mm, compaction: keep migration source private to a single
>>
>> but I do have some problem in running thpscale/mmtest. I'd like to see if anyone
>> could give a try.
>>
>> BTW, I naturally hate the false sharing even it's in theory. Anyone who doesn't? :)
> 
> You keep bringing up false sharing but you don't fix the false sharing
> by doing this. You are still allowing the flags for multiple
> pageblocks per cacheline so you still have false sharing even after
> this.

yes, the cacheline false sharing is still there. But as you pointed, cmpxchg level
false sharing could be addressed much by the patchset.


> 
> What I believe you are attempting to address is the fact that multiple
> pageblocks share a single long value and that long is being used with
> a cmpxchg so you end up with multiple threads potentially all banging
> on the same value and watching it change. However the field currently
> consists of only 4 bits, 3 of them for migratetype and 1 for the skip
> bit. In the case of the 3 bit portion a cmpxchg makes sense and is
> usually protected by the zone lock so you would only have one thread
> accessing it in most cases with the possible exception of a section
> that spans multiple zones.
> 
> For the case such as the skip bit and MIGRATE_UNMOVABLE (0x0) where we
> would be clearing or setting the entire mask maybe it would make more
> sense to simply use an atomic_or or atomic_and depending on if you are
> setting or clearing the flag? It would allow you to avoid the spinning
> or having to read the word before performing the operation since you
> would just be directly applying an AND or OR via a mask value.

Right that the different level to fix this problem, but narrow the cmpxchg
comparsion is still needed and helpful.

Thanks
Alex
>
Alexander Duyck Aug. 30, 2020, 8:32 p.m. UTC | #7
On Sun, Aug 30, 2020 at 3:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/20 上午12:50, Alexander Duyck 写道:
> > On Wed, Aug 19, 2020 at 1:11 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> 在 2020/8/19 下午3:57, Anshuman Khandual 写道:
> >>>
> >>>
> >>> On 08/19/2020 11:17 AM, Alex Shi wrote:
> >>>> Current pageblock_flags is only 4 bits, so it has to share a char size
> >>>> in cmpxchg when get set, the false sharing cause perf drop.
> >>>>
> >>>> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and
> >>>> the only cost is half char per pageblock, which is half char per 128MB
> >>>> on x86, 4 chars in 1 GB.
> >>>
> >>> Agreed that increase in memory utilization is negligible here but does
> >>> this really improve performance ?
> >>>
> >>
> >> It's no doubt in theory. and it would had a bad impact according to
> >> commit e380bebe4771548  mm, compaction: keep migration source private to a single
> >>
> >> but I do have some problem in running thpscale/mmtest. I'd like to see if anyone
> >> could give a try.
> >>
> >> BTW, I naturally hate the false sharing even it's in theory. Anyone who doesn't? :)
> >
> > You keep bringing up false sharing but you don't fix the false sharing
> > by doing this. You are still allowing the flags for multiple
> > pageblocks per cacheline so you still have false sharing even after
> > this.
>
> yes, the cacheline false sharing is still there. But as you pointed, cmpxchg level
> false sharing could be addressed much by the patchset.
>
>
> >
> > What I believe you are attempting to address is the fact that multiple
> > pageblocks share a single long value and that long is being used with
> > a cmpxchg so you end up with multiple threads potentially all banging
> > on the same value and watching it change. However the field currently
> > consists of only 4 bits, 3 of them for migratetype and 1 for the skip
> > bit. In the case of the 3 bit portion a cmpxchg makes sense and is
> > usually protected by the zone lock so you would only have one thread
> > accessing it in most cases with the possible exception of a section
> > that spans multiple zones.
> >
> > For the case such as the skip bit and MIGRATE_UNMOVABLE (0x0) where we
> > would be clearing or setting the entire mask maybe it would make more
> > sense to simply use an atomic_or or atomic_and depending on if you are
> > setting or clearing the flag? It would allow you to avoid the spinning
> > or having to read the word before performing the operation since you
> > would just be directly applying an AND or OR via a mask value.
>
> Right that the different level to fix this problem, but narrow the cmpxchg
> comparsion is still needed and helpful.

What I was getting at though is that I am not sure that is the case.
Normally I believe we are always holding the zone lock when updating
the migrate type. The skip flag is a one-off operation that could
easily be addressed by changing the logic to use atomic_and or
atomic_or for the cases where we are updating single bit flags and
setting the mask value to all 1's or all 0's. So adding this extra
complexity which only really applies to the skip bit may not provide
much value, especially as there are a number of possible paths that
don't use the skip bit anyway.
Vlastimil Babka Sept. 1, 2020, 5:04 p.m. UTC | #8
On 8/19/20 10:09 AM, Alex Shi wrote:
> 
> 
> 在 2020/8/19 下午3:57, Anshuman Khandual 写道:
>> 
>> 
>> On 08/19/2020 11:17 AM, Alex Shi wrote:
>>> Current pageblock_flags is only 4 bits, so it has to share a char size
>>> in cmpxchg when get set, the false sharing cause perf drop.
>>>
>>> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and
>>> the only cost is half char per pageblock, which is half char per 128MB
>>> on x86, 4 chars in 1 GB.
>> 
>> Agreed that increase in memory utilization is negligible here but does
>> this really improve performance ?
>> 
> 
> It's no doubt in theory. and it would had a bad impact according to 
> commit e380bebe4771548  mm, compaction: keep migration source private to a single 

I don't think that commit is doing the test_and_set_skip() under lock to avoid
false sharing. I think it's done to simply make the test and set protected
against races without relying on e.g. a truly atomic test_and_set_bit(). It's
still noted that it's just a hint so it's not protected to others calling
set_pageblock_skip() from other contexts not under a lock.

> but I do have some problem in running thpscale/mmtest. I'd like to see if anyone
> could give a try.
> 
> BTW, I naturally hate the false sharing even it's in theory. Anyone who doesn't? :)
> 
> Thanks
> Alex 
>
Alex Shi Sept. 3, 2020, 5:35 a.m. UTC | #9
在 2020/8/31 上午4:32, Alexander Duyck 写道:
>> Right that the different level to fix this problem, but narrow the cmpxchg
>> comparsion is still needed and helpful.
> What I was getting at though is that I am not sure that is the case.
> Normally I believe we are always holding the zone lock when updating
> the migrate type. The skip flag is a one-off operation that could
> easily be addressed by changing the logic to use atomic_and or
> atomic_or for the cases where we are updating single bit flags and
> setting the mask value to all 1's or all 0's. So adding this extra
> complexity which only really applies to the skip bit may not provide
> much value, especially as there are a number of possible paths that
> don't use the skip bit anyway.


Using atomic bit set is only helpful for skip bit. but migrate bits are still
do the false sharing sync large.

And actully, for this issue on cmpxchg, it's cross arch issue which it's better
addressed in kernel. I see much of cmpxchg is using in kernel, but no one mentioned
the shortage on this. On this pointer, this problem should be resovled in kernel.

Thanks
Alex
diff mbox series

Patch

diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index d189441568eb..f785c9d6d68c 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -25,7 +25,7 @@  enum pageblock_bits {
 	 * Assume the bits will always align on a word. If this assumption
 	 * changes then get/set pageblock needs updating.
 	 */
-	NR_PAGEBLOCK_BITS
+	NR_PAGEBLOCK_BITS = BITS_PER_BYTE
 };
 
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f60071e8a4e1..65f692c762d5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -517,7 +517,7 @@  void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
 	unsigned long bitidx, byte_bitidx;
 	unsigned char old_byte, byte;
 
-	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
+	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != BITS_PER_BYTE);
 	BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits));
 
 	bitmap = get_pageblock_bitmap(page, pfn);