mbox series

[RFC,0/2] sbitmap: NUMA node spreading

Message ID 1652181274-136198-1-git-send-email-john.garry@huawei.com (mailing list archive)
Headers show
Series sbitmap: NUMA node spreading | expand

Message

John Garry May 10, 2022, 11:14 a.m. UTC
Hi Jens, guys,

I am sending this as an RFC to see if there is any future in it or ideas
on how to make better. I also need to improve some items (as mentioned in
2/2 commit message) and test a lot more.

The general idea is that we change from allocating a single array of
sbitmap words to allocating an sub-array per NUMA node. And then each CPU
in that node is hinted to use that sub-array

Initial performance looks decent.

Some figures:
System: 4-nodes (with memory on all nodes), 128 CPUs

null blk config block:
20 devs, submit_queues=NR_CPUS, shared_tags, shared_tag_bitmap,
hw_queue_depth=256

fio config:
bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
ioscheduler=none

Before:
7130K

After:
7630K

So a +7% IOPS gain.

Any comments welcome, thanks!.

Based on v5.18-rc6.

John Garry (2):
  sbitmap: Make sbitmap.map a double pointer
  sbitmap: Spread sbitmap word allocation over NUMA nodes

 include/linux/sbitmap.h | 16 +++++---
 lib/sbitmap.c           | 83 +++++++++++++++++++++++++++++++++--------
 2 files changed, 79 insertions(+), 20 deletions(-)

Comments

Jens Axboe May 10, 2022, 12:50 p.m. UTC | #1
On 5/10/22 5:14 AM, John Garry wrote:
> Hi Jens, guys,
> 
> I am sending this as an RFC to see if there is any future in it or ideas
> on how to make better. I also need to improve some items (as mentioned in
> 2/2 commit message) and test a lot more.
> 
> The general idea is that we change from allocating a single array of
> sbitmap words to allocating an sub-array per NUMA node. And then each CPU
> in that node is hinted to use that sub-array
> 
> Initial performance looks decent.
> 
> Some figures:
> System: 4-nodes (with memory on all nodes), 128 CPUs
> 
> null blk config block:
> 20 devs, submit_queues=NR_CPUS, shared_tags, shared_tag_bitmap,
> hw_queue_depth=256
> 
> fio config:
> bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
> ioscheduler=none
> 
> Before:
> 7130K
> 
> After:
> 7630K
> 
> So a +7% IOPS gain.

What does the comparison run on a non-NUMA non-shared queue look like?
Because I bet it'd be slower.

To be honest, I don't like this approach at all. It makes the normal
case quite a bit slower by having an extra layer of indirection for the
word, that's quite a bit of extra cost. It doesn't seem like a good
approach for the issue, as it pessimizes the normal fast case.

Spreading the memory out does probably make sense, but we need to retain
the fast normal case. Making sbitmap support both, selected at init
time, would be far more likely to be acceptable imho.
John Garry May 10, 2022, 1:44 p.m. UTC | #2
On 10/05/2022 13:50, Jens Axboe wrote:
>> fio config:
>> bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
>> ioscheduler=none
>>
>> Before:
>> 7130K
>>
>> After:
>> 7630K
>>
>> So a +7% IOPS gain.

Thanks for having a look.

> What does the comparison run on a non-NUMA non-shared queue look like?
> Because I bet it'd be slower.

I could test more to get a solid result for that.

> 
> To be honest, I don't like this approach at all. It makes the normal
> case quite a bit slower by having an extra layer of indirection for the
> word, that's quite a bit of extra cost.

Yes, there is the extra load. I would hope that there would be a low 
cost, but I agree that we still want to avoid it. So prob no point in 
testing this more.

> It doesn't seem like a good
> approach for the issue, as it pessimizes the normal fast case.
> 
> Spreading the memory out does probably make sense, but we need to retain
> the fast normal case. Making sbitmap support both, selected at init
> time, would be far more likely to be acceptable imho.

I wanted to keep the code changes minimal for an initial RFC to test the 
water.

My original approach did not introduce the extra load for normal path 
and had some init time selection for a normal word map vs numa word map, 
but the code grew and became somewhat unmanageable. I'll revisit it to 
see how to improve that.

Cheers,
john
Jens Axboe May 10, 2022, 2:34 p.m. UTC | #3
On 5/10/22 7:44 AM, John Garry wrote:
> On 10/05/2022 13:50, Jens Axboe wrote:
>>> fio config:
>>> bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
>>> ioscheduler=none
>>>
>>> Before:
>>> 7130K
>>>
>>> After:
>>> 7630K
>>>
>>> So a +7% IOPS gain.
> 
> Thanks for having a look.
> 
>> What does the comparison run on a non-NUMA non-shared queue look like?
>> Because I bet it'd be slower.
> 
> I could test more to get a solid result for that.
> 
>>
>> To be honest, I don't like this approach at all. It makes the normal
>> case quite a bit slower by having an extra layer of indirection for the
>> word, that's quite a bit of extra cost.
> 
> Yes, there is the extra load. I would hope that there would be a low
> cost, but I agree that we still want to avoid it. So prob no point in
> testing this more.

I don't think that's low cost at all. It's the very hot path, and you're
now not only doing an extra load, it's a dependent load - you need to
load both to make any progress. On top of that, it's not like it's two
loads from the same cacheline or even page. The most important thing for
performance these days is having good cache utilization, the patch as it
stands very much makes that a lot worse.

Besides, for any kind of performance work like that, it's customary to
showcase both the situation that is supposedly fixed or improved with
the change, but also to test that it didn't regress the existing
common/fast case.

>> It doesn't seem like a good
>> approach for the issue, as it pessimizes the normal fast case.
>>
>> Spreading the memory out does probably make sense, but we need to retain
>> the fast normal case. Making sbitmap support both, selected at init
>> time, would be far more likely to be acceptable imho.
> 
> I wanted to keep the code changes minimal for an initial RFC to test
> the water.
>
> My original approach did not introduce the extra load for normal path
> and had some init time selection for a normal word map vs numa word
> map, but the code grew and became somewhat unmanageable. I'll revisit
> it to see how to improve that.

Probably just needs some clean refactoring first, so that the actual
change can be pretty small.
John Garry May 10, 2022, 3:03 p.m. UTC | #4
On 10/05/2022 15:34, Jens Axboe wrote:
>> Yes, there is the extra load. I would hope that there would be a low
>> cost, but I agree that we still want to avoid it. So prob no point in
>> testing this more.
> I don't think that's low cost at all. It's the very hot path, and you're
> now not only doing an extra load, it's a dependent load - you need to
> load both to make any progress. On top of that, it's not like it's two
> loads from the same cacheline or even page. The most important thing for
> performance these days is having good cache utilization, the patch as it
> stands very much makes that a lot worse.

Understood. Essentially patch #1/2 points in the wrong direction.

I have to admit that I was a bit blinkered by seeing how much I could 
improve the NUMA case.

> 
> Besides, for any kind of performance work like that, it's customary to
> showcase both the situation that is supposedly fixed or improved with
> the change, but also to test that it didn't regress the existing
> common/fast case.

Right, I should have done that.

> 
>>> It doesn't seem like a good
>>> approach for the issue, as it pessimizes the normal fast case.
>>>
>>> Spreading the memory out does probably make sense, but we need to retain
>>> the fast normal case. Making sbitmap support both, selected at init
>>> time, would be far more likely to be acceptable imho.
>> I wanted to keep the code changes minimal for an initial RFC to test
>> the water.
>>
>> My original approach did not introduce the extra load for normal path
>> and had some init time selection for a normal word map vs numa word
>> map, but the code grew and became somewhat unmanageable. I'll revisit
>> it to see how to improve that.
> Probably just needs some clean refactoring first, so that the actual
> change can be pretty small.

I think that it may be just a case of separating out the handling of 
evaluating the sbitmap_word ptr as that is that common struct which we 
deal with.

Thanks,
John
Ming Lei May 11, 2022, 2:07 a.m. UTC | #5
On Tue, May 10, 2022 at 02:44:50PM +0100, John Garry wrote:
> On 10/05/2022 13:50, Jens Axboe wrote:
> > > fio config:
> > > bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read,
> > > ioscheduler=none
> > > 
> > > Before:
> > > 7130K
> > > 
> > > After:
> > > 7630K
> > > 
> > > So a +7% IOPS gain.
> 
> Thanks for having a look.
> 
> > What does the comparison run on a non-NUMA non-shared queue look like?
> > Because I bet it'd be slower.
> 
> I could test more to get a solid result for that.
> 
> > 
> > To be honest, I don't like this approach at all. It makes the normal
> > case quite a bit slower by having an extra layer of indirection for the
> > word, that's quite a bit of extra cost.
> 
> Yes, there is the extra load. I would hope that there would be a low cost,
> but I agree that we still want to avoid it. So prob no point in testing this
> more.
> 
> > It doesn't seem like a good
> > approach for the issue, as it pessimizes the normal fast case.
> > 
> > Spreading the memory out does probably make sense, but we need to retain
> > the fast normal case. Making sbitmap support both, selected at init
> > time, would be far more likely to be acceptable imho.
> 
> I wanted to keep the code changes minimal for an initial RFC to test the
> water.
> 
> My original approach did not introduce the extra load for normal path and
> had some init time selection for a normal word map vs numa word map, but the
> code grew and became somewhat unmanageable. I'll revisit it to see how to
> improve that.

I understand this approach just splits shared sbitmap into per-numa-node
part, but what if all IOs are just from CPUs in one same numa node? Doesn't
this way cause tag starvation and waste?


Thanks,
Ming
John Garry May 11, 2022, 9:57 a.m. UTC | #6
On 11/05/2022 03:07, Ming Lei wrote:

Hi Ming,

>>> Spreading the memory out does probably make sense, but we need to retain
>>> the fast normal case. Making sbitmap support both, selected at init
>>> time, would be far more likely to be acceptable imho.
>> I wanted to keep the code changes minimal for an initial RFC to test the
>> water.
>>
>> My original approach did not introduce the extra load for normal path and
>> had some init time selection for a normal word map vs numa word map, but the
>> code grew and became somewhat unmanageable. I'll revisit it to see how to
>> improve that.
> I understand this approach just splits shared sbitmap into per-numa-node
> part, but what if all IOs are just from CPUs in one same numa node? Doesn't
> this way cause tag starvation and waste?
> 

We would not do this. If we can't find a free bit in one node then we 
need to check the others before giving up. This is some of the added 
complexity which I hinted at. And things like batch get or RR support 
become more complex.

Alternatively we could have the double pointer for numa spreading only, 
which would make things simpler. I need to check which is overall 
better. Adding the complexity for dealing with numa node sub-arrays may 
affect performance also.

Thanks,
John