mbox series

[0/5] device mapper percpu patches

Message ID 20181106213458.308968760@debian-a64.vm (mailing list archive)
Headers show
Series device mapper percpu patches | expand

Message

Mikulas Patocka Nov. 6, 2018, 9:34 p.m. UTC
Hi

These are the device mapper percpu patches.

Note that I didn't test request-based device mapper because I don't have
hardware for it (the patches don't convert request-base targets to percpu
values, but there are a few inevitable changes in dm-rq.c).

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Mike Snitzer Nov. 7, 2018, 10:29 p.m. UTC | #1
On Tue, Nov 06 2018 at  4:34pm -0500,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> Hi
> 
> These are the device mapper percpu patches.
> 
> Note that I didn't test request-based device mapper because I don't have
> hardware for it (the patches don't convert request-base targets to percpu
> values, but there are a few inevitable changes in dm-rq.c).

Patches 1 - 3 make sense.  But the use of percpu inflight counters isn't
something I can get upstream.  Any more scalable counter still needs to
be wired up to the block stats interfaces (the one you did in patch 5 is
only for the "inflight" fsffs file, there is also the generic diskstats
callout to part_in_flight(), etc).  Wiring up both part_in_flight() and
part_in_flight_rw() to optionally callout to a new callback isn't going
to fly.. especially if that callout is looping up the sum of percpu
counters.

I checked with Jens and now that in 4.21 all of the old request-based IO
path is gone (and given that blk-mq bypasses use of ->in_flight[]): the
only consumer of the existing ->in_flight[] is the bio-based IO path.

Given that now only bio-based is consuming it, and your work was focused
on making bio-based DM's "pending" IO accounting more scalable, it is
best to just change block core's ->in_flight[] directly.

But Jens is against switching to using percpu counters because they are
really slow when summing the counts.  And diskstats does that
frequently.  Jens said at least 2 other attempts were made and rejected
to switch over to percpu counters.

Jens' suggestion is to implement a new generic rolling per-node
counter.  Would you be open to trying that?

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Nov. 7, 2018, 10:42 p.m. UTC | #2
On Wed, Nov 07 2018 at  5:29pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Nov 06 2018 at  4:34pm -0500,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > These are the device mapper percpu patches.
> > 
> > Note that I didn't test request-based device mapper because I don't have
> > hardware for it (the patches don't convert request-base targets to percpu
> > values, but there are a few inevitable changes in dm-rq.c).
> 
> Patches 1 - 3 make sense.  But the use of percpu inflight counters isn't
> something I can get upstream.  Any more scalable counter still needs to
> be wired up to the block stats interfaces (the one you did in patch 5 is
> only for the "inflight" fsffs file, there is also the generic diskstats
> callout to part_in_flight(), etc).  Wiring up both part_in_flight() and
> part_in_flight_rw() to optionally callout to a new callback isn't going
> to fly.. especially if that callout is looping up the sum of percpu
> counters.
> 
> I checked with Jens and now that in 4.21 all of the old request-based IO
> path is gone (and given that blk-mq bypasses use of ->in_flight[]): the
> only consumer of the existing ->in_flight[] is the bio-based IO path.
> 
> Given that now only bio-based is consuming it, and your work was focused
> on making bio-based DM's "pending" IO accounting more scalable, it is
> best to just change block core's ->in_flight[] directly.
> 
> But Jens is against switching to using percpu counters because they are
> really slow when summing the counts.  And diskstats does that
> frequently.  Jens said at least 2 other attempts were made and rejected
> to switch over to percpu counters.
> 
> Jens' suggestion is to implement a new generic rolling per-node
> counter.  Would you be open to trying that?

I've based my dm-4.21 branch on Jens' for-4.21/block branch and staged
your first 3 patches.  If you're interested in exploring rolling
per-node inflight counters please base your work on this branch:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.21

("rolling per-node counters" aside, any new DM development should be
based on the dm-4.21 branch)

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Nov. 7, 2018, 10:47 p.m. UTC | #3
On Wed, 7 Nov 2018, Mike Snitzer wrote:

> On Tue, Nov 06 2018 at  4:34pm -0500,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > These are the device mapper percpu patches.
> > 
> > Note that I didn't test request-based device mapper because I don't have
> > hardware for it (the patches don't convert request-base targets to percpu
> > values, but there are a few inevitable changes in dm-rq.c).
> 
> Patches 1 - 3 make sense.  But the use of percpu inflight counters isn't
> something I can get upstream.  Any more scalable counter still needs to
> be wired up to the block stats interfaces (the one you did in patch 5 is
> only for the "inflight" fsffs file, there is also the generic diskstats
> callout to part_in_flight(), etc).  Wiring up both part_in_flight() and
> part_in_flight_rw() to optionally callout to a new callback isn't going
> to fly.. especially if that callout is looping up the sum of percpu
> counters.
> 
> I checked with Jens and now that in 4.21 all of the old request-based IO
> path is gone (and given that blk-mq bypasses use of ->in_flight[]): the
> only consumer of the existing ->in_flight[] is the bio-based IO path.
> 
> Given that now only bio-based is consuming it, and your work was focused
> on making bio-based DM's "pending" IO accounting more scalable, it is
> best to just change block core's ->in_flight[] directly.
> 
> But Jens is against switching to using percpu counters because they are
> really slow when summing the counts.  And diskstats does that
> frequently.  Jens said at least 2 other attempts were made and rejected
> to switch over to percpu counters.

I'd like to know - which kernel part needs to sum the percpu IO counters 
frequently?

My impression was that the counters need to be summed only when the user 
is reading the files in sysfs and that is not frequent at all.

I forgot about "/proc/diskstats" - but - is reading them really frequent? 
Do we want to sacrifice IOPS throughput for the sake of improving the time 
of reading "/proc/diskstats"?

Mikulas

> Jens' suggestion is to implement a new generic rolling per-node
> counter.  Would you be open to trying that?
> 
> Mike
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Jens Axboe Nov. 7, 2018, 10:59 p.m. UTC | #4
On 11/7/18 3:47 PM, Mikulas Patocka wrote:
> 
> 
> On Wed, 7 Nov 2018, Mike Snitzer wrote:
> 
>> On Tue, Nov 06 2018 at  4:34pm -0500,
>> Mikulas Patocka <mpatocka@redhat.com> wrote:
>>
>>> Hi
>>>
>>> These are the device mapper percpu patches.
>>>
>>> Note that I didn't test request-based device mapper because I don't have
>>> hardware for it (the patches don't convert request-base targets to percpu
>>> values, but there are a few inevitable changes in dm-rq.c).
>>
>> Patches 1 - 3 make sense.  But the use of percpu inflight counters isn't
>> something I can get upstream.  Any more scalable counter still needs to
>> be wired up to the block stats interfaces (the one you did in patch 5 is
>> only for the "inflight" fsffs file, there is also the generic diskstats
>> callout to part_in_flight(), etc).  Wiring up both part_in_flight() and
>> part_in_flight_rw() to optionally callout to a new callback isn't going
>> to fly.. especially if that callout is looping up the sum of percpu
>> counters.
>>
>> I checked with Jens and now that in 4.21 all of the old request-based IO
>> path is gone (and given that blk-mq bypasses use of ->in_flight[]): the
>> only consumer of the existing ->in_flight[] is the bio-based IO path.
>>
>> Given that now only bio-based is consuming it, and your work was focused
>> on making bio-based DM's "pending" IO accounting more scalable, it is
>> best to just change block core's ->in_flight[] directly.
>>
>> But Jens is against switching to using percpu counters because they are
>> really slow when summing the counts.  And diskstats does that
>> frequently.  Jens said at least 2 other attempts were made and rejected
>> to switch over to percpu counters.
> 
> I'd like to know - which kernel part needs to sum the percpu IO counters 
> frequently?
> 
> My impression was that the counters need to be summed only when the user 
> is reading the files in sysfs and that is not frequent at all.

part_round_stats() does it on IO completion - only every jiffy, but it's
enough that previous attempts at percpu inflight counters only worked
for some cases, and were worse for others.
Mikulas Patocka Nov. 16, 2018, midnight UTC | #5
On Wed, 7 Nov 2018, Jens Axboe wrote:

> On 11/7/18 3:47 PM, Mikulas Patocka wrote:
> > 
> > I'd like to know - which kernel part needs to sum the percpu IO counters 
> > frequently?
> > 
> > My impression was that the counters need to be summed only when the user 
> > is reading the files in sysfs and that is not frequent at all.
> 
> part_round_stats() does it on IO completion - only every jiffy, but it's
> enough that previous attempts at percpu inflight counters only worked
> for some cases, and were worse for others.

I see.

I thought about it - part_round_stats() is used to calculate two counters 
- time_in_queue and io_ticks.

time_in_queue can be calculating by adding the duration of the I/O when 
the I/O ends (the value will be the same except for in-progress I/Os).

io_ticks could be approximated - if an I/O is started or finished and the 
"jiffies" value changes, we add 1. This is approximation, but if the I/Os 
take less than a jiffy, the value will be the same.


These are the benchmarks for the patches for IOPS on ramdisk (request 
size 512 bytes) and ramdisk with dm-linear attached:

fio --ioengine=psync --iodepth=1 --rw=read --bs=512 --direct=1 --numjobs=12 --time_based --runtime=10 --group_reporting --name=/dev/ram0

a system with 2 6-core processors:
/dev/ram0					6656445 IOPS
/dev/mapper/linear				2061914 IOPS
/dev/mapper/linear with percpu counters		5500976 IOPS

a system with 1 6-core processor:
/dev/ram0					4019921 IOPS
/dev/mapper/linear				2104687 IOPS
/dev/mapper/linear with percpu counters		3050195 IOPS

a virtual machine (12 virtual cores and 12 physical cores):
/dev/ram0					5304687 IOPS
/dev/mapper/linear				2115234 IOPS
/dev/mapper/linear with percpu counters		4457031 IOPS


My point of view is that we shouldn't degrade I/O throughput just to keep 
the counters accurate, so I suggest to change the counters to 
less-accurate mode. I'll send patches for that.

Device mapper has a separate dm-stats functionality that can provide 
accurate I/O counters for the whole device and for any range - it is off 
by default, so it doesn't degrade performance if the user doesn't need it.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel