Message ID | 20181106213458.308968760@debian-a64.vm (mailing list archive) |
---|---|
Headers | show |
Series | device mapper percpu patches | expand |
On Tue, Nov 06 2018 at 4:34pm -0500, Mikulas Patocka <mpatocka@redhat.com> wrote: > Hi > > These are the device mapper percpu patches. > > Note that I didn't test request-based device mapper because I don't have > hardware for it (the patches don't convert request-base targets to percpu > values, but there are a few inevitable changes in dm-rq.c). Patches 1 - 3 make sense. But the use of percpu inflight counters isn't something I can get upstream. Any more scalable counter still needs to be wired up to the block stats interfaces (the one you did in patch 5 is only for the "inflight" fsffs file, there is also the generic diskstats callout to part_in_flight(), etc). Wiring up both part_in_flight() and part_in_flight_rw() to optionally callout to a new callback isn't going to fly.. especially if that callout is looping up the sum of percpu counters. I checked with Jens and now that in 4.21 all of the old request-based IO path is gone (and given that blk-mq bypasses use of ->in_flight[]): the only consumer of the existing ->in_flight[] is the bio-based IO path. Given that now only bio-based is consuming it, and your work was focused on making bio-based DM's "pending" IO accounting more scalable, it is best to just change block core's ->in_flight[] directly. But Jens is against switching to using percpu counters because they are really slow when summing the counts. And diskstats does that frequently. Jens said at least 2 other attempts were made and rejected to switch over to percpu counters. Jens' suggestion is to implement a new generic rolling per-node counter. Would you be open to trying that? Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Nov 07 2018 at 5:29pm -0500, Mike Snitzer <snitzer@redhat.com> wrote: > On Tue, Nov 06 2018 at 4:34pm -0500, > Mikulas Patocka <mpatocka@redhat.com> wrote: > > > Hi > > > > These are the device mapper percpu patches. > > > > Note that I didn't test request-based device mapper because I don't have > > hardware for it (the patches don't convert request-base targets to percpu > > values, but there are a few inevitable changes in dm-rq.c). > > Patches 1 - 3 make sense. But the use of percpu inflight counters isn't > something I can get upstream. Any more scalable counter still needs to > be wired up to the block stats interfaces (the one you did in patch 5 is > only for the "inflight" fsffs file, there is also the generic diskstats > callout to part_in_flight(), etc). Wiring up both part_in_flight() and > part_in_flight_rw() to optionally callout to a new callback isn't going > to fly.. especially if that callout is looping up the sum of percpu > counters. > > I checked with Jens and now that in 4.21 all of the old request-based IO > path is gone (and given that blk-mq bypasses use of ->in_flight[]): the > only consumer of the existing ->in_flight[] is the bio-based IO path. > > Given that now only bio-based is consuming it, and your work was focused > on making bio-based DM's "pending" IO accounting more scalable, it is > best to just change block core's ->in_flight[] directly. > > But Jens is against switching to using percpu counters because they are > really slow when summing the counts. And diskstats does that > frequently. Jens said at least 2 other attempts were made and rejected > to switch over to percpu counters. > > Jens' suggestion is to implement a new generic rolling per-node > counter. Would you be open to trying that? I've based my dm-4.21 branch on Jens' for-4.21/block branch and staged your first 3 patches. If you're interested in exploring rolling per-node inflight counters please base your work on this branch: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.21 ("rolling per-node counters" aside, any new DM development should be based on the dm-4.21 branch) Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 7 Nov 2018, Mike Snitzer wrote: > On Tue, Nov 06 2018 at 4:34pm -0500, > Mikulas Patocka <mpatocka@redhat.com> wrote: > > > Hi > > > > These are the device mapper percpu patches. > > > > Note that I didn't test request-based device mapper because I don't have > > hardware for it (the patches don't convert request-base targets to percpu > > values, but there are a few inevitable changes in dm-rq.c). > > Patches 1 - 3 make sense. But the use of percpu inflight counters isn't > something I can get upstream. Any more scalable counter still needs to > be wired up to the block stats interfaces (the one you did in patch 5 is > only for the "inflight" fsffs file, there is also the generic diskstats > callout to part_in_flight(), etc). Wiring up both part_in_flight() and > part_in_flight_rw() to optionally callout to a new callback isn't going > to fly.. especially if that callout is looping up the sum of percpu > counters. > > I checked with Jens and now that in 4.21 all of the old request-based IO > path is gone (and given that blk-mq bypasses use of ->in_flight[]): the > only consumer of the existing ->in_flight[] is the bio-based IO path. > > Given that now only bio-based is consuming it, and your work was focused > on making bio-based DM's "pending" IO accounting more scalable, it is > best to just change block core's ->in_flight[] directly. > > But Jens is against switching to using percpu counters because they are > really slow when summing the counts. And diskstats does that > frequently. Jens said at least 2 other attempts were made and rejected > to switch over to percpu counters. I'd like to know - which kernel part needs to sum the percpu IO counters frequently? My impression was that the counters need to be summed only when the user is reading the files in sysfs and that is not frequent at all. I forgot about "/proc/diskstats" - but - is reading them really frequent? Do we want to sacrifice IOPS throughput for the sake of improving the time of reading "/proc/diskstats"? Mikulas > Jens' suggestion is to implement a new generic rolling per-node > counter. Would you be open to trying that? > > Mike > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 11/7/18 3:47 PM, Mikulas Patocka wrote: > > > On Wed, 7 Nov 2018, Mike Snitzer wrote: > >> On Tue, Nov 06 2018 at 4:34pm -0500, >> Mikulas Patocka <mpatocka@redhat.com> wrote: >> >>> Hi >>> >>> These are the device mapper percpu patches. >>> >>> Note that I didn't test request-based device mapper because I don't have >>> hardware for it (the patches don't convert request-base targets to percpu >>> values, but there are a few inevitable changes in dm-rq.c). >> >> Patches 1 - 3 make sense. But the use of percpu inflight counters isn't >> something I can get upstream. Any more scalable counter still needs to >> be wired up to the block stats interfaces (the one you did in patch 5 is >> only for the "inflight" fsffs file, there is also the generic diskstats >> callout to part_in_flight(), etc). Wiring up both part_in_flight() and >> part_in_flight_rw() to optionally callout to a new callback isn't going >> to fly.. especially if that callout is looping up the sum of percpu >> counters. >> >> I checked with Jens and now that in 4.21 all of the old request-based IO >> path is gone (and given that blk-mq bypasses use of ->in_flight[]): the >> only consumer of the existing ->in_flight[] is the bio-based IO path. >> >> Given that now only bio-based is consuming it, and your work was focused >> on making bio-based DM's "pending" IO accounting more scalable, it is >> best to just change block core's ->in_flight[] directly. >> >> But Jens is against switching to using percpu counters because they are >> really slow when summing the counts. And diskstats does that >> frequently. Jens said at least 2 other attempts were made and rejected >> to switch over to percpu counters. > > I'd like to know - which kernel part needs to sum the percpu IO counters > frequently? > > My impression was that the counters need to be summed only when the user > is reading the files in sysfs and that is not frequent at all. part_round_stats() does it on IO completion - only every jiffy, but it's enough that previous attempts at percpu inflight counters only worked for some cases, and were worse for others.
On Wed, 7 Nov 2018, Jens Axboe wrote: > On 11/7/18 3:47 PM, Mikulas Patocka wrote: > > > > I'd like to know - which kernel part needs to sum the percpu IO counters > > frequently? > > > > My impression was that the counters need to be summed only when the user > > is reading the files in sysfs and that is not frequent at all. > > part_round_stats() does it on IO completion - only every jiffy, but it's > enough that previous attempts at percpu inflight counters only worked > for some cases, and were worse for others. I see. I thought about it - part_round_stats() is used to calculate two counters - time_in_queue and io_ticks. time_in_queue can be calculating by adding the duration of the I/O when the I/O ends (the value will be the same except for in-progress I/Os). io_ticks could be approximated - if an I/O is started or finished and the "jiffies" value changes, we add 1. This is approximation, but if the I/Os take less than a jiffy, the value will be the same. These are the benchmarks for the patches for IOPS on ramdisk (request size 512 bytes) and ramdisk with dm-linear attached: fio --ioengine=psync --iodepth=1 --rw=read --bs=512 --direct=1 --numjobs=12 --time_based --runtime=10 --group_reporting --name=/dev/ram0 a system with 2 6-core processors: /dev/ram0 6656445 IOPS /dev/mapper/linear 2061914 IOPS /dev/mapper/linear with percpu counters 5500976 IOPS a system with 1 6-core processor: /dev/ram0 4019921 IOPS /dev/mapper/linear 2104687 IOPS /dev/mapper/linear with percpu counters 3050195 IOPS a virtual machine (12 virtual cores and 12 physical cores): /dev/ram0 5304687 IOPS /dev/mapper/linear 2115234 IOPS /dev/mapper/linear with percpu counters 4457031 IOPS My point of view is that we shouldn't degrade I/O throughput just to keep the counters accurate, so I suggest to change the counters to less-accurate mode. I'll send patches for that. Device mapper has a separate dm-stats functionality that can provide accurate I/O counters for the whole device and for any range - it is off by default, so it doesn't degrade performance if the user doesn't need it. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel