Message ID | 20200807160327.GA977@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [git,pull] device mapper changes for 5.9 | expand |
On Fri, Aug 07 2020 at 12:03pm -0400, Mike Snitzer <snitzer@redhat.com> wrote: > Hi Linus, > > The following changes since commit 11ba468877bb23f28956a35e896356252d63c983: > > Linux 5.8-rc5 (2020-07-12 16:34:50 -0700) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-5.9/dm-changes > > for you to fetch changes up to a9cb9f4148ef6bb8fabbdaa85c42b2171fbd5a0d: > > dm: don't call report zones for more than the user requested (2020-08-04 16:31:12 -0400) > > Please pull, thanks! > Mike > > ---------------------------------------------------------------- > - DM multipath locking fixes around m->flags tests and improvements to > bio-based code so that it follows patterns established by > request-based code. > > - Request-based DM core improvement to eliminate unnecessary call to > blk_mq_queue_stopped(). > > - Add "panic_on_corruption" error handling mode to DM verity target. > > - DM bufio fix to to perform buffer cleanup from a workqueue rather > than wait for IO in reclaim context from shrinker. > > - DM crypt improvement to optionally avoid async processing via > workqueues for reads and/or writes -- via "no_read_workqueue" and > "no_write_workqueue" features. This more direct IO processing > improves latency and throughput with faster storage. Avoiding > workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is > a requirement for adding zoned block device support to DM crypt. I forgot to note that you'll get a trivial merge conflict in dm-crypt.c due to commit ed00aabd5eb9f (" block: rename generic_make_request to submit_bio_noacct"). Mike
On Fri, Aug 7, 2020 at 9:03 AM Mike Snitzer <snitzer@redhat.com> wrote: > > - DM crypt improvement to optionally avoid async processing via > workqueues for reads and/or writes -- via "no_read_workqueue" and > "no_write_workqueue" features. This more direct IO processing > improves latency and throughput with faster storage. Avoiding > workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is > a requirement for adding zoned block device support to DM crypt. Is there a reason the async workqueue isn't removed entirely if it's not a win? Or at least made to not be the default. Now it seems to be just optionally disabled, which seems the wrong way around to me. I do not believe async crypto has ever worked or been worth it. Off-loaded crypto in the IO path, yes. Async using a workqueue? Naah. Or do I misunderstand? Linus
The pull request you sent on Fri, 7 Aug 2020 12:03:27 -0400:
> git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-5.9/dm-changes
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/2f12d44085dabf5fa5779ff0bb0aaa1b2cc768cb
Thank you!
On Fri, Aug 07 2020 at 4:11pm -0400, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Aug 7, 2020 at 9:03 AM Mike Snitzer <snitzer@redhat.com> wrote: > > > > - DM crypt improvement to optionally avoid async processing via > > workqueues for reads and/or writes -- via "no_read_workqueue" and > > "no_write_workqueue" features. This more direct IO processing > > improves latency and throughput with faster storage. Avoiding > > workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is > > a requirement for adding zoned block device support to DM crypt. > > Is there a reason the async workqueue isn't removed entirely if it's not a win? > > Or at least made to not be the default. I haven't assessed it yet. There are things like rbtree sorting that is also hooked off async, but that is more meaningful for rotational storage. > Now it seems to be just optionally disabled, which seems the wrong way > around to me. > > I do not believe async crypto has ever worked or been worth it. > Off-loaded crypto in the IO path, yes. Async using a workqueue? Naah. > > Or do I misunderstand? No, you've got it. My thinking was to introduce the ability to avoid the workqueue code via opt-in and then we can evaluate the difference it has on different classes of storage. More conservative approach that also allows me to not know the end from the beginning... this was work driven by others so on this point I rolled with what was proposed since I personally don't have the answer (yet) on whether workqueue is actually helpful. Best to _really_ know that before removing it. I'll make a point to try to get more eyes on this to see if it makes sense to just eliminate the workqueue code entirely (or invert the default). Will keep you updated. Mike
For what it's worth, I just ran two tests on a machine with dm-crypt using the cipher_null:ecb cipher. Results are mixed; not offloading IO submission can result in -27% to +23% change in throughput, in a selection of three IO patterns HDDs and SSDs. (Note that the IO submission thread also reorders IO to attempt to submit it in sector order, so that is an additional difference between the two modes -- it's not just "offload writes to another thread" vs "don't offload writes".) The summary (for my FIO workloads focused on parallelism) is that offloading is useful for high IO depth random writes on SSDs, and for long sequential small writes on HDDs. Offloading reduced throughput for immensely high IO depths on SSDs, where I would guess lock contention is reducing effective IO depth to the disk; and for low IO depths of sequential writes on HDDs, where I would guess (as it would for a zoned device) preserving submission order is better than attempting to reorder before submission. Two test regimes: randwrite on 7xSamsung SSD 850 PRO 128G, somewhat aged, behind a LSI MegaRAID card providing raid0. 6 processors (Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz); 128G RAM; and seqwrite, on a software raid0 (512k chunk size) of 4 HDDs on the same machine specs. Scheduler 'none' for both. LSI card in writethrough cache mode. All data in MB/s. depth jobs bs dflt no_wq %chg raw disk ----------------randwrite, SSD-------------- 128 1 4k 282 282 0 285 256 4 4k 251 183 -27 283 2048 4 4k 266 283 +6 284 1 4 1m 433 414 -4 403 ----------------seqwrite, HDD--------------- 128 1 4k 87 107 +23 86 256 4 4k 101 90 -11 91 2048 4 4k 273 233 -15 249 1 4 1m 144 146 +1 146
On Tue, Aug 18, 2020 at 1:40 PM John Dorminy <jdorminy@redhat.com> wrote: > > The summary (for my FIO workloads focused on > parallelism) is that offloading is useful for high IO depth random > writes on SSDs, and for long sequential small writes on HDDs. Do we have any non-microbenchmarks that might be somewhat representative of something, and might be used to at least set a default? Or can we perhaps - even better - dynamically notice whether to offload or not? I suspect that offloading is horrible for any latency situation, particularly with any modern setup where the SSD is fast enough that doing scheduling to another thread is noticeable. After all, some people are working on polling IO, because the result comes back so fast that taking the interrupt is unnecessary extra work. Those people admittedly have faster disks than most of us, but .. At least from a latency angle, maybe we could have the fairly common case of a IO depth of 1 (because synchronous reads) not trigger it. It looks like you only did throughput benchmarks (like pretty much everybody always does, because latency benchmarks are a lot harder to do well). Linus
On Tue, Aug 18, 2020 at 2:12 PM Ignat Korchagin <ignat@cloudflare.com> wrote: > > Additionally if one cares about latency I think everybody really deep down cares about latency, they just don't always know it, and the benchmarks are very seldom about it because it's so much harder to measure. > they will not use HDDs for the workflow and HDDs have much higher IO latency than CPU scheduling. I think by now we can just say that anybody who uses HDD's don't care about performance as a primary issue. I don't think they are really interesting as a benchmark target - at least from the standpoint of what the kernel should optimize for. People have HDD's for legacy reasons or because they care much more about capacity than performance. Why should _we_ then worry about performance that the user doesn't worry about? I'm not saying we should penalize HDD's, but I don't think they are things we should primarily care deeply about any more. Linus
Your points are good. I don't know a good macrobenchmark at present, but at least various latency numbers are easy to get out of fio. I ran a similar set of tests on an Optane 900P with results below. 'clat' is, as fio reports, the completion latency, measured in usec. 'configuration' is [block size], [iodepth], [jobs]; picked to be a varied selection that obtained excellent throughput from the drive. Table reports average, and 99th percentile, latency times as well as throughput. It matches Ignat's report that large block sizes using the new option can have worse latency and throughput on top-end drives, although that result doesn't make any sense to me. Happy to run some more there or elsewhere if there are suggestions. devicetype configuration MB/s clat mean clat 99% ------------------------------------------------------------------ nvme base 1m,32,4 2259 59280 67634 crypt default 1m,32,4 2267 59050 182000 crypt no_w_wq 1m,32,4 1758 73954.54 84411 nvme base 64k,1024,1 2273 29500.92 30540 crypt default 64k,1024,1 2167 29518.89 50594 crypt no_w_wq 64k,1024,1 2056 31090.23 31327 nvme base 4k,128,4 2159 924.57 1106 crypt default 4k,128,4 1256 1663.67 3294 crypt no_w_wq 4k,128,4 1703 1165.69 1319 Ignat, how do these numbers match up to what you've been seeing? -John On Tue, Aug 18, 2020 at 5:23 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, Aug 18, 2020 at 2:12 PM Ignat Korchagin <ignat@cloudflare.com> wrote: > > > > Additionally if one cares about latency > > I think everybody really deep down cares about latency, they just > don't always know it, and the benchmarks are very seldom about it > because it's so much harder to measure. > > > they will not use HDDs for the workflow and HDDs have much higher IO latency than CPU scheduling. > > I think by now we can just say that anybody who uses HDD's don't care > about performance as a primary issue. > > I don't think they are really interesting as a benchmark target - at > least from the standpoint of what the kernel should optimize for. > > People have HDD's for legacy reasons or because they care much more > about capacity than performance. Why should _we_ then worry about > performance that the user doesn't worry about? > > I'm not saying we should penalize HDD's, but I don't think they are > things we should primarily care deeply about any more. > > Linus >
Thank you John, Damien for extensive measurements. I don't have much to add as my measurements are probably just a subset of Damien's and repeat their results. One interesting thing I noticed when experimenting with this: we mostly talk about average throughput, but sometimes it is interesting to see the instant values (measured over small time slices). For example, for 4kb block size, qd=1, 50/50 randrw job for a dm-crypt encrypted ramdisk with ecb(cipher_null) cipher I just continuously run in the terminal, I can see the instant throughput having somewhat bimodal distribution: it reliably jumps between ~120 MiB/s and ~80 MiB/s medians (the overall average throughput being ~100 MiB/s of course). This is for dm-crypt with workqueues. If I disable the workqueues the distribution of the instant throughput becomes "normal". Without looking into much detail I wonder if HT has some side-effects on dm-crypt processing (I have it enabled), because it seems not all "cores" are equal for dm-cypt even on the null cipher. I might get my hands on an arm64 server soon and curious to see how dm-crypt and workques will compare there. Regards, Ignat On Wed, Aug 19, 2020 at 8:10 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote: > > John, > > On 2020/08/19 13:25, John Dorminy wrote: > > Your points are good. I don't know a good macrobenchmark at present, > > but at least various latency numbers are easy to get out of fio. > > > > I ran a similar set of tests on an Optane 900P with results below. > > 'clat' is, as fio reports, the completion latency, measured in usec. > > 'configuration' is [block size], [iodepth], [jobs]; picked to be a > > varied selection that obtained excellent throughput from the drive. > > Table reports average, and 99th percentile, latency times as well as > > throughput. It matches Ignat's report that large block sizes using the > > new option can have worse latency and throughput on top-end drives, > > although that result doesn't make any sense to me. > > > > Happy to run some more there or elsewhere if there are suggestions. > > > > devicetype configuration MB/s clat mean clat 99% > > ------------------------------------------------------------------ > > nvme base 1m,32,4 2259 59280 67634 > > crypt default 1m,32,4 2267 59050 182000 > > crypt no_w_wq 1m,32,4 1758 73954.54 84411 > > nvme base 64k,1024,1 2273 29500.92 30540 > > crypt default 64k,1024,1 2167 29518.89 50594 > > crypt no_w_wq 64k,1024,1 2056 31090.23 31327 > > nvme base 4k,128,4 2159 924.57 1106 > > crypt default 4k,128,4 1256 1663.67 3294 > > crypt no_w_wq 4k,128,4 1703 1165.69 1319 > > I have been doing a lot of testing recently on dm-crypt, mostly for zoned > storage, that is with write workqueue disabled, but also with regular disks to > have something to compare to and verify my results. I confirm that I see similar > changes in throughput/latency in my tests: disabling workqueues improves > throughput for small IO sizes thanks to the lower latency (removed context > switch overhead), but the benefits of disabling the workqueues become dubious > for large IO sizes, and deep queue depth. See the heat-map attached for more > results (nullblk device used for these measurements with 1 job per CPU). > > I also pushed things further as my tests as I primarily focused on enterprise > systems with a large number of storage devices being used with a single server. > To flatten things out and avoid any performance limitations due to the storage > devices, PCIe and/or HBA bus speed and memory bus speed, I ended up performing > lots of tests using nullblk with different settings: > > 1) SSD like multiqueue setting without "none" scheduler, with irq_mode=0 > (immediate completion in submission context) and irq_mode=1 for softirq > completion (different completion context than submission). > 2) HDD like single queue with mq-deadline as the scheduler, and the different > irq_mode settings. > > I also played with CPU assignments for the fio jobs and tried various things. > > My observations are as follows, in no particular order: > > 1) Maximum throughput clearly directly depends on the numbers of CPUs involved > in the crypto work. The crypto acceleration is limited per core and so the > number of issuer contexts (for writes) and or completion contexts (for reads) > almost directly determine maximum performance with worqueue disabled. I measured > about 1.4GB/s at best on my system with a single writer 128KB/QD=4. > > 2) For a multi drive setup with IO issuers limited to a small set of CPUs, > performance does not scale with the number of disks as the crypto engine speed > of the CPUs being used is the limiting factor: both write encryption and read > decryption happen on that set of CPUs, regardless of the others CPUs load. > > 3) For single queue devices, write performance scales badly with the number of > CPUs used for IO issuing: the one CPU that runs the device queue to dispatch > commands end up doing a lot of crypto work for requests queued through other > CPUs too. > > 4) On a very busy system with a very large number of disks and CPUs used for > IOs, the total throughput I get is very close for all settings with workqueues > enabled and disabled, about 50GB/s total on my dual socket Xeon system. There > was a small advantage for the none scheduler/multiqueue setting that gave up to > 56GB/s with workqueues on and 47GB/s with workqueues off. The single > queue/mq-deadline case gave 51 GB/s and 48 GB/s with workqueues on/off. > > 5) For the tests with the large number of drives and CPUs, things got > interesting with the average latency: I saw about the same average with > workqueues on and off. But the p99 latency was about 3 times lower with > workqueues off than workqueues on. When all CPUs are busy, reducing overhead by > avoiding additional context switches clearly helps. > > 6) With an arguably more realistic workload of 66% read and 34 % writes (read > size is 64KB/1MB with a 60%/40% ratio and write size is fixed at 1MB), I ended > up with higher total throughput with workqueues disabled (44GB/s) vs enabled > (38GB/s). Average write latency was also 30% lower with workqueues disabled > without any significant change to the average read latency. > > From all these tests, I am currently considering that for a large system with > lots of devices, disabling workqueues is a win, as long as IO issuers are not > limited to a small set of CPUs. > > The benefits of disabling workqueues for a desktop like system or a server > system with one (or very few) super fast drives are much less clear in my > opinion. Average and p99 latency are generally better with workqueues off, but > total throughput may significantly suffer if only a small number of IO contexts > are involved, that is, a small number of CPUs participate in the crypto > processing. Then crypto hardware speed dictates the results and using workqueues > to get parallelism between more CPU cores can give better throughput. > > That said, I am thinking that from all this, we can extract some hints to > automate decision for using workqueues or not: > 1) Small IOs (e.g. 4K) would probably benefit from having workqueue disabled, > especially for 4Kn storage devices as such request would be processed as a > single block with a single crypto API call. > 2) It may be good to process any BIO marked with REQ_HIPRI (polling BIO) without > any workqueue, to reduce latency, as intended by the caller. > 3) We may want to have read-ahead reads use workqueues, especially for single > queue devices (HDDs) to avoid increasing latency for other reads completing > together with these read-ahead requests. > > In the end, I am still scratching my head trying to figure out what the best > default setup may be. > > Best regards. > > > -- > Damien Le Moal > Western Digital Research