mbox series

[git,pull] device mapper changes for 5.9

Message ID 20200807160327.GA977@redhat.com (mailing list archive)
State New, archived
Headers show
Series [git,pull] device mapper changes for 5.9 | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-5.9/dm-changes

Message

Mike Snitzer Aug. 7, 2020, 4:03 p.m. UTC
Hi Linus,

The following changes since commit 11ba468877bb23f28956a35e896356252d63c983:

  Linux 5.8-rc5 (2020-07-12 16:34:50 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-5.9/dm-changes

for you to fetch changes up to a9cb9f4148ef6bb8fabbdaa85c42b2171fbd5a0d:

  dm: don't call report zones for more than the user requested (2020-08-04 16:31:12 -0400)

Please pull, thanks!
Mike

----------------------------------------------------------------
- DM multipath locking fixes around m->flags tests and improvements to
  bio-based code so that it follows patterns established by
  request-based code.

- Request-based DM core improvement to eliminate unnecessary call to
  blk_mq_queue_stopped().

- Add "panic_on_corruption" error handling mode to DM verity target.

- DM bufio fix to to perform buffer cleanup from a workqueue rather
  than wait for IO in reclaim context from shrinker.

- DM crypt improvement to optionally avoid async processing via
  workqueues for reads and/or writes -- via "no_read_workqueue" and
  "no_write_workqueue" features.  This more direct IO processing
  improves latency and throughput with faster storage.  Avoiding
  workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is
  a requirement for adding zoned block device support to DM crypt.

- Add zoned block device support to DM crypt.  Makes use of
  DM_CRYPT_NO_WRITE_WORKQUEUE and a new optional feature
  (DM_CRYPT_WRITE_INLINE) that allows write completion to wait for
  encryption to complete.  This allows write ordering to be preserved,
  which is needed for zoned block devices.

- Fix DM ebs target's check for REQ_OP_FLUSH.

- Fix DM core's report zones support to not report more zones than
  were requested.

- A few small compiler warning fixes.

- DM dust improvements to return output directly to the user rather
  than require they scrape the system log for output.

----------------------------------------------------------------
Damien Le Moal (5):
      dm crypt: Enable zoned block device support
      dm verity: Fix compilation warning
      dm raid: Remove empty if statement
      dm ioctl: Fix compilation warning
      dm init: Set file local variable static

Ignat Korchagin (1):
      dm crypt: add flags to optionally bypass kcryptd workqueues

JeongHyeon Lee (1):
      dm verity: add "panic_on_corruption" error handling mode

Johannes Thumshirn (1):
      dm: don't call report zones for more than the user requested

John Dorminy (1):
      dm ebs: Fix incorrect checking for REQ_OP_FLUSH

Mike Snitzer (7):
      dm mpath: changes from initial m->flags locking audit
      dm mpath: take m->lock spinlock when testing QUEUE_IF_NO_PATH
      dm mpath: push locking down to must_push_back_rq()
      dm mpath: factor out multipath_queue_bio
      dm mpath: rework __map_bio()
      dm mpath: rename current_pgpath to pgpath in multipath_prepare_ioctl
      dm mpath: use double checked locking in fast path

Mikulas Patocka (1):
      dm bufio: do buffer cleanup from a workqueue

Ming Lei (1):
      dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue()

yangerkun (2):
      dm dust: report some message results directly back to user
      dm dust: add interface to list all badblocks

 .../admin-guide/device-mapper/dm-dust.rst          |  32 ++++-
 Documentation/admin-guide/device-mapper/verity.rst |   4 +
 drivers/md/dm-bufio.c                              |  60 ++++++---
 drivers/md/dm-crypt.c                              | 129 ++++++++++++++++--
 drivers/md/dm-dust.c                               |  58 ++++++--
 drivers/md/dm-ebs-target.c                         |   2 +-
 drivers/md/dm-init.c                               |   2 +-
 drivers/md/dm-ioctl.c                              |   2 +-
 drivers/md/dm-mpath.c                              | 146 ++++++++++++++-------
 drivers/md/dm-raid.c                               |   2 -
 drivers/md/dm-rq.c                                 |   3 -
 drivers/md/dm-verity-target.c                      |  13 +-
 drivers/md/dm-verity-verify-sig.h                  |  14 +-
 drivers/md/dm-verity.h                             |   3 +-
 drivers/md/dm.c                                    |   3 +-
 15 files changed, 355 insertions(+), 118 deletions(-)

Comments

Mike Snitzer Aug. 7, 2020, 4:20 p.m. UTC | #1
On Fri, Aug 07 2020 at 12:03pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> Hi Linus,
> 
> The following changes since commit 11ba468877bb23f28956a35e896356252d63c983:
> 
>   Linux 5.8-rc5 (2020-07-12 16:34:50 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-5.9/dm-changes
> 
> for you to fetch changes up to a9cb9f4148ef6bb8fabbdaa85c42b2171fbd5a0d:
> 
>   dm: don't call report zones for more than the user requested (2020-08-04 16:31:12 -0400)
> 
> Please pull, thanks!
> Mike
> 
> ----------------------------------------------------------------
> - DM multipath locking fixes around m->flags tests and improvements to
>   bio-based code so that it follows patterns established by
>   request-based code.
> 
> - Request-based DM core improvement to eliminate unnecessary call to
>   blk_mq_queue_stopped().
> 
> - Add "panic_on_corruption" error handling mode to DM verity target.
> 
> - DM bufio fix to to perform buffer cleanup from a workqueue rather
>   than wait for IO in reclaim context from shrinker.
> 
> - DM crypt improvement to optionally avoid async processing via
>   workqueues for reads and/or writes -- via "no_read_workqueue" and
>   "no_write_workqueue" features.  This more direct IO processing
>   improves latency and throughput with faster storage.  Avoiding
>   workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is
>   a requirement for adding zoned block device support to DM crypt.

I forgot to note that you'll get a trivial merge conflict in dm-crypt.c
due to commit ed00aabd5eb9f (" block: rename generic_make_request to
submit_bio_noacct").

Mike
Linus Torvalds Aug. 7, 2020, 8:11 p.m. UTC | #2
On Fri, Aug 7, 2020 at 9:03 AM Mike Snitzer <snitzer@redhat.com> wrote:
>
> - DM crypt improvement to optionally avoid async processing via
>   workqueues for reads and/or writes -- via "no_read_workqueue" and
>   "no_write_workqueue" features.  This more direct IO processing
>   improves latency and throughput with faster storage.  Avoiding
>   workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is
>   a requirement for adding zoned block device support to DM crypt.

Is there a reason the async workqueue isn't removed entirely if it's not a win?

Or at least made to not be the default.

Now it seems to be just optionally disabled, which seems the wrong way
around to me.

I do not believe async crypto has ever worked or been worth it.
Off-loaded crypto in the IO path, yes. Async using a workqueue? Naah.

Or do I misunderstand?

               Linus
pr-tracker-bot@kernel.org Aug. 7, 2020, 8:40 p.m. UTC | #3
The pull request you sent on Fri, 7 Aug 2020 12:03:27 -0400:

> git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-5.9/dm-changes

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/2f12d44085dabf5fa5779ff0bb0aaa1b2cc768cb

Thank you!
Mike Snitzer Aug. 7, 2020, 8:40 p.m. UTC | #4
On Fri, Aug 07 2020 at  4:11pm -0400,
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Aug 7, 2020 at 9:03 AM Mike Snitzer <snitzer@redhat.com> wrote:
> >
> > - DM crypt improvement to optionally avoid async processing via
> >   workqueues for reads and/or writes -- via "no_read_workqueue" and
> >   "no_write_workqueue" features.  This more direct IO processing
> >   improves latency and throughput with faster storage.  Avoiding
> >   workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is
> >   a requirement for adding zoned block device support to DM crypt.
> 
> Is there a reason the async workqueue isn't removed entirely if it's not a win?
> 
> Or at least made to not be the default.

I haven't assessed it yet.

There are things like rbtree sorting that is also hooked off async, but
that is more meaningful for rotational storage.

> Now it seems to be just optionally disabled, which seems the wrong way
> around to me.
> 
> I do not believe async crypto has ever worked or been worth it.
> Off-loaded crypto in the IO path, yes. Async using a workqueue? Naah.
> 
> Or do I misunderstand?

No, you've got it.

My thinking was to introduce the ability to avoid the workqueue code via
opt-in and then we can evaluate the difference it has on different
classes of storage.

More conservative approach that also allows me to not know the end from
the beginning... this was work driven by others so on this point I
rolled with what was proposed since I personally don't have the answer
(yet) on whether workqueue is actually helpful.  Best to _really_ know
that before removing it.

I'll make a point to try to get more eyes on this to see if it makes
sense to just eliminate the workqueue code entirely (or invert the
default).  Will keep you updated.

Mike
John Dorminy Aug. 18, 2020, 8:39 p.m. UTC | #5
For what it's worth, I just ran two tests on a machine with dm-crypt
using the cipher_null:ecb cipher. Results are mixed; not offloading IO
submission can result in -27% to +23% change in throughput, in a
selection of three IO patterns HDDs and SSDs.

(Note that the IO submission thread also reorders IO to attempt to
submit it in sector order, so that is an additional difference between
the two modes -- it's not just "offload writes to another thread" vs
"don't offload writes".) The summary (for my FIO workloads focused on
parallelism) is that offloading is useful for high IO depth random
writes on SSDs, and for long sequential small writes on HDDs.
Offloading reduced throughput for immensely high IO depths on SSDs,
where I would guess lock contention is reducing effective IO depth to
the disk; and for low IO depths of sequential writes on HDDs, where I
would guess (as it would for a zoned device) preserving submission order
is better than attempting to reorder before submission.

Two test regimes: randwrite on 7xSamsung SSD 850 PRO 128G, somewhat
aged, behind a LSI MegaRAID card providing raid0. 6 processors
(Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz); 128G RAM; and seqwrite,
on a software raid0 (512k chunk size) of 4 HDDs on the same machine
specs. Scheduler 'none' for both. LSI card in writethrough cache mode.
All data in MB/s.


depth    jobs    bs    dflt    no_wq    %chg    raw disk
----------------randwrite, SSD--------------
128    1    4k    282    282    0    285
256    4    4k    251    183    -27    283
2048    4    4k    266    283    +6    284
1    4    1m    433    414    -4    403
----------------seqwrite, HDD---------------
128    1    4k    87    107    +23    86
256    4    4k    101    90     -11    91
2048    4    4k    273    233    -15    249
1    4    1m    144    146    +1    146
Linus Torvalds Aug. 18, 2020, 9:02 p.m. UTC | #6
On Tue, Aug 18, 2020 at 1:40 PM John Dorminy <jdorminy@redhat.com> wrote:
>
>    The summary (for my FIO workloads focused on
> parallelism) is that offloading is useful for high IO depth random
> writes on SSDs, and for long sequential small writes on HDDs.

Do we have any non-microbenchmarks that might be somewhat
representative of something, and might be used to at least set a
default?

Or can we perhaps - even better - dynamically notice whether to offload or not?

I suspect that offloading is horrible for any latency situation,
particularly with any modern setup where the SSD is fast enough that
doing scheduling to another thread is noticeable.

After all, some people are working on polling IO, because the result
comes back so fast that taking the interrupt is unnecessary extra
work. Those people admittedly have faster disks than most of us, but
..

At least from a latency angle, maybe we could have the fairly common
case of a IO depth of 1 (because synchronous reads) not trigger it.

It looks like you only did throughput benchmarks (like pretty much
everybody always does, because latency benchmarks are a lot harder to
do well).

                 Linus
Linus Torvalds Aug. 18, 2020, 9:22 p.m. UTC | #7
On Tue, Aug 18, 2020 at 2:12 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
>
> Additionally if one cares about latency

I think everybody really deep down cares about latency, they just
don't always know it, and the benchmarks are very seldom about it
because it's so much harder to measure.

> they will not use HDDs for the workflow and HDDs have much higher IO latency than CPU scheduling.

I think by now we can just say that anybody who uses HDD's don't care
about performance as a primary issue.

I don't think they are really interesting as a benchmark target - at
least from the standpoint of what the kernel should optimize for.

People have HDD's for legacy reasons or because they care much more
about capacity than performance.  Why should _we_ then worry about
performance that the user doesn't worry about?

I'm not saying we should penalize HDD's, but I don't think they are
things we should primarily care deeply about any more.

               Linus
John Dorminy Aug. 19, 2020, 4:25 a.m. UTC | #8
Your points are good. I don't know a good macrobenchmark at present,
but at least various latency numbers are easy to get out of fio.

I ran a similar set of tests on an Optane 900P with results below.
'clat' is, as fio reports, the completion latency, measured in usec.
'configuration' is [block size], [iodepth], [jobs]; picked to be a
varied selection that obtained excellent throughput from the drive.
Table reports average, and 99th percentile, latency times as well as
throughput. It matches Ignat's report that large block sizes using the
new option can have worse latency and throughput on top-end drives,
although that result doesn't make any sense to me.

Happy to run some more there or elsewhere if there are suggestions.

devicetype    configuration    MB/s    clat mean    clat 99%
------------------------------------------------------------------
nvme base    1m,32,4     2259    59280       67634
crypt default    1m,32,4     2267    59050       182000
crypt no_w_wq    1m,32,4     1758    73954.54    84411
nvme base    64k,1024,1    2273    29500.92    30540
crypt default    64k,1024,1    2167    29518.89    50594
crypt no_w_wq    64k,1024,1    2056    31090.23    31327
nvme base    4k,128,4    2159      924.57    1106
crypt default    4k,128,4    1256     1663.67    3294
crypt no_w_wq    4k,128,4    1703     1165.69    1319

Ignat, how do these numbers match up to what you've been seeing?

-John


On Tue, Aug 18, 2020 at 5:23 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, Aug 18, 2020 at 2:12 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
> >
> > Additionally if one cares about latency
>
> I think everybody really deep down cares about latency, they just
> don't always know it, and the benchmarks are very seldom about it
> because it's so much harder to measure.
>
> > they will not use HDDs for the workflow and HDDs have much higher IO latency than CPU scheduling.
>
> I think by now we can just say that anybody who uses HDD's don't care
> about performance as a primary issue.
>
> I don't think they are really interesting as a benchmark target - at
> least from the standpoint of what the kernel should optimize for.
>
> People have HDD's for legacy reasons or because they care much more
> about capacity than performance.  Why should _we_ then worry about
> performance that the user doesn't worry about?
>
> I'm not saying we should penalize HDD's, but I don't think they are
> things we should primarily care deeply about any more.
>
>                Linus
>
Ignat Korchagin Aug. 19, 2020, 10:29 a.m. UTC | #9
Thank you John, Damien for extensive measurements. I don't have much
to add as my measurements are probably just a subset of Damien's and
repeat their results.

One interesting thing I noticed when experimenting with this: we
mostly talk about average throughput, but sometimes it is interesting
to see the instant values (measured over small time slices). For
example, for 4kb block size, qd=1, 50/50 randrw job for a dm-crypt
encrypted ramdisk with ecb(cipher_null) cipher I just continuously run
in the terminal, I can see the instant throughput having somewhat
bimodal distribution: it reliably jumps between ~120 MiB/s and ~80
MiB/s medians (the overall average throughput being ~100 MiB/s of
course). This is for dm-crypt with workqueues. If I disable the
workqueues the distribution of the instant throughput becomes
"normal".

Without looking into much detail I wonder if HT has some side-effects
on dm-crypt processing (I have it enabled), because it seems not all
"cores" are equal for dm-cypt even on the null cipher.

I might get my hands on an arm64 server soon and curious to see how
dm-crypt and workques will compare there.

Regards,
Ignat


On Wed, Aug 19, 2020 at 8:10 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>
> John,
>
> On 2020/08/19 13:25, John Dorminy wrote:
> > Your points are good. I don't know a good macrobenchmark at present,
> > but at least various latency numbers are easy to get out of fio.
> >
> > I ran a similar set of tests on an Optane 900P with results below.
> > 'clat' is, as fio reports, the completion latency, measured in usec.
> > 'configuration' is [block size], [iodepth], [jobs]; picked to be a
> > varied selection that obtained excellent throughput from the drive.
> > Table reports average, and 99th percentile, latency times as well as
> > throughput. It matches Ignat's report that large block sizes using the
> > new option can have worse latency and throughput on top-end drives,
> > although that result doesn't make any sense to me.
> >
> > Happy to run some more there or elsewhere if there are suggestions.
> >
> > devicetype    configuration    MB/s    clat mean    clat 99%
> > ------------------------------------------------------------------
> > nvme base        1m,32,4     2259    59280        67634
> > crypt default    1m,32,4     2267    59050       182000
> > crypt no_w_wq    1m,32,4     1758    73954.54     84411
> > nvme base       64k,1024,1   2273    29500.92     30540
> > crypt default   64k,1024,1   2167    29518.89     50594
> > crypt no_w_wq   64k,1024,1   2056    31090.23     31327
> > nvme base        4k,128,4    2159      924.57      1106
> > crypt default    4k,128,4    1256     1663.67      3294
> > crypt no_w_wq    4k,128,4    1703     1165.69      1319
>
> I have been doing a lot of testing recently on dm-crypt, mostly for zoned
> storage, that is with write workqueue disabled, but also with regular disks to
> have something to compare to and verify my results. I confirm that I see similar
> changes in throughput/latency in my tests: disabling workqueues improves
> throughput for small IO sizes thanks to the lower latency (removed context
> switch overhead), but the benefits of disabling the workqueues become dubious
> for large IO sizes, and deep queue depth. See the heat-map attached for more
> results (nullblk device used for these measurements with 1 job per CPU).
>
> I also pushed things further as my tests as I primarily focused on enterprise
> systems with a large number of storage devices being used with a single server.
> To flatten things out and avoid any performance limitations due to the storage
> devices, PCIe and/or HBA bus speed and memory bus speed, I ended up performing
> lots of tests using nullblk with different settings:
>
> 1) SSD like multiqueue setting without "none" scheduler, with irq_mode=0
> (immediate completion in submission context) and irq_mode=1 for softirq
> completion (different completion context than submission).
> 2) HDD like single queue with mq-deadline as the scheduler, and the different
> irq_mode settings.
>
> I also played with CPU assignments for the fio jobs and tried various things.
>
> My observations are as follows, in no particular order:
>
> 1) Maximum throughput clearly directly depends on the numbers of CPUs involved
> in the crypto work. The crypto acceleration is limited per core and so the
> number of issuer contexts (for writes) and or completion contexts (for reads)
> almost directly determine maximum performance with worqueue disabled. I measured
> about 1.4GB/s at best on my system with a single writer 128KB/QD=4.
>
> 2) For a multi drive setup with IO issuers limited to a small set of CPUs,
> performance does not scale with the number of disks as the crypto engine speed
> of the CPUs being used is the limiting factor: both write encryption and read
> decryption happen on that set of CPUs, regardless of the others CPUs load.
>
> 3) For single queue devices, write performance scales badly with the number of
> CPUs used for IO issuing: the one CPU that runs the device queue to dispatch
> commands end up doing a lot of crypto work for requests queued through other
> CPUs too.
>
> 4) On a very busy system with a very large number of disks and CPUs used for
> IOs, the total throughput I get is very close for all settings with workqueues
> enabled and disabled, about 50GB/s total on my dual socket Xeon system. There
> was a small advantage for the none scheduler/multiqueue setting that gave up to
> 56GB/s with workqueues on and 47GB/s with workqueues off. The single
> queue/mq-deadline case gave 51 GB/s and 48 GB/s with workqueues on/off.
>
> 5) For the tests with the large number of drives and CPUs, things got
> interesting with the average latency: I saw about the same average with
> workqueues on and off. But the p99 latency was about 3 times lower with
> workqueues off than workqueues on. When all CPUs are busy, reducing overhead by
> avoiding additional context switches clearly helps.
>
> 6) With an arguably more realistic workload of 66% read and 34 % writes (read
> size is 64KB/1MB with a 60%/40% ratio and write size is fixed at 1MB), I ended
> up with higher total throughput with workqueues disabled (44GB/s) vs enabled
> (38GB/s). Average write latency was also 30% lower with workqueues disabled
> without any significant change to the average read latency.
>
> From all these tests, I am currently considering that for a large system with
> lots of devices, disabling workqueues is a win, as long as IO issuers are not
> limited to a small set of CPUs.
>
> The benefits of disabling workqueues for a desktop like system or a server
> system with one (or very few) super fast drives are much less clear in my
> opinion. Average and p99 latency are generally better with workqueues off, but
> total throughput may significantly suffer if only a small number of IO contexts
> are involved, that is, a small number of CPUs participate in the crypto
> processing. Then crypto hardware speed dictates the results and using workqueues
> to get parallelism between more CPU cores can give better throughput.
>
> That said, I am thinking that from all this, we can extract some hints to
> automate decision for using workqueues or not:
> 1) Small IOs (e.g. 4K) would probably benefit from having workqueue disabled,
> especially for 4Kn storage devices as such request would be processed as a
> single block with a single crypto API call.
> 2) It may be good to process any BIO marked with REQ_HIPRI (polling BIO) without
> any workqueue, to reduce latency, as intended by the caller.
> 3) We may want to have read-ahead reads use workqueues, especially for single
> queue devices (HDDs) to avoid increasing latency for other reads completing
> together with these read-ahead requests.
>
> In the end, I am still scratching my head trying to figure out what the best
> default setup may be.
>
> Best regards.
>
>
> --
> Damien Le Moal
> Western Digital Research