mbox series

[0/4] dm: Introduce dm-qcow2 driver to attach QCOW2 files as block device

Message ID 164846619932.251310.3668540533992131988.stgit@pro (mailing list archive)
Headers show
Series dm: Introduce dm-qcow2 driver to attach QCOW2 files as block device | expand

Message

Kirill Tkhai March 28, 2022, 11:18 a.m. UTC
This patchset adds a new driver allowing to attach QCOW2 files
as block devices. Its idea is to implement in kernel only that
features, which affect runtime IO performance (IO requests
processing functionality). The maintenance operations are
synchronously processed in userspace, while device is suspended.

Userspace is allowed to do only that operations, which never
modifies virtual disk's data. It is only allowed to modify
QCOW2 file metadata providing that disk's data. The examples
of allowed operations is snapshot creation and resize.

Userspace part is handled by already existing utils (qemu-img).

For instance, snapshot creation on attached dm-qcow2 device looks like:

# dmsetup suspend $device
# qemu-img snapshot -c <snapshot_name> $device.qcow2
# dmsetup resume $device

1)Suspend flushes all pending IO and related metadata to file,
  leaving the file in consistent QCOW2 format.
  Driver .postsuspend throws out all images's cached metadata.
2)qemu-img creates snapshot: changes/moves metadata inside QCOW2 file.
3)Driver .preresume reads new version of metadata
  from file (1 page is required), and the device is ready
  to continue handling of IO requests.

This example shows the way of device-mapper infrastructure
allows to implement drivers following the idea of
kernel/userspace components demarcation. Thus, the driver
uses advantages of device-mapper instead of implementing
its own suspend/resume engine.

The below fio test was used to measure performance:

# fio --name=test --ioengine=libaio --direct=1 --bs=$bs --filename=$dev
      --readwrite=$rw --runtime=60 --numjobs=2 --iodepth=8

The collected results consists of the both: fio measurement
and system load taken from /proc/loadavg. Since loadavg min
period is 60 seconds, fio's runtime is 60 too.
Here is average results of 5 runs (IO/loadavg is also average
of IO/loadavg of 5 runs):

-------------------------------------------+---------------------------------+--------------------------+
                  qemu-nbd (native aio)    |             dm-qcow2            |          diff, %         |
----+--------------------------------------+---------------------------------+--------------------------+
bs  |  RW   | IO,MiB/s  loadavg  IO/loadavg|  IO,MiB/s   loadavg   IO/loadavg|IO     loadavg  IO/loadavg|
------------|------------------------------+---------------------------------+--------------------------+
4K  | READ  |  279       1.986     147     |  512        2.088     248       |+83.7    +5.1     +68.4   |
4K  | WRITE |  242       2.31      105     |  770        2.172     357       |+217.9   -5.9     +239.7  |
----+-------|------------------------------+---------------------------------+--------------------------+
64K | READ  |  1199      1.794     691     |  1218       1.118     1217      |+1.6     -37.7    +76     |
64K | WRITE |  946       1.084     877     |  1003       0.466     2144      |+6.1     -57      +144.5  |
------------|------------------------------+---------------------------------+--------------------------+
512K| READ  |  1741      1.142     1526    |  2196       0.546     4197      |+26.1    -52.2    +175.1  |
512K| WRITE |  1016      1.084     941     |  993        0.306     3267      |-2.2     -71.7    +246.9  |
----|-------|------------------------------+---------------------------------+--------------------------+
1M  | READ  |  1793      1.174     1542    |  2373       0.566     4384      |+32.4    -51.8    +184.2  |
1M  | WRITE |  1037      0.894     1165    |  1068       0.892     1196      |+2.9     -0.2     +2.7    |
----|-------+------------------------------+---------------------------------+--------------------------+
2M  | READ  |  1784      1.084     1654    |  2431       0.788     3090      |+36.3    -27.3    +86.8   |
2M  | WRITE |  1027      0.878     1172    |  1063       0.878     1212      |+3.6     0        +3.4    |
----+-------+------------------------------+---------------------------------+--------------------------+
(NBD attaching string: qemu-nbd -c $dev --aio=native --nocache file.qcow2)

As in diff column, dm-qcow2 driver has the best throughput
(the only exception is 512K WRITE), and the smallest
loadavg (the only exception is 4K READ). The density
of dm-qcow2 is significantly better.

(Note, that tests are made on preallocated images, when
 all L2 table is allocated, since QEMU has lazy L2 allocation
 feature, which is not implemented in dm-qcow2 yet).

So, one of the reasons of implementing the driver is providing
better performance and density than it's done in qemu-nbd.
The second reason is a possibility to unify virtual disks format
for VMs and containers, so a disk image can be used to start
both of them.

This patchset consists of 4 patches. Patches [1-2] make small
changes in dm code: [1] exports a function, while [2] makes
.io_hints be called for drivers not having .iterate_devices.
Patch [3] adds dm-qcow2, while patch [4] adds a userspace
wrapper for attaching such the devices.

---

Kirill Tkhai (4):
      dm: Export dm_complete_request()
      dm: Process .io_hints for drivers not having underlying devices
      dm-qcow2: Introduce driver to create block devices over QCOW2 files
      dm-qcow2: Add helper for working with dm-qcow2 devices


 drivers/md/Kconfig           |   17 +
 drivers/md/Makefile          |    2 +
 drivers/md/dm-qcow2-cmd.c    |  383 +++
 drivers/md/dm-qcow2-map.c    | 4256 ++++++++++++++++++++++++++++++++++
 drivers/md/dm-qcow2-target.c | 1026 ++++++++
 drivers/md/dm-qcow2.h        |  368 +++
 drivers/md/dm-rq.c           |    3 +-
 drivers/md/dm-rq.h           |    2 +
 drivers/md/dm-table.c        |    5 +-
 scripts/qcow2-dm.sh          |  249 ++
 10 files changed, 6309 insertions(+), 2 deletions(-)
 create mode 100644 drivers/md/dm-qcow2-cmd.c
 create mode 100644 drivers/md/dm-qcow2-map.c
 create mode 100644 drivers/md/dm-qcow2-target.c
 create mode 100644 drivers/md/dm-qcow2.h
 create mode 100755 scripts/qcow2-dm.sh

--
Signed-off-by: Kirill Tkhai <kirill.tkhai@openvz.org>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Comments

Christoph Hellwig March 29, 2022, 1:08 p.m. UTC | #1
On Mon, Mar 28, 2022 at 02:18:16PM +0300, Kirill Tkhai wrote:
> This patchset adds a new driver allowing to attach QCOW2 files
> as block devices. Its idea is to implement in kernel only that
> features, which affect runtime IO performance (IO requests
> processing functionality).

From a quick looks it seems to be like this should be a block driver
just like the loop driver and not use device mapper.  Why would
you use device mapper to basically reimplement a fancy loop driver
to start with?

> The maintenance operations are
> synchronously processed in userspace, while device is suspended.
> 
> Userspace is allowed to do only that operations, which never
> modifies virtual disk's data. It is only allowed to modify
> QCOW2 file metadata providing that disk's data. The examples
> of allowed operations is snapshot creation and resize.

And this sounds like a pretty fragile design.  It basically requires
both userspace and the kernel driver to access metadata on disk, which
sounds rather dangerous.

> This example shows the way of device-mapper infrastructure
> allows to implement drivers following the idea of
> kernel/userspace components demarcation. Thus, the driver
> uses advantages of device-mapper instead of implementing
> its own suspend/resume engine.

What do you need more than a queue freeze?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Kirill Tkhai March 29, 2022, 3:14 p.m. UTC | #2
On 29.03.2022 16:08, Christoph Hellwig wrote:
> On Mon, Mar 28, 2022 at 02:18:16PM +0300, Kirill Tkhai wrote:
>> This patchset adds a new driver allowing to attach QCOW2 files
>> as block devices. Its idea is to implement in kernel only that
>> features, which affect runtime IO performance (IO requests
>> processing functionality).
> 
> From a quick looks it seems to be like this should be a block driver
> just like the loop driver and not use device mapper.  Why would
> you use device mapper to basically reimplement a fancy loop driver
> to start with?

This is a driver for containers and virtual machines. One of basic features
for them is migration and backups. There are several drives for that, which
are already implemented in device-mapper. For example, dm-era and dm-snap.
Instead of implementing such the functionality in QCOW2 driver once again,
the sane behavior is to use already implemented drivers. The module-approach
is better for support and errors eliminating just because of less code.

1)A device-mapper based driver does not require migration and backup devices
  are built in a stack for the whole device lifetime:

  a)Normal work, almost 100% of time: there is only /dev/mapper/qcow2_dev.
  b)Migration: /dev/mapper/qcow2_dev is reloaded with a migration device,
    which points to new qcow2_dev.real:

    /dev/mapper/qcow2_dev          [migration driver]
      /dev/mapper/qcow2_dev.real   [dm-qcow2 driver]
    
   After migration is completed, we reload /dev/mapper/qcow2_dev back
   to use dm-qcow2 driver. So, there is no excess dm layers during normal work.

2)In case of the driver is not a device-mapper based, it's necessary to have
  the stack built for the whole device lifetime, since it's impossible to reload
  bare block driver with dm-based driver on demand:

    /dev/mapper/qcow2_dev          [migration driver]
       /dev/qcow2_dev.real         [bare qcow2 driver]

  So, we would have excess dm layer during whole device lifetime.

Our performance tests show, that a single dm layer may cause up to 10% performance
decrease on NVME, so the reason is to eliminate such the fall. Also, the general
reasoning say that excess layer is a wrong way.

Other reason is previous experience of implementing file-backed block drivers.
We had ploop driver before. Ploop format is much simpler that QCOW2 format, but
there are about 17K strings, while dm-qcow2 driver took about 6K strings.
Device mapper allows to avoid writing a lot of code, the only thing you need
is to implement proper .ctr and .dtr functions, while the rest of configuration
actions are done by simple device-mapper reload.

>> The maintenance operations are
>> synchronously processed in userspace, while device is suspended.
>>
>> Userspace is allowed to do only that operations, which never
>> modifies virtual disk's data. It is only allowed to modify
>> QCOW2 file metadata providing that disk's data. The examples
>> of allowed operations is snapshot creation and resize.
> 
> And this sounds like a pretty fragile design.  It basically requires
> both userspace and the kernel driver to access metadata on disk, which
> sounds rather dangerous.

I don't think so. Device-mapper already allows to replace a device driver with
another driver. Nobody blames dm-linear, that it may be reloaded to point wrong
partition, while it can. Nobody blames loop, that someone in userspace may corrupt
its blocks, and filesystem on that device will become broken.

The thing is kernel and userspace never access the file at the same time.
It case of maintenance actions may be called in userspace, they must be, since
this reduces the kernel code.
 
>> This example shows the way of device-mapper infrastructure
>> allows to implement drivers following the idea of
>> kernel/userspace components demarcation. Thus, the driver
>> uses advantages of device-mapper instead of implementing
>> its own suspend/resume engine.
> 
> What do you need more than a queue freeze?

Theoretically, I can, in case of this flushes all pending requests. But this
will increase driver code significantly, since there won't be possible to use
reload mechanism, and there are other problems like problems with performance
like I wrote above. So, this approach look significantly worse.

Kirill

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel