[v4,00/25] block layer: split block APIs in global state and I/O

Message ID	20211025101735.2060852-1-eesposit@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=rKP4=PN=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 37E9160F46 From: Emanuele Giuseppe Esposito <eesposit@redhat.com> To: qemu-block@nongnu.org Subject: [PATCH v4 00/25] block layer: split block APIs in global state and I/O Date: Mon, 25 Oct 2021 06:17:10 -0400 Message-Id: <20211025101735.2060852-1-eesposit@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII" Received-SPF: pass client-ip=216.205.24.124; envelope-from=eesposit@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: Kevin Wolf <kwolf@redhat.com>, Fam Zheng <fam@euphon.net>, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>, =?utf-8?q?Daniel_P?= =?utf-8?q?=2E_Berrang=C3=A9?= <berrange@redhat.com>, Eduardo Habkost <ehabkost@redhat.com>, Juan Quintela <quintela@redhat.com>, qemu-devel@nongnu.org, John Snow <jsnow@redhat.com>, Emanuele Giuseppe Esposito <eesposit@redhat.com>, Richard Henderson <richard.henderson@linaro.org>, Markus Armbruster <armbru@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Hanna Reitz <hreitz@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Eric Blake <eblake@redhat.com> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	block layer: split block APIs in global state and I/O \| expand [v4,00/25] block layer: split block APIs in global state and I/O [v4,01/25] main-loop.h: introduce qemu_in_main_thread() [v4,02/25] include/block/block: split header into I/O and global state API [v4,03/25] assertions for block global state API [v4,04/25] include/sysemu/block-backend: split header into I/O and global state (GS) API [v4,05/25] block/block-backend.c: assertions for block-backend [v4,06/25] include/block/block_int: split header into I/O and global state API [v4,07/25] assertions for block_int global state API [v4,08/25] block: introduce assert_bdrv_graph_writable [v4,09/25] include/block/blockjob_int.h: split header into I/O and GS API [v4,10/25] assertions for blockjob_int.h [v4,11/25] include/block/blockjob.h: global state API [v4,12/25] assertions for blockob.h global state API [v4,13/25] include/sysemu/blockdev.h: move drive_add and inline drive_def [v4,14/25] include/systemu/blockdev.h: global state API [v4,15/25] assertions for blockdev.h global state API [v4,16/25] include/block/snapshot: global state API + assertions [v4,17/25] block/copy-before-write.h: global state API + assertions [v4,18/25] block/coroutines: I/O API [v4,19/25] block_int-common.h: split function pointers in BlockDriver [v4,20/25] block_int-common.h: assertion in the callers of BlockDriver function pointers [v4,21/25] block_int-common.h: split function pointers in BdrvChildClass [v4,22/25] block_int-common.h: assertions in the callers of BdrvChildClass function pointers [v4,23/25] block-backend-common.h: split function pointers in BlockDevOps [v4,24/25] job.h: split function pointers in JobDriver [v4,25/25] job.h: assertions in the callers of JobDriver funcion pointers

Emanuele Giuseppe Esposito Oct. 25, 2021, 10:17 a.m. UTC

Currently, block layer APIs like block-backend.h contain a mix of
functions that are either running in the main loop and under the
BQL, or are thread-safe functions and run in iothreads performing I/O.
The functions running under BQL also take care of modifying the
block graph, by using drain and/or aio_context_acquire/release.
This makes it very confusing to understand where each function
runs, and what assumptions it provided with regards to thread
safety.

We call the functions running under BQL "global state (GS) API", and
distinguish them from the thread-safe "I/O API".

The aim of this series is to split the relevant block headers in
global state and I/O sub-headers. The division will be done in
this way:
header.h will be split in header-global-state.h, header-io.h and
header-common.h. The latter will just contain the data structures
needed by header-global-state and header-io, and common helpers
that are neither in GS nor in I/O. header.h will remain for
legacy and to avoid changing all includes in all QEMU c files,
but will only include the two new headers. No function shall be
added in header.c .
Once we split all relevant headers, it will be much easier to see what
uses the AioContext lock and remove it, which is the overall main
goal of this and other series that I posted/will post.

In addition to splitting the relevant headers shown in this series,
it is also very helpful splitting the function pointers in some
block structures, to understand what runs under AioContext lock and
what doesn't. This is what patches 19-25 do.

Each function in the GS API will have an assertion, checking
that it is always running under BQL.
I/O functions are instead thread safe (or so should be), meaning
that they *can* run under BQL, but also in an iothread in another
AioContext. Therefore they do not provide any assertion, and
need to be audited manually to verify the correctness.

Adding assetions has helped finding 2 bugs already, as shown in
my series "Migration: fix missing iothread locking".

Tested this series by running unit tests, qemu-iotests and qtests
(x86_64).
Some functions in the GS API are used everywhere but not
properly tested. Therefore their assertion is never actually run in
the tests, so despite my very careful auditing, it is not impossible
to exclude that some will trigger while actually using QEMU.

Patch 1 introduces qemu_in_main_thread(), the function used in
all assertions. This had to be introduced otherwise all unit tests
would fail, since they run in the main loop but use the code in
stubs/iothread.c
Patches 2-14 and 19-25 (with the exception of patch 9, that is an additional
assert) are all structured in the same way: first we split the header
and in the next (even) patch we add assertions.
The rest of the patches ontain either both assertions and split,
or have no assertions.

Next steps once this get reviewed:
1) audit the GS API and replace the AioContext lock with drains,
or remove them when not necessary (requires further discussion).
2) [optional as it should be already the case] audit the I/O API
and check that thread safety is guaranteed

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
v3 -> v4:
* blockdev.h (patch 14): blockdev_mark_auto_del, blockdev_auto_del
  and blk_legacy_dinfo as GS API.
* add copyright header to block.h, block-io.h and block-global-state.h
* rebase on current master (c5b2f55981)

v2 -> v3:
* rename "graph API" into "global state API"
* change order of patches, block.h comes before block-backend.h
* change GS and I/O comment headers to avoid redundancy, all other
  headers refer to block-global-state.h and block-io.h
* fix typo on GS and I/O headers
* use assert instead of g_assert
* move bdrv_pwrite_sync, bdrv_block_status and bdrv_co_copy_range_{from/to}
  to the I/O API
* change assert_bdrv_graph_writable implementation, since we need
  to introduce additional drains
* remove transactions API split
* add preparation patch for blockdev.h (patch 13)
* backup-top -> copy-on-write
* change I/O comment in job.h into a better meaningful explanation
* fix all warnings given by checkpatch, mostly due to /* */ to be
  split in separate lines
* rebase on current master (c09124dcb8), and split the following new functions:
	blk_replace_bs (I/O)
	bdrv_bsc_is_data (I/O)
	bdrv_bsc_invalidate_range (I/O)
	bdrv_bsc_fill (I/O)
	bdrv_new_open_driver_opts (GS)
	blk_get_max_hw_iov (I/O)
  they are all added in patches 4 and 6.

v1 -> v2:
* remove the iothread locking bug fix, and send it as separate patch
* rename graph API -> global state API
* better documented patch 1 (qemu_in_main_thread)
* add and split all other block layer headers
* fix warnings given by checkpatch on multiline comments

Emanuele Giuseppe Esposito (25):
  main-loop.h: introduce qemu_in_main_thread()
  include/block/block: split header into I/O and global state API
  assertions for block global state API
  include/sysemu/block-backend: split header into I/O and global state
    (GS) API
  block/block-backend.c: assertions for block-backend
  include/block/block_int: split header into I/O and global state API
  assertions for block_int global state API
  block: introduce assert_bdrv_graph_writable
  include/block/blockjob_int.h: split header into I/O and GS API
  assertions for blockjob_int.h
  include/block/blockjob.h: global state API
  assertions for blockob.h global state API
  include/sysemu/blockdev.h: move drive_add and inline drive_def
  include/systemu/blockdev.h: global state API
  assertions for blockdev.h global state API
  include/block/snapshot: global state API + assertions
  block/copy-before-write.h: global state API + assertions
  block/coroutines: I/O API
  block_int-common.h: split function pointers in BlockDriver
  block_int-common.h: assertion in the callers of BlockDriver function
    pointers
  block_int-common.h: split function pointers in BdrvChildClass
  block_int-common.h: assertions in the callers of BdrvChildClass
    function pointers
  block-backend-common.h: split function pointers in BlockDevOps
  job.h: split function pointers in JobDriver
  job.h: assertions in the callers of JobDriver funcion pointers

 block.c                                     |  188 ++-
 block/backup.c                              |    1 +
 block/block-backend.c                       |  105 +-
 block/commit.c                              |    4 +
 block/copy-before-write.c                   |    2 +
 block/copy-before-write.h                   |    7 +
 block/coroutines.h                          |    6 +
 block/dirty-bitmap.c                        |    1 +
 block/io.c                                  |   37 +
 block/meson.build                           |    7 +-
 block/mirror.c                              |    4 +
 block/monitor/bitmap-qmp-cmds.c             |    6 +
 block/monitor/block-hmp-cmds.c              |    2 +-
 block/snapshot.c                            |   28 +
 block/stream.c                              |    2 +
 blockdev.c                                  |   55 +-
 blockjob.c                                  |   14 +
 include/block/block-common.h                |  389 +++++
 include/block/block-global-state.h          |  286 ++++
 include/block/block-io.h                    |  306 ++++
 include/block/block.h                       |  878 +----------
 include/block/block_int-common.h            | 1193 +++++++++++++++
 include/block/block_int-global-state.h      |  327 ++++
 include/block/block_int-io.h                |  163 ++
 include/block/block_int.h                   | 1478 +------------------
 include/block/blockjob.h                    |    9 +
 include/block/blockjob_int.h                |   28 +
 include/block/snapshot.h                    |   13 +-
 include/qemu/job.h                          |   16 +
 include/qemu/main-loop.h                    |   13 +
 include/sysemu/block-backend-common.h       |   92 ++
 include/sysemu/block-backend-global-state.h |  122 ++
 include/sysemu/block-backend-io.h           |  139 ++
 include/sysemu/block-backend.h              |  269 +---
 include/sysemu/blockdev.h                   |   24 +-
 job.c                                       |    9 +
 migration/savevm.c                          |    2 +
 softmmu/cpus.c                              |    5 +
 softmmu/qdev-monitor.c                      |    2 +
 softmmu/vl.c                                |   25 +-
 stubs/iothread-lock.c                       |    5 +
 41 files changed, 3619 insertions(+), 2643 deletions(-)
 create mode 100644 include/block/block-common.h
 create mode 100644 include/block/block-global-state.h
 create mode 100644 include/block/block-io.h
 create mode 100644 include/block/block_int-common.h
 create mode 100644 include/block/block_int-global-state.h
 create mode 100644 include/block/block_int-io.h
 create mode 100644 include/sysemu/block-backend-common.h
 create mode 100644 include/sysemu/block-backend-global-state.h
 create mode 100644 include/sysemu/block-backend-io.h

Philippe Mathieu-Daudé Oct. 25, 2021, 2:09 p.m. UTC | #1

On 10/25/21 12:17, Emanuele Giuseppe Esposito wrote:
[...]

> Each function in the GS API will have an assertion, checking
> that it is always running under BQL.
> I/O functions are instead thread safe (or so should be), meaning
> that they *can* run under BQL, but also in an iothread in another
> AioContext. Therefore they do not provide any assertion, and
> need to be audited manually to verify the correctness.
> 
> Adding assetions has helped finding 2 bugs already, as shown in
> my series "Migration: fix missing iothread locking".
> 
> Tested this series by running unit tests, qemu-iotests and qtests
> (x86_64).
> Some functions in the GS API are used everywhere but not
> properly tested. Therefore their assertion is never actually run in
> the tests, so despite my very careful auditing, it is not impossible
> to exclude that some will trigger while actually using QEMU.
> 
> Patch 1 introduces qemu_in_main_thread(), the function used in
> all assertions. This had to be introduced otherwise all unit tests
> would fail, since they run in the main loop but use the code in
> stubs/iothread.c
> Patches 2-14 and 19-25 (with the exception of patch 9, that is an additional
> assert) are all structured in the same way: first we split the header
> and in the next (even) patch we add assertions.
> The rest of the patches ontain either both assertions and split,
> or have no assertions.

This seems a lot of assertions added in hot-path code.

Does it makes sense to use a BLOCK_ASSERT() macro instead,
only expanded when configure with --enable-debug?

Stefan Hajnoczi Oct. 28, 2021, 3:45 p.m. UTC | #2

On Mon, Oct 25, 2021 at 04:09:41PM +0200, Philippe Mathieu-Daudé wrote:
> On 10/25/21 12:17, Emanuele Giuseppe Esposito wrote:
> [...]
> 
> > Each function in the GS API will have an assertion, checking
> > that it is always running under BQL.
> > I/O functions are instead thread safe (or so should be), meaning
> > that they *can* run under BQL, but also in an iothread in another
> > AioContext. Therefore they do not provide any assertion, and
> > need to be audited manually to verify the correctness.
> > 
> > Adding assetions has helped finding 2 bugs already, as shown in
> > my series "Migration: fix missing iothread locking".
> > 
> > Tested this series by running unit tests, qemu-iotests and qtests
> > (x86_64).
> > Some functions in the GS API are used everywhere but not
> > properly tested. Therefore their assertion is never actually run in
> > the tests, so despite my very careful auditing, it is not impossible
> > to exclude that some will trigger while actually using QEMU.
> > 
> > Patch 1 introduces qemu_in_main_thread(), the function used in
> > all assertions. This had to be introduced otherwise all unit tests
> > would fail, since they run in the main loop but use the code in
> > stubs/iothread.c
> > Patches 2-14 and 19-25 (with the exception of patch 9, that is an additional
> > assert) are all structured in the same way: first we split the header
> > and in the next (even) patch we add assertions.
> > The rest of the patches ontain either both assertions and split,
> > or have no assertions.
> 
> This seems a lot of assertions added in hot-path code.
> 
> Does it makes sense to use a BLOCK_ASSERT() macro instead,
> only expanded when configure with --enable-debug?

I think the assertions are only in the slow path (functions that must be
run with the BQL held from the main thread). The I/O request code path
does not have new assertions.

Stefan

Stefan Hajnoczi Oct. 28, 2021, 3:49 p.m. UTC | #3

On Mon, Oct 25, 2021 at 06:17:10AM -0400, Emanuele Giuseppe Esposito wrote:
> Currently, block layer APIs like block-backend.h contain a mix of
> functions that are either running in the main loop and under the
> BQL, or are thread-safe functions and run in iothreads performing I/O.
> The functions running under BQL also take care of modifying the
> block graph, by using drain and/or aio_context_acquire/release.
> This makes it very confusing to understand where each function
> runs, and what assumptions it provided with regards to thread
> safety.
> 
> We call the functions running under BQL "global state (GS) API", and
> distinguish them from the thread-safe "I/O API".
> 
> The aim of this series is to split the relevant block headers in
> global state and I/O sub-headers. The division will be done in

Kevin and Hanna,
Does one of you want to review and merge this? It affects the entire
block layer and your input would be valuable.

Thanks,
Stefan

> this way:
> header.h will be split in header-global-state.h, header-io.h and
> header-common.h. The latter will just contain the data structures
> needed by header-global-state and header-io, and common helpers
> that are neither in GS nor in I/O. header.h will remain for
> legacy and to avoid changing all includes in all QEMU c files,
> but will only include the two new headers. No function shall be
> added in header.c .
> Once we split all relevant headers, it will be much easier to see what
> uses the AioContext lock and remove it, which is the overall main
> goal of this and other series that I posted/will post.
> 
> In addition to splitting the relevant headers shown in this series,
> it is also very helpful splitting the function pointers in some
> block structures, to understand what runs under AioContext lock and
> what doesn't. This is what patches 19-25 do.
> 
> Each function in the GS API will have an assertion, checking
> that it is always running under BQL.
> I/O functions are instead thread safe (or so should be), meaning
> that they *can* run under BQL, but also in an iothread in another
> AioContext. Therefore they do not provide any assertion, and
> need to be audited manually to verify the correctness.
> 
> Adding assetions has helped finding 2 bugs already, as shown in
> my series "Migration: fix missing iothread locking".
> 
> Tested this series by running unit tests, qemu-iotests and qtests
> (x86_64).
> Some functions in the GS API are used everywhere but not
> properly tested. Therefore their assertion is never actually run in
> the tests, so despite my very careful auditing, it is not impossible
> to exclude that some will trigger while actually using QEMU.
> 
> Patch 1 introduces qemu_in_main_thread(), the function used in
> all assertions. This had to be introduced otherwise all unit tests
> would fail, since they run in the main loop but use the code in
> stubs/iothread.c
> Patches 2-14 and 19-25 (with the exception of patch 9, that is an additional
> assert) are all structured in the same way: first we split the header
> and in the next (even) patch we add assertions.
> The rest of the patches ontain either both assertions and split,
> or have no assertions.
> 
> Next steps once this get reviewed:
> 1) audit the GS API and replace the AioContext lock with drains,
> or remove them when not necessary (requires further discussion).
> 2) [optional as it should be already the case] audit the I/O API
> and check that thread safety is guaranteed
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
> v3 -> v4:
> * blockdev.h (patch 14): blockdev_mark_auto_del, blockdev_auto_del
>   and blk_legacy_dinfo as GS API.
> * add copyright header to block.h, block-io.h and block-global-state.h
> * rebase on current master (c5b2f55981)
> 
> v2 -> v3:
> * rename "graph API" into "global state API"
> * change order of patches, block.h comes before block-backend.h
> * change GS and I/O comment headers to avoid redundancy, all other
>   headers refer to block-global-state.h and block-io.h
> * fix typo on GS and I/O headers
> * use assert instead of g_assert
> * move bdrv_pwrite_sync, bdrv_block_status and bdrv_co_copy_range_{from/to}
>   to the I/O API
> * change assert_bdrv_graph_writable implementation, since we need
>   to introduce additional drains
> * remove transactions API split
> * add preparation patch for blockdev.h (patch 13)
> * backup-top -> copy-on-write
> * change I/O comment in job.h into a better meaningful explanation
> * fix all warnings given by checkpatch, mostly due to /* */ to be
>   split in separate lines
> * rebase on current master (c09124dcb8), and split the following new functions:
> 	blk_replace_bs (I/O)
> 	bdrv_bsc_is_data (I/O)
> 	bdrv_bsc_invalidate_range (I/O)
> 	bdrv_bsc_fill (I/O)
> 	bdrv_new_open_driver_opts (GS)
> 	blk_get_max_hw_iov (I/O)
>   they are all added in patches 4 and 6.
> 
> v1 -> v2:
> * remove the iothread locking bug fix, and send it as separate patch
> * rename graph API -> global state API
> * better documented patch 1 (qemu_in_main_thread)
> * add and split all other block layer headers
> * fix warnings given by checkpatch on multiline comments
> 
> Emanuele Giuseppe Esposito (25):
>   main-loop.h: introduce qemu_in_main_thread()
>   include/block/block: split header into I/O and global state API
>   assertions for block global state API
>   include/sysemu/block-backend: split header into I/O and global state
>     (GS) API
>   block/block-backend.c: assertions for block-backend
>   include/block/block_int: split header into I/O and global state API
>   assertions for block_int global state API
>   block: introduce assert_bdrv_graph_writable
>   include/block/blockjob_int.h: split header into I/O and GS API
>   assertions for blockjob_int.h
>   include/block/blockjob.h: global state API
>   assertions for blockob.h global state API
>   include/sysemu/blockdev.h: move drive_add and inline drive_def
>   include/systemu/blockdev.h: global state API
>   assertions for blockdev.h global state API
>   include/block/snapshot: global state API + assertions
>   block/copy-before-write.h: global state API + assertions
>   block/coroutines: I/O API
>   block_int-common.h: split function pointers in BlockDriver
>   block_int-common.h: assertion in the callers of BlockDriver function
>     pointers
>   block_int-common.h: split function pointers in BdrvChildClass
>   block_int-common.h: assertions in the callers of BdrvChildClass
>     function pointers
>   block-backend-common.h: split function pointers in BlockDevOps
>   job.h: split function pointers in JobDriver
>   job.h: assertions in the callers of JobDriver funcion pointers
> 
>  block.c                                     |  188 ++-
>  block/backup.c                              |    1 +
>  block/block-backend.c                       |  105 +-
>  block/commit.c                              |    4 +
>  block/copy-before-write.c                   |    2 +
>  block/copy-before-write.h                   |    7 +
>  block/coroutines.h                          |    6 +
>  block/dirty-bitmap.c                        |    1 +
>  block/io.c                                  |   37 +
>  block/meson.build                           |    7 +-
>  block/mirror.c                              |    4 +
>  block/monitor/bitmap-qmp-cmds.c             |    6 +
>  block/monitor/block-hmp-cmds.c              |    2 +-
>  block/snapshot.c                            |   28 +
>  block/stream.c                              |    2 +
>  blockdev.c                                  |   55 +-
>  blockjob.c                                  |   14 +
>  include/block/block-common.h                |  389 +++++
>  include/block/block-global-state.h          |  286 ++++
>  include/block/block-io.h                    |  306 ++++
>  include/block/block.h                       |  878 +----------
>  include/block/block_int-common.h            | 1193 +++++++++++++++
>  include/block/block_int-global-state.h      |  327 ++++
>  include/block/block_int-io.h                |  163 ++
>  include/block/block_int.h                   | 1478 +------------------
>  include/block/blockjob.h                    |    9 +
>  include/block/blockjob_int.h                |   28 +
>  include/block/snapshot.h                    |   13 +-
>  include/qemu/job.h                          |   16 +
>  include/qemu/main-loop.h                    |   13 +
>  include/sysemu/block-backend-common.h       |   92 ++
>  include/sysemu/block-backend-global-state.h |  122 ++
>  include/sysemu/block-backend-io.h           |  139 ++
>  include/sysemu/block-backend.h              |  269 +---
>  include/sysemu/blockdev.h                   |   24 +-
>  job.c                                       |    9 +
>  migration/savevm.c                          |    2 +
>  softmmu/cpus.c                              |    5 +
>  softmmu/qdev-monitor.c                      |    2 +
>  softmmu/vl.c                                |   25 +-
>  stubs/iothread-lock.c                       |    5 +
>  41 files changed, 3619 insertions(+), 2643 deletions(-)
>  create mode 100644 include/block/block-common.h
>  create mode 100644 include/block/block-global-state.h
>  create mode 100644 include/block/block-io.h
>  create mode 100644 include/block/block_int-common.h
>  create mode 100644 include/block/block_int-global-state.h
>  create mode 100644 include/block/block_int-io.h
>  create mode 100644 include/sysemu/block-backend-common.h
>  create mode 100644 include/sysemu/block-backend-global-state.h
>  create mode 100644 include/sysemu/block-backend-io.h
> 
> -- 
> 2.27.0
>

Hanna Czenczek Nov. 15, 2021, 4:03 p.m. UTC | #4

On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> Currently, block layer APIs like block-backend.h contain a mix of
> functions that are either running in the main loop and under the
> BQL, or are thread-safe functions and run in iothreads performing I/O.
> The functions running under BQL also take care of modifying the
> block graph, by using drain and/or aio_context_acquire/release.
> This makes it very confusing to understand where each function
> runs, and what assumptions it provided with regards to thread
> safety.
>
> We call the functions running under BQL "global state (GS) API", and
> distinguish them from the thread-safe "I/O API".
>
> The aim of this series is to split the relevant block headers in
> global state and I/O sub-headers.

Despite leaving quite some comments, the series and the split seem 
reasonable to me overall.  (This is a pretty big series, after all, so 
those “some comments” stack up against a majority of changes that seem 
OK to me. :))

One thing I noticed while reviewing is that it’s really hard to verify 
that no I/O function calls a GS function.  What would be wonderful is 
some function marker like coroutine_fn that marks GS functions (or I/O 
functions) and that we could then verify the call paths.  But AFAIU 
we’ve always wanted precisely that for coroutine_fn and still don’t have 
it, so this seems like extremely wishful thinking... :(

I think most of the issues I found can be fixed (or are even 
irrelevant), the only thing that really worries me are the two places 
that are clearly I/O paths that call permission functions: Namely first 
block_crypto_amend_options_generic_luks() (part of the luks block 
driver’s .bdrv_co_amend implementation), which calls 
bdrv_child_refresh_perms(); and second fuse_do_truncate(), which calls 
blk_set_perm().

In the first case, we need this call so that we don’t permanently hog 
the WRITE permission for the luks file, which used to be a problem, I 
believe.  We want to unshare the WRITE permission (and apparently also 
CONSISTENT_READ) during the key update, so we need some way to 
temporarily update the permissions.

I only really see four solutions for this:
(1) We somehow make the amend job run in the main context under the BQL 
and have it prevent all concurrent I/O access (seems bad)
(2) We can make the permission functions part of the I/O path (seems 
wrong and probably impossible?)
(3) We can drop the permissions update and permanently require the 
permissions that we need when updating keys (I think this might break 
existing use cases)
(4) We can acquire the BQL around the permission update call and perhaps 
that works?

I don’t know how (4) would work but it’s basically the only reasonable 
solution I can come up with.  Would this be a way to call a BQL function 
from an I/O function?

As for the second case, the same applies as above, with the differences 
that we have no jobs, so this code must always run in the block device’s 
AioContext (I think), which rules out (1); but (3) would become easier 
(i.e. require the RESIZE permission all the time), although that too 
might have an impact on existing users (don’t think so, though).  In any 
case, if we could do (4), that would solve the problem here, too.

And finally, another notable thing I noticed is that the way how 
create-related functions are handled is inconsistent.  I believe they 
should all be GS functions; qmp_blockdev_create() seems to agree with me 
on this, but we currently seem to have some bugs there.  It’s possible 
to invoke blockdev-create on a block device that’s in an I/O thread, and 
then qemu crashes.  Oops.  (The comment in qmp_blockdev_create() says 
that the block drivers’ implementations should prevent this, but 
apparently they don’t...?) In any case, that’s a pre-existing bug, of 
course, that doesn’t concern this series (other than that it suggests 
that “create” functions should be classified as GS).

Hanna

Daniel P. Berrangé Nov. 15, 2021, 4:11 p.m. UTC | #5

On Mon, Nov 15, 2021 at 05:03:28PM +0100, Hanna Reitz wrote:
> On 25.10.21 12:17, Emanuele Giuseppe Esposito wrote:
> > Currently, block layer APIs like block-backend.h contain a mix of
> > functions that are either running in the main loop and under the
> > BQL, or are thread-safe functions and run in iothreads performing I/O.
> > The functions running under BQL also take care of modifying the
> > block graph, by using drain and/or aio_context_acquire/release.
> > This makes it very confusing to understand where each function
> > runs, and what assumptions it provided with regards to thread
> > safety.
> > 
> > We call the functions running under BQL "global state (GS) API", and
> > distinguish them from the thread-safe "I/O API".
> > 
> > The aim of this series is to split the relevant block headers in
> > global state and I/O sub-headers.
> 
> Despite leaving quite some comments, the series and the split seem
> reasonable to me overall.  (This is a pretty big series, after all, so those
> “some comments” stack up against a majority of changes that seem OK to me.
> :))
> 
> One thing I noticed while reviewing is that it’s really hard to verify that
> no I/O function calls a GS function.  What would be wonderful is some
> function marker like coroutine_fn that marks GS functions (or I/O functions)
> and that we could then verify the call paths.  But AFAIU we’ve always wanted
> precisely that for coroutine_fn and still don’t have it, so this seems like
> extremely wishful thinking... :(

Even if we don't programmatically verify these rules, it would be a major
step forward if we simply adopted a standard naming convention for the
APIs that was essentally self-describing. eg block_io_XXX for all I/O
stuff and block_state_XXXX for all global state ,and block_common if
we have stuff applicable to both use cases.

I wouldn't suggest doing it as part of this series, but I think it'd
be worthwhile in general for the medium-long term, despite the code
churn in updating all usage in the short term.


Regards,
Daniel

Paolo Bonzini Nov. 18, 2021, 1:50 p.m. UTC | #6

On 11/15/21 17:03, Hanna Reitz wrote:
> 
> I only really see four solutions for this:
> (1) We somehow make the amend job run in the main context under the BQL 
> and have it prevent all concurrent I/O access (seems bad)
> (2) We can make the permission functions part of the I/O path (seems 
> wrong and probably impossible?)
> (3) We can drop the permissions update and permanently require the 
> permissions that we need when updating keys (I think this might break 
> existing use cases)
> (4) We can acquire the BQL around the permission update call and perhaps 
> that works?
> 
> I don’t know how (4) would work but it’s basically the only reasonable 
> solution I can come up with.  Would this be a way to call a BQL function 
> from an I/O function?

I think that would deadlock:

	main				I/O thread
	--------			-----
	start bdrv_co_amend
					take BQL
	bdrv_drain
	... hangs ...

(2) is definitely wrong.

(3) I have no idea.

Would it be possible or meaningful to do the bdrv_child_refresh_perms in 
qmp_x_blockdev_amend?  It seems that all users need it, and in general 
it seems weird to amend a qcow2 or luks header (and thus the meaning of 
parts of the file) while others can write to the same file.

Paolo

Paolo Bonzini Nov. 18, 2021, 2:04 p.m. UTC | #7

On 11/15/21 17:03, Hanna Reitz wrote:
> and second fuse_do_truncate(), which calls blk_set_perm().

Here it seems that a non-growable export is still growable as long as 
nobody is watching. :)  Is this the desired behavior?

Paolo

Hanna Czenczek Nov. 18, 2021, 3:22 p.m. UTC | #8

On 18.11.21 15:04, Paolo Bonzini wrote:
> On 11/15/21 17:03, Hanna Reitz wrote:
>> and second fuse_do_truncate(), which calls blk_set_perm().
>
> Here it seems that a non-growable export is still growable as long as 
> nobody is watching. :)  Is this the desired behavior?

Yes, absolutely.  “Growable” is documented to mean that writes after the 
end of the exported file will grow it to fit.  Explicit truncating is 
something else, and I believe we should allow it on all writable 
exports.  (Of course only when other potential users of the block node 
in question allow it to be resized, but that’s what the permission is for.)

Hanna

Hanna Czenczek Nov. 18, 2021, 3:31 p.m. UTC | #9

On 18.11.21 14:50, Paolo Bonzini wrote:
> On 11/15/21 17:03, Hanna Reitz wrote:
>>
>> I only really see four solutions for this:
>> (1) We somehow make the amend job run in the main context under the 
>> BQL and have it prevent all concurrent I/O access (seems bad)
>> (2) We can make the permission functions part of the I/O path (seems 
>> wrong and probably impossible?)
>> (3) We can drop the permissions update and permanently require the 
>> permissions that we need when updating keys (I think this might break 
>> existing use cases)
>> (4) We can acquire the BQL around the permission update call and 
>> perhaps that works?
>>
>> I don’t know how (4) would work but it’s basically the only 
>> reasonable solution I can come up with.  Would this be a way to call 
>> a BQL function from an I/O function?
>
> I think that would deadlock:
>
>     main                I/O thread
>     --------            -----
>     start bdrv_co_amend
>                     take BQL
>     bdrv_drain
>     ... hangs ...

:/

Is there really nothing we can do?  Forgive me if I’m talking complete 
nonsense here (because frankly I don’t even really know what a bottom 
half is exactly), but can’t we schedule some coroutine in the main 
thread to do the perm notifications and wait for them in the I/O thread?

> (2) is definitely wrong.
>
> (3) I have no idea.
>
> Would it be possible or meaningful to do the bdrv_child_refresh_perms 
> in qmp_x_blockdev_amend?  It seems that all users need it, and in 
> general it seems weird to amend a qcow2 or luks header (and thus the 
> meaning of parts of the file) while others can write to the same file.

Hmm...  Perhaps.  We would need to undo the permission change when the 
job finishes, though, i.e. in JobDriver.prepare() or JobDriver.clean().  
Doing the change in qmp_x_blockdev_amend() would be asymmetric then, so 
we’d probably want a new JobDriver method that runs in the main thread 
before .run() is invoked. (Unfortunately, “.prepare()” is now taken 
already...)

Doesn’t solve the FUSE problem, but there we could try to just take the 
RESIZE permission permanently and if that fails, we just don’t allow 
truncates for that export.  Not nice, but should work for common cases.

Hanna

Paolo Bonzini Nov. 19, 2021, 3:13 a.m. UTC | #10

El jue., 18 nov. 2021 16:31, Hanna Reitz <hreitz@redhat.com> escribió:

> On 18.11.21 14:50, Paolo Bonzini wrote:
> > On 11/15/21 17:03, Hanna Reitz wrote:
> >>
> >> I only really see four solutions for this:
> >> (1) We somehow make the amend job run in the main context under the
> >> BQL and have it prevent all concurrent I/O access (seems bad)
> >> (2) We can make the permission functions part of the I/O path (seems
> >> wrong and probably impossible?)
> >> (3) We can drop the permissions update and permanently require the
> >> permissions that we need when updating keys (I think this might break
> >> existing use cases)
> >> (4) We can acquire the BQL around the permission update call and
> >> perhaps that works?
> >>
> >> I don’t know how (4) would work but it’s basically the only
> >> reasonable solution I can come up with.  Would this be a way to call
> >> a BQL function from an I/O function?
> >
> > I think that would deadlock:
> >
> >     main                I/O thread
> >     --------            -----
> >     start bdrv_co_amend
> >                     take BQL
> >     bdrv_drain
> >     ... hangs ...
>
> :/
>
> Is there really nothing we can do?  Forgive me if I’m talking complete
> nonsense here (because frankly I don’t even really know what a bottom
> half is exactly), but can’t we schedule some coroutine in the main
> thread to do the perm notifications and wait for them in the I/O thread?
>

I think you still get a deadlock, just one with a longer chain. You still
have a cycle of things depending on each other, but one of them is now the
I/O thread waiting for the bottom half.

Hmm...  Perhaps.  We would need to undo the permission change when the
> job finishes, though, i.e. in JobDriver.prepare() or JobDriver.clean().
> Doing the change in qmp_x_blockdev_amend() would be asymmetric then, so
> we’d probably want a new JobDriver method that runs in the main thread
> before .run() is invoked. (Unfortunately, “.prepare()” is now taken
> already...)
>

Ok at least it's feasible.

Doesn’t solve the FUSE problem, but there we could try to just take the
> RESIZE permission permanently and if that fails, we just don’t allow
> truncates for that export.  Not nice, but should work for common cases.
>

Yeah definitely not nice. Probably permissions could be protected by their
own mutex, even a global one like the one we have for jobs. For now I
suggest just ignoring the problem and adding a comment, since it's not
really something that didn't exist.

Paolo

Emanuele Giuseppe Esposito Nov. 19, 2021, 10:42 a.m. UTC | #11

On 19/11/2021 04:13, Paolo Bonzini wrote:
> 
> 
> El jue., 18 nov. 2021 16:31, Hanna Reitz <hreitz@redhat.com 
> <mailto:hreitz@redhat.com>> escribió:
> 
>     On 18.11.21 14:50, Paolo Bonzini wrote:
>      > On 11/15/21 17:03, Hanna Reitz wrote:
>      >>
>      >> I only really see four solutions for this:
>      >> (1) We somehow make the amend job run in the main context under the
>      >> BQL and have it prevent all concurrent I/O access (seems bad)
>      >> (2) We can make the permission functions part of the I/O path
>     (seems
>      >> wrong and probably impossible?)
>      >> (3) We can drop the permissions update and permanently require the
>      >> permissions that we need when updating keys (I think this might
>     break
>      >> existing use cases)
>      >> (4) We can acquire the BQL around the permission update call and
>      >> perhaps that works?
>      >>
>      >> I don’t know how (4) would work but it’s basically the only
>      >> reasonable solution I can come up with.  Would this be a way to
>     call
>      >> a BQL function from an I/O function?
>      >
>      > I think that would deadlock:
>      >
>      >     main                I/O thread
>      >     --------            -----
>      >     start bdrv_co_amend
>      >                     take BQL
>      >     bdrv_drain
>      >     ... hangs ...
> 
>     :/
> 
>     Is there really nothing we can do?  Forgive me if I’m talking complete
>     nonsense here (because frankly I don’t even really know what a bottom
>     half is exactly), but can’t we schedule some coroutine in the main
>     thread to do the perm notifications and wait for them in the I/O thread?
> 
> 
> I think you still get a deadlock, just one with a longer chain. You 
> still have a cycle of things depending on each other, but one of them is 
> now the I/O thread waiting for the bottom half.
> 
>     Hmm...  Perhaps.  We would need to undo the permission change when the
>     job finishes, though, i.e. in JobDriver.prepare() or JobDriver.clean().
>     Doing the change in qmp_x_blockdev_amend() would be asymmetric then, so
>     we’d probably want a new JobDriver method that runs in the main thread
>     before .run() is invoked. (Unfortunately, “.prepare()” is now taken
>     already...)
> 
> 
> Ok at least it's feasible.

Ok I think I got it. I will create a new callback, maybe "pre_run" or 
something like that to perform the first bdrv_child_refresh_perms and 
implement the .clean callback to perform the "cleanup" 
bdrv_child_refresh_perms in block_crypto_amend_options_generic_luks.

> 
>     Doesn’t solve the FUSE problem, but there we could try to just take the
>     RESIZE permission permanently and if that fails, we just don’t allow
>     truncates for that export.  Not nice, but should work for common cases.
> 
> 
> Yeah definitely not nice. Probably permissions could be protected by 
> their own mutex, even a global one like the one we have for jobs. For 
> now I suggest just ignoring the problem and adding a comment, since it's 
> not really something that didn't exist.
> 

Will add a TODO in blk_set/get permissions explaining the issue.
Last issue we had with regards to permissions in GS had to do with 
bdrv_co_invalidate_cache: however, Paolo suggested me a simple fix to 
simply assert that the function is either under BQL or does not have 
open_flags & BDRV_O_INACTIVE set. This basically skips the permission 
code block, entering it only if we have the BQL.


Ok, apart from this permissions issue and assert_bdrv_graph_writable, I 
should have addressed all main comments of this series. Assume that for 
the others where I did not explicitly answered, I agree and applied your 
comments.

Thank you,
Emanuele

[v4,00/25] block layer: split block APIs in global state and I/O

Message

Comments