diff mbox series

block/blklogwrites: Fix a bug when logging "write zeroes" operations.

Message ID 20240109184646.1128475-1-megari@gmx.com (mailing list archive)
State New, archived
Headers show
Series block/blklogwrites: Fix a bug when logging "write zeroes" operations. | expand

Commit Message

megari@gmx.com Jan. 9, 2024, 6:46 p.m. UTC
From: Ari Sundholm <ari@tuxera.com>

There is a bug in the blklogwrites driver pertaining to logging "write
zeroes" operations, causing log corruption. This can be easily observed
by setting detect-zeroes to something other than "off" for the driver.

The issue is caused by a concurrency bug pertaining to the fact that
"write zeroes" operations have to be logged in two parts: first the log
entry metadata, then the zeroed-out region. While the log entry
metadata is being written by bdrv_co_pwritev(), another operation may
begin in the meanwhile and modify the state of the blklogwrites driver.
This is as intended by the coroutine-driven I/O model in QEMU, of
course.

Unfortunately, this specific scenario is mishandled. A short example:
    1. Initially, in the current operation (#1), the current log sector
number in the driver state is only incremented by the number of sectors
taken by the log entry metadata, after which the log entry metadata is
written. The current operation yields.
    2. Another operation (#2) may start while the log entry metadata is
being written. It uses the current log position as the start offset for
its log entry. This is in the sector right after the operation #1 log
entry metadata, which is bad!
    3. After bdrv_co_pwritev() returns (#1), the current log sector
number is reread from the driver state in order to find out the start
offset for bdrv_co_pwrite_zeroes(). This is an obvious blunder, as the
offset will be the sector right after the (misplaced) operation #2 log
entry, which means that the zeroed-out region begins at the wrong
offset.
    4. As a result of the above, the log is corrupt.

Fix this by only reading the driver metadata once, computing the
offsets and sizes in one go (including the optional zeroed-out region)
and setting the log sector number to the appropriate value for the next
operation in line.

Signed-off-by: Ari Sundholm <ari@tuxera.com>
Cc: qemu-stable@nongnu.org
---
 block/blklogwrites.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

--
2.43.0

Comments

Kevin Wolf Jan. 10, 2024, 1:39 p.m. UTC | #1
Am 09.01.2024 um 19:46 hat megari@gmx.com geschrieben:
> From: Ari Sundholm <ari@tuxera.com>
> 
> There is a bug in the blklogwrites driver pertaining to logging "write
> zeroes" operations, causing log corruption. This can be easily observed
> by setting detect-zeroes to something other than "off" for the driver.
> 
> The issue is caused by a concurrency bug pertaining to the fact that
> "write zeroes" operations have to be logged in two parts: first the log
> entry metadata, then the zeroed-out region. While the log entry
> metadata is being written by bdrv_co_pwritev(), another operation may
> begin in the meanwhile and modify the state of the blklogwrites driver.
> This is as intended by the coroutine-driven I/O model in QEMU, of
> course.
> 
> Unfortunately, this specific scenario is mishandled. A short example:
>     1. Initially, in the current operation (#1), the current log sector
> number in the driver state is only incremented by the number of sectors
> taken by the log entry metadata, after which the log entry metadata is
> written. The current operation yields.
>     2. Another operation (#2) may start while the log entry metadata is
> being written. It uses the current log position as the start offset for
> its log entry. This is in the sector right after the operation #1 log
> entry metadata, which is bad!
>     3. After bdrv_co_pwritev() returns (#1), the current log sector
> number is reread from the driver state in order to find out the start
> offset for bdrv_co_pwrite_zeroes(). This is an obvious blunder, as the
> offset will be the sector right after the (misplaced) operation #2 log
> entry, which means that the zeroed-out region begins at the wrong
> offset.
>     4. As a result of the above, the log is corrupt.
> 
> Fix this by only reading the driver metadata once, computing the
> offsets and sizes in one go (including the optional zeroed-out region)
> and setting the log sector number to the appropriate value for the next
> operation in line.
> 
> Signed-off-by: Ari Sundholm <ari@tuxera.com>
> Cc: qemu-stable@nongnu.org

Thanks, applied to the block branch.

Note that while this fixes the single threaded case, it is not thread
safe and will still break with multiqueue enabled (see the new
iothread-vq-mapping option added to virtio-blk very recently). Most
block drivers already take a lock when modifying their global state, and
it looks like we missed blklogwrites when checking what needs to be
prepared for the block layer to be thread safe.

So I think we'll want another patch to add a QemuMutex that can be taken
while you do the calculations on s->cur_log_sector. But this patch is
a good first step because it means that we don't need to keep a lock
across an I/O request (just for the sake of completeness, this would
have had to be a CoMutex rather than a QemuMutex).

Kevin
Ari Sundholm Jan. 10, 2024, 3:21 p.m. UTC | #2
Hi, Kevin!

On 1/10/24 15:39, Kevin Wolf wrote:
> Am 09.01.2024 um 19:46 hat megari@gmx.com geschrieben:
>> From: Ari Sundholm <ari@tuxera.com>
>>
>> There is a bug in the blklogwrites driver pertaining to logging "write
>> zeroes" operations, causing log corruption. This can be easily observed
>> by setting detect-zeroes to something other than "off" for the driver.
>>
>> The issue is caused by a concurrency bug pertaining to the fact that
>> "write zeroes" operations have to be logged in two parts: first the log
>> entry metadata, then the zeroed-out region. While the log entry
>> metadata is being written by bdrv_co_pwritev(), another operation may
>> begin in the meanwhile and modify the state of the blklogwrites driver.
>> This is as intended by the coroutine-driven I/O model in QEMU, of
>> course.
>>
>> Unfortunately, this specific scenario is mishandled. A short example:
>>      1. Initially, in the current operation (#1), the current log sector
>> number in the driver state is only incremented by the number of sectors
>> taken by the log entry metadata, after which the log entry metadata is
>> written. The current operation yields.
>>      2. Another operation (#2) may start while the log entry metadata is
>> being written. It uses the current log position as the start offset for
>> its log entry. This is in the sector right after the operation #1 log
>> entry metadata, which is bad!
>>      3. After bdrv_co_pwritev() returns (#1), the current log sector
>> number is reread from the driver state in order to find out the start
>> offset for bdrv_co_pwrite_zeroes(). This is an obvious blunder, as the
>> offset will be the sector right after the (misplaced) operation #2 log
>> entry, which means that the zeroed-out region begins at the wrong
>> offset.
>>      4. As a result of the above, the log is corrupt.
>>
>> Fix this by only reading the driver metadata once, computing the
>> offsets and sizes in one go (including the optional zeroed-out region)
>> and setting the log sector number to the appropriate value for the next
>> operation in line.
>>
>> Signed-off-by: Ari Sundholm <ari@tuxera.com>
>> Cc: qemu-stable@nongnu.org
> 
> Thanks, applied to the block branch.
> 

Thank you.

> Note that while this fixes the single threaded case, it is not thread
> safe and will still break with multiqueue enabled (see the new
> iothread-vq-mapping option added to virtio-blk very recently). Most
> block drivers already take a lock when modifying their global state, and
> it looks like we missed blklogwrites when checking what needs to be
> prepared for the block layer to be thread safe.
> 

I see. Thanks for the heads up. I had missed this new development.

> So I think we'll want another patch to add a QemuMutex that can be taken
> while you do the calculations on s->cur_log_sector. But this patch is
> a good first step because it means that we don't need to keep a lock
> across an I/O request (just for the sake of completeness, this would
> have had to be a CoMutex rather than a QemuMutex).
> 

OK. I guess I have a bit of additional work to do, then. What release 
would these fixes realistically make it to? Just trying to gauge the 
urgency for the fix for the multi-threaded case for prioritization purposes.

And yes, holding a CoMutex while doing I/O would have fixed this issue, 
with the tiny drawback of killing any concurrency in the driver. ;)

Best regards,
Ari Sundholm
ari@tuxera.com

> Kevin
>
Kevin Wolf Jan. 10, 2024, 5:21 p.m. UTC | #3
Am 10.01.2024 um 16:21 hat Ari Sundholm geschrieben:
> On 1/10/24 15:39, Kevin Wolf wrote:
> > Am 09.01.2024 um 19:46 hat megari@gmx.com geschrieben:
> > > From: Ari Sundholm <ari@tuxera.com>
> > > 
> > > There is a bug in the blklogwrites driver pertaining to logging "write
> > > zeroes" operations, causing log corruption. This can be easily observed
> > > by setting detect-zeroes to something other than "off" for the driver.
> > > 
> > > The issue is caused by a concurrency bug pertaining to the fact that
> > > "write zeroes" operations have to be logged in two parts: first the log
> > > entry metadata, then the zeroed-out region. While the log entry
> > > metadata is being written by bdrv_co_pwritev(), another operation may
> > > begin in the meanwhile and modify the state of the blklogwrites driver.
> > > This is as intended by the coroutine-driven I/O model in QEMU, of
> > > course.
> > > 
> > > Unfortunately, this specific scenario is mishandled. A short example:
> > >      1. Initially, in the current operation (#1), the current log sector
> > > number in the driver state is only incremented by the number of sectors
> > > taken by the log entry metadata, after which the log entry metadata is
> > > written. The current operation yields.
> > >      2. Another operation (#2) may start while the log entry metadata is
> > > being written. It uses the current log position as the start offset for
> > > its log entry. This is in the sector right after the operation #1 log
> > > entry metadata, which is bad!
> > >      3. After bdrv_co_pwritev() returns (#1), the current log sector
> > > number is reread from the driver state in order to find out the start
> > > offset for bdrv_co_pwrite_zeroes(). This is an obvious blunder, as the
> > > offset will be the sector right after the (misplaced) operation #2 log
> > > entry, which means that the zeroed-out region begins at the wrong
> > > offset.
> > >      4. As a result of the above, the log is corrupt.
> > > 
> > > Fix this by only reading the driver metadata once, computing the
> > > offsets and sizes in one go (including the optional zeroed-out region)
> > > and setting the log sector number to the appropriate value for the next
> > > operation in line.
> > > 
> > > Signed-off-by: Ari Sundholm <ari@tuxera.com>
> > > Cc: qemu-stable@nongnu.org
> > 
> > Thanks, applied to the block branch.
> > 
> 
> Thank you.
> 
> > Note that while this fixes the single threaded case, it is not thread
> > safe and will still break with multiqueue enabled (see the new
> > iothread-vq-mapping option added to virtio-blk very recently). Most
> > block drivers already take a lock when modifying their global state, and
> > it looks like we missed blklogwrites when checking what needs to be
> > prepared for the block layer to be thread safe.
> > 
> 
> I see. Thanks for the heads up. I had missed this new development.
> 
> > So I think we'll want another patch to add a QemuMutex that can be taken
> > while you do the calculations on s->cur_log_sector. But this patch is
> > a good first step because it means that we don't need to keep a lock
> > across an I/O request (just for the sake of completeness, this would
> > have had to be a CoMutex rather than a QemuMutex).
> 
> OK. I guess I have a bit of additional work to do, then.

Yes, a bit, but I honestly don't expect the locking to be a big effort
for a small driver like blklogwrites.

> What release would these fixes realistically make it to? Just trying
> to gauge the urgency for the fix for the multi-threaded case for
> prioritization purposes.

Well, this one is queued for 9.0. If we can make sure to get the thread
safety fix into 9.0, too (I expect the soft freeze to be somewhere
around beginning of March and the release in April), that would be good,
because that would also be the first release to be affected.

I'm not sure if and when a 8.2.* stable release is planned before that.

> And yes, holding a CoMutex while doing I/O would have fixed this
> issue, with the tiny drawback of killing any concurrency in the
> driver. ;)

Ah, that's not true, only any concurrency in the log writes. ;)

Kevin
Kevin Wolf Jan. 11, 2024, 2:06 p.m. UTC | #4
Am 10.01.2024 um 20:50 hat Ari Sundholm geschrieben:
> During the review of a fix for a concurrency issue in blklogwrites,
> it was found that the driver needs an additional fix when enabling
> multiqueue, which is a new feature introduced in QEMU 9.0, as the
> driver state may be read and written by multiple threads at the same
> time, which was not the case when the driver was originally written.
> 
> Fix the multi-threaded scenario by introducing a mutex to protect the
> mutable fields in the driver state, and always having the mutex locked
> by the current thread when accessing them.
> 
> Additionally, add the const qualifier to a few BDRVBlkLogWritesState
> pointer targets in contexts where the driver state is not written to.
> 
> Signed-off-by: Ari Sundholm <ari@tuxera.com>
> ---
>  block/blklogwrites.c | 29 +++++++++++++++++++++++------
>  1 file changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blklogwrites.c b/block/blklogwrites.c
> index ba717dab4d..50f68df2a6 100644
> --- a/block/blklogwrites.c
> +++ b/block/blklogwrites.c
> @@ -3,7 +3,7 @@
>   *
>   * Copyright (c) 2017 Tuomas Tynkkynen <tuomas@tuxera.com>
>   * Copyright (c) 2018 Aapo Vienamo <aapo@tuxera.com>
> - * Copyright (c) 2018 Ari Sundholm <ari@tuxera.com>
> + * Copyright (c) 2018-2024 Ari Sundholm <ari@tuxera.com>
>   *
>   * This work is licensed under the terms of the GNU GPL, version 2 or later.
>   * See the COPYING file in the top-level directory.
> @@ -55,9 +55,18 @@ typedef struct {
>      BdrvChild *log_file;
>      uint32_t sectorsize;
>      uint32_t sectorbits;
> +    uint64_t update_interval;
> +
> +    /*
> +     * The mutable state of the driver, consisting of the current log sector
> +     * and the number of log entries.
> +     *
> +     * May be read and/or written from multiple threads, and the mutex must be
> +     * held when accessing these fields.
> +     */
>      uint64_t cur_log_sector;
>      uint64_t nr_entries;
> -    uint64_t update_interval;
> +    QemuMutex mutex;
>  } BDRVBlkLogWritesState;
>  
>  static QemuOptsList runtime_opts = {
> @@ -149,6 +158,7 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
>      uint64_t log_sector_size;
>      bool log_append;
>  
> +    qemu_mutex_init(&s->mutex);
>      opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
>      if (!qemu_opts_absorb_qdict(opts, options, errp)) {
>          ret = -EINVAL;
> @@ -255,6 +265,7 @@ fail_log:
>          bdrv_unref_child(bs, s->log_file);
>          bdrv_graph_wrunlock();
>          s->log_file = NULL;
> +        qemu_mutex_destroy(&s->mutex);
>      }
>  fail:
>      qemu_opts_del(opts);
> @@ -269,6 +280,7 @@ static void blk_log_writes_close(BlockDriverState *bs)
>      bdrv_unref_child(bs, s->log_file);
>      s->log_file = NULL;
>      bdrv_graph_wrunlock();
> +    qemu_mutex_destroy(&s->mutex);
>  }
>  
>  static int64_t coroutine_fn GRAPH_RDLOCK
> @@ -295,7 +307,7 @@ static void blk_log_writes_child_perm(BlockDriverState *bs, BdrvChild *c,
>  
>  static void blk_log_writes_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
> -    BDRVBlkLogWritesState *s = bs->opaque;
> +    const BDRVBlkLogWritesState *s = bs->opaque;
>      bs->bl.request_alignment = s->sectorsize;
>  }
>  
> @@ -338,15 +350,18 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
>       * driver may be modified by other driver operations while waiting for the
>       * I/O to complete.
>       */
> +    qemu_mutex_lock(&s->mutex);
>      const uint64_t entry_start_sector = s->cur_log_sector;
>      const uint64_t entry_offset = entry_start_sector << s->sectorbits;
>      const uint64_t qiov_aligned_size = ROUND_UP(lr->qiov->size, s->sectorsize);
>      const uint64_t entry_aligned_size = qiov_aligned_size +
>          ROUND_UP(lr->zero_size, s->sectorsize);
>      const uint64_t entry_nr_sectors = entry_aligned_size >> s->sectorbits;
> +    const uint64_t entry_seq = s->nr_entries + 1;
>  
> -    s->nr_entries++;
> +    s->nr_entries = entry_seq;
>      s->cur_log_sector += entry_nr_sectors;
> +    qemu_mutex_unlock(&s->mutex);
>  
>      /*
>       * Write the log entry. Note that if this is a "write zeroes" operation,
> @@ -366,14 +381,16 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
>  
>      /* Update super block on flush or every update interval */
>      if (lr->log_ret == 0 && ((lr->entry.flags & LOG_FLUSH_FLAG)
> -        || (s->nr_entries % s->update_interval == 0)))
> +        || (entry_seq % s->update_interval == 0)))
>      {
> +        qemu_mutex_lock(&s->mutex);
>          struct log_write_super super = {
>              .magic      = cpu_to_le64(WRITE_LOG_MAGIC),
>              .version    = cpu_to_le64(WRITE_LOG_VERSION),
>              .nr_entries = cpu_to_le64(s->nr_entries),
>              .sectorsize = cpu_to_le32(s->sectorsize),
>          };
> +        qemu_mutex_unlock(&s->mutex);

This hunk looks odd. Is the only thing the lock does here that
s->nr_entries is accessed atomically?

Looking a bit closer, if s->nr_entries has been changed (increased)
meanwhile by another request, I assume that we indeed want the newer
value to be stored in the superblock rather than using the value that
was current when we did the calculations. So superficially, this part
looks good.

But the other thing I notice is that a few lines down you can get
concurrent write requests to the superblock, and there is no guaranteed
order, so an older update could overwrite a newer one. Don't we need to
serialise writes to the superblock? (I would actually expect this to be
a problem already without multithreading.)

Kevin
Ari Sundholm Jan. 11, 2024, 2:53 p.m. UTC | #5
On 1/11/24 16:06, Kevin Wolf wrote:
> Am 10.01.2024 um 20:50 hat Ari Sundholm geschrieben:
>> During the review of a fix for a concurrency issue in blklogwrites,
>> it was found that the driver needs an additional fix when enabling
>> multiqueue, which is a new feature introduced in QEMU 9.0, as the
>> driver state may be read and written by multiple threads at the same
>> time, which was not the case when the driver was originally written.
>>
>> Fix the multi-threaded scenario by introducing a mutex to protect the
>> mutable fields in the driver state, and always having the mutex locked
>> by the current thread when accessing them.
>>
>> Additionally, add the const qualifier to a few BDRVBlkLogWritesState
>> pointer targets in contexts where the driver state is not written to.
>>
>> Signed-off-by: Ari Sundholm <ari@tuxera.com>
>> ---
>>   block/blklogwrites.c | 29 +++++++++++++++++++++++------
>>   1 file changed, 23 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/blklogwrites.c b/block/blklogwrites.c
>> index ba717dab4d..50f68df2a6 100644
>> --- a/block/blklogwrites.c
>> +++ b/block/blklogwrites.c
>> @@ -3,7 +3,7 @@
>>    *
>>    * Copyright (c) 2017 Tuomas Tynkkynen <tuomas@tuxera.com>
>>    * Copyright (c) 2018 Aapo Vienamo <aapo@tuxera.com>
>> - * Copyright (c) 2018 Ari Sundholm <ari@tuxera.com>
>> + * Copyright (c) 2018-2024 Ari Sundholm <ari@tuxera.com>
>>    *
>>    * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>    * See the COPYING file in the top-level directory.
>> @@ -55,9 +55,18 @@ typedef struct {
>>       BdrvChild *log_file;
>>       uint32_t sectorsize;
>>       uint32_t sectorbits;
>> +    uint64_t update_interval;
>> +
>> +    /*
>> +     * The mutable state of the driver, consisting of the current log sector
>> +     * and the number of log entries.
>> +     *
>> +     * May be read and/or written from multiple threads, and the mutex must be
>> +     * held when accessing these fields.
>> +     */
>>       uint64_t cur_log_sector;
>>       uint64_t nr_entries;
>> -    uint64_t update_interval;
>> +    QemuMutex mutex;
>>   } BDRVBlkLogWritesState;
>>   
>>   static QemuOptsList runtime_opts = {
>> @@ -149,6 +158,7 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
>>       uint64_t log_sector_size;
>>       bool log_append;
>>   
>> +    qemu_mutex_init(&s->mutex);
>>       opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
>>       if (!qemu_opts_absorb_qdict(opts, options, errp)) {
>>           ret = -EINVAL;
>> @@ -255,6 +265,7 @@ fail_log:
>>           bdrv_unref_child(bs, s->log_file);
>>           bdrv_graph_wrunlock();
>>           s->log_file = NULL;
>> +        qemu_mutex_destroy(&s->mutex);
>>       }
>>   fail:
>>       qemu_opts_del(opts);
>> @@ -269,6 +280,7 @@ static void blk_log_writes_close(BlockDriverState *bs)
>>       bdrv_unref_child(bs, s->log_file);
>>       s->log_file = NULL;
>>       bdrv_graph_wrunlock();
>> +    qemu_mutex_destroy(&s->mutex);
>>   }
>>   
>>   static int64_t coroutine_fn GRAPH_RDLOCK
>> @@ -295,7 +307,7 @@ static void blk_log_writes_child_perm(BlockDriverState *bs, BdrvChild *c,
>>   
>>   static void blk_log_writes_refresh_limits(BlockDriverState *bs, Error **errp)
>>   {
>> -    BDRVBlkLogWritesState *s = bs->opaque;
>> +    const BDRVBlkLogWritesState *s = bs->opaque;
>>       bs->bl.request_alignment = s->sectorsize;
>>   }
>>   
>> @@ -338,15 +350,18 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
>>        * driver may be modified by other driver operations while waiting for the
>>        * I/O to complete.
>>        */
>> +    qemu_mutex_lock(&s->mutex);
>>       const uint64_t entry_start_sector = s->cur_log_sector;
>>       const uint64_t entry_offset = entry_start_sector << s->sectorbits;
>>       const uint64_t qiov_aligned_size = ROUND_UP(lr->qiov->size, s->sectorsize);
>>       const uint64_t entry_aligned_size = qiov_aligned_size +
>>           ROUND_UP(lr->zero_size, s->sectorsize);
>>       const uint64_t entry_nr_sectors = entry_aligned_size >> s->sectorbits;
>> +    const uint64_t entry_seq = s->nr_entries + 1;
>>   
>> -    s->nr_entries++;
>> +    s->nr_entries = entry_seq;
>>       s->cur_log_sector += entry_nr_sectors;
>> +    qemu_mutex_unlock(&s->mutex);
>>   
>>       /*
>>        * Write the log entry. Note that if this is a "write zeroes" operation,
>> @@ -366,14 +381,16 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
>>   
>>       /* Update super block on flush or every update interval */
>>       if (lr->log_ret == 0 && ((lr->entry.flags & LOG_FLUSH_FLAG)
>> -        || (s->nr_entries % s->update_interval == 0)))
>> +        || (entry_seq % s->update_interval == 0)))
>>       {
>> +        qemu_mutex_lock(&s->mutex);
>>           struct log_write_super super = {
>>               .magic      = cpu_to_le64(WRITE_LOG_MAGIC),
>>               .version    = cpu_to_le64(WRITE_LOG_VERSION),
>>               .nr_entries = cpu_to_le64(s->nr_entries),
>>               .sectorsize = cpu_to_le32(s->sectorsize),
>>           };
>> +        qemu_mutex_unlock(&s->mutex);
> 
> This hunk looks odd. Is the only thing the lock does here that
> s->nr_entries is accessed atomically?
> 

Yes. We want to make sure that the latest value is being used.

> Looking a bit closer, if s->nr_entries has been changed (increased)
> meanwhile by another request, I assume that we indeed want the newer
> value to be stored in the superblock rather than using the value that
> was current when we did the calculations. So superficially, this part
> looks good.
> 

Yes, that is the intention.

> But the other thing I notice is that a few lines down you can get
> concurrent write requests to the superblock, and there is no guaranteed
> order, so an older update could overwrite a newer one. Don't we need to
> serialise writes to the superblock? (I would actually expect this to be
> a problem already without multithreading.)
> 

That's a good point. Maybe I should add a flag indicating that the 
superblock is being updated. But that may be too simplistic, as an older 
superblock update should neither clobber a newer one nor prevent a newer 
one from being performed, and we do not want to  hold the mutex while 
doing I/O. I'll think this through and post a v2.

Thanks,
Ari

> Kevin
>
diff mbox series

Patch

diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index 7207b2e757..ba717dab4d 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -328,22 +328,39 @@  static void coroutine_fn GRAPH_RDLOCK
 blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
 {
     BDRVBlkLogWritesState *s = lr->bs->opaque;
-    uint64_t cur_log_offset = s->cur_log_sector << s->sectorbits;

-    s->nr_entries++;
-    s->cur_log_sector +=
-            ROUND_UP(lr->qiov->size, s->sectorsize) >> s->sectorbits;
+    /*
+     * Determine the offsets and sizes of different parts of the entry, and
+     * update the state of the driver.
+     *
+     * This needs to be done in one go, before any actual I/O is done, as the
+     * log entry may have to be written in two parts, and the state of the
+     * driver may be modified by other driver operations while waiting for the
+     * I/O to complete.
+     */
+    const uint64_t entry_start_sector = s->cur_log_sector;
+    const uint64_t entry_offset = entry_start_sector << s->sectorbits;
+    const uint64_t qiov_aligned_size = ROUND_UP(lr->qiov->size, s->sectorsize);
+    const uint64_t entry_aligned_size = qiov_aligned_size +
+        ROUND_UP(lr->zero_size, s->sectorsize);
+    const uint64_t entry_nr_sectors = entry_aligned_size >> s->sectorbits;

-    lr->log_ret = bdrv_co_pwritev(s->log_file, cur_log_offset, lr->qiov->size,
+    s->nr_entries++;
+    s->cur_log_sector += entry_nr_sectors;
+
+    /*
+     * Write the log entry. Note that if this is a "write zeroes" operation,
+     * only the entry header is written here, with the zeroing being done
+     * separately below.
+     */
+    lr->log_ret = bdrv_co_pwritev(s->log_file, entry_offset, lr->qiov->size,
                                   lr->qiov, 0);

     /* Logging for the "write zeroes" operation */
     if (lr->log_ret == 0 && lr->zero_size) {
-        cur_log_offset = s->cur_log_sector << s->sectorbits;
-        s->cur_log_sector +=
-                ROUND_UP(lr->zero_size, s->sectorsize) >> s->sectorbits;
+        const uint64_t zeroes_offset = entry_offset + qiov_aligned_size;

-        lr->log_ret = bdrv_co_pwrite_zeroes(s->log_file, cur_log_offset,
+        lr->log_ret = bdrv_co_pwrite_zeroes(s->log_file, zeroes_offset,
                                             lr->zero_size, 0);
     }