diff mbox

[v4,1/3] block: ignore flush requests when storage is clean

Message ID 1467038869-11538-2-git-send-email-den@openvz.org (mailing list archive)
State New, archived
Headers show

Commit Message

Denis V. Lunev June 27, 2016, 2:47 p.m. UTC
From: Evgeny Yakovlev <eyakovlev@virtuozzo.com>

Some guests (win2008 server for example) do a lot of unnecessary
flushing when underlying media has not changed. This adds additional
overhead on host when calling fsync/fdatasync.

This change introduces a dirty flag in BlockDriverState which is set
in bdrv_set_dirty and is checked in bdrv_co_flush. This allows us to
avoid unnecessary flushing when storage is clean.

The problem with excessive flushing was found by a performance test
which does parallel directory tree creation (from 2 processes).
Results improved from 0.424 loops/sec to 0.432 loops/sec.
Each loop creates 10^3 directories with 10 files in each.

Signed-off-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: John Snow <jsnow@redhat.com>
---
 block.c                   |  1 +
 block/dirty-bitmap.c      |  3 +++
 block/io.c                | 19 +++++++++++++++++++
 include/block/block_int.h |  1 +
 4 files changed, 24 insertions(+)

Comments

Fam Zheng June 28, 2016, 1:27 a.m. UTC | #1
On Mon, 06/27 17:47, Denis V. Lunev wrote:
> From: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
> 
> Some guests (win2008 server for example) do a lot of unnecessary
> flushing when underlying media has not changed. This adds additional
> overhead on host when calling fsync/fdatasync.
> 
> This change introduces a dirty flag in BlockDriverState which is set
> in bdrv_set_dirty and is checked in bdrv_co_flush. This allows us to
> avoid unnecessary flushing when storage is clean.
> 
> The problem with excessive flushing was found by a performance test
> which does parallel directory tree creation (from 2 processes).
> Results improved from 0.424 loops/sec to 0.432 loops/sec.
> Each loop creates 10^3 directories with 10 files in each.
> 
> Signed-off-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: John Snow <jsnow@redhat.com>
> ---
>  block.c                   |  1 +
>  block/dirty-bitmap.c      |  3 +++
>  block/io.c                | 19 +++++++++++++++++++
>  include/block/block_int.h |  1 +
>  4 files changed, 24 insertions(+)
> 
> diff --git a/block.c b/block.c
> index 947df29..68ae3a0 100644
> --- a/block.c
> +++ b/block.c
> @@ -2581,6 +2581,7 @@ int bdrv_truncate(BlockDriverState *bs, int64_t offset)
>          ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
>          bdrv_dirty_bitmap_truncate(bs);
>          bdrv_parent_cb_resize(bs);
> +        bs->dirty = true; /* file node sync is needed after truncate */
>      }
>      return ret;
>  }
> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
> index 4902ca5..54e0413 100644
> --- a/block/dirty-bitmap.c
> +++ b/block/dirty-bitmap.c
> @@ -370,6 +370,9 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector,
>          }
>          hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors);
>      }
> +
> +    /* Set global block driver dirty flag even if bitmap is disabled */
> +    bs->dirty = true;
>  }
>  
>  /**
> diff --git a/block/io.c b/block/io.c
> index b9e53e3..152f5a9 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2247,6 +2247,25 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>          goto flush_parent;
>      }
>  
> +    /* Check if storage is actually dirty before flushing to disk */
> +    if (!bs->dirty) {
> +        /* Flush requests are appended to tracked request list in order so that
> +         * most recent request is at the head of the list. Following code uses
> +         * this ordering to wait for the most recent flush request to complete
> +         * to ensure that requests return in order */
> +        BdrvTrackedRequest *prev_req;
> +        QLIST_FOREACH(prev_req, &bs->tracked_requests, list) {
> +            if (prev_req == &req || prev_req->type != BDRV_TRACKED_FLUSH) {
> +                continue;
> +            }
> +
> +            qemu_co_queue_wait(&prev_req->wait_queue);
> +            break;
> +        }
> +        goto flush_parent;

Should we check bs->dirty again after qemu_co_queue_wait()? I think another
write request could sneak in while this coroutine yields.

> +    }
> +    bs->dirty = false;
> +
>      BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_DISK);
>      if (bs->drv->bdrv_co_flush_to_disk) {
>          ret = bs->drv->bdrv_co_flush_to_disk(bs);
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 0432ba5..59a7def 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -435,6 +435,7 @@ struct BlockDriverState {
>      bool valid_key; /* if true, a valid encryption key has been set */
>      bool sg;        /* if true, the device is a /dev/sg* */
>      bool probed;    /* if true, format was probed rather than specified */
> +    bool dirty;     /* if true, media is dirty and should be flushed */

How about renaming this to "need_flush"? The one "dirty" we had is set by
bdrv_set_dirty, and cleared by bdrv_reset_dirty_bitmap. I'd avoid the
confusion between the two concepts.

Fam

>  
>      int copy_on_read; /* if nonzero, copy read backing sectors into image.
>                           note this is a reference count */
> -- 
> 2.1.4
>
Denis V. Lunev June 28, 2016, 9:10 a.m. UTC | #2
On 06/28/2016 04:27 AM, Fam Zheng wrote:
> On Mon, 06/27 17:47, Denis V. Lunev wrote:
>> From: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
>>
>> Some guests (win2008 server for example) do a lot of unnecessary
>> flushing when underlying media has not changed. This adds additional
>> overhead on host when calling fsync/fdatasync.
>>
>> This change introduces a dirty flag in BlockDriverState which is set
>> in bdrv_set_dirty and is checked in bdrv_co_flush. This allows us to
>> avoid unnecessary flushing when storage is clean.
>>
>> The problem with excessive flushing was found by a performance test
>> which does parallel directory tree creation (from 2 processes).
>> Results improved from 0.424 loops/sec to 0.432 loops/sec.
>> Each loop creates 10^3 directories with 10 files in each.
>>
>> Signed-off-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Fam Zheng <famz@redhat.com>
>> CC: John Snow <jsnow@redhat.com>
>> ---
>>   block.c                   |  1 +
>>   block/dirty-bitmap.c      |  3 +++
>>   block/io.c                | 19 +++++++++++++++++++
>>   include/block/block_int.h |  1 +
>>   4 files changed, 24 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 947df29..68ae3a0 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -2581,6 +2581,7 @@ int bdrv_truncate(BlockDriverState *bs, int64_t offset)
>>           ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
>>           bdrv_dirty_bitmap_truncate(bs);
>>           bdrv_parent_cb_resize(bs);
>> +        bs->dirty = true; /* file node sync is needed after truncate */
>>       }
>>       return ret;
>>   }
>> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
>> index 4902ca5..54e0413 100644
>> --- a/block/dirty-bitmap.c
>> +++ b/block/dirty-bitmap.c
>> @@ -370,6 +370,9 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector,
>>           }
>>           hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors);
>>       }
>> +
>> +    /* Set global block driver dirty flag even if bitmap is disabled */
>> +    bs->dirty = true;
>>   }
>>   
>>   /**
>> diff --git a/block/io.c b/block/io.c
>> index b9e53e3..152f5a9 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -2247,6 +2247,25 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>           goto flush_parent;
>>       }
>>   
>> +    /* Check if storage is actually dirty before flushing to disk */
>> +    if (!bs->dirty) {
>> +        /* Flush requests are appended to tracked request list in order so that
>> +         * most recent request is at the head of the list. Following code uses
>> +         * this ordering to wait for the most recent flush request to complete
>> +         * to ensure that requests return in order */
>> +        BdrvTrackedRequest *prev_req;
>> +        QLIST_FOREACH(prev_req, &bs->tracked_requests, list) {
>> +            if (prev_req == &req || prev_req->type != BDRV_TRACKED_FLUSH) {
>> +                continue;
>> +            }
>> +
>> +            qemu_co_queue_wait(&prev_req->wait_queue);
>> +            break;
>> +        }
>> +        goto flush_parent;
> Should we check bs->dirty again after qemu_co_queue_wait()? I think another
> write request could sneak in while this coroutine yields.
no, we do not care. Any subsequent to FLUSH write does not guaranteed to
be flushed. We have the warranty only that all write requests completed
prior to this flush are really flushed.



>> +    }
>> +    bs->dirty = false;
>> +
>>       BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_DISK);
>>       if (bs->drv->bdrv_co_flush_to_disk) {
>>           ret = bs->drv->bdrv_co_flush_to_disk(bs);
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index 0432ba5..59a7def 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -435,6 +435,7 @@ struct BlockDriverState {
>>       bool valid_key; /* if true, a valid encryption key has been set */
>>       bool sg;        /* if true, the device is a /dev/sg* */
>>       bool probed;    /* if true, format was probed rather than specified */
>> +    bool dirty;     /* if true, media is dirty and should be flushed */
> How about renaming this to "need_flush"? The one "dirty" we had is set by
> bdrv_set_dirty, and cleared by bdrv_reset_dirty_bitmap. I'd avoid the
> confusion between the two concepts.
>
> Fam

can be

>>   
>>       int copy_on_read; /* if nonzero, copy read backing sectors into image.
>>                            note this is a reference count */
>> -- 
>> 2.1.4
>>
Fam Zheng June 29, 2016, 1:12 a.m. UTC | #3
On Tue, 06/28 12:10, Denis V. Lunev wrote:
> On 06/28/2016 04:27 AM, Fam Zheng wrote:
> > On Mon, 06/27 17:47, Denis V. Lunev wrote:
> > > From: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
> > > 
> > > Some guests (win2008 server for example) do a lot of unnecessary
> > > flushing when underlying media has not changed. This adds additional
> > > overhead on host when calling fsync/fdatasync.
> > > 
> > > This change introduces a dirty flag in BlockDriverState which is set
> > > in bdrv_set_dirty and is checked in bdrv_co_flush. This allows us to
> > > avoid unnecessary flushing when storage is clean.
> > > 
> > > The problem with excessive flushing was found by a performance test
> > > which does parallel directory tree creation (from 2 processes).
> > > Results improved from 0.424 loops/sec to 0.432 loops/sec.
> > > Each loop creates 10^3 directories with 10 files in each.
> > > 
> > > Signed-off-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
> > > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > > CC: Kevin Wolf <kwolf@redhat.com>
> > > CC: Max Reitz <mreitz@redhat.com>
> > > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > > CC: Fam Zheng <famz@redhat.com>
> > > CC: John Snow <jsnow@redhat.com>
> > > ---
> > >   block.c                   |  1 +
> > >   block/dirty-bitmap.c      |  3 +++
> > >   block/io.c                | 19 +++++++++++++++++++
> > >   include/block/block_int.h |  1 +
> > >   4 files changed, 24 insertions(+)
> > > 
> > > diff --git a/block.c b/block.c
> > > index 947df29..68ae3a0 100644
> > > --- a/block.c
> > > +++ b/block.c
> > > @@ -2581,6 +2581,7 @@ int bdrv_truncate(BlockDriverState *bs, int64_t offset)
> > >           ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
> > >           bdrv_dirty_bitmap_truncate(bs);
> > >           bdrv_parent_cb_resize(bs);
> > > +        bs->dirty = true; /* file node sync is needed after truncate */
> > >       }
> > >       return ret;
> > >   }
> > > diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
> > > index 4902ca5..54e0413 100644
> > > --- a/block/dirty-bitmap.c
> > > +++ b/block/dirty-bitmap.c
> > > @@ -370,6 +370,9 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector,
> > >           }
> > >           hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors);
> > >       }
> > > +
> > > +    /* Set global block driver dirty flag even if bitmap is disabled */
> > > +    bs->dirty = true;
> > >   }
> > >   /**
> > > diff --git a/block/io.c b/block/io.c
> > > index b9e53e3..152f5a9 100644
> > > --- a/block/io.c
> > > +++ b/block/io.c
> > > @@ -2247,6 +2247,25 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
> > >           goto flush_parent;
> > >       }
> > > +    /* Check if storage is actually dirty before flushing to disk */
> > > +    if (!bs->dirty) {
> > > +        /* Flush requests are appended to tracked request list in order so that
> > > +         * most recent request is at the head of the list. Following code uses
> > > +         * this ordering to wait for the most recent flush request to complete
> > > +         * to ensure that requests return in order */
> > > +        BdrvTrackedRequest *prev_req;
> > > +        QLIST_FOREACH(prev_req, &bs->tracked_requests, list) {
> > > +            if (prev_req == &req || prev_req->type != BDRV_TRACKED_FLUSH) {
> > > +                continue;
> > > +            }
> > > +
> > > +            qemu_co_queue_wait(&prev_req->wait_queue);
> > > +            break;
> > > +        }
> > > +        goto flush_parent;
> > Should we check bs->dirty again after qemu_co_queue_wait()? I think another
> > write request could sneak in while this coroutine yields.
> no, we do not care. Any subsequent to FLUSH write does not guaranteed to
> be flushed. We have the warranty only that all write requests completed
> prior to this flush are really flushed.

I'm not worried about subsequent requests.

A prior request can be already in progress or be waiting when we check
bs->dirty, though it would be false there, but it will become true soon --
bdrv_set_dirty is only called when a request is completing.

Fam

> 
> 
> 
> > > +    }
> > > +    bs->dirty = false;
> > > +
> > >       BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_DISK);
> > >       if (bs->drv->bdrv_co_flush_to_disk) {
> > >           ret = bs->drv->bdrv_co_flush_to_disk(bs);
> > > diff --git a/include/block/block_int.h b/include/block/block_int.h
> > > index 0432ba5..59a7def 100644
> > > --- a/include/block/block_int.h
> > > +++ b/include/block/block_int.h
> > > @@ -435,6 +435,7 @@ struct BlockDriverState {
> > >       bool valid_key; /* if true, a valid encryption key has been set */
> > >       bool sg;        /* if true, the device is a /dev/sg* */
> > >       bool probed;    /* if true, format was probed rather than specified */
> > > +    bool dirty;     /* if true, media is dirty and should be flushed */
> > How about renaming this to "need_flush"? The one "dirty" we had is set by
> > bdrv_set_dirty, and cleared by bdrv_reset_dirty_bitmap. I'd avoid the
> > confusion between the two concepts.
> > 
> > Fam
> 
> can be
> 
> > >       int copy_on_read; /* if nonzero, copy read backing sectors into image.
> > >                            note this is a reference count */
> > > -- 
> > > 2.1.4
> > > 
> 
>
Denis V. Lunev June 29, 2016, 8:30 a.m. UTC | #4
On 06/29/2016 04:12 AM, Fam Zheng wrote:
> On Tue, 06/28 12:10, Denis V. Lunev wrote:
>> On 06/28/2016 04:27 AM, Fam Zheng wrote:
>>> On Mon, 06/27 17:47, Denis V. Lunev wrote:
>>>> From: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
>>>>
>>>> Some guests (win2008 server for example) do a lot of unnecessary
>>>> flushing when underlying media has not changed. This adds additional
>>>> overhead on host when calling fsync/fdatasync.
>>>>
>>>> This change introduces a dirty flag in BlockDriverState which is set
>>>> in bdrv_set_dirty and is checked in bdrv_co_flush. This allows us to
>>>> avoid unnecessary flushing when storage is clean.
>>>>
>>>> The problem with excessive flushing was found by a performance test
>>>> which does parallel directory tree creation (from 2 processes).
>>>> Results improved from 0.424 loops/sec to 0.432 loops/sec.
>>>> Each loop creates 10^3 directories with 10 files in each.
>>>>
>>>> Signed-off-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>> CC: Kevin Wolf <kwolf@redhat.com>
>>>> CC: Max Reitz <mreitz@redhat.com>
>>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>>> CC: Fam Zheng <famz@redhat.com>
>>>> CC: John Snow <jsnow@redhat.com>
>>>> ---
>>>>    block.c                   |  1 +
>>>>    block/dirty-bitmap.c      |  3 +++
>>>>    block/io.c                | 19 +++++++++++++++++++
>>>>    include/block/block_int.h |  1 +
>>>>    4 files changed, 24 insertions(+)
>>>>
>>>> diff --git a/block.c b/block.c
>>>> index 947df29..68ae3a0 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -2581,6 +2581,7 @@ int bdrv_truncate(BlockDriverState *bs, int64_t offset)
>>>>            ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
>>>>            bdrv_dirty_bitmap_truncate(bs);
>>>>            bdrv_parent_cb_resize(bs);
>>>> +        bs->dirty = true; /* file node sync is needed after truncate */
>>>>        }
>>>>        return ret;
>>>>    }
>>>> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
>>>> index 4902ca5..54e0413 100644
>>>> --- a/block/dirty-bitmap.c
>>>> +++ b/block/dirty-bitmap.c
>>>> @@ -370,6 +370,9 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector,
>>>>            }
>>>>            hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors);
>>>>        }
>>>> +
>>>> +    /* Set global block driver dirty flag even if bitmap is disabled */
>>>> +    bs->dirty = true;
>>>>    }
>>>>    /**
>>>> diff --git a/block/io.c b/block/io.c
>>>> index b9e53e3..152f5a9 100644
>>>> --- a/block/io.c
>>>> +++ b/block/io.c
>>>> @@ -2247,6 +2247,25 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
>>>>            goto flush_parent;
>>>>        }
>>>> +    /* Check if storage is actually dirty before flushing to disk */
>>>> +    if (!bs->dirty) {
>>>> +        /* Flush requests are appended to tracked request list in order so that
>>>> +         * most recent request is at the head of the list. Following code uses
>>>> +         * this ordering to wait for the most recent flush request to complete
>>>> +         * to ensure that requests return in order */
>>>> +        BdrvTrackedRequest *prev_req;
>>>> +        QLIST_FOREACH(prev_req, &bs->tracked_requests, list) {
>>>> +            if (prev_req == &req || prev_req->type != BDRV_TRACKED_FLUSH) {
>>>> +                continue;
>>>> +            }
>>>> +
>>>> +            qemu_co_queue_wait(&prev_req->wait_queue);
>>>> +            break;
>>>> +        }
>>>> +        goto flush_parent;
>>> Should we check bs->dirty again after qemu_co_queue_wait()? I think another
>>> write request could sneak in while this coroutine yields.
>> no, we do not care. Any subsequent to FLUSH write does not guaranteed to
>> be flushed. We have the warranty only that all write requests completed
>> prior to this flush are really flushed.
> I'm not worried about subsequent requests.
>
> A prior request can be already in progress or be waiting when we check
> bs->dirty, though it would be false there, but it will become true soon --
> bdrv_set_dirty is only called when a request is completing.
>
> Fam
I have written specifically about this situation. FLUSH in the
controller does not guarantee that requests which are in the
progress at the moment when the flush is initiated will be
flushed.

It guarantees that requests, which are completed, i.e. which
status 'COMPLETED' was returned to the guest, will  be flushed.

Den
Stefan Hajnoczi June 29, 2016, 9:09 a.m. UTC | #5
On Wed, Jun 29, 2016 at 09:12:41AM +0800, Fam Zheng wrote:
> On Tue, 06/28 12:10, Denis V. Lunev wrote:
> > On 06/28/2016 04:27 AM, Fam Zheng wrote:
> > > On Mon, 06/27 17:47, Denis V. Lunev wrote:
> > > > From: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
> > > > 
> > > > Some guests (win2008 server for example) do a lot of unnecessary
> > > > flushing when underlying media has not changed. This adds additional
> > > > overhead on host when calling fsync/fdatasync.
> > > > 
> > > > This change introduces a dirty flag in BlockDriverState which is set
> > > > in bdrv_set_dirty and is checked in bdrv_co_flush. This allows us to
> > > > avoid unnecessary flushing when storage is clean.
> > > > 
> > > > The problem with excessive flushing was found by a performance test
> > > > which does parallel directory tree creation (from 2 processes).
> > > > Results improved from 0.424 loops/sec to 0.432 loops/sec.
> > > > Each loop creates 10^3 directories with 10 files in each.
> > > > 
> > > > Signed-off-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
> > > > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > > > CC: Kevin Wolf <kwolf@redhat.com>
> > > > CC: Max Reitz <mreitz@redhat.com>
> > > > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > > > CC: Fam Zheng <famz@redhat.com>
> > > > CC: John Snow <jsnow@redhat.com>
> > > > ---
> > > >   block.c                   |  1 +
> > > >   block/dirty-bitmap.c      |  3 +++
> > > >   block/io.c                | 19 +++++++++++++++++++
> > > >   include/block/block_int.h |  1 +
> > > >   4 files changed, 24 insertions(+)
> > > > 
> > > > diff --git a/block.c b/block.c
> > > > index 947df29..68ae3a0 100644
> > > > --- a/block.c
> > > > +++ b/block.c
> > > > @@ -2581,6 +2581,7 @@ int bdrv_truncate(BlockDriverState *bs, int64_t offset)
> > > >           ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
> > > >           bdrv_dirty_bitmap_truncate(bs);
> > > >           bdrv_parent_cb_resize(bs);
> > > > +        bs->dirty = true; /* file node sync is needed after truncate */
> > > >       }
> > > >       return ret;
> > > >   }
> > > > diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
> > > > index 4902ca5..54e0413 100644
> > > > --- a/block/dirty-bitmap.c
> > > > +++ b/block/dirty-bitmap.c
> > > > @@ -370,6 +370,9 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector,
> > > >           }
> > > >           hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors);
> > > >       }
> > > > +
> > > > +    /* Set global block driver dirty flag even if bitmap is disabled */
> > > > +    bs->dirty = true;
> > > >   }
> > > >   /**
> > > > diff --git a/block/io.c b/block/io.c
> > > > index b9e53e3..152f5a9 100644
> > > > --- a/block/io.c
> > > > +++ b/block/io.c
> > > > @@ -2247,6 +2247,25 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
> > > >           goto flush_parent;
> > > >       }
> > > > +    /* Check if storage is actually dirty before flushing to disk */
> > > > +    if (!bs->dirty) {
> > > > +        /* Flush requests are appended to tracked request list in order so that
> > > > +         * most recent request is at the head of the list. Following code uses
> > > > +         * this ordering to wait for the most recent flush request to complete
> > > > +         * to ensure that requests return in order */
> > > > +        BdrvTrackedRequest *prev_req;
> > > > +        QLIST_FOREACH(prev_req, &bs->tracked_requests, list) {
> > > > +            if (prev_req == &req || prev_req->type != BDRV_TRACKED_FLUSH) {
> > > > +                continue;
> > > > +            }
> > > > +
> > > > +            qemu_co_queue_wait(&prev_req->wait_queue);
> > > > +            break;
> > > > +        }
> > > > +        goto flush_parent;
> > > Should we check bs->dirty again after qemu_co_queue_wait()? I think another
> > > write request could sneak in while this coroutine yields.
> > no, we do not care. Any subsequent to FLUSH write does not guaranteed to
> > be flushed. We have the warranty only that all write requests completed
> > prior to this flush are really flushed.
> 
> I'm not worried about subsequent requests.
> 
> A prior request can be already in progress or be waiting when we check
> bs->dirty, though it would be false there, but it will become true soon --
> bdrv_set_dirty is only called when a request is completing.

Flush only guarantees that already completed writes are persistent.  It
is not a barrier operation.  It does not wait for in-flight writes and
makes no guarantee regarding them.

Stefan
diff mbox

Patch

diff --git a/block.c b/block.c
index 947df29..68ae3a0 100644
--- a/block.c
+++ b/block.c
@@ -2581,6 +2581,7 @@  int bdrv_truncate(BlockDriverState *bs, int64_t offset)
         ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
         bdrv_dirty_bitmap_truncate(bs);
         bdrv_parent_cb_resize(bs);
+        bs->dirty = true; /* file node sync is needed after truncate */
     }
     return ret;
 }
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 4902ca5..54e0413 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -370,6 +370,9 @@  void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector,
         }
         hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors);
     }
+
+    /* Set global block driver dirty flag even if bitmap is disabled */
+    bs->dirty = true;
 }
 
 /**
diff --git a/block/io.c b/block/io.c
index b9e53e3..152f5a9 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2247,6 +2247,25 @@  int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
         goto flush_parent;
     }
 
+    /* Check if storage is actually dirty before flushing to disk */
+    if (!bs->dirty) {
+        /* Flush requests are appended to tracked request list in order so that
+         * most recent request is at the head of the list. Following code uses
+         * this ordering to wait for the most recent flush request to complete
+         * to ensure that requests return in order */
+        BdrvTrackedRequest *prev_req;
+        QLIST_FOREACH(prev_req, &bs->tracked_requests, list) {
+            if (prev_req == &req || prev_req->type != BDRV_TRACKED_FLUSH) {
+                continue;
+            }
+
+            qemu_co_queue_wait(&prev_req->wait_queue);
+            break;
+        }
+        goto flush_parent;
+    }
+    bs->dirty = false;
+
     BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_DISK);
     if (bs->drv->bdrv_co_flush_to_disk) {
         ret = bs->drv->bdrv_co_flush_to_disk(bs);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 0432ba5..59a7def 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -435,6 +435,7 @@  struct BlockDriverState {
     bool valid_key; /* if true, a valid encryption key has been set */
     bool sg;        /* if true, the device is a /dev/sg* */
     bool probed;    /* if true, format was probed rather than specified */
+    bool dirty;     /* if true, media is dirty and should be flushed */
 
     int copy_on_read; /* if nonzero, copy read backing sectors into image.
                          note this is a reference count */