[v4,0/3] qcow2: fix parallel rewrite and discard (rw-lock)

Message ID	20210319100811.342464-1-vsementsov@virtuozzo.com (mailing list archive)
Headers	show Return-Path: <SRS0=Pxic=IR=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C177564EF6 From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> To: qemu-block@nongnu.org Cc: qemu-devel@nongnu.org, mreitz@redhat.com, kwolf@redhat.com, vsementsov@virtuozzo.com, den@openvz.org Subject: [PATCH v4 0/3] qcow2: fix parallel rewrite and discard (rw-lock) Date: Fri, 19 Mar 2021 13:08:08 +0300 Message-Id: <20210319100811.342464-1-vsementsov@virtuozzo.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received-SPF: pass client-ip=40.107.20.125; envelope-from=vsementsov@virtuozzo.com; helo=EUR05-DB8-obe.outbound.protection.outlook.com X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, MSGID_FROM_MTA_HEADER=0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	qcow2: fix parallel rewrite and discard (rw-lock) \| expand [v4,0/3] qcow2: fix parallel rewrite and discard (rw-lock) [v4,1/3] qemu-io: add aio_discard [v4,2/3] iotests: add qcow2-discard-during-rewrite [v4,3/3] block/qcow2: introduce discard_rw_lock: fix discarding host clusters

Vladimir Sementsov-Ogievskiy March 19, 2021, 10:08 a.m. UTC

Look at 03 for the problem and fix. 01 is preparation and 02 is the
test.

Actually previous version of this thing is 
   [PATCH v2(RFC) 0/3] qcow2: fix parallel rewrite and discard

Still
   [PATCH v3 0/6] qcow2: compressed write cache
includes another fix (more complicated) for the bug, so this is called
v4.

So, what's new:

It's still a CoRwlock based solution as suggested by Kevin.

Now I think that "writer" of the lock should be code in
update_refcount() which wants to set refcount to zero. If we consider
only guest discard request as "writer" we may miss other sources of
discarding host clusters (like rewriting compressed cluster to normal,
maybe some snapshot operations, who knows what's more).

And this means that we want to take rw-lock under qcow2 s->lock. And
this brings ordering restriction for the two locks: if we want both
locks taken, we should always take s->lock first, and never take s->lock
when rw-lock is already taken (otherwise we get classic deadlock).

This leads us to taking rd-lock for in-flight writes under s->lock in
same critical section where cluster is allocated (or just got from
metadata) and releasing after data writing completion.

This in turn leads to a bit tricky logic around transferring rd-lock to
task coroutine on normal write path (see 03).. But this is still simpler
than inflight-write-counters solution in v3..

Vladimir Sementsov-Ogievskiy (3):
  qemu-io: add aio_discard
  iotests: add qcow2-discard-during-rewrite
  block/qcow2: introduce discard_rw_lock: fix discarding host clusters

 block/qcow2.h                                 |  20 +++
 block/qcow2-refcount.c                        |  22 ++++
 block/qcow2.c                                 |  73 +++++++++--
 qemu-io-cmds.c                                | 117 ++++++++++++++++++
 .../tests/qcow2-discard-during-rewrite        |  99 +++++++++++++++
 .../tests/qcow2-discard-during-rewrite.out    |  17 +++
 6 files changed, 341 insertions(+), 7 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
 create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out

Vladimir Sementsov-Ogievskiy March 25, 2021, 7:12 p.m. UTC | #1

ping. Do we want it for 6.0?

19.03.2021 13:08, Vladimir Sementsov-Ogievskiy wrote:
> Look at 03 for the problem and fix. 01 is preparation and 02 is the
> test.
> 
> Actually previous version of this thing is
>     [PATCH v2(RFC) 0/3] qcow2: fix parallel rewrite and discard
> 
> Still
>     [PATCH v3 0/6] qcow2: compressed write cache
> includes another fix (more complicated) for the bug, so this is called
> v4.
> 
> So, what's new:
> 
> It's still a CoRwlock based solution as suggested by Kevin.
> 
> Now I think that "writer" of the lock should be code in
> update_refcount() which wants to set refcount to zero. If we consider
> only guest discard request as "writer" we may miss other sources of
> discarding host clusters (like rewriting compressed cluster to normal,
> maybe some snapshot operations, who knows what's more).
> 
> And this means that we want to take rw-lock under qcow2 s->lock. And
> this brings ordering restriction for the two locks: if we want both
> locks taken, we should always take s->lock first, and never take s->lock
> when rw-lock is already taken (otherwise we get classic deadlock).
> 
> This leads us to taking rd-lock for in-flight writes under s->lock in
> same critical section where cluster is allocated (or just got from
> metadata) and releasing after data writing completion.
> 
> This in turn leads to a bit tricky logic around transferring rd-lock to
> task coroutine on normal write path (see 03).. But this is still simpler
> than inflight-write-counters solution in v3..
> 
> Vladimir Sementsov-Ogievskiy (3):
>    qemu-io: add aio_discard
>    iotests: add qcow2-discard-during-rewrite
>    block/qcow2: introduce discard_rw_lock: fix discarding host clusters
> 
>   block/qcow2.h                                 |  20 +++
>   block/qcow2-refcount.c                        |  22 ++++
>   block/qcow2.c                                 |  73 +++++++++--
>   qemu-io-cmds.c                                | 117 ++++++++++++++++++
>   .../tests/qcow2-discard-during-rewrite        |  99 +++++++++++++++
>   .../tests/qcow2-discard-during-rewrite.out    |  17 +++
>   6 files changed, 341 insertions(+), 7 deletions(-)
>   create mode 100755 tests/qemu-iotests/tests/qcow2-discard-during-rewrite
>   create mode 100644 tests/qemu-iotests/tests/qcow2-discard-during-rewrite.out
>

Max Reitz March 30, 2021, 9:49 a.m. UTC | #2

On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
> ping. Do we want it for 6.0?

I’d rather wait.  I think the conclusion was that guests shouldn’t hit 
this because they serialize discards?

There’s also something Kevin wrote on IRC a couple of weeks ago, for 
which I had hoped he’d sent an email but I don’t think he did, so I’ll 
try to remember and paraphrase as well as I can...

He basically asked whether it wouldn’t be conceptually simpler to take a 
reference to some cluster in get_cluster_offset() and later release it 
with a to-be-added put_cluster_offset().

He also noted that reading is problematic, too, because if you read a 
discarded and reused cluster, this might result in an information leak 
(some guest application might be able to read data it isn’t allowed to 
read); that’s why making get_cluster_offset() the point of locking 
clusters against discarding would be better.

This would probably work with both of your solutions.  For the in-memory 
solutions, you’d take a refcount to an actual cluster; in the CoRwLock 
solution, you’d take that lock.

What do you think?

Max

Vladimir Sementsov-Ogievskiy March 30, 2021, 10:51 a.m. UTC | #3

30.03.2021 12:49, Max Reitz wrote:
> On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
>> ping. Do we want it for 6.0?
> 
> I’d rather wait.  I think the conclusion was that guests shouldn’t hit this because they serialize discards?

I think, that we never had bugs, so we of course can wait.

> 
> There’s also something Kevin wrote on IRC a couple of weeks ago, for which I had hoped he’d sent an email but I don’t think he did, so I’ll try to remember and paraphrase as well as I can...
> 
> He basically asked whether it wouldn’t be conceptually simpler to take a reference to some cluster in get_cluster_offset() and later release it with a to-be-added put_cluster_offset().
> 
> He also noted that reading is problematic, too, because if you read a discarded and reused cluster, this might result in an information leak (some guest application might be able to read data it isn’t allowed to read); that’s why making get_cluster_offset() the point of locking clusters against discarding would be better.

Yes, I thought about read too, (RFCed in cover letter of [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless))

> 
> This would probably work with both of your solutions.  For the in-memory solutions, you’d take a refcount to an actual cluster; in the CoRwLock solution, you’d take that lock.
> 
> What do you think?
> 

Hmm. What do you mean? Just rename my qcow2_inflight_writes_inc() and qcow2_inflight_writes_dec() to get_cluster_offset()/put_cluster_offset(), to make it more native to use for read operations as well?

Or to update any kind of "getting cluster offset" in the whole qcow2 driver to take a kind of "dynamic reference count" by get_cluster_offset() and then call corresponding put() somewhere? In this case I'm afraid it's a lot more work.. It would be also the problem that a lot of paths in qcow2 are not in coroutine and don't even take s->lock when they actually should. This will also mean that we do same job as normal qcow2 refcounts already do: no sense in keeping additional "dynamic refcount" for L2 table cluster while reading it, as we already have non-zero qcow2 normal refcount for it..

Max Reitz March 30, 2021, 12:51 p.m. UTC | #4

On 30.03.21 12:51, Vladimir Sementsov-Ogievskiy wrote:
> 30.03.2021 12:49, Max Reitz wrote:
>> On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
>>> ping. Do we want it for 6.0?
>>
>> I’d rather wait.  I think the conclusion was that guests shouldn’t hit 
>> this because they serialize discards?
> 
> I think, that we never had bugs, so we of course can wait.
> 
>>
>> There’s also something Kevin wrote on IRC a couple of weeks ago, for 
>> which I had hoped he’d sent an email but I don’t think he did, so I’ll 
>> try to remember and paraphrase as well as I can...
>>
>> He basically asked whether it wouldn’t be conceptually simpler to take 
>> a reference to some cluster in get_cluster_offset() and later release 
>> it with a to-be-added put_cluster_offset().
>>
>> He also noted that reading is problematic, too, because if you read a 
>> discarded and reused cluster, this might result in an information leak 
>> (some guest application might be able to read data it isn’t allowed to 
>> read); that’s why making get_cluster_offset() the point of locking 
>> clusters against discarding would be better.
> 
> Yes, I thought about read too, (RFCed in cover letter of [PATCH v5 0/6] 
> qcow2: fix parallel rewrite and discard (lockless))
> 
>>
>> This would probably work with both of your solutions.  For the 
>> in-memory solutions, you’d take a refcount to an actual cluster; in 
>> the CoRwLock solution, you’d take that lock.
>>
>> What do you think?
>>
> 
> Hmm. What do you mean? Just rename my qcow2_inflight_writes_inc() and 
> qcow2_inflight_writes_dec() to 
> get_cluster_offset()/put_cluster_offset(), to make it more native to use 
> for read operations as well?

Hm.  Our discussion wasn’t so detailed.

I interpreted it to mean all qcow2 functions that find an offset to a 
qcow2 cluster, namely qcow2_get_host_offset(), 
qcow2_alloc_host_offset(), and qcow2_alloc_compressed_cluster_offset().

When those functions return an offset (in)to some cluster, that cluster 
(or the image as a whole) should be locked against discards.  Every 
offset received this way would require an accompanying 
qcow2_put_host_offset().

> Or to update any kind of "getting cluster offset" in the whole qcow2 
> driver to take a kind of "dynamic reference count" by 
> get_cluster_offset() and then call corresponding put() somewhere? In 
> this case I'm afraid it's a lot more work..

Hm, really?  I would have assumed we need to do some locking in all 
functions that get a cluster offset this way, so it should be less work 
to take the lock in the functions they invoke to get the offset.

> It would be also the problem 
> that a lot of paths in qcow2 are not in coroutine and don't even take 
> s->lock when they actually should.

I’m not sure what you mean here, because all functions that invoke any 
of the three functions I listed above are coroutine_fns (or, well, I 
didn’t look it up, but they all have *_co_* in their name).

> This will also mean that we do same 
> job as normal qcow2 refcounts already do: no sense in keeping additional 
> "dynamic refcount" for L2 table cluster while reading it, as we already 
> have non-zero qcow2 normal refcount for it..

I’m afraid I don’t understand how normal refcounts relate to this.  For 
example, qcow2_get_host_offset() doesn’t touch refcounts at all.

Max

Vladimir Sementsov-Ogievskiy March 30, 2021, 1:25 p.m. UTC | #5

30.03.2021 15:51, Max Reitz wrote:
> On 30.03.21 12:51, Vladimir Sementsov-Ogievskiy wrote:
>> 30.03.2021 12:49, Max Reitz wrote:
>>> On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
>>>> ping. Do we want it for 6.0?
>>>
>>> I’d rather wait.  I think the conclusion was that guests shouldn’t hit this because they serialize discards?
>>
>> I think, that we never had bugs, so we of course can wait.
>>
>>>
>>> There’s also something Kevin wrote on IRC a couple of weeks ago, for which I had hoped he’d sent an email but I don’t think he did, so I’ll try to remember and paraphrase as well as I can...
>>>
>>> He basically asked whether it wouldn’t be conceptually simpler to take a reference to some cluster in get_cluster_offset() and later release it with a to-be-added put_cluster_offset().
>>>
>>> He also noted that reading is problematic, too, because if you read a discarded and reused cluster, this might result in an information leak (some guest application might be able to read data it isn’t allowed to read); that’s why making get_cluster_offset() the point of locking clusters against discarding would be better.
>>
>> Yes, I thought about read too, (RFCed in cover letter of [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless))
>>
>>>
>>> This would probably work with both of your solutions.  For the in-memory solutions, you’d take a refcount to an actual cluster; in the CoRwLock solution, you’d take that lock.
>>>
>>> What do you think?
>>>
>>
>> Hmm. What do you mean? Just rename my qcow2_inflight_writes_inc() and qcow2_inflight_writes_dec() to get_cluster_offset()/put_cluster_offset(), to make it more native to use for read operations as well?
> 
> Hm.  Our discussion wasn’t so detailed.
> 
> I interpreted it to mean all qcow2 functions that find an offset to a qcow2 cluster, namely qcow2_get_host_offset(), qcow2_alloc_host_offset(), and qcow2_alloc_compressed_cluster_offset().

What about qcow2_alloc_clusters() ?

> 
> When those functions return an offset (in)to some cluster, that cluster (or the image as a whole) should be locked against discards.  Every offset received this way would require an accompanying qcow2_put_host_offset().
> 
>> Or to update any kind of "getting cluster offset" in the whole qcow2 driver to take a kind of "dynamic reference count" by get_cluster_offset() and then call corresponding put() somewhere? In this case I'm afraid it's a lot more work..
> 
> Hm, really?  I would have assumed we need to do some locking in all functions that get a cluster offset this way, so it should be less work to take the lock in the functions they invoke to get the offset.
> 
>> It would be also the problem that a lot of paths in qcow2 are not in coroutine and don't even take s->lock when they actually should.
> 
> I’m not sure what you mean here, because all functions that invoke any of the three functions I listed above are coroutine_fns (or, well, I didn’t look it up, but they all have *_co_* in their name).

qcow2_alloc_clusters() has a lot more callers..

> 
>> This will also mean that we do same job as normal qcow2 refcounts already do: no sense in keeping additional "dynamic refcount" for L2 table cluster while reading it, as we already have non-zero qcow2 normal refcount for it..
> 
> I’m afraid I don’t understand how normal refcounts relate to this.  For example, qcow2_get_host_offset() doesn’t touch refcounts at all.
> 

I mean the following: remember our discussion about what is free-cluster. If we add "dynamic-refcount", or "infligth-write-counter" thing only to count inflight data-writing (or, as discussed, we should count data-reads as well) operations, than "full reference count" of the cluster is inflight-write-count + qcow2-metadata-refcount.

But if we add a kind of "dynamic refcount" for any use of host cluster, for example reading of L2 table, than we duplicate the reference in qcow2-metadata to this L2 table (represented as refcount) by our "dynamic refcount", and we don't have a concept of "full reference count" as the sum above.. We still should treat a cluster as free when both "dynamic refcount" and qcow2-metadata-refcount are zero, but their sum doesn't have a good sense. Not a problem maybe.. But looks like a complication with no benefit.


==

OK, I think now that you didn't mean qcow2_alloc_clusters(). So, we are saying about only functions returning an offset to cluster with "guest data", not to any kind of host cluster. Than what you propose looks like this to me:

  - take my v5
  - rename qcow2_inflight_writes_dec() to put_cluster_offset()
  - call qcow2_inflight_writes_inc() from the three functions you mention

That make sense to me. Still, put_cluster_offset() name doesn't make it obvious that it's only for clusters with "guest data", and we shouldn't call it when work with metadata clusters.

Max Reitz March 30, 2021, 4:39 p.m. UTC | #6

On 30.03.21 15:25, Vladimir Sementsov-Ogievskiy wrote:
> 30.03.2021 15:51, Max Reitz wrote:
>> On 30.03.21 12:51, Vladimir Sementsov-Ogievskiy wrote:
>>> 30.03.2021 12:49, Max Reitz wrote:
>>>> On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
>>>>> ping. Do we want it for 6.0?
>>>>
>>>> I’d rather wait.  I think the conclusion was that guests shouldn’t 
>>>> hit this because they serialize discards?
>>>
>>> I think, that we never had bugs, so we of course can wait.
>>>
>>>>
>>>> There’s also something Kevin wrote on IRC a couple of weeks ago, for 
>>>> which I had hoped he’d sent an email but I don’t think he did, so 
>>>> I’ll try to remember and paraphrase as well as I can...
>>>>
>>>> He basically asked whether it wouldn’t be conceptually simpler to 
>>>> take a reference to some cluster in get_cluster_offset() and later 
>>>> release it with a to-be-added put_cluster_offset().
>>>>
>>>> He also noted that reading is problematic, too, because if you read 
>>>> a discarded and reused cluster, this might result in an information 
>>>> leak (some guest application might be able to read data it isn’t 
>>>> allowed to read); that’s why making get_cluster_offset() the point 
>>>> of locking clusters against discarding would be better.
>>>
>>> Yes, I thought about read too, (RFCed in cover letter of [PATCH v5 
>>> 0/6] qcow2: fix parallel rewrite and discard (lockless))
>>>
>>>>
>>>> This would probably work with both of your solutions.  For the 
>>>> in-memory solutions, you’d take a refcount to an actual cluster; in 
>>>> the CoRwLock solution, you’d take that lock.
>>>>
>>>> What do you think?
>>>>
>>>
>>> Hmm. What do you mean? Just rename my qcow2_inflight_writes_inc() and 
>>> qcow2_inflight_writes_dec() to 
>>> get_cluster_offset()/put_cluster_offset(), to make it more native to 
>>> use for read operations as well?
>>
>> Hm.  Our discussion wasn’t so detailed.
>>
>> I interpreted it to mean all qcow2 functions that find an offset to a 
>> qcow2 cluster, namely qcow2_get_host_offset(), 
>> qcow2_alloc_host_offset(), and qcow2_alloc_compressed_cluster_offset().
> 
> What about qcow2_alloc_clusters() ?

Seems like all callers for that but do_alloc_cluster_offset() call it to 
allocate metadata clusters, which cannot be discarded from the guest.

Is it really possible that some metadata cluster is used while qcow2 
discards it internally at the same time, or isn’t all of this only a 
problem for data clusters?

>> When those functions return an offset (in)to some cluster, that 
>> cluster (or the image as a whole) should be locked against discards.  
>> Every offset received this way would require an accompanying 
>> qcow2_put_host_offset().
>>
>>> Or to update any kind of "getting cluster offset" in the whole qcow2 
>>> driver to take a kind of "dynamic reference count" by 
>>> get_cluster_offset() and then call corresponding put() somewhere? In 
>>> this case I'm afraid it's a lot more work..
>>
>> Hm, really?  I would have assumed we need to do some locking in all 
>> functions that get a cluster offset this way, so it should be less 
>> work to take the lock in the functions they invoke to get the offset.
>>
>>> It would be also the problem that a lot of paths in qcow2 are not in 
>>> coroutine and don't even take s->lock when they actually should.
>>
>> I’m not sure what you mean here, because all functions that invoke any 
>> of the three functions I listed above are coroutine_fns (or, well, I 
>> didn’t look it up, but they all have *_co_* in their name).
> 
> qcow2_alloc_clusters() has a lot more callers..

Let’s hope we don’t need to worry about it then. O:)

>>> This will also mean that we do same job as normal qcow2 refcounts 
>>> already do: no sense in keeping additional "dynamic refcount" for L2 
>>> table cluster while reading it, as we already have non-zero qcow2 
>>> normal refcount for it..
>>
>> I’m afraid I don’t understand how normal refcounts relate to this.  
>> For example, qcow2_get_host_offset() doesn’t touch refcounts at all.
>>
> 
> I mean the following: remember our discussion about what is 
> free-cluster. If we add "dynamic-refcount", or "infligth-write-counter" 
> thing only to count inflight data-writing (or, as discussed, we should 
> count data-reads as well) operations, than "full reference count" of the 
> cluster is inflight-write-count + qcow2-metadata-refcount.
> 
> But if we add a kind of "dynamic refcount" for any use of host cluster, 
> for example reading of L2 table, than we duplicate the reference in 
> qcow2-metadata to this L2 table (represented as refcount) by our 
> "dynamic refcount", and we don't have a concept of "full reference 
> count" as the sum above.. We still should treat a cluster as free when 
> both "dynamic refcount" and qcow2-metadata-refcount are zero, but their 
> sum doesn't have a good sense. Not a problem maybe.. But looks like a 
> complication with no benefit.

You’re right, but I don’t think that’s a real problem.  Perhaps the sum 
was even a false equivalency.  There is a difference between the dynamic 
refcount and the on-disk refcount: We must wait with discarding until 
the the dynamic refcount is 0, and discarding will then drop the on-disk 
refcount to 0, too.  I think.  So I’m not sure whether the sum really 
means anything.

But if metadata isn’t a problem and that means we don’t have ask these 
questions at all, well, that’ll be even better.

> ==
> 
> OK, I think now that you didn't mean qcow2_alloc_clusters(). So, we are 
> saying about only functions returning an offset to cluster with "guest 
> data", not to any kind of host cluster. Than what you propose looks like 
> this to me:
> 
>   - take my v5
>   - rename qcow2_inflight_writes_dec() to put_cluster_offset()
>   - call qcow2_inflight_writes_inc() from the three functions you mention

Yes, I think so.  Or you take the CoRwLock in those three functions, 
depending on which implementation we want.

Sorry if we’ve discussed this before and I just forgot, but: What are 
the performance implications of either solution?  As far as I remember, 
the inflight-write-counter solution had the problem of always doing 
stuff on every I/O access.  You said the impact was small and yes, it 
is, but it’s still there.
I haven’t looked at the CoRwLock solution but it I would assume it’s 
basically zero cost for common cases, right?  I.e. the case where the 
guest already serializes discards from other accesses, which I thought 
is what e.g. Linux does.  (And even if it doesn’t, I would assume that 
concurrent I/O and discards are rather rare.)

> That make sense to me. Still, put_cluster_offset() name doesn't make it 
> obvious that it's only for clusters with "guest data", and we shouldn't 
> call it when work with metadata clusters.

Yeah, it was meant for symmetry with qcow2_get_host_offset() (i.e. it 
would be named “qcow2_put_host_offset()”).  Now that there are three 
functions that would take a reference, it should get some other name.  I 
don’t know.  qcow2_put_data_cluster_offset()?

Max

Vladimir Sementsov-Ogievskiy March 30, 2021, 6:33 p.m. UTC | #7

30.03.2021 19:39, Max Reitz wrote:
> On 30.03.21 15:25, Vladimir Sementsov-Ogievskiy wrote:
>> 30.03.2021 15:51, Max Reitz wrote:
>>> On 30.03.21 12:51, Vladimir Sementsov-Ogievskiy wrote:
>>>> 30.03.2021 12:49, Max Reitz wrote:
>>>>> On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> ping. Do we want it for 6.0?
>>>>>
>>>>> I’d rather wait.  I think the conclusion was that guests shouldn’t hit this because they serialize discards?
>>>>
>>>> I think, that we never had bugs, so we of course can wait.
>>>>
>>>>>
>>>>> There’s also something Kevin wrote on IRC a couple of weeks ago, for which I had hoped he’d sent an email but I don’t think he did, so I’ll try to remember and paraphrase as well as I can...
>>>>>
>>>>> He basically asked whether it wouldn’t be conceptually simpler to take a reference to some cluster in get_cluster_offset() and later release it with a to-be-added put_cluster_offset().
>>>>>
>>>>> He also noted that reading is problematic, too, because if you read a discarded and reused cluster, this might result in an information leak (some guest application might be able to read data it isn’t allowed to read); that’s why making get_cluster_offset() the point of locking clusters against discarding would be better.
>>>>
>>>> Yes, I thought about read too, (RFCed in cover letter of [PATCH v5 0/6] qcow2: fix parallel rewrite and discard (lockless))
>>>>
>>>>>
>>>>> This would probably work with both of your solutions.  For the in-memory solutions, you’d take a refcount to an actual cluster; in the CoRwLock solution, you’d take that lock.
>>>>>
>>>>> What do you think?
>>>>>
>>>>
>>>> Hmm. What do you mean? Just rename my qcow2_inflight_writes_inc() and qcow2_inflight_writes_dec() to get_cluster_offset()/put_cluster_offset(), to make it more native to use for read operations as well?
>>>
>>> Hm.  Our discussion wasn’t so detailed.
>>>
>>> I interpreted it to mean all qcow2 functions that find an offset to a qcow2 cluster, namely qcow2_get_host_offset(), qcow2_alloc_host_offset(), and qcow2_alloc_compressed_cluster_offset().
>>
>> What about qcow2_alloc_clusters() ?
> 
> Seems like all callers for that but do_alloc_cluster_offset() call it to allocate metadata clusters, which cannot be discarded from the guest.
> 
> Is it really possible that some metadata cluster is used while qcow2 discards it internally at the same time, or isn’t all of this only a problem for data clusters?

It's a good question which I think doesn't have fast answer.

I think:

1. On normal code paths we usually handle qcow2 metadata always under s->lock mutex. So, we can't decrease refcount to zero in parallel, as update_refcount() should be called under s->lock too.

2. [1] is violated at least by code paths that are not in coroutine (for example qcow2-snapshots). But seems that on these paths we are guarded by other things (like drained section).. For sure we should move the whole qcow2 driver to coroutine and do every metadata update under s->lock.

3. Does [1] violated on some coroutine paths I don't know. But I hope that it doesn't, otherwise we should have bugs..

In future we'll probably want to work with metadata without s->lock, at least for IO. Then we'll need same mechanisms like we are now inventing for data writes.. But I'm not sure that for metadata it will be so simple (as for now mutex prevents not only parallel write & discard of metadata, but also parallel updates. (And for parallel guest writes to same cluster we don't care, it's a problem of the guest)

> 
>>> When those functions return an offset (in)to some cluster, that cluster (or the image as a whole) should be locked against discards. Every offset received this way would require an accompanying qcow2_put_host_offset().
>>>
>>>> Or to update any kind of "getting cluster offset" in the whole qcow2 driver to take a kind of "dynamic reference count" by get_cluster_offset() and then call corresponding put() somewhere? In this case I'm afraid it's a lot more work..
>>>
>>> Hm, really?  I would have assumed we need to do some locking in all functions that get a cluster offset this way, so it should be less work to take the lock in the functions they invoke to get the offset.
>>>
>>>> It would be also the problem that a lot of paths in qcow2 are not in coroutine and don't even take s->lock when they actually should.
>>>
>>> I’m not sure what you mean here, because all functions that invoke any of the three functions I listed above are coroutine_fns (or, well, I didn’t look it up, but they all have *_co_* in their name).
>>
>> qcow2_alloc_clusters() has a lot more callers..
> 
> Let’s hope we don’t need to worry about it then. O:)
> 
>>>> This will also mean that we do same job as normal qcow2 refcounts already do: no sense in keeping additional "dynamic refcount" for L2 table cluster while reading it, as we already have non-zero qcow2 normal refcount for it..
>>>
>>> I’m afraid I don’t understand how normal refcounts relate to this. For example, qcow2_get_host_offset() doesn’t touch refcounts at all.
>>>
>>
>> I mean the following: remember our discussion about what is free-cluster. If we add "dynamic-refcount", or "infligth-write-counter" thing only to count inflight data-writing (or, as discussed, we should count data-reads as well) operations, than "full reference count" of the cluster is inflight-write-count + qcow2-metadata-refcount.
>>
>> But if we add a kind of "dynamic refcount" for any use of host cluster, for example reading of L2 table, than we duplicate the reference in qcow2-metadata to this L2 table (represented as refcount) by our "dynamic refcount", and we don't have a concept of "full reference count" as the sum above.. We still should treat a cluster as free when both "dynamic refcount" and qcow2-metadata-refcount are zero, but their sum doesn't have a good sense. Not a problem maybe.. But looks like a complication with no benefit.
> 
> You’re right, but I don’t think that’s a real problem.  Perhaps the sum was even a false equivalency.  There is a difference between the dynamic refcount and the on-disk refcount: We must wait with discarding until the the dynamic refcount is 0, and discarding will then drop the on-disk refcount to 0, too.  I think.  So I’m not sure whether the sum really means anything.
> 
> But if metadata isn’t a problem and that means we don’t have ask these questions at all, well, that’ll be even better.
> 
>> ==
>>
>> OK, I think now that you didn't mean qcow2_alloc_clusters(). So, we are saying about only functions returning an offset to cluster with "guest data", not to any kind of host cluster. Than what you propose looks like this to me:
>>
>>   - take my v5
>>   - rename qcow2_inflight_writes_dec() to put_cluster_offset()
>>   - call qcow2_inflight_writes_inc() from the three functions you mention
> 
> Yes, I think so.  Or you take the CoRwLock in those three functions, depending on which implementation we want.

Yes.

> 
> Sorry if we’ve discussed this before and I just forgot, but: What are the performance implications of either solution?  As far as I remember, the inflight-write-counter solution had the problem of always doing stuff on every I/O access.  You said the impact was small and yes, it is, but it’s still there.

Yes. I think, I can measure it on tmpfs, and if degradation less than 1%, I'd just ignore it. Still inflight-counter solution is more complicated.

Hmm, note that CoRwLock solution has one non-trivial thing too: moving the lock to task coroutine. With inflight-counters we don't need such thing.

> I haven’t looked at the CoRwLock solution but it I would assume it’s basically zero cost for common cases, right?  I.e. the case where the guest already serializes discards from other accesses, which I thought is what e.g. Linux does.  (And even if it doesn’t, I would assume that concurrent I/O and discards are rather rare.)

With CoRwLock any kind of host-cluster-discard becomes blocking. We block the guest io (remember that we want to protect reads too), wait until all operations completed, then we do discard, than unfreeze the whole io.. What about removing persistent dirty bitmap? It becomes such an operation.. Hmm. Can I imagine another example? Switch-to-snapshot should be in a drained section anyway.. Rewriting compressed cluster to normal cluster is rare case. Hmm what else?

We don't have it now, but what about a feature like this:

  During block-commit, discard clusters that are already committed to base from all intermediate images: thus we avoid increasing of disk-usage during commit job.

But, anyway, we don't have such feature yet.

So, I think that in general lockless solution is better.. But currently we don't have a significant use-case which would be a show-stopper for CoRwLock solution. Or at least I can't imagine it.

Ha, or, I have one in mind: mirror. mirror does discards (it has MIRROR_METHOD_DISCARD). Honestly, I always doubt about that fact. I think that mirror job should not use discard at all, it should either write or write-zeroes, nothing else.. But it's my opinion, probably I don't know the use-case. Anyway, with current implementation of mirror-job write-lock on discard may reduce performance of mirror job (and of guest as well) in cases when mirror does discard operations.. Mirror does discard when block_status return neither ZERO nor DATA. And accordingly to my old mail https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg05136.html , only iscsi may reasonably report such block status... Hmm. Any thoughts?


> 
>> That make sense to me. Still, put_cluster_offset() name doesn't make it obvious that it's only for clusters with "guest data", and we shouldn't call it when work with metadata clusters.
> 
> Yeah, it was meant for symmetry with qcow2_get_host_offset() (i.e. it would be named “qcow2_put_host_offset()”).  Now that there are three functions that would take a reference, it should get some other name.  I don’t know.  qcow2_put_data_cluster_offset()?
> 
> Max
>

Vladimir Sementsov-Ogievskiy March 31, 2021, 6:23 a.m. UTC | #8

30.03.2021 19:39, Max Reitz wrote:
>> ==
>>
>> OK, I think now that you didn't mean qcow2_alloc_clusters(). So, we are saying about only functions returning an offset to cluster with "guest data", not to any kind of host cluster. Than what you propose looks like this to me:
>>
>>   - take my v5
>>   - rename qcow2_inflight_writes_dec() to put_cluster_offset()
>>   - call qcow2_inflight_writes_inc() from the three functions you mention
> 
> Yes, I think so.  Or you take the CoRwLock in those three functions, depending on which implementation we want.

CoRwLock brings one inconvenient feature: to pass ownership on the reference to coroutine (for example to task coroutine in qcow2_co_pwritev) we have to unlock the lock in original coroutine and lock in another, as CoRwLock doesn't support transferring to other coroutine. So we'll do put_cluster_offset() in qcow2_co_pwritev() and do get_cluster_offset() in qcow2_co_pwritev_task(). (and we must be sure that the second lock taken before releasing the first one). So in this case it probably better to keep direct CoRwLock interface, not shadowing it by get/put.

[v4,0/3] qcow2: fix parallel rewrite and discard (rw-lock)

Message

Comments