Message ID | 20170411011718.9152-2-eblake@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Am 11.04.2017 um 03:17 hat Eric Blake geschrieben: > 'qemu-img map' already coalesces information about unallocated > clusters (those with status 0) and pure zero clusters (those > with status BDRV_BLOCK_ZERO and no offset). Furthermore, all > qcow2 images with no backing file already report all unallocated > clusters (in the preallocation sense of clusters with no offset) > as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was > set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of > BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit > to external users), thanks to generic block layer code in > bdrv_co_get_block_status(). > > So, for an image with no backing file, having bdrv_pwrite_zeroes > mark clusters as unallocated (defer to backing file) rather than > reads-as-zero (regardless of backing file) makes no difference > to normal behavior, but may potentially allow for fewer writes to > the L2 table of a freshly-created image where the L2 table is > initially written to all-zeroes (although I don't actually know > if we skip an L2 update and flush when re-writing the same > contents as already present). I don't get this. Allocating a cluster always involves an L2 update, no matter whether it was previously unallocated or a zero cluster. > Furthermore, this matches the behavior of discard_single_l2(), in > favoring an unallocated cluster over a zero cluster when full > discard is requested. The only use for "full discard" is qcow2_make_empty(). It explicitly requests that the backing file becomes visible again. This is a completely different case. In other words, in order to stay consistent between discard and write_zeroes from a guest POV, we need to leave this code alone. > Meanwhile, version 2 qcow2 files (compat=0.10) lack support for an > explicit zero cluster. This minor tweak therefore allows us to turn > write zeroes with unmap into an actual unallocation on those files, > where they used to return -ENOTSUP and cause an allocation due to > the fallback to explicitly written zeroes. Okay, this is true. But I doubt that making write_zeroes more efficient on v2 images without a backing file is really worth any extra complexity at this point... Kevin
On 04/12/2017 04:49 AM, Kevin Wolf wrote: > Am 11.04.2017 um 03:17 hat Eric Blake geschrieben: >> 'qemu-img map' already coalesces information about unallocated >> clusters (those with status 0) and pure zero clusters (those >> with status BDRV_BLOCK_ZERO and no offset). Furthermore, all >> qcow2 images with no backing file already report all unallocated >> clusters (in the preallocation sense of clusters with no offset) >> as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was >> set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of >> BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit >> to external users), thanks to generic block layer code in >> bdrv_co_get_block_status(). >> >> So, for an image with no backing file, having bdrv_pwrite_zeroes >> mark clusters as unallocated (defer to backing file) rather than >> reads-as-zero (regardless of backing file) makes no difference >> to normal behavior, but may potentially allow for fewer writes to >> the L2 table of a freshly-created image where the L2 table is >> initially written to all-zeroes (although I don't actually know >> if we skip an L2 update and flush when re-writing the same >> contents as already present). > > I don't get this. Allocating a cluster always involves an L2 update, no > matter whether it was previously unallocated or a zero cluster. But we're NOT allocating a cluster. Either we are unmapping a cluster, or are we are changing a pure unallocated cluster (bdrv_get_block_status of 0) to a read-as-zero but unallocated cluster (status 1) [different from the 'bdrv_is_allocated' sense of an allocated cluster being any cluster determined by the current layer instead of by its backing file - I really wish we wouldn't overload 'allocated' to two different meanings, but am not sure which of the two meanings would be better to rename, and to what]. My argument is that for a brand new image with no backing file, all the clusters are pure unallocated, and that a write zero with unmapping is doing more work by writing 0 => 1 than it would by leaving 0 => 0, with identical results. > >> Furthermore, this matches the behavior of discard_single_l2(), in >> favoring an unallocated cluster over a zero cluster when full >> discard is requested. > > The only use for "full discard" is qcow2_make_empty(). It explicitly > requests that the backing file becomes visible again. This is a > completely different case. > > In other words, in order to stay consistent between discard and > write_zeroes from a guest POV, we need to leave this code alone. I already argued that from a guest POV, the code is identical. The only situation that I changed is the situation where there is no backing file, so the guest is guaranteed to read zeros from the cluster whether the cluster is marked pure unallocated (0) or as reads-as-zero unallocated (1). The guest already requested unmapping, so they don't need the preallocated-reads-as-zero (1 | offset). Note that I absolutely agree that we must NOT make this change when there is a backing file - a guest request to write zeroes MUST use reads-as-zero (1) or preallocated reads-as-zero (1 | offset), and not pure unallocated, unless we can guarantee that falling back to the backing file will also read as zeroes, but taking the time to learn the backing file status is not worth the expense. > >> Meanwhile, version 2 qcow2 files (compat=0.10) lack support for an >> explicit zero cluster. This minor tweak therefore allows us to turn >> write zeroes with unmap into an actual unallocation on those files, >> where they used to return -ENOTSUP and cause an allocation due to >> the fallback to explicitly written zeroes. > > Okay, this is true. > > But I doubt that making write_zeroes more efficient on v2 images without > a backing file is really worth any extra complexity at this point... Is it just my commit message that is wrong? The code itself is fairly short to review (I spent a lot more time on the commit message than on the code - and it looks like I still need more time on the commit message). And later patches in the series DO depend on this, so I'm not ready to just drop the patch yet.
On 04/12/2017 04:49 AM, Kevin Wolf wrote: >> Furthermore, this matches the behavior of discard_single_l2(), in >> favoring an unallocated cluster over a zero cluster when full >> discard is requested. > > The only use for "full discard" is qcow2_make_empty(). It explicitly > requests that the backing file becomes visible again. This is a > completely different case. No, let's look at that code - I'm not referring to the qcow2_make_empty() case, but to the bdrv_pdiscard() case: /* * If full_discard is false, make sure that a discarded area reads back * as zeroes for v3 images (we cannot do it for v2 without actually * writing a zero-filled buffer). We can skip the operation if the * cluster is already marked as zero, or if it's unallocated and we * don't have a backing file. * * TODO We might want to use bdrv_get_block_status(bs) here, but we're * holding s->lock, so that doesn't work today. * * If full_discard is true, the sector should not read back as zeroes, * but rather fall through to the backing file. */ switch (qcow2_get_cluster_type(old_l2_entry)) { case QCOW2_CLUSTER_UNALLOCATED: if (full_discard || !bs->backing) { continue; } break; That says: if our cluster is currently unallocated, and we have no backing file, then LEAVE it unallocated (rather than marking it QCOW2_CLUSTER_ZERO), even through bdrv_pdiscard. Meanwhile, the next lines: case QCOW2_CLUSTER_ZERO: if (!full_discard) { continue; } state that if we are doing bdrv_pdiscard(), and the cluster already reads as zeroes, then minimize the churn by not changing the cluster (and that in turn means we do NOT lose a preallocation). > > In other words, in order to stay consistent between discard and > write_zeroes from a guest POV, we need to leave this code alone. Again, my argument is that if we have no backing file, and the cluster is previously unallocated, then leaving it unallocated is equivalent to changing it to QCOW2_CLUSTER_ZERO. There is no difference to the guest POV by this change. > >> Meanwhile, version 2 qcow2 files (compat=0.10) lack support for an >> explicit zero cluster. This minor tweak therefore allows us to turn >> write zeroes with unmap into an actual unallocation on those files, >> where they used to return -ENOTSUP and cause an allocation due to >> the fallback to explicitly written zeroes. > > Okay, this is true. > > But I doubt that making write_zeroes more efficient on v2 images without > a backing file is really worth any extra complexity at this point... My argument is that it is not much complexity, and that making this change coupled with the new qemu-iotest 179 in patch 2/13 (which fails without this change) enabled me to test qemu-io -c 'w -z -u' (which nothing else was testing) and ensure that I'm not regression when the rest of the series further tweaks things to be byte-based.
[revisiting this older patch version, even though the final version in today's pull request changed somewhat from this approach] On 04/12/2017 04:49 AM, Kevin Wolf wrote: > Am 11.04.2017 um 03:17 hat Eric Blake geschrieben: >> 'qemu-img map' already coalesces information about unallocated >> clusters (those with status 0) and pure zero clusters (those >> with status BDRV_BLOCK_ZERO and no offset). Furthermore, all >> qcow2 images with no backing file already report all unallocated >> clusters (in the preallocation sense of clusters with no offset) >> as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was >> set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of >> BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit >> to external users), thanks to generic block layer code in >> bdrv_co_get_block_status(). >> >> So, for an image with no backing file, having bdrv_pwrite_zeroes >> mark clusters as unallocated (defer to backing file) rather than >> reads-as-zero (regardless of backing file) makes no difference >> to normal behavior, but may potentially allow for fewer writes to >> the L2 table of a freshly-created image where the L2 table is >> initially written to all-zeroes (although I don't actually know >> if we skip an L2 update and flush when re-writing the same >> contents as already present). > > I don't get this. Allocating a cluster always involves an L2 update, no > matter whether it was previously unallocated or a zero cluster. On IRC, John, Kevin, and I were discussing the current situation with libvirt NBD storage migration. When libvirt creates a file on the destination (rather than the user pre-creating it), it currently defaults to 0.10 [v2] images, even if the source was a 1.1 image [v3] (arguably something that could be improved in libvirt, but https://bugzilla.redhat.com/show_bug.cgi?id=1371749 was closed as not a bug). Therefore, the use case of doing a mirror job to a v2 image, and having that image become thick even though the source was thin, is happening more than we'd like (https://bugzilla.redhat.com/show_bug.cgi?id=1371749). While Kevin had a point that in the common case we ALWAYS want to turn an unallocated cluster into a zero cluster (so that we don't have to audit whether all callers are properly accounting for the case where a backing image is added later), our conversation on IRC today conceded that we may want to introduce a new BDRV_REQ_READ_ZERO_NOP (or some such name) that particular callers can use to request that if a cluster already reads as zeroes, the write zero request does NOT have to change it. Normal guest operations would not use the flag, but mirroring IS a case that would use the flag, so that we can end up with thinner mirrors even to 0.10 images. The other consideration is that on 0.10 images, even if we have to allocate, right now our allocation is done by way of failing with -ENOTSUP and falling back to the normal pwrite() of explicit zeroes. It may be worth teaching the qcow2 layer to explicitly handle write zeroes, even on 0.10 images, by allocating a cluster (as needed) but then telling bs->file to write zeroes (punching a hole as appropriate) so that the file is still thin. In fact, it matches the fact that we already have code that probes whether a qcow2 cluster that reports BDRV_BLOCK_DATA|BDRV_BLOCK_OFFSET_VALID then probes bs->file to see if there is a hole there, at which point it can add BDRV_BLOCK_ZERO to the bdrv_get_block_status. I don't know which one of us will tackle patches along these lines, but wanted to at least capture the gist of the IRC conversation in the archives for a bit more permanence.
On 05/11/2017 09:56 AM, Eric Blake wrote: > > On IRC, John, Kevin, and I were discussing the current situation with > libvirt NBD storage migration. When libvirt creates a file on the > destination (rather than the user pre-creating it), it currently > defaults to 0.10 [v2] images, even if the source was a 1.1 image [v3] > (arguably something that could be improved in libvirt, but > https://bugzilla.redhat.com/show_bug.cgi?id=1371749 was closed as not a > bug). > > Therefore, the use case of doing a mirror job to a v2 image, and having > that image become thick even though the source was thin, is happening > more than we'd like > (https://bugzilla.redhat.com/show_bug.cgi?id=1371749). While Kevin had Bah, pasted the wrong link. Correct one is: https://bugzilla.redhat.com/show_bug.cgi?id=1219541
On 2017-05-11 16:56, Eric Blake wrote: > [revisiting this older patch version, even though the final version in > today's pull request changed somewhat from this approach] > > On 04/12/2017 04:49 AM, Kevin Wolf wrote: >> Am 11.04.2017 um 03:17 hat Eric Blake geschrieben: >>> 'qemu-img map' already coalesces information about unallocated >>> clusters (those with status 0) and pure zero clusters (those >>> with status BDRV_BLOCK_ZERO and no offset). Furthermore, all >>> qcow2 images with no backing file already report all unallocated >>> clusters (in the preallocation sense of clusters with no offset) >>> as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was >>> set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of >>> BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit >>> to external users), thanks to generic block layer code in >>> bdrv_co_get_block_status(). >>> >>> So, for an image with no backing file, having bdrv_pwrite_zeroes >>> mark clusters as unallocated (defer to backing file) rather than >>> reads-as-zero (regardless of backing file) makes no difference >>> to normal behavior, but may potentially allow for fewer writes to >>> the L2 table of a freshly-created image where the L2 table is >>> initially written to all-zeroes (although I don't actually know >>> if we skip an L2 update and flush when re-writing the same >>> contents as already present). >> >> I don't get this. Allocating a cluster always involves an L2 update, no >> matter whether it was previously unallocated or a zero cluster. > > On IRC, John, Kevin, and I were discussing the current situation with > libvirt NBD storage migration. When libvirt creates a file on the > destination (rather than the user pre-creating it), it currently > defaults to 0.10 [v2] images, even if the source was a 1.1 image [v3] > (arguably something that could be improved in libvirt, but > https://bugzilla.redhat.com/show_bug.cgi?id=1371749 was closed as not a > bug). > > Therefore, the use case of doing a mirror job to a v2 image, and having > that image become thick even though the source was thin, is happening > more than we'd like > (https://bugzilla.redhat.com/show_bug.cgi?id=1371749). While Kevin had > a point that in the common case we ALWAYS want to turn an unallocated > cluster into a zero cluster (so that we don't have to audit whether all > callers are properly accounting for the case where a backing image is > added later), our conversation on IRC today conceded that we may want to > introduce a new BDRV_REQ_READ_ZERO_NOP (or some such name) that > particular callers can use to request that if a cluster already reads as > zeroes, the write zero request does NOT have to change it. Normal guest > operations would not use the flag, but mirroring IS a case that would > use the flag, so that we can end up with thinner mirrors even to 0.10 > images. > > The other consideration is that on 0.10 images, even if we have to > allocate, right now our allocation is done by way of failing with > -ENOTSUP and falling back to the normal pwrite() of explicit zeroes. It > may be worth teaching the qcow2 layer to explicitly handle write zeroes, > even on 0.10 images, by allocating a cluster (as needed) but then > telling bs->file to write zeroes (punching a hole as appropriate) so > that the file is still thin. In fact, it matches the fact that we > already have code that probes whether a qcow2 cluster that reports > BDRV_BLOCK_DATA|BDRV_BLOCK_OFFSET_VALID then probes bs->file to see if > there is a hole there, at which point it can add BDRV_BLOCK_ZERO to the > bdrv_get_block_status. > > I don't know which one of us will tackle patches along these lines, but > wanted to at least capture the gist of the IRC conversation in the > archives for a bit more permanence. Just short ideas: (1) I do consider it a bug if v2 images are created. The BZ wasn't closed for a very good reason, but because nobody answered this question: > is this really a problem? And apparently it is. (2) There is a way of creating zero clusters on v2 images, but we've never done it (yet): You create a single data cluster containing zeroes and then you just point to it whenever you need a zero cluster (until its refcount overflows, at which point you allocate another cluster). Maybe that helps. I'd be really careful with optimizations for v2 images. You have already proven to me that my fears of too much complexity are out of proportions sometimes, but still. v2 upstream is old legacy and should be treated as such, at least IMHO. Max
On 05/12/2017 12:06 PM, Max Reitz wrote: > On 2017-05-11 16:56, Eric Blake wrote: >> [revisiting this older patch version, even though the final version in >> today's pull request changed somewhat from this approach] >> >> On 04/12/2017 04:49 AM, Kevin Wolf wrote: >>> Am 11.04.2017 um 03:17 hat Eric Blake geschrieben: >>>> 'qemu-img map' already coalesces information about unallocated >>>> clusters (those with status 0) and pure zero clusters (those >>>> with status BDRV_BLOCK_ZERO and no offset). Furthermore, all >>>> qcow2 images with no backing file already report all unallocated >>>> clusters (in the preallocation sense of clusters with no offset) >>>> as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was >>>> set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of >>>> BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit >>>> to external users), thanks to generic block layer code in >>>> bdrv_co_get_block_status(). >>>> >>>> So, for an image with no backing file, having bdrv_pwrite_zeroes >>>> mark clusters as unallocated (defer to backing file) rather than >>>> reads-as-zero (regardless of backing file) makes no difference >>>> to normal behavior, but may potentially allow for fewer writes to >>>> the L2 table of a freshly-created image where the L2 table is >>>> initially written to all-zeroes (although I don't actually know >>>> if we skip an L2 update and flush when re-writing the same >>>> contents as already present). >>> >>> I don't get this. Allocating a cluster always involves an L2 update, no >>> matter whether it was previously unallocated or a zero cluster. >> >> On IRC, John, Kevin, and I were discussing the current situation with >> libvirt NBD storage migration. When libvirt creates a file on the >> destination (rather than the user pre-creating it), it currently >> defaults to 0.10 [v2] images, even if the source was a 1.1 image [v3] >> (arguably something that could be improved in libvirt, but >> https://bugzilla.redhat.com/show_bug.cgi?id=1371749 was closed as not a >> bug). >> >> Therefore, the use case of doing a mirror job to a v2 image, and having >> that image become thick even though the source was thin, is happening >> more than we'd like >> (https://bugzilla.redhat.com/show_bug.cgi?id=1371749). While Kevin had >> a point that in the common case we ALWAYS want to turn an unallocated >> cluster into a zero cluster (so that we don't have to audit whether all >> callers are properly accounting for the case where a backing image is >> added later), our conversation on IRC today conceded that we may want to >> introduce a new BDRV_REQ_READ_ZERO_NOP (or some such name) that >> particular callers can use to request that if a cluster already reads as >> zeroes, the write zero request does NOT have to change it. Normal guest >> operations would not use the flag, but mirroring IS a case that would >> use the flag, so that we can end up with thinner mirrors even to 0.10 >> images. >> >> The other consideration is that on 0.10 images, even if we have to >> allocate, right now our allocation is done by way of failing with >> -ENOTSUP and falling back to the normal pwrite() of explicit zeroes. It >> may be worth teaching the qcow2 layer to explicitly handle write zeroes, >> even on 0.10 images, by allocating a cluster (as needed) but then >> telling bs->file to write zeroes (punching a hole as appropriate) so >> that the file is still thin. In fact, it matches the fact that we >> already have code that probes whether a qcow2 cluster that reports >> BDRV_BLOCK_DATA|BDRV_BLOCK_OFFSET_VALID then probes bs->file to see if >> there is a hole there, at which point it can add BDRV_BLOCK_ZERO to the >> bdrv_get_block_status. >> >> I don't know which one of us will tackle patches along these lines, but >> wanted to at least capture the gist of the IRC conversation in the >> archives for a bit more permanence. > > Just short ideas: > > (1) I do consider it a bug if v2 images are created. The BZ wasn't > closed for a very good reason, but because nobody answered this question: > >> is this really a problem? > > And apparently it is. > > (2) There is a way of creating zero clusters on v2 images, but we've > never done it (yet): You create a single data cluster containing zeroes > and then you just point to it whenever you need a zero cluster (until > its refcount overflows, at which point you allocate another cluster). > Maybe that helps. > > I'd be really careful with optimizations for v2 images. You have already > proven to me that my fears of too much complexity are out of proportions > sometimes, but still. v2 upstream is old legacy and should be treated as > such, at least IMHO. > > Max > > I agree that V2 changes should be limited in nature, but there are less-fiddly ways to have thin 0.10 images on sparse filesystems. This way feels a little too clever and mostly likely too intrusive for an old format. I'd also think that it'd probably confuse older copies of qemu-img check, so maybe this is too hacky of a solution.
On 2017-05-13 01:00, John Snow wrote: > > > On 05/12/2017 12:06 PM, Max Reitz wrote: >> On 2017-05-11 16:56, Eric Blake wrote: >>> [revisiting this older patch version, even though the final version in >>> today's pull request changed somewhat from this approach] >>> >>> On 04/12/2017 04:49 AM, Kevin Wolf wrote: >>>> Am 11.04.2017 um 03:17 hat Eric Blake geschrieben: >>>>> 'qemu-img map' already coalesces information about unallocated >>>>> clusters (those with status 0) and pure zero clusters (those >>>>> with status BDRV_BLOCK_ZERO and no offset). Furthermore, all >>>>> qcow2 images with no backing file already report all unallocated >>>>> clusters (in the preallocation sense of clusters with no offset) >>>>> as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was >>>>> set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of >>>>> BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit >>>>> to external users), thanks to generic block layer code in >>>>> bdrv_co_get_block_status(). >>>>> >>>>> So, for an image with no backing file, having bdrv_pwrite_zeroes >>>>> mark clusters as unallocated (defer to backing file) rather than >>>>> reads-as-zero (regardless of backing file) makes no difference >>>>> to normal behavior, but may potentially allow for fewer writes to >>>>> the L2 table of a freshly-created image where the L2 table is >>>>> initially written to all-zeroes (although I don't actually know >>>>> if we skip an L2 update and flush when re-writing the same >>>>> contents as already present). >>>> >>>> I don't get this. Allocating a cluster always involves an L2 update, no >>>> matter whether it was previously unallocated or a zero cluster. >>> >>> On IRC, John, Kevin, and I were discussing the current situation with >>> libvirt NBD storage migration. When libvirt creates a file on the >>> destination (rather than the user pre-creating it), it currently >>> defaults to 0.10 [v2] images, even if the source was a 1.1 image [v3] >>> (arguably something that could be improved in libvirt, but >>> https://bugzilla.redhat.com/show_bug.cgi?id=1371749 was closed as not a >>> bug). >>> >>> Therefore, the use case of doing a mirror job to a v2 image, and having >>> that image become thick even though the source was thin, is happening >>> more than we'd like >>> (https://bugzilla.redhat.com/show_bug.cgi?id=1371749). While Kevin had >>> a point that in the common case we ALWAYS want to turn an unallocated >>> cluster into a zero cluster (so that we don't have to audit whether all >>> callers are properly accounting for the case where a backing image is >>> added later), our conversation on IRC today conceded that we may want to >>> introduce a new BDRV_REQ_READ_ZERO_NOP (or some such name) that >>> particular callers can use to request that if a cluster already reads as >>> zeroes, the write zero request does NOT have to change it. Normal guest >>> operations would not use the flag, but mirroring IS a case that would >>> use the flag, so that we can end up with thinner mirrors even to 0.10 >>> images. Reading this again, you could just pass through the write-zero request. The issue right now is that the qcow2 driver just returns -ENOTSUP for every write-zero request on v2 images when it could just create a data cluster and invoke a write-zero request on the data. Just not suppressing the BDRV_REQ_ZERO_WRITE flag for the generic fall-back zero write (with bs->supported_write_flags listing it) and then issuing a bdrv_co_pwrite_zeroes() in qcow2_co_pwritev() works (with BDRV_REQ_MAY_UNMAP). (Whether or not to set BDRV_REQ_MAY_UNMAP should be derived from... Maybe QCOW2_DISCARD_OTHER?) >>> The other consideration is that on 0.10 images, even if we have to >>> allocate, right now our allocation is done by way of failing with >>> -ENOTSUP and falling back to the normal pwrite() of explicit zeroes. It >>> may be worth teaching the qcow2 layer to explicitly handle write zeroes, >>> even on 0.10 images, by allocating a cluster (as needed) but then >>> telling bs->file to write zeroes (punching a hole as appropriate) so >>> that the file is still thin. In fact, it matches the fact that we >>> already have code that probes whether a qcow2 cluster that reports >>> BDRV_BLOCK_DATA|BDRV_BLOCK_OFFSET_VALID then probes bs->file to see if >>> there is a hole there, at which point it can add BDRV_BLOCK_ZERO to the >>> bdrv_get_block_status. >>> >>> I don't know which one of us will tackle patches along these lines, but >>> wanted to at least capture the gist of the IRC conversation in the >>> archives for a bit more permanence. >> >> Just short ideas: >> >> (1) I do consider it a bug if v2 images are created. The BZ wasn't >> closed for a very good reason, but because nobody answered this question: >> >>> is this really a problem? >> >> And apparently it is. >> >> (2) There is a way of creating zero clusters on v2 images, but we've >> never done it (yet): You create a single data cluster containing zeroes >> and then you just point to it whenever you need a zero cluster (until >> its refcount overflows, at which point you allocate another cluster). >> Maybe that helps. >> >> I'd be really careful with optimizations for v2 images. You have already >> proven to me that my fears of too much complexity are out of proportions >> sometimes, but still. v2 upstream is old legacy and should be treated as >> such, at least IMHO. >> >> Max >> >> > > I agree that V2 changes should be limited in nature, but there are > less-fiddly ways to have thin 0.10 images on sparse filesystems. This > way feels a little too clever and mostly likely too intrusive for an old > format. Yes. > I'd also think that it'd probably confuse older copies of qemu-img > check, so maybe this is too hacky of a solution. I don't. Anyway, feel free to convince me by writing a simple patch that allows the v2 sparseness you need. Let's again get to the root of the issue: Mirroring to a new v2 qcow2 image does not create a sparse image. There are multiple questions here: (1) Who does mirroring? (2) Who uses v2 qcow2 for the destination? (3) Who needs sparse images? (1) is clear, everybody does it. (3) is rather clear, it would be nice to have (but not an absolute catastrophe if we didn't). So what remains is (2). From qemu's perspective, qcow2v3 has been the default for quite some time. So just using qemu, nobody will use v2. Maybe you're using an old qemu, maybe you're using RHEL 6's qemu which only supports v2. This is an argument, but I'd very much just recommend an update. (Bit of an internal question: Are you willing to backport the upstream changes this would bring to RHEL 6?) And then there's the case the BZ is about: Maybe you're using a management tool which forces v2 images. I'd call it a bug in the management tool because we didn't make v3 the default for nothing. Maybe there are good reasons why some management tools create v2, but I don't know any and when the BZ was closed none were given. It was just closed because there was no apparent reason not to (but there is!). So I see the following courses of action: 1. Ignore the v2 "issue". It's not an upstream issue, it's a bug in a management tool, so let's just reopen the BZ and ignore RHEL 6 users. Also, v2 just is a limited format, there is a reason v3 exists! So I wouldn't even call it an issue, it's just how it is. 2. Improve v2. Would "fix" the issue at hand while still not letting RHEL 7's libvirt create v3 images (which might very well come back to haunt us some other day), and would complicate upstream code for no upstream reason. Now, as I said, I'm open to be convinced that (2) is viable: If the change is simple enough, sure, let's take it, why not. But I don't see the point still. Max
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 100398c..12f44b2 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -1579,7 +1579,8 @@ static int zero_single_l2(BlockDriverState *bs, uint64_t offset, /* Update L2 entries */ qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table); if (old_offset & QCOW_OFLAG_COMPRESSED || flags & BDRV_REQ_MAY_UNMAP) { - l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO); + l2_table[l2_index + i] = bs->backing + ? cpu_to_be64(QCOW_OFLAG_ZERO) : 0; qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST); } else { l2_table[l2_index + i] |= cpu_to_be64(QCOW_OFLAG_ZERO); @@ -1598,8 +1599,11 @@ int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int nb_sectors, uint64_t nb_clusters; int ret; - /* The zero flag is only supported by version 3 and newer */ - if (s->qcow_version < 3) { + /* The zero flag is only supported by version 3 and newer; we + * require the use of that flag if there is a backing file or if + * we are not allowed to unmap. */ + if (s->qcow_version < 3 && + (bs->backing || !(flags & BDRV_REQ_MAY_UNMAP))) { return -ENOTSUP; }
'qemu-img map' already coalesces information about unallocated clusters (those with status 0) and pure zero clusters (those with status BDRV_BLOCK_ZERO and no offset). Furthermore, all qcow2 images with no backing file already report all unallocated clusters (in the preallocation sense of clusters with no offset) as BDRV_BLOCK_ZERO, regardless of whether the QCOW_OFLAG_ZERO was set in that L2 entry (QCOW_OFLAG_ZERO also implies a return of BDRV_BLOCK_ALLOCATED, but we intentionally do not expose that bit to external users), thanks to generic block layer code in bdrv_co_get_block_status(). So, for an image with no backing file, having bdrv_pwrite_zeroes mark clusters as unallocated (defer to backing file) rather than reads-as-zero (regardless of backing file) makes no difference to normal behavior, but may potentially allow for fewer writes to the L2 table of a freshly-created image where the L2 table is initially written to all-zeroes (although I don't actually know if we skip an L2 update and flush when re-writing the same contents as already present). Furthermore, this matches the behavior of discard_single_l2(), in favoring an unallocated cluster over a zero cluster when full discard is requested. Meanwhile, version 2 qcow2 files (compat=0.10) lack support for an explicit zero cluster. This minor tweak therefore allows us to turn write zeroes with unmap into an actual unallocation on those files, where they used to return -ENOTSUP and cause an allocation due to the fallback to explicitly written zeroes. Note that technically, we _could_ write a cluster as unallocated rather than zero if a backing file exists but the backing file also reads as zero, but that's more expensive to determine, so this optimization is limited to qcow2 without a backing file. Also note that this patch unmaps a compressed cluster when there is no backing file, even when BDRV_REQ_MAY_UNMAP was not set, but it is unlikely to have compressed clusters in a fully preallocated image (the point of compression is to reduce space requirements), so the side effect of unmapping a cluster in that case is not deemed to be a problem. Finally, note that this change can create a subtle difference if a backing file is later added with 'qemu-img rebase -u'; pre-patch, a v3 file (compat=1.1) will have a cluster that still reads as 0 (the cluster is not allocated in the sense of preallocation, but is provided from the layer), while post-patch the cluster will now defer to the backing file - but it's already an unsupported action if you are modifying guest-visible data by messing with backing chains ;) Signed-off-by: Eric Blake <eblake@redhat.com> --- v9: new patch --- block/qcow2-cluster.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)