Message ID | 20180220222459.8461-3-eblake@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote: I was also preparing a patch to change this, but you arrived first :-) > So, it's time to cut back on the waste. A compressed cluster > will NEVER occupy more than an uncompressed cluster (okay, gzip > DOES document that because the compression stream adds metadata, > and because of the pigeonhole principle, there are worst case > scenarios where attempts to compress will actually inflate an > image - but in those cases, we would just write the cluster > uncompressed instead of inflating it). And as that is a smaller > amount of memory, we can get by with the simpler g_malloc. > - if (!s->cluster_cache) { > - s->cluster_cache = g_malloc(s->cluster_size); > + assert(!s->cluster_cache); > + s->cluster_data = g_try_malloc(s->cluster_size); > + s->cluster_cache = g_try_malloc(s->cluster_size); There's a few things here: - QEMU won't write compressed data if the size is >= s->cluster_size (there's an explicit check for that in qcow2_co_pwritev_compressed()) - The size field of the compressed cluster descriptor *does* allow larger sizes, so you can't simply read csize bytes into s->cluster_data becuase you could cause a buffer overflow. - Solution a: check that csize < s->cluster_size and return an error if it's not. However! although QEMU won't produce an image with a compressed cluster that is larger than the uncompressed one, the qcow2 on-disk format in principle allows for that, so arguably we should accept it. - Solution b: the width of the 'compressed cluster size' field is (cluster_bits - 8), that's (cluster_size / 256) sectors. Since the size of each sector is 512 bytes, the maximum possible size that the field can store is (cluster_size * 2) bytes. So allocate that amount of space for s->cluster_data, read the data as it is on disk and let the decompression code return an error if the data is indeed corrupted or it doesn't fit in the output buffer. Berto
On 02/21/2018 04:04 AM, Alberto Garcia wrote: > On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote: > > I was also preparing a patch to change this, but you arrived first :-) > >> So, it's time to cut back on the waste. A compressed cluster >> will NEVER occupy more than an uncompressed cluster (okay, gzip >> DOES document that because the compression stream adds metadata, >> and because of the pigeonhole principle, there are worst case >> scenarios where attempts to compress will actually inflate an >> image - but in those cases, we would just write the cluster >> uncompressed instead of inflating it). And as that is a smaller >> amount of memory, we can get by with the simpler g_malloc. > >> - if (!s->cluster_cache) { >> - s->cluster_cache = g_malloc(s->cluster_size); >> + assert(!s->cluster_cache); >> + s->cluster_data = g_try_malloc(s->cluster_size); >> + s->cluster_cache = g_try_malloc(s->cluster_size); Shoot - I made edits that I forgot to commit before sending; I meant for these to be g_malloc() rather than g_try_malloc(). > > There's a few things here: > > - QEMU won't write compressed data if the size is >= s->cluster_size > (there's an explicit check for that in qcow2_co_pwritev_compressed()) Correct, we never cause inflation (and even if we wanted to, we can't, because the qcow2 format doesn't have enough bits for us to record that many sectors for a compressed stream that occupies more space than the original cluster). > > - The size field of the compressed cluster descriptor *does* allow > larger sizes, so you can't simply read csize bytes into > s->cluster_data becuase you could cause a buffer overflow. Let's step through this: nb_csectors = ((cluster_offset >> s->csize_shift) & s->csize_mask + 1; Since s->csize_mask is determined by s->cluster_size / 512, then after this assignment, the most nb_csectors can be is exactly s->cluster_size / 512. sector_offset = coffset & 511; csize = nb_csectors * 512 - sector_offset; And here, csize can only get smaller than the length picked by nb_csectors. Therefore, csize is GUARANTEED to be <= c->sector_size. > > - Solution a: check that csize < s->cluster_size and return an error if > it's not. However! although QEMU won't produce an image with a > compressed cluster that is larger than the uncompressed one, the qcow2 > on-disk format in principle allows for that, so arguably we should > accept it. No, the qcow2 on-disk format does not have enough bits reserved for that. It is impossible to store an inflated cluster, because you don't have enough bits to record it. That said, we MAY have a bug, more likely to be visible in 512-byte clusters but possible on any size. While the 'number sectors' field IS sufficient for any compressed cluster starting at offset 0 in the cluster, we run into issues if the starting offset is later in the cluster. That is, even though the compressed length of a 512-byte cluster is <= 512 (or we wouldn't compress it), if we start at offset 510, we NEED to read the next cluster to get the rest of the compressed stream - but at 512-byte clusters, there are 0 bits reserved for 'number sectors'. Per the above calculations with the example offset of 510, nb_csectors would be 1 (it can't be anything else for a 512-byte cluster image), and csize would then be 2 bytes, which is insufficient for reading back enough data to reconstruct the cluster. We probably need to clarify in the spec that if the starting offset of a compressed cluster falls mid-sector, then the compressed size has to be smaller than cluster size - (offset % 512) (this requirement is already there implicitly due to the field widths, but being explicit can't hurt). We probably also need to fix our compression code to actually do the right thing, particularly for 512-byte clusters where we are most likely to run into a compressed size that is likely to overflow the space available for nb_csectors. > > - Solution b: the width of the 'compressed cluster size' field is > (cluster_bits - 8), that's (cluster_size / 256) sectors. Not true. It is (cluster_bits - 9) or (cluster_size / 512). Remember, x = 62 - (cluster_bits - 8); for a 512-byte cluster, x = 61. The 'number sectors' field is then bits x+1 - 61 (but you can't have a bitfield occupying bit 62 upto 61; especially since bit 62 is the bit for compressed cluster). > Since the > size of each sector is 512 bytes, the maximum possible size that the > field can store is (cluster_size * 2) bytes. So allocate that amount > of space for s->cluster_data, read the data as it is on disk and let > the decompression code return an error if the data is indeed > corrupted or it doesn't fit in the output buffer. Again, I argue that the qcow2 spec says that the maximum size for a compressed cluster is cluster_size, even if it spills over a host cluster boundary. But if in practice, we HAVE allowed a spillover beyond the 'number fields' maximum size because of a non-zero offset, where we weren't careful about what got recorded into the qcow2 image, then we could instead state that reading s->cluster_size+512 is guaranteed to find the end of the compressed stream, even when 'number fields' is 0 because of overflow, at which point, sizing the buffer to be one sector larger than a cluster will mean that we can cope even with 512-byte clusters that were compressed incorrectly.
On Wed 21 Feb 2018 04:00:54 PM CET, Eric Blake wrote: >> - Solution b: the width of the 'compressed cluster size' field is >> (cluster_bits - 8), that's (cluster_size / 256) sectors. > > Not true. It is (cluster_bits - 9) or (cluster_size / 512). It's not, it's (cluster_bits - 8), the documentation is wrong, see the patch that I sent earlier. Here's qcow2_do_open(): s->csize_mask = (1 << (s->cluster_bits - 8)) - 1; For a 64k cluster (16 bits) the width is 16 - 8 = 8 bits. That's 2^8 = 256 sectors, and 256 * 512 = 128k, exactly two clusters. Coming back to your patch, since we know where the compressed data starts and we know that it's not going to be larger than cluster_size, we can read [coffset, coffset + cluster_size) and skip the final bytes of the last sector because we know that we don't need that data. Or we can read as much as the size field indicates (up to twice the cluster size) and let the decompression code handle the rest. Berto
On 02/21/2018 09:00 AM, Eric Blake wrote: > On 02/21/2018 04:04 AM, Alberto Garcia wrote: >> On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote: >> >> I was also preparing a patch to change this, but you arrived first :-) >> >>> So, it's time to cut back on the waste. A compressed cluster >>> will NEVER occupy more than an uncompressed cluster > And here, csize can only get smaller than the length picked by > nb_csectors. Therefore, csize is GUARANTEED to be <= c->sector_size. > >> >> - Solution a: check that csize < s->cluster_size and return an error if >> it's not. However! although QEMU won't produce an image with a >> compressed cluster that is larger than the uncompressed one, the qcow2 >> on-disk format in principle allows for that, so arguably we should >> accept it. > > No, the qcow2 on-disk format does not have enough bits reserved for > that. It is impossible to store an inflated cluster, because you don't > have enough bits to record it. Okay, the spec is WRONG, compared to our code base. > > That said, we MAY have a bug, more likely to be visible in 512-byte > clusters but possible on any size. While the 'number sectors' field IS > sufficient for any compressed cluster starting at offset 0 in the > cluster, we run into issues if the starting offset is later in the > cluster. That is, even though the compressed length of a 512-byte > cluster is <= 512 (or we wouldn't compress it), if we start at offset > 510, we NEED to read the next cluster to get the rest of the compressed > stream - but at 512-byte clusters, there are 0 bits reserved for 'number > sectors'. Per the above calculations with the example offset of 510, > nb_csectors would be 1 (it can't be anything else for a 512-byte cluster > image), and csize would then be 2 bytes, which is insufficient for > reading back enough data to reconstruct the cluster. In fact, here's a demonstration of a discrepancy, where qemu-img and John's qcheck tool [1] disagree about the validity of an image created by qemu (although it may just be that qcheck hasn't yet learned about compressed clusters): [1]https://github.com/jnsnow/qcheck $ f=12345678 $ f=$f$f$f$f # 32 $ f=$f$f$f$f # 128 $ f=$f$f$f$f # 512 $ f=$f$f$f$f # 2k $ f=$f$f$f$f # 8k $ f=$f$f$f$f # 32k $ f=$f$f$f$f # 128k $ printf "$f" > data $ qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 \ data data.qcow2 $ qemu-img check data.qcow2 No errors were found on the image. 256/256 = 100.00% allocated, 100.00% fragmented, 100.00% compressed clusters Image end offset: 18944 $ ./qcheck data.qcow2 ... == L2 Tables == == L2 cluster l1[0] : 0x0000000000000800 == Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375 == L2 cluster l1[1] : 0x0000000000000e00 == Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375 == L2 cluster l1[2] : 0x0000000000001400 == Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375 == L2 cluster l1[3] : 0x0000000000001a00 == Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375 L2 tables: Could not complete analysis, 257 problems found == Reference Count Analysis == Refcount analysis: 00 vacant clusters Refcount analysis: 04 leaked clusters Refcount analysis: 00 ghost clusters Refcount analysis: 04 miscounted clusters Refcount analysis: 8 problems found == Cluster Counts == Metadata: 0x1000 Data: 0x800 Leaked: 0x800 Vacant: 0x0 total: 0x2000 qcheck: 73 problems found > > Not true. It is (cluster_bits - 9) or (cluster_size / 512). Remember, > x = 62 - (cluster_bits - 8); for a 512-byte cluster, x = 61. The > 'number sectors' field is then bits x+1 - 61 (but you can't have a > bitfield occupying bit 62 upto 61; especially since bit 62 is the bit > for compressed cluster). So instead of blindly reading the spec, I'm now going to single-stepping through the 'qemu-img convert' command line above, to see what REALLY happens: Line numbers from commit a6e0344fa0 $ gdb --args ./qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 data data.qcow2 ... (gdb) b qcow2_alloc_bytes Breakpoint 1 at 0x57610: file block/qcow2-refcount.c, line 1052. (gdb) r Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes ( bs=bs@entry=0x555555d87f50, size=size@entry=15) at block/qcow2-refcount.c:1052 1052 { (gdb) So we are compressing 512 bytes down to 15 every time, which means that after 34 clusters compressed, we should be at offset 510. Let's resume debugging: (gdb) c 34 Will ignore next 33 crossings of breakpoint 1. Continuing. [Thread 0x7fffe3cfe700 (LWP 32229) exited] [New Thread 0x7fffe3cfe700 (LWP 32300)] [New Thread 0x7fffe25ed700 (LWP 32301)] Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes ( bs=bs@entry=0x555555d87f50, size=size@entry=15) at block/qcow2-refcount.c:1052 1052 { (gdb) n 1053 BDRVQcow2State *s = bs->opaque; (gdb) 1058 BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_BYTES); (gdb) 1059 assert(size > 0 && size <= s->cluster_size); (gdb) p s->free_byte_offset $2 = 3070 (gdb) p 3070%512 $3 = 510 ... (gdb) 1076 free_in_cluster = s->cluster_size - offset_into_cluster(s, offset); (gdb) 1078 if (!offset || free_in_cluster < size) { (gdb) p free_in_cluster $4 = 2 1079 int64_t new_cluster = alloc_clusters_noref(bs, s->cluster_size); (gdb) 1080 if (new_cluster < 0) { (gdb) 1084 if (new_cluster == 0) { (gdb) 1091 if (!offset || ROUND_UP(offset, s->cluster_size) != new_cluster) { (gdb) 1095 free_in_cluster += s->cluster_size; (gdb) 1099 assert(offset); so we got a contiguous cluster, and our goal is to let the caller bleed the compressed cluster into to the tail of the current sector and into the head of the next cluster. Continuing: (gdb) fin Run till exit from #0 qcow2_alloc_bytes (bs=bs@entry=0x555555d87f50, size=size@entry=15) at block/qcow2-refcount.c:1118 [Thread 0x7fffe25ed700 (LWP 32301) exited] [Thread 0x7fffe3cfe700 (LWP 32300) exited] qcow2_alloc_compressed_cluster_offset (bs=bs@entry=0x555555d87f50, offset=offset@entry=17408, compressed_size=compressed_size@entry=15) at block/qcow2-cluster.c:768 768 if (cluster_offset < 0) { Value returned is $5 = 3070 (gdb) n 773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) - (gdb) 774 (cluster_offset >> 9); (gdb) 773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) - (gdb) 777 ((uint64_t)nb_csectors << s->csize_shift); (gdb) l 772 773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) - 774 (cluster_offset >> 9); 775 776 cluster_offset |= QCOW_OFLAG_COMPRESSED | 777 ((uint64_t)nb_csectors << s->csize_shift); 778 779 /* update L2 table */ 780 781 /* compressed clusters never have the copied flag */ (gdb) p nb_csectors $6 = 1 (gdb) p s->csize_shift $7 = 61 (gdb) p/x cluster_offset $8 = 0xbfe (gdb) n 776 cluster_offset |= QCOW_OFLAG_COMPRESSED | (gdb) 783 BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED); (gdb) p/x cluster_offset $9 = 0x6000000000000bfe Where is s->csize_shift initialized? In qcow2_do_open(): s->csize_shift = (62 - (s->cluster_bits - 8)); s->csize_mask = (1 << (s->cluster_bits - 8)) - 1; s->cluster_offset_mask = (1LL << s->csize_shift) - 1; Revisiting the wording in the spec: Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)): Bit 0 - x: Host cluster offset. This is usually _not_ aligned to a cluster boundary! x+1 - 61: Compressed size of the images in sectors of 512 bytes which says bits 0-61 are the host cluster offset, and 62-61 is the number of sectors. But our code sets s->csize_shift to divide this differently, at 0-60 and 61-61. Which means your earlier claim that there are enough 'number sector' bits to allow for up to 2*cluster_size as the size of the compressed stream (rather than my claim of exactly cluster_size) is right, and other implementations CAN inflate a cluster (if we don't document otherwise), and that even if they DON'T inflate, they can STILL cause a read larger than a cluster size when the offset is near the tail of one sector (most likely at 512-byte clusters, but remotely possible at other cluster sizes as well).
Am 20.02.2018 um 23:24 hat Eric Blake geschrieben: > When reading a compressed image, we were allocating s->cluster_data > to 32*cluster_size + 512 (possibly over 64 megabytes, for an image > with 2M clusters). Let's check out the history: > > Back when qcow2 was first written, we used s->cluster_data for > everything, including copy_sectors() and encryption, where we want > to operate on more than one cluster at once. Obviously, at that > point, the buffer had to be aligned for other users, even though > compression itself doesn't require any alignment. > > But commit 1b9f1491 (v1.1!) changed things to allocate parallel > buffers on demand rather than sharing a single buffer, for encryption > and COW, leaving compression as the final client of s->cluster_data. > That use was still preserved, because if a single compressed cluster > is read more than once, we reuse the cache instead of decompressing > it a second time (I'm not sure how often this optimization actually > fires, or if it penalizes us from being able to decompress multiple > clusters in parallel even though we can now decrypt clusters in > parallel; the XXX comment in qcow2_co_preadv for > QCOW2_CLUSTER_COMPRESSED is telling). > > Much later, in commit de82815d (v2.2), we noticed that a 64M > allocation is prone to failure, so we switched over to a graceful > memory allocation error message. But note that elsewhere in the > code, we do g_malloc(2 * cluster_size) without ever checking for > failure. > > Then even later, in 3e4c7052 (2.11), we realized that allocating > a large buffer up front for every qcow2 image is expensive, and > switched to lazy allocation only for images that actually had > compressed clusters. But in the process, we never even bothered > to check whether what we were allocating still made sense in its > new context! > > So, it's time to cut back on the waste. A compressed cluster > will NEVER occupy more than an uncompressed cluster (okay, gzip > DOES document that because the compression stream adds metadata, > and because of the pigeonhole principle, there are worst case > scenarios where attempts to compress will actually inflate an > image - but in those cases, we would just write the cluster > uncompressed instead of inflating it). And as that is a smaller > amount of memory, we can get by with the simpler g_malloc. > > Signed-off-by: Eric Blake <eblake@redhat.com> My comments feel a bit trivial compared to Berto's, but anyway: > diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c > index 85be7d5e340..8c4b26ceaf2 100644 > --- a/block/qcow2-cluster.c > +++ b/block/qcow2-cluster.c > @@ -1603,15 +1603,9 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset) > * are freed in .bdrv_close(). > */ > if (!s->cluster_data) { > - /* one more sector for decompressed data alignment */ > - s->cluster_data = qemu_try_blockalign(bs->file->bs, > - QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512); > - if (!s->cluster_data) { > - return -ENOMEM; > - } > - } > - if (!s->cluster_cache) { > - s->cluster_cache = g_malloc(s->cluster_size); > + assert(!s->cluster_cache); Wouldn't it be better to assert (!!s->cluster_cache == !!s->cluster_data) unconditionally? > + s->cluster_data = g_try_malloc(s->cluster_size); Why are you going from qemu_try_blockalign() to simple malloc here? This buffer is used with bdrv_read() (or bdrv_pread() after patch 1), so we should avoid unnecessary use of a bounce buffer. > + s->cluster_cache = g_try_malloc(s->cluster_size); As you already said, either g_malloc() or check the result. I actually think that g_try_malloc() and checking the result is nicer, we still allocate up to 2 MB here. Kevin
On 02/21/2018 10:51 AM, Kevin Wolf wrote: > Am 20.02.2018 um 23:24 hat Eric Blake geschrieben: >> When reading a compressed image, we were allocating s->cluster_data >> to 32*cluster_size + 512 (possibly over 64 megabytes, for an image >> with 2M clusters). Let's check out the history: >> >> Much later, in commit de82815d (v2.2), we noticed that a 64M >> allocation is prone to failure, so we switched over to a graceful >> memory allocation error message. But note that elsewhere in the >> code, we do g_malloc(2 * cluster_size) without ever checking for >> failure. >> >> - } >> - if (!s->cluster_cache) { >> - s->cluster_cache = g_malloc(s->cluster_size); >> + assert(!s->cluster_cache); > > Wouldn't it be better to assert (!!s->cluster_cache == > !!s->cluster_data) unconditionally? > Sure. >> + s->cluster_data = g_try_malloc(s->cluster_size); > > Why are you going from qemu_try_blockalign() to simple malloc here? This > buffer is used with bdrv_read() (or bdrv_pread() after patch 1), so we > should avoid unnecessary use of a bounce buffer. But does bdrv_pread() actually need to use a bounce buffer if we don't have an aligned buffer to read into? Either the underlying protocol already supports byte-aligned reads (direct into our buffer, regardless of alignment, no bouncing required), or it already has do to a full sector read into a bounce buffer anyways (and it doesn't matter whether we aligned our buffer). blockalign() made sense when we had multiple clients for the buffer, but ever since v1.1, when we have only a single client, and that single client is most likely not going to read sector-aligned data in the first place, aligning the buffer doesn't buy us anything. > >> + s->cluster_cache = g_try_malloc(s->cluster_size); > > As you already said, either g_malloc() or check the result. I actually > think that g_try_malloc() and checking the result is nicer, we still > allocate up to 2 MB here. See my commit message comment - we have other spots in the code base that blindly g_malloc(2 * s->cluster_size). And I intended (but sent the email without amending my commit) to use g_malloc(). But as Berto has convinced me that an externally produced image can convince us to read up to 4M (even though we don't need that much to decompress), I suppose that the try_ variant plus checking is reasonable (and care in NULL'ing out if one but not both allocations succeed).
Am 21.02.2018 um 17:59 hat Eric Blake geschrieben: > On 02/21/2018 10:51 AM, Kevin Wolf wrote: > > Am 20.02.2018 um 23:24 hat Eric Blake geschrieben: > > > When reading a compressed image, we were allocating s->cluster_data > > > to 32*cluster_size + 512 (possibly over 64 megabytes, for an image > > > with 2M clusters). Let's check out the history: > > > > > > > Much later, in commit de82815d (v2.2), we noticed that a 64M > > > allocation is prone to failure, so we switched over to a graceful > > > memory allocation error message. But note that elsewhere in the > > > code, we do g_malloc(2 * cluster_size) without ever checking for > > > failure. > > > > > > > - } > > > - if (!s->cluster_cache) { > > > - s->cluster_cache = g_malloc(s->cluster_size); > > > + assert(!s->cluster_cache); > > > > Wouldn't it be better to assert (!!s->cluster_cache == > > !!s->cluster_data) unconditionally? > > > > Sure. > > > > + s->cluster_data = g_try_malloc(s->cluster_size); > > > > Why are you going from qemu_try_blockalign() to simple malloc here? This > > buffer is used with bdrv_read() (or bdrv_pread() after patch 1), so we > > should avoid unnecessary use of a bounce buffer. > > But does bdrv_pread() actually need to use a bounce buffer if we don't have > an aligned buffer to read into? Either the underlying protocol already > supports byte-aligned reads (direct into our buffer, regardless of > alignment, no bouncing required), or it already has do to a full sector read > into a bounce buffer anyways (and it doesn't matter whether we aligned our > buffer). blockalign() made sense when we had multiple clients for the > buffer, but ever since v1.1, when we have only a single client, and that > single client is most likely not going to read sector-aligned data in the > first place, aligning the buffer doesn't buy us anything. Good point. To be honest, I don't even analyse each caller, but just consistently use qemu_try_blockalign() whenever a buffer is used for I/O. It's a simple rule of thumb that generally makes sense. So as you say, in this case it's unlikely, but possible that we can benefit from an aligned buffer. I guess my point is more about consistency than actual functionality then. But it's okay either way. > > > > > + s->cluster_cache = g_try_malloc(s->cluster_size); > > > > As you already said, either g_malloc() or check the result. I actually > > think that g_try_malloc() and checking the result is nicer, we still > > allocate up to 2 MB here. > > See my commit message comment - we have other spots in the code base that > blindly g_malloc(2 * s->cluster_size). Though is that a reason to do the same in new code or to phase out such allocations whenever you touch them? > And I intended (but sent the email without amending my commit) to use > g_malloc(). But as Berto has convinced me that an externally produced > image can convince us to read up to 4M (even though we don't need that > much to decompress), I suppose that the try_ variant plus checking is > reasonable (and care in NULL'ing out if one but not both allocations > succeed). Sounds good. Another thought I had is whether we should do per-request allocation for compressed clusters, too, instead of having per-BDS buffers. Kevin
On 02/21/2018 11:39 AM, Kevin Wolf wrote: >> See my commit message comment - we have other spots in the code base that >> blindly g_malloc(2 * s->cluster_size). > > Though is that a reason to do the same in new code or to phase out such > allocations whenever you touch them? Touché. > >> And I intended (but sent the email without amending my commit) to use >> g_malloc(). But as Berto has convinced me that an externally produced >> image can convince us to read up to 4M (even though we don't need that >> much to decompress), I suppose that the try_ variant plus checking is >> reasonable (and care in NULL'ing out if one but not both allocations >> succeed). > > Sounds good. > > Another thought I had is whether we should do per-request allocation for > compressed clusters, too, instead of having per-BDS buffers. The only benefit of a per-BDS buffer is that we cache things - multiple sub-cluster reads in a row all from the same compressed cluster benefit from decompressing only once. The drawbacks of a per-BDS buffer: we can't do things in parallel (everything else in qcow2 drops the lock around bdrv_co_pread[v]), so the initial read prevents anything else in the qcow2 layer from progressing. I also wonder - since we ARE allowing multiple parallel readers in other parts of qcow2 (without a patch, decompression is not in this boat, but decryption and even bounce buffers due to lower-layer alignment constraints are), what sort of mechanisms do we have for using a pool of reusable buffers, rather than having each cluster access that requires a buffer malloc and free the buffer on a per-access basis? I don't know how much time the malloc/free per-transaction overhead adds, or if it is already much smaller than the actual I/O time. But note that while reusable buffers from a pool would cut down on the per-I/O malloc/free overhead if we switch decompression away from per-BDS buffer, it would still not solve the fact that we only get the caching ability where multiple sub-cluster requests from the same compressed cluster require only one decompression, since that's only possible on a per-BDS caching level.
On 02/21/2018 10:59 AM, Eric Blake wrote: > On 02/21/2018 09:00 AM, Eric Blake wrote: >> On 02/21/2018 04:04 AM, Alberto Garcia wrote: >>> On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote: >>> >>> I was also preparing a patch to change this, but you arrived first :-) >>> >>>> So, it's time to cut back on the waste. A compressed cluster >>>> will NEVER occupy more than an uncompressed cluster > > >> And here, csize can only get smaller than the length picked by >> nb_csectors. Therefore, csize is GUARANTEED to be <= c->sector_size. >> >>> >>> - Solution a: check that csize < s->cluster_size and return an error if >>> it's not. However! although QEMU won't produce an image with a >>> compressed cluster that is larger than the uncompressed one, the >>> qcow2 >>> on-disk format in principle allows for that, so arguably we should >>> accept it. >> >> No, the qcow2 on-disk format does not have enough bits reserved for >> that. It is impossible to store an inflated cluster, because you >> don't have enough bits to record it. > > Okay, the spec is WRONG, compared to our code base. > >> >> That said, we MAY have a bug, more likely to be visible in 512-byte >> clusters but possible on any size. While the 'number sectors' field >> IS sufficient for any compressed cluster starting at offset 0 in the >> cluster, we run into issues if the starting offset is later in the >> cluster. That is, even though the compressed length of a 512-byte >> cluster is <= 512 (or we wouldn't compress it), if we start at offset >> 510, we NEED to read the next cluster to get the rest of the >> compressed stream - but at 512-byte clusters, there are 0 bits >> reserved for 'number sectors'. Per the above calculations with the >> example offset of 510, nb_csectors would be 1 (it can't be anything >> else for a 512-byte cluster image), and csize would then be 2 bytes, >> which is insufficient for reading back enough data to reconstruct the >> cluster. > > In fact, here's a demonstration of a discrepancy, where qemu-img and > John's qcheck tool [1] disagree about the validity of an image created > by qemu (although it may just be that qcheck hasn't yet learned about > compressed clusters): > > [1]https://github.com/jnsnow/qcheck > I wouldn't trust my tool's ability to understand compressed clusters :) I didn't get very far, though I did run across the fact that there appeared to be a discrepancy between the spec and the actual implementation, IIRC. It looked like you came to the same conclusion when you stepped through it manually. > $ f=12345678 > $ f=$f$f$f$f # 32 > $ f=$f$f$f$f # 128 > $ f=$f$f$f$f # 512 > $ f=$f$f$f$f # 2k > $ f=$f$f$f$f # 8k > $ f=$f$f$f$f # 32k > $ f=$f$f$f$f # 128k > $ printf "$f" > data > $ qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 \ > data data.qcow2 > $ qemu-img check data.qcow2 > No errors were found on the image. > 256/256 = 100.00% allocated, 100.00% fragmented, 100.00% compressed > clusters > Image end offset: 18944 > $ ./qcheck data.qcow2 > ... > == L2 Tables == > > == L2 cluster l1[0] : 0x0000000000000800 == > Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375 > == L2 cluster l1[1] : 0x0000000000000e00 == > Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375 > == L2 cluster l1[2] : 0x0000000000001400 == > Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375 > == L2 cluster l1[3] : 0x0000000000001a00 == > Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375 > L2 tables: Could not complete analysis, 257 problems found > > > == Reference Count Analysis == > > Refcount analysis: 00 vacant clusters > Refcount analysis: 04 leaked clusters > Refcount analysis: 00 ghost clusters > Refcount analysis: 04 miscounted clusters > Refcount analysis: 8 problems found > > > == Cluster Counts == > > Metadata: 0x1000 > Data: 0x800 > Leaked: 0x800 > Vacant: 0x0 > total: 0x2000 > qcheck: 73 problems found > > >> >> Not true. It is (cluster_bits - 9) or (cluster_size / 512). >> Remember, x = 62 - (cluster_bits - 8); for a 512-byte cluster, x = >> 61. The 'number sectors' field is then bits x+1 - 61 (but you can't >> have a bitfield occupying bit 62 upto 61; especially since bit 62 is >> the bit for compressed cluster). > > So instead of blindly reading the spec, I'm now going to single-stepping > through the 'qemu-img convert' command line above, to see what REALLY > happens: > > Line numbers from commit a6e0344fa0 > $ gdb --args ./qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 > data data.qcow2 > ... > (gdb) b qcow2_alloc_bytes > Breakpoint 1 at 0x57610: file block/qcow2-refcount.c, line 1052. > (gdb) r > Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes ( > bs=bs@entry=0x555555d87f50, size=size@entry=15) > at block/qcow2-refcount.c:1052 > 1052 { > (gdb) > > So we are compressing 512 bytes down to 15 every time, which means that > after 34 clusters compressed, we should be at offset 510. Let's resume > debugging: > > (gdb) c 34 > Will ignore next 33 crossings of breakpoint 1. Continuing. > [Thread 0x7fffe3cfe700 (LWP 32229) exited] > [New Thread 0x7fffe3cfe700 (LWP 32300)] > [New Thread 0x7fffe25ed700 (LWP 32301)] > > Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes ( > bs=bs@entry=0x555555d87f50, size=size@entry=15) > at block/qcow2-refcount.c:1052 > 1052 { > (gdb) n > 1053 BDRVQcow2State *s = bs->opaque; > (gdb) > 1058 BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_BYTES); > (gdb) > 1059 assert(size > 0 && size <= s->cluster_size); > (gdb) p s->free_byte_offset > $2 = 3070 > (gdb) p 3070%512 > $3 = 510 > ... > (gdb) > 1076 free_in_cluster = s->cluster_size - offset_into_cluster(s, > offset); > (gdb) > 1078 if (!offset || free_in_cluster < size) { > (gdb) p free_in_cluster > $4 = 2 > 1079 int64_t new_cluster = alloc_clusters_noref(bs, > s->cluster_size); > (gdb) > 1080 if (new_cluster < 0) { > (gdb) > 1084 if (new_cluster == 0) { > (gdb) > 1091 if (!offset || ROUND_UP(offset, s->cluster_size) != > new_cluster) { > (gdb) > 1095 free_in_cluster += s->cluster_size; > (gdb) > 1099 assert(offset); > > so we got a contiguous cluster, and our goal is to let the caller bleed > the compressed cluster into to the tail of the current sector and into > the head of the next cluster. Continuing: > > (gdb) fin > Run till exit from #0 qcow2_alloc_bytes (bs=bs@entry=0x555555d87f50, > size=size@entry=15) at block/qcow2-refcount.c:1118 > [Thread 0x7fffe25ed700 (LWP 32301) exited] > [Thread 0x7fffe3cfe700 (LWP 32300) exited] > qcow2_alloc_compressed_cluster_offset (bs=bs@entry=0x555555d87f50, > offset=offset@entry=17408, compressed_size=compressed_size@entry=15) > at block/qcow2-cluster.c:768 > 768 if (cluster_offset < 0) { > Value returned is $5 = 3070 > > (gdb) n > 773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) - > (gdb) > 774 (cluster_offset >> 9); > (gdb) > 773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) - > (gdb) > 777 ((uint64_t)nb_csectors << s->csize_shift); > (gdb) l > 772 > 773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) - > 774 (cluster_offset >> 9); > 775 > 776 cluster_offset |= QCOW_OFLAG_COMPRESSED | > 777 ((uint64_t)nb_csectors << s->csize_shift); > 778 > 779 /* update L2 table */ > 780 > 781 /* compressed clusters never have the copied flag */ > (gdb) p nb_csectors > $6 = 1 > (gdb) p s->csize_shift > $7 = 61 > (gdb) p/x cluster_offset > $8 = 0xbfe > (gdb) n > 776 cluster_offset |= QCOW_OFLAG_COMPRESSED | > (gdb) > 783 BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED); > (gdb) p/x cluster_offset > $9 = 0x6000000000000bfe > > Where is s->csize_shift initialized? In qcow2_do_open(): > > s->csize_shift = (62 - (s->cluster_bits - 8)); > s->csize_mask = (1 << (s->cluster_bits - 8)) - 1; > s->cluster_offset_mask = (1LL << s->csize_shift) - 1; > > Revisiting the wording in the spec: > > Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)): > > Bit 0 - x: Host cluster offset. This is usually _not_ aligned to a > cluster boundary! > > x+1 - 61: Compressed size of the images in sectors of 512 bytes > > which says bits 0-61 are the host cluster offset, and 62-61 is the > number of sectors. But our code sets s->csize_shift to divide this > differently, at 0-60 and 61-61. Which means your earlier claim that > there are enough 'number sector' bits to allow for up to 2*cluster_size > as the size of the compressed stream (rather than my claim of exactly > cluster_size) is right, and other implementations CAN inflate a cluster > (if we don't document otherwise), and that even if they DON'T inflate, > they can STILL cause a read larger than a cluster size when the offset > is near the tail of one sector (most likely at 512-byte clusters, but > remotely possible at other cluster sizes as well). >
Am 21.02.2018 um 19:32 hat Eric Blake geschrieben: > On 02/21/2018 11:39 AM, Kevin Wolf wrote: > > > See my commit message comment - we have other spots in the code base that > > > blindly g_malloc(2 * s->cluster_size). > > > > Though is that a reason to do the same in new code or to phase out such > > allocations whenever you touch them? > > Touché. > > > > > > And I intended (but sent the email without amending my commit) to use > > > g_malloc(). But as Berto has convinced me that an externally produced > > > image can convince us to read up to 4M (even though we don't need that > > > much to decompress), I suppose that the try_ variant plus checking is > > > reasonable (and care in NULL'ing out if one but not both allocations > > > succeed). > > > > Sounds good. > > > > Another thought I had is whether we should do per-request allocation for > > compressed clusters, too, instead of having per-BDS buffers. > > The only benefit of a per-BDS buffer is that we cache things - multiple > sub-cluster reads in a row all from the same compressed cluster benefit from > decompressing only once. Oh, you're right. I missed that this is actually used as a cache. I guess we want to leave it for now then. Maybe at some point we can actually implement the data cache that I proposed a few years ago (using Qcow2Cache for data clusters under some circumstances), then we could probably make that hold the data instead of having a separate cache. > The drawbacks of a per-BDS buffer: we can't do things in parallel > (everything else in qcow2 drops the lock around bdrv_co_pread[v]), so > the initial read prevents anything else in the qcow2 layer from > progressing. Yes, though there are probably other optimisations that could be made for compression before this becomes relevant, like reading more than one cluster at a time. > I also wonder - since we ARE allowing multiple parallel readers in other > parts of qcow2 (without a patch, decompression is not in this boat, but > decryption and even bounce buffers due to lower-layer alignment constraints > are), what sort of mechanisms do we have for using a pool of reusable > buffers, rather than having each cluster access that requires a buffer > malloc and free the buffer on a per-access basis? I don't know how much > time the malloc/free per-transaction overhead adds, or if it is already much > smaller than the actual I/O time. I don't either. A while ago, we used g_slice_alloc() in some places (I remember qemu_aio_get), but it was actually slower than just using malloc/free each time. So if we do want to pool buffers, we probably need to implement that manually. I don't think we have a generic memory pool in qemu yet. > But note that while reusable buffers from a pool would cut down on the > per-I/O malloc/free overhead if we switch decompression away from per-BDS > buffer, it would still not solve the fact that we only get the caching > ability where multiple sub-cluster requests from the same compressed cluster > require only one decompression, since that's only possible on a per-BDS > caching level. Yes, as I said above, I didn't notice that it's a real cache. Without the possibility to use Qcow2Cache instead, we'll want to keep it. Kevin
On Wed 21 Feb 2018 05:59:58 PM CET, Eric Blake wrote: > But as Berto has convinced me that an externally produced image can > convince us to read up to 4M (even though we don't need that much to > decompress), A (harmless but funny) consequence of the way this works is that for any valid compressed cluster you should be able to increase the value of the size field as much as you want without causing any user-visible effect. So if you're working with 2MB clusters but for a particular compressed cluster the size field is 0x0006 (7 sectors) you can still increase it to the maximum (0x1fff, or 8192 sectors) and it should work just the same. QEMU will read 4MB instead of ~4KB but since decompression stops once the original cluster has been restored there's no harm. I think I'll write a test case for this, it can be useful to verify that QEMU can handle this kind of scenarios. Berto
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 85be7d5e340..8c4b26ceaf2 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -1603,15 +1603,9 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset) * are freed in .bdrv_close(). */ if (!s->cluster_data) { - /* one more sector for decompressed data alignment */ - s->cluster_data = qemu_try_blockalign(bs->file->bs, - QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512); - if (!s->cluster_data) { - return -ENOMEM; - } - } - if (!s->cluster_cache) { - s->cluster_cache = g_malloc(s->cluster_size); + assert(!s->cluster_cache); + s->cluster_data = g_try_malloc(s->cluster_size); + s->cluster_cache = g_try_malloc(s->cluster_size); } BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED); diff --git a/block/qcow2.c b/block/qcow2.c index 288b5299d80..6ad3436e0e5 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -2103,7 +2103,7 @@ static void qcow2_close(BlockDriverState *bs) g_free(s->image_backing_format); g_free(s->cluster_cache); - qemu_vfree(s->cluster_data); + g_free(s->cluster_data); qcow2_refcount_close(bs); qcow2_free_snapshots(bs); }
When reading a compressed image, we were allocating s->cluster_data to 32*cluster_size + 512 (possibly over 64 megabytes, for an image with 2M clusters). Let's check out the history: Back when qcow2 was first written, we used s->cluster_data for everything, including copy_sectors() and encryption, where we want to operate on more than one cluster at once. Obviously, at that point, the buffer had to be aligned for other users, even though compression itself doesn't require any alignment. But commit 1b9f1491 (v1.1!) changed things to allocate parallel buffers on demand rather than sharing a single buffer, for encryption and COW, leaving compression as the final client of s->cluster_data. That use was still preserved, because if a single compressed cluster is read more than once, we reuse the cache instead of decompressing it a second time (I'm not sure how often this optimization actually fires, or if it penalizes us from being able to decompress multiple clusters in parallel even though we can now decrypt clusters in parallel; the XXX comment in qcow2_co_preadv for QCOW2_CLUSTER_COMPRESSED is telling). Much later, in commit de82815d (v2.2), we noticed that a 64M allocation is prone to failure, so we switched over to a graceful memory allocation error message. But note that elsewhere in the code, we do g_malloc(2 * cluster_size) without ever checking for failure. Then even later, in 3e4c7052 (2.11), we realized that allocating a large buffer up front for every qcow2 image is expensive, and switched to lazy allocation only for images that actually had compressed clusters. But in the process, we never even bothered to check whether what we were allocating still made sense in its new context! So, it's time to cut back on the waste. A compressed cluster will NEVER occupy more than an uncompressed cluster (okay, gzip DOES document that because the compression stream adds metadata, and because of the pigeonhole principle, there are worst case scenarios where attempts to compress will actually inflate an image - but in those cases, we would just write the cluster uncompressed instead of inflating it). And as that is a smaller amount of memory, we can get by with the simpler g_malloc. Signed-off-by: Eric Blake <eblake@redhat.com> --- block/qcow2-cluster.c | 12 +++--------- block/qcow2.c | 2 +- 2 files changed, 4 insertions(+), 10 deletions(-)