Message ID | cover.1615922644.git.osandov@fb.com (mailing list archive) |
---|---|
Headers | show |
Series | fs: interface for directly reading/writing compressed data | expand |
On 3/16/21 3:42 PM, Omar Sandoval wrote: > From: Omar Sandoval <osandov@fb.com> > > This series adds an API for reading compressed data on a filesystem > without decompressing it as well as support for writing compressed data > directly to the filesystem. As with the previous submissions, I've > included a man page patch describing the API. I have test cases > (including fsstress support) and example programs which I'll send up > [1]. > > The main use-case is Btrfs send/receive: currently, when sending data > from one compressed filesystem to another, the sending side decompresses > the data and the receiving side recompresses it before writing it out. > This is wasteful and can be avoided if we can just send and write > compressed extents. The patches implementing the send/receive support > will be sent shortly. > > Patches 1-3 add the VFS support and UAPI. Patch 4 is a fix that this > series depends on; it can be merged independently. Patches 5-8 are Btrfs > prep patches. Patch 9 adds Btrfs encoded read support and patch 10 adds > Btrfs encoded write support. > > These patches are based on Dave Sterba's Btrfs misc-next branch [2], > which is in turn currently based on v5.12-rc3. > Al, Can we get some movement on this? Omar is sort of spinning his wheels here trying to get this stuff merged, no major changes have been done in a few postings. Thanks, Josef
On Fri, Mar 19, 2021 at 11:21 AM Josef Bacik <josef@toxicpanda.com> wrote: > > Can we get some movement on this? Omar is sort of spinning his wheels here > trying to get this stuff merged, no major changes have been done in a few > postings. I'm not Al, and I absolutely detest the IOCB_ENCODED thing, and want more explanations of why this should be done that way, and pollute our iov_iter handling EVEN MORE. Our iov_iter stuff isn't the most legible, and I don't understand why anybody would ever think it's a good idea to spread what is clearly a "struct" inside multiple different iov extents. Honestly, this sounds way more like an ioctl interface than read/write. We've done that before. But if it has to be done with an iov_iter, is there *any* reason to not make it be a hard rule that iov[0] should not be the entirely of the struct, and the code shouldn't ever need to iterate? Also I see references to the man-page, but honestly, that's not how the kernel UAPI should be defined ("just read the man-page"), plus I refuse to live in the 70's, and consider troff to be an atrocious format. So make the UAPI explanation for this horror be in a legible format that is actually part of the kernel so that I can read what the intent is, instead of having to decode hieroglypics. Linus
On 3/19/21 4:08 PM, Linus Torvalds wrote: > On Fri, Mar 19, 2021 at 11:21 AM Josef Bacik <josef@toxicpanda.com> wrote: >> >> Can we get some movement on this? Omar is sort of spinning his wheels here >> trying to get this stuff merged, no major changes have been done in a few >> postings. > > I'm not Al, and I absolutely detest the IOCB_ENCODED thing, and want > more explanations of why this should be done that way, and pollute our > iov_iter handling EVEN MORE. > > Our iov_iter stuff isn't the most legible, and I don't understand why > anybody would ever think it's a good idea to spread what is clearly a > "struct" inside multiple different iov extents. > > Honestly, this sounds way more like an ioctl interface than > read/write. We've done that before. That's actually the way this started https://lore.kernel.org/linux-fsdevel/8eae56abb90c0fe87c350322485ce8674e135074.1567623877.git.osandov@fb.com/ it was suggested that Omar make it generic by Dave Chinner, hence this is the direction it took. I'll leave the rest of the comments for Omar to respond to himself. Thanks, Josef
On Fri, Mar 19, 2021 at 01:08:05PM -0700, Linus Torvalds wrote: > On Fri, Mar 19, 2021 at 11:21 AM Josef Bacik <josef@toxicpanda.com> wrote: > > > > Can we get some movement on this? Omar is sort of spinning his wheels here > > trying to get this stuff merged, no major changes have been done in a few > > postings. > > I'm not Al, and I absolutely detest the IOCB_ENCODED thing, and want > more explanations of why this should be done that way, and pollute our > iov_iter handling EVEN MORE. > > Our iov_iter stuff isn't the most legible, and I don't understand why > anybody would ever think it's a good idea to spread what is clearly a > "struct" inside multiple different iov extents. > > Honestly, this sounds way more like an ioctl interface than > read/write. We've done that before. As Josef just mentioned, I started with an ioctl, and Dave Chinner suggested doing it this way: https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/ The nice thing about it is that it sidesteps all of the issues we had with the dedupe/reflink ioctls over the years where permissions checks and the like were slightly different from read/write. > But if it has to be done with an iov_iter, is there *any* reason to > not make it be a hard rule that iov[0] should not be the entirely of > the struct, and the code shouldn't ever need to iterate? For RWF_ENCODED, iov[0] is always used as the entirety of the struct. I made the helper more generic to support other use cases, but if that's the main objection I can easily make it specifically use iov[0]. > Also I see references to the man-page, but honestly, that's not how > the kernel UAPI should be defined ("just read the man-page"), plus I > refuse to live in the 70's, and consider troff to be an atrocious > format. No disagreement here, troff is horrible to read. > So make the UAPI explanation for this horror be in a legible format > that is actually part of the kernel so that I can read what the intent > is, instead of having to decode hieroglypics. I didn't want to document the UAPI in two places that would need to be kept in sync, but this is fair enough. I'll add UAPI documentation to the kernel. In the meantime, for anyone else following along, here's the formatted man page: ENCODED_IO(7) Linux Programmer's Manual ENCODED_IO(7) NAME encoded_io - overview of encoded I/O DESCRIPTION Several filesystems (e.g., Btrfs) support transparent encoding (e.g., compression, encryption) of data on disk: written data is encoded by the kernel before it is written to disk, and read data is decoded before being returned to the user. In some cases, it is useful to skip this encoding step. For example, the user may want to read the compressed contents of a file or write pre-compressed data directly to a file. This is referred to as "encoded I/O". Encoded I/O API Encoded I/O is specified with the RWF_ENCODED flag to preadv2(2) and pwritev2(2). If RWF_ENCODED is specified, then iov[0].iov_base points to an encoded_iov structure, defined in <linux/fs.h> as: struct encoded_iov { __aligned_u64 len; __aligned_u64 unencoded_len; __aligned_u64 unencoded_offset; __u32 compression; __u32 encryption; }; This may be extended in the future, so iov[0].iov_len must be set to sizeof(struct encoded_iov) for forward/backward compati‐ bility. The remaining buffers contain the encoded data. compression and encryption are the encoding fields. compres‐ sion is ENCODED_IOV_COMPRESSION_NONE (zero) or a filesystem- specific ENCODED_IOV_COMPRESSION_* constant; see Filesystem support below. encryption is currently always ENCODED_IOV_EN‐ CRYPTION_NONE (zero). unencoded_len is the length of the unencoded (i.e., decrypted and decompressed) data. unencoded_offset is the offset from the first byte of the unencoded data to the first byte of logi‐ cal data in the file (less than or equal to unencoded_len). len is the length of the data in the file (less than or equal to unencoded_len - unencoded_offset). See Extent layout below for some examples. If the unencoded data is actually longer than unencoded_len, then it is truncated; if it is shorter, then it is extended with zeroes. pwritev2(2) uses the metadata specified in iov[0], writes the encoded data from the remaining buffers, and returns the number of encoded bytes written (that is, the sum of iov[n].iov_len for 1 <= n < iovcnt; partial writes will not occur). At least one encoding field must be non-zero. Note that the encoded data is not validated when it is written; if it is not valid (e.g., it cannot be decompressed), then a subsequent read may return an error. If the offset argument to pwritev2(2) is -1, then the file offset is incremented by len. If iov[0].iov_len is less than sizeof(struct encoded_iov) in the kernel, then any fields unknown to user space are treated as if they were zero; if it is greater and any fields unknown to the kernel are non- zero, then this returns -1 and sets errno to E2BIG. preadv2(2) populates the metadata in iov[0], the encoded data in the remaining buffers, and returns the number of encoded bytes read. This will only return one extent per call. This can also read data which is not encoded; all encoding fields will be zero in that case. If the offset argument to preadv2(2) is -1, then the file offset is incremented by len. If iov[0].iov_len is less than sizeof(struct encoded_iov) in the kernel and any fields unknown to user space are non-zero, then preadv2(2) returns -1 and sets errno to E2BIG; if it is greater, then any fields unknown to the kernel are returned as zero. If the provided buffers are not large enough to return an entire encoded extent, then preadv2(2) returns -1 and sets errno to ENOBUFS. As the filesystem page cache typically contains decoded data, encoded I/O bypasses the page cache. Extent layout By using len, unencoded_len, and unencoded_offset, it is possi‐ ble to refer to a subset of an unencoded extent. In the simplest case, len is equal to unencoded_len and unen‐ coded_offset is zero. This means that the entire unencoded ex‐ tent is used. However, suppose we read 50 bytes into a file which contains a single compressed extent. The filesystem must still return the entire compressed extent for us to be able to decompress it, so unencoded_len would be the length of the entire decompressed extent. However, because the read was at offset 50, the first 50 bytes should be ignored. Therefore, unencoded_offset would be 50, and len would accordingly be unencoded_len - 50. Additionally, suppose we want to create an encrypted file with length 500, but the file is encrypted with a block cipher using a block size of 4096. The unencoded data would therefore in‐ clude the appropriate padding, and unencoded_len would be 4096. However, to represent the logical size of the file, len would be 500 (and unencoded_offset would be 0). Similar situations can arise in other cases: * If the filesystem pads data to the filesystem block size be‐ fore compressing, then compressed files with a size un‐ aligned to the filesystem block size will end with an extent with len < unencoded_len. * Extents cloned from the middle of a larger encoded extent with FICLONERANGE may have a non-zero unencoded_offset and/or len < unencoded_len. * If the middle of an encoded extent is overwritten, the filesystem may create extents with a non-zero unencoded_off‐ set and/or len < unencoded_len for the parts that were not overwritten. Security Encoded I/O creates the potential for some security issues: * Encoded writes allow writing arbitrary data which the kernel will decode on a subsequent read. Decompression algorithms are complex and may have bugs which can be exploited by ma‐ liciously crafted data. * Encoded reads may return data which is not logically present in the file (see the discussion of len vs unencoded_len above). It may not be intended for this data to be read‐ able. Therefore, encoded I/O requires privilege. Namely, the RWF_EN‐ CODED flag may only be used if the file description has the O_ALLOW_ENCODED file status flag set, and the O_ALLOW_ENCODED flag may only be set by a thread with the CAP_SYS_ADMIN capa‐ bility. The O_ALLOW_ENCODED flag can be set by open(2) or fc‐ ntl(2). It can also be cleared by fcntl(2); clearing it does not require CAP_SYS_ADMIN. Note that it is not cleared on fork(2) or execve(2). One may wish to use O_CLOEXEC with O_AL‐ LOW_ENCODED. Filesystem support Encoded I/O is supported on the following filesystems: Btrfs (since Linux 5.13) Btrfs supports encoded reads and writes of compressed data. The data is encoded as follows: * If compression is ENCODED_IOV_COMPRESSION_BTRFS_ZLIB, then the encoded data is a single zlib stream. * If compression is ENCODED_IOV_COMPRESSION_BTRFS_ZSTD, then the encoded data is a single zstd frame com‐ pressed with the windowLog compression parameter set to no more than 17. * If compression is one of ENCODED_IOV_COMPRES‐ SION_BTRFS_LZO_4K, ENCODED_IOV_COMPRES‐ SION_BTRFS_LZO_8K, ENCODED_IOV_COMPRES‐ SION_BTRFS_LZO_16K, ENCODED_IOV_COMPRES‐ SION_BTRFS_LZO_32K, or ENCODED_IOV_COMPRES‐ SION_BTRFS_LZO_64K, then the encoded data is com‐ pressed page by page (using the page size indicated by the name of the constant) with LZO1X and wrapped in the format documented in the Linux kernel source file fs/btrfs/lzo.c. Additionally, there are some restrictions on pwritev2(2): * offset (or the current file offset if offset is -1) must be aligned to the sector size of the filesystem. * len must be aligned to the sector size of the filesystem unless the data ends at or beyond the cur‐ rent end of the file. * unencoded_len and the length of the encoded data must each be no more than 128 KiB. This limit may in‐ crease in the future. * The length of the encoded data must be less than or equal to unencoded_len. * If using LZO, the filesystem's page size must match the compression page size. SEE ALSO open(2), preadv2(2) Linux 2020-11-11 ENCODED_IO(7)
On Fri, Mar 19, 2021 at 01:27:25PM -0700, Omar Sandoval wrote: > On Fri, Mar 19, 2021 at 01:08:05PM -0700, Linus Torvalds wrote: > > On Fri, Mar 19, 2021 at 11:21 AM Josef Bacik <josef@toxicpanda.com> wrote: > > > > > > Can we get some movement on this? Omar is sort of spinning his wheels here > > > trying to get this stuff merged, no major changes have been done in a few > > > postings. > > > > I'm not Al, and I absolutely detest the IOCB_ENCODED thing, and want > > more explanations of why this should be done that way, and pollute our > > iov_iter handling EVEN MORE. > > > > Our iov_iter stuff isn't the most legible, and I don't understand why > > anybody would ever think it's a good idea to spread what is clearly a > > "struct" inside multiple different iov extents. > > > > Honestly, this sounds way more like an ioctl interface than > > read/write. We've done that before. > > As Josef just mentioned, I started with an ioctl, and Dave Chinner > suggested doing it this way: > https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/ > > The nice thing about it is that it sidesteps all of the issues we had > with the dedupe/reflink ioctls over the years where permissions checks > and the like were slightly different from read/write. > > > But if it has to be done with an iov_iter, is there *any* reason to > > not make it be a hard rule that iov[0] should not be the entirely of > > the struct, and the code shouldn't ever need to iterate? > > For RWF_ENCODED, iov[0] is always used as the entirety of the struct. I > made the helper more generic to support other use cases, but if that's > the main objection I can easily make it specifically use iov[0]. > > > Also I see references to the man-page, but honestly, that's not how > > the kernel UAPI should be defined ("just read the man-page"), plus I > > refuse to live in the 70's, and consider troff to be an atrocious > > format. > > No disagreement here, troff is horrible to read. Not a fan of troff either: One thing that might help you a little bit is to use pandoc to convert to troff. You can write your manpage in markdown or rst and then convert it into troff via pandoc. Still might require some massaging but it makes it considerably more pleasant to write a manpage. Christian
On Fri, Mar 19, 2021 at 1:27 PM Omar Sandoval <osandov@osandov.com> wrote: > > For RWF_ENCODED, iov[0] is always used as the entirety of the struct. I > made the helper more generic to support other use cases, but if that's > the main objection I can easily make it specifically use iov[0]. Honestly, with new interfaces, I'd prefer to always start off as limited as possible. And read/write is not very limited (but O_ALLOW_ENCODED and RWF_ENCODED at least helps with the "fool suid program to do it"). But at least we could make sure that the structure then has to be as strict as humanly possible. So it's not so much a "main objection" as more about trying to make the rules stricter in the hope that that at least makes only one very particular way of doing things valid. I'd hate for user space to start 'streaming" struct data. > > Also I see references to the man-page, but honestly, that's not how > > the kernel UAPI should be defined ("just read the man-page"), plus I > > refuse to live in the 70's, and consider troff to be an atrocious > > format. > > No disagreement here, troff is horrible to read. > > > So make the UAPI explanation for this horror be in a legible format > > that is actually part of the kernel so that I can read what the intent > > is, instead of having to decode hieroglypics. > > I didn't want to document the UAPI in two places that would need to be > kept in sync Honestly, I would suggest that nobody ever write troff format stuff. I don't think it supports UTF-8 properly, for example, which means that you can't even give credit to people properly etc. RST isn't perfect, but at least it's human-legible, and it's obviously what we're encouraging for kernel use. You can use rst2man to convert to bad formats (and yes, you might lose something in the translation, eg proper names etc). Almost anything else(*) is better than troff. But at least I can read the formatted version. Linus (*) With the possible exception of "info" files. Now *there* is a truly pointless format maximally designed to make it inconvenient for users.
On Fri, Mar 19, 2021 at 01:55:18PM -0700, Linus Torvalds wrote: > On Fri, Mar 19, 2021 at 1:27 PM Omar Sandoval <osandov@osandov.com> wrote: > > > > For RWF_ENCODED, iov[0] is always used as the entirety of the struct. I > > made the helper more generic to support other use cases, but if that's > > the main objection I can easily make it specifically use iov[0]. > > Honestly, with new interfaces, I'd prefer to always start off as > limited as possible. > > And read/write is not very limited (but O_ALLOW_ENCODED and > RWF_ENCODED at least helps with the "fool suid program to do it"). But > at least we could make sure that the structure then has to be as > strict as humanly possible. > > So it's not so much a "main objection" as more about trying to make > the rules stricter in the hope that that at least makes only one very > particular way of doing things valid. I'd hate for user space to start > 'streaming" struct data. After spending a few minutes trying to simplify copy_struct_from_iter(), it's honestly easier to just use the iterate_all_kinds() craziness than open coding it to only operate on iov[0]. But that's an implementation detail, and we can trivially make the interface stricter: diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 395ca89e5d9b..41b6b0325d18 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -969,9 +969,9 @@ EXPORT_SYMBOL(copy_page_from_iter); * On success, the iterator is advanced @usize bytes. On error, the iterator is * not advanced. */ -int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i, - size_t usize) +int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i) { + size_t usize = iov_iter_single_seg_count(i); if (usize <= ksize) { if (!copy_from_iter_full(dst, usize, i)) return -EFAULT; I.e., the size of the userspace structure is always the remaining size of the current segment. Maybe we can even throw in a check that we're either at the beginning of the current segment or the very beginning of the whole iter, what do you think? (Again, this is what RWF_ENCODED already does, it was just easier to write copy_struct_from_iter() more generically). > > > Also I see references to the man-page, but honestly, that's not how > > > the kernel UAPI should be defined ("just read the man-page"), plus I > > > refuse to live in the 70's, and consider troff to be an atrocious > > > format. > > > > No disagreement here, troff is horrible to read. > > > > > So make the UAPI explanation for this horror be in a legible format > > > that is actually part of the kernel so that I can read what the intent > > > is, instead of having to decode hieroglypics. > > > > I didn't want to document the UAPI in two places that would need to be > > kept in sync > > Honestly, I would suggest that nobody ever write troff format stuff. > I don't think it supports UTF-8 properly, for example, which means > that you can't even give credit to people properly etc. > > RST isn't perfect, but at least it's human-legible, and it's obviously > what we're encouraging for kernel use. You can use rst2man to convert > to bad formats (and yes, you might lose something in the translation, > eg proper names etc). It looks like there's precedent for man pages being generated from the kernel documentation [1], so I'll try that. 1: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=53666f6c30451cde022f65d35a8d448f5a7132ba > Almost anything else(*) is better than troff. But at least I can read > the formatted version. > > Linus > > (*) With the possible exception of "info" files. Now *there* is a > truly pointless format maximally designed to make it inconvenient for > users.
On Fri, Mar 19, 2021 at 2:12 PM Omar Sandoval <osandov@osandov.com> wrote: > > After spending a few minutes trying to simplify copy_struct_from_iter(), > it's honestly easier to just use the iterate_all_kinds() craziness than > open coding it to only operate on iov[0]. But that's an implementation > detail, and we can trivially make the interface stricter: This is an improvement, but talking about the iterate_all_kinds() craziness, I think your existing code is broken. That third case (kernel pointer source): + copy = min(ksize - copied, v.iov_len); + memcpy(dst + copied, v.iov_base, copy); + if (memchr_inv(v.iov_base, 0, v.iov_len)) + return -E2BIG; can't be right. Aren't you checking that it's *all* zero, even the part you copied? Our iov_iter stuff is really complicated already, this is part of why I'm not a huge fan of using it. I still suspect you'd be better off not using the iterate_all_kinds() thing at all, and just explicitly checking ITER_BVEC/ITER_KVEC manually. Because you can play games like fooling your "copy_struct_from_iter()" to not copy anything at all with ITER_DISCARD, can't you? Which then sounds like it might end up being useful as a kernel data leak, because it will use some random uninitialized kernel memory for the structure. Now, I don't think you can actually get that ITER_DISCARD case, so this is not *really* a problem, but it's another example of how that iterate_all_kinds() thing has these subtle cases embedded into it. The whole point of copy_struct_from_iter() is presumably to be the same kind of "obviously safe" interface as copy_struct_from_user() is meant to be, so these subtle cases just then make me go "Hmm". I think just open-coding this when you know there is no actual looping going on, and the data has to be at the *beginning*, should be fairly simple. What makes iterate_all_kinds() complicated is that iteration, the fact that there can be empty entries in there, but it's also that "iov_offset" thing etc. For the case where you just (a) require that iov_offset is zero, and (b) everything has to fit into the very first iov entry (regardless of what type that iov entry is), I think you actually end up with a much simpler model. I do realize that I am perhaps concentrating a bit too much on this one part of the patch series, but the iov_iter thing has bitten us before. And it has bitten really core developers and then Al has had to fix up mistakes. In fact, it wasn't that long ago that I asked Al to verify code I wrote, because I was worried about having missed something subtle. So now when I see these iov_iter users, it just makes me go all nervous. Linus
On Fri, Mar 19, 2021 at 02:47:03PM -0700, Linus Torvalds wrote: > On Fri, Mar 19, 2021 at 2:12 PM Omar Sandoval <osandov@osandov.com> wrote: > > > > After spending a few minutes trying to simplify copy_struct_from_iter(), > > it's honestly easier to just use the iterate_all_kinds() craziness than > > open coding it to only operate on iov[0]. But that's an implementation > > detail, and we can trivially make the interface stricter: > > This is an improvement, but talking about the iterate_all_kinds() > craziness, I think your existing code is broken. > > That third case (kernel pointer source): > > + copy = min(ksize - copied, v.iov_len); > + memcpy(dst + copied, v.iov_base, copy); > + if (memchr_inv(v.iov_base, 0, v.iov_len)) > + return -E2BIG; > > can't be right. Aren't you checking that it's *all* zero, even the > part you copied? Oops, that should of course be if (memchr_inv(v.iov_base + copy, 0, v.iov_len - copy)) return -E2BIG; like the other cases. Point taken, though. > Our iov_iter stuff is really complicated already, this is part of why > I'm not a huge fan of using it. > > I still suspect you'd be better off not using the iterate_all_kinds() > thing at all, and just explicitly checking ITER_BVEC/ITER_KVEC > manually. > > Because you can play games like fooling your "copy_struct_from_iter()" > to not copy anything at all with ITER_DISCARD, can't you? > > Which then sounds like it might end up being useful as a kernel data > leak, because it will use some random uninitialized kernel memory for > the structure. > > Now, I don't think you can actually get that ITER_DISCARD case, so > this is not *really* a problem, but it's another example of how that > iterate_all_kinds() thing has these subtle cases embedded into it. Right, that would probably be better off returning EFAULT or something for ITER_DISCARD. > The whole point of copy_struct_from_iter() is presumably to be the > same kind of "obviously safe" interface as copy_struct_from_user() is > meant to be, so these subtle cases just then make me go "Hmm". > > I think just open-coding this when you know there is no actual > looping going on, and the data has to be at the *beginning*, should be > fairly simple. What makes iterate_all_kinds() complicated is that > iteration, the fact that there can be empty entries in there, but it's > also that "iov_offset" thing etc. > > For the case where you just (a) require that iov_offset is zero, and > (b) everything has to fit into the very first iov entry (regardless of > what type that iov entry is), I think you actually end up with a much > simpler model. > > I do realize that I am perhaps concentrating a bit too much on this > one part of the patch series, but the iov_iter thing has bitten us > before. And it has bitten really core developers and then Al has had > to fix up mistakes. > > In fact, it wasn't that long ago that I asked Al to verify code I > wrote, because I was worried about having missed something subtle. So > now when I see these iov_iter users, it just makes me go all nervous. So here's what it looks like with these restrictions (chances are there's a bug or two in here): int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i) { size_t usize; int ret; if (i->iov_offset != 0) return -EINVAL; if (iter_is_iovec(i)) { usize = i->iov->iov_len; might_fault(); if (copyin(dst, i->iov->iov_base, min(ksize, usize))) return -EFAULT; if (usize > ksize) { ret = check_zeroed_user(i->iov->iov_base + ksize, usize - ksize); if (ret < 0) return ret; else if (ret == 0) return -E2BIG; } } else if (iov_iter_is_kvec(i)) { usize = i->kvec->iov_len; memcpy(dst, i->kvec->iov_base, min(ksize, usize)); if (usize > ksize && memchr_inv(i->kvec->iov_base + ksize, 0, usize - ksize)) return -E2BIG; } else if (iov_iter_is_bvec(i)) { char *p; usize = i->bvec->bv_len; p = kmap_atomic(i->bvec->bv_page); memcpy(dst, p + i->bvec->bv_offset, min(ksize, usize)); if (usize > ksize && memchr_inv(p + i->bvec->bv_offset + ksize, 0, usize - ksize)) { kunmap_atomic(p); return -E2BIG; } kunmap_atomic(p); } else { return -EFAULT; } if (usize < ksize) memset(dst + usize, 0, ksize - usize); iov_iter_advance(i, usize); return 0; } Not much shorter, but it is easier to follow.
On Fri, Mar 19, 2021 at 3:46 PM Omar Sandoval <osandov@osandov.com> wrote: > > Not much shorter, but it is easier to follow. Yeah, that looks about right to me. You should probably use kmap_local_page() rather than kmap_atomic() these days, but other than that this looks fairly straightforward, and I much prefer the model where we very much force that "must be the first iovec entry". As you say, maybe not shorter, but a lot more straightforward. That said, looking through the patch series, I see at least one other issue. Look at parisc: +#define O_ALLOW_ENCODED 100000000 yeah, that's completely wrong. I see how it happened, but that's _really_ wrong. I would want others to take a look in case there's something else. I'm not qualified to comment about (nor do I deeply care) about the btrfs parts, but the generic interface parts should most definitely get more attention. By Al, if possible, but other fs people too.. Linus
On Fri, Mar 19, 2021 at 05:31:18PM -0700, Linus Torvalds wrote: > On Fri, Mar 19, 2021 at 3:46 PM Omar Sandoval <osandov@osandov.com> wrote: > > > > Not much shorter, but it is easier to follow. > > Yeah, that looks about right to me. > > You should probably use kmap_local_page() rather than kmap_atomic() > these days, but other than that this looks fairly straightforward, and > I much prefer the model where we very much force that "must be the > first iovec entry". To be exact, this code only enforces that the iov_iter is at the beginning of the current entry. As far as I can tell, iov_iter doesn't track its position overall, so there's no way to tell whether the current entry is the first one. > As you say, maybe not shorter, but a lot more straightforward. > > That said, looking through the patch series, I see at least one other > issue. Look at parisc: > > +#define O_ALLOW_ENCODED 100000000 > > yeah, that's completely wrong. I see how it happened, but that's _really_ wrong. Ugh, that's embarrassing. It also happens to add exactly one new bit to VALID_OPEN_FLAGS, so the BUILD_BUG_ON() in fcntl_init() didn't catch it. > I would want others to take a look in case there's something else. I'm > not qualified to comment about (nor do I deeply care) about the btrfs > parts, but the generic interface parts should most definitely get more > attention. > > By Al, if possible, but other fs people too.. That's all I ask. I'll fix your comments and wait a few days to get some more feedback on the fs side before I resend the patch bomb. Thanks, Omar
From: Omar Sandoval <osandov@fb.com> This series adds an API for reading compressed data on a filesystem without decompressing it as well as support for writing compressed data directly to the filesystem. As with the previous submissions, I've included a man page patch describing the API. I have test cases (including fsstress support) and example programs which I'll send up [1]. The main use-case is Btrfs send/receive: currently, when sending data from one compressed filesystem to another, the sending side decompresses the data and the receiving side recompresses it before writing it out. This is wasteful and can be avoided if we can just send and write compressed extents. The patches implementing the send/receive support will be sent shortly. Patches 1-3 add the VFS support and UAPI. Patch 4 is a fix that this series depends on; it can be merged independently. Patches 5-8 are Btrfs prep patches. Patch 9 adds Btrfs encoded read support and patch 10 adds Btrfs encoded write support. These patches are based on Dave Sterba's Btrfs misc-next branch [2], which is in turn currently based on v5.12-rc3. Changes since v7 [3]: - Rebased on current kdave/misc-next (v5.12-rc3). Omar Sandoval (10): iov_iter: add copy_struct_from_iter() fs: add O_ALLOW_ENCODED open flag fs: add RWF_ENCODED for reading/writing compressed data btrfs: fix check_data_csum() error message for direct I/O btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() btrfs: add ram_bytes and offset to btrfs_ordered_extent btrfs: support different disk extent size for delalloc btrfs: optionally extend i_size in cow_file_range_inline() btrfs: implement RWF_ENCODED reads btrfs: implement RWF_ENCODED writes 1: https://github.com/osandov/xfstests/tree/rwf-encoded 2: https://github.com/kdave/btrfs-devel/tree/misc-next 3: https://lore.kernel.org/linux-btrfs/cover.1611346706.git.osandov@fb.com/ Documentation/filesystems/encoded_io.rst | 74 ++ Documentation/filesystems/index.rst | 1 + arch/alpha/include/uapi/asm/fcntl.h | 1 + arch/parisc/include/uapi/asm/fcntl.h | 1 + arch/sparc/include/uapi/asm/fcntl.h | 1 + fs/btrfs/compression.c | 12 +- fs/btrfs/compression.h | 6 +- fs/btrfs/ctree.h | 9 +- fs/btrfs/delalloc-space.c | 18 +- fs/btrfs/file-item.c | 35 +- fs/btrfs/file.c | 46 +- fs/btrfs/inode.c | 936 ++++++++++++++++++++--- fs/btrfs/ordered-data.c | 80 +- fs/btrfs/ordered-data.h | 18 +- fs/btrfs/relocation.c | 4 +- fs/fcntl.c | 10 +- fs/namei.c | 4 + fs/read_write.c | 168 +++- include/linux/encoded_io.h | 17 + include/linux/fcntl.h | 2 +- include/linux/fs.h | 13 + include/linux/uio.h | 2 + include/uapi/asm-generic/fcntl.h | 4 + include/uapi/linux/encoded_io.h | 30 + include/uapi/linux/fs.h | 5 +- lib/iov_iter.c | 82 ++ 26 files changed, 1378 insertions(+), 201 deletions(-) create mode 100644 Documentation/filesystems/encoded_io.rst create mode 100644 include/linux/encoded_io.h create mode 100644 include/uapi/linux/encoded_io.h