[v8,00/10] fs: interface for directly reading/writing compressed data

Message ID	cover.1615922644.git.osandov@fb.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Omar Sandoval <osandov@osandov.com> To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro <viro@zeniv.linux.org.uk>, Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com>, Jann Horn <jannh@google.com>, Amir Goldstein <amir73il@gmail.com>, Aleksa Sarai <cyphar@cyphar.com>, linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v8 00/10] fs: interface for directly reading/writing compressed data Date: Tue, 16 Mar 2021 12:42:55 -0700 Message-Id: <cover.1615922644.git.osandov@fb.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	fs: interface for directly reading/writing compressed data \| expand [v8,00/10] fs: interface for directly reading/writing compressed data [v8,01/10] iov_iter: add copy_struct_from_iter() [v8,02/10] fs: add O_ALLOW_ENCODED open flag [v8,03/10] fs: add RWF_ENCODED for reading/writing compressed data [v8,04/10] btrfs: fix check_data_csum() error message for direct I/O [v8,05/10] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() [v8,06/10] btrfs: add ram_bytes and offset to btrfs_ordered_extent [v8,07/10] btrfs: support different disk extent size for delalloc [v8,08/10] btrfs: optionally extend i_size in cow_file_range_inline() [v8,09/10] btrfs: implement RWF_ENCODED reads [v8,10/10] btrfs: implement RWF_ENCODED writes

Omar Sandoval March 16, 2021, 7:42 p.m. UTC

From: Omar Sandoval <osandov@fb.com>

This series adds an API for reading compressed data on a filesystem
without decompressing it as well as support for writing compressed data
directly to the filesystem. As with the previous submissions, I've
included a man page patch describing the API. I have test cases
(including fsstress support) and example programs which I'll send up
[1].

The main use-case is Btrfs send/receive: currently, when sending data
from one compressed filesystem to another, the sending side decompresses
the data and the receiving side recompresses it before writing it out.
This is wasteful and can be avoided if we can just send and write
compressed extents. The patches implementing the send/receive support
will be sent shortly.

Patches 1-3 add the VFS support and UAPI. Patch 4 is a fix that this
series depends on; it can be merged independently. Patches 5-8 are Btrfs
prep patches. Patch 9 adds Btrfs encoded read support and patch 10 adds
Btrfs encoded write support.

These patches are based on Dave Sterba's Btrfs misc-next branch [2],
which is in turn currently based on v5.12-rc3.

Changes since v7 [3]:

- Rebased on current kdave/misc-next (v5.12-rc3).

Omar Sandoval (10):
  iov_iter: add copy_struct_from_iter()
  fs: add O_ALLOW_ENCODED open flag
  fs: add RWF_ENCODED for reading/writing compressed data
  btrfs: fix check_data_csum() error message for direct I/O
  btrfs: don't advance offset for compressed bios in
    btrfs_csum_one_bio()
  btrfs: add ram_bytes and offset to btrfs_ordered_extent
  btrfs: support different disk extent size for delalloc
  btrfs: optionally extend i_size in cow_file_range_inline()
  btrfs: implement RWF_ENCODED reads
  btrfs: implement RWF_ENCODED writes

1: https://github.com/osandov/xfstests/tree/rwf-encoded
2: https://github.com/kdave/btrfs-devel/tree/misc-next
3: https://lore.kernel.org/linux-btrfs/cover.1611346706.git.osandov@fb.com/

 Documentation/filesystems/encoded_io.rst |  74 ++
 Documentation/filesystems/index.rst      |   1 +
 arch/alpha/include/uapi/asm/fcntl.h      |   1 +
 arch/parisc/include/uapi/asm/fcntl.h     |   1 +
 arch/sparc/include/uapi/asm/fcntl.h      |   1 +
 fs/btrfs/compression.c                   |  12 +-
 fs/btrfs/compression.h                   |   6 +-
 fs/btrfs/ctree.h                         |   9 +-
 fs/btrfs/delalloc-space.c                |  18 +-
 fs/btrfs/file-item.c                     |  35 +-
 fs/btrfs/file.c                          |  46 +-
 fs/btrfs/inode.c                         | 936 ++++++++++++++++++++---
 fs/btrfs/ordered-data.c                  |  80 +-
 fs/btrfs/ordered-data.h                  |  18 +-
 fs/btrfs/relocation.c                    |   4 +-
 fs/fcntl.c                               |  10 +-
 fs/namei.c                               |   4 +
 fs/read_write.c                          | 168 +++-
 include/linux/encoded_io.h               |  17 +
 include/linux/fcntl.h                    |   2 +-
 include/linux/fs.h                       |  13 +
 include/linux/uio.h                      |   2 +
 include/uapi/asm-generic/fcntl.h         |   4 +
 include/uapi/linux/encoded_io.h          |  30 +
 include/uapi/linux/fs.h                  |   5 +-
 lib/iov_iter.c                           |  82 ++
 26 files changed, 1378 insertions(+), 201 deletions(-)
 create mode 100644 Documentation/filesystems/encoded_io.rst
 create mode 100644 include/linux/encoded_io.h
 create mode 100644 include/uapi/linux/encoded_io.h

Josef Bacik March 19, 2021, 6:21 p.m. UTC | #1

On 3/16/21 3:42 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This series adds an API for reading compressed data on a filesystem
> without decompressing it as well as support for writing compressed data
> directly to the filesystem. As with the previous submissions, I've
> included a man page patch describing the API. I have test cases
> (including fsstress support) and example programs which I'll send up
> [1].
> 
> The main use-case is Btrfs send/receive: currently, when sending data
> from one compressed filesystem to another, the sending side decompresses
> the data and the receiving side recompresses it before writing it out.
> This is wasteful and can be avoided if we can just send and write
> compressed extents. The patches implementing the send/receive support
> will be sent shortly.
> 
> Patches 1-3 add the VFS support and UAPI. Patch 4 is a fix that this
> series depends on; it can be merged independently. Patches 5-8 are Btrfs
> prep patches. Patch 9 adds Btrfs encoded read support and patch 10 adds
> Btrfs encoded write support.
> 
> These patches are based on Dave Sterba's Btrfs misc-next branch [2],
> which is in turn currently based on v5.12-rc3.
> 

Al,

Can we get some movement on this?  Omar is sort of spinning his wheels here 
trying to get this stuff merged, no major changes have been done in a few 
postings.  Thanks,

Josef

Linus Torvalds March 19, 2021, 8:08 p.m. UTC | #2

On Fri, Mar 19, 2021 at 11:21 AM Josef Bacik <josef@toxicpanda.com> wrote:
>
> Can we get some movement on this?  Omar is sort of spinning his wheels here
> trying to get this stuff merged, no major changes have been done in a few
> postings.

I'm not Al, and I absolutely detest the IOCB_ENCODED thing, and want
more explanations of why this should be done that way, and pollute our
iov_iter handling EVEN MORE.

Our iov_iter stuff isn't the most legible, and I don't understand why
anybody would ever think it's a good idea to spread what is clearly a
"struct" inside multiple different iov extents.

Honestly, this sounds way more like an ioctl interface than
read/write. We've done that before.

But if it has to be done with an iov_iter, is there *any* reason to
not make it be a hard rule that iov[0] should not be the entirely of
the struct, and the code shouldn't ever need to iterate?

Also I see references to the man-page, but honestly, that's not how
the kernel UAPI should be defined ("just read the man-page"), plus I
refuse to live in the 70's, and consider troff to be an atrocious
format.

So make the UAPI explanation for this horror be in a legible format
that is actually part of the kernel so that I can read what the intent
is, instead of having to decode hieroglypics.

            Linus

Josef Bacik March 19, 2021, 8:12 p.m. UTC | #3

On 3/19/21 4:08 PM, Linus Torvalds wrote:
> On Fri, Mar 19, 2021 at 11:21 AM Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> Can we get some movement on this?  Omar is sort of spinning his wheels here
>> trying to get this stuff merged, no major changes have been done in a few
>> postings.
> 
> I'm not Al, and I absolutely detest the IOCB_ENCODED thing, and want
> more explanations of why this should be done that way, and pollute our
> iov_iter handling EVEN MORE.
> 
> Our iov_iter stuff isn't the most legible, and I don't understand why
> anybody would ever think it's a good idea to spread what is clearly a
> "struct" inside multiple different iov extents.
> 
> Honestly, this sounds way more like an ioctl interface than
> read/write. We've done that before.

That's actually the way this started

https://lore.kernel.org/linux-fsdevel/8eae56abb90c0fe87c350322485ce8674e135074.1567623877.git.osandov@fb.com/

it was suggested that Omar make it generic by Dave Chinner, hence this is the 
direction it took.  I'll leave the rest of the comments for Omar to respond to 
himself.  Thanks,

Josef

Omar Sandoval March 19, 2021, 8:27 p.m. UTC | #4

On Fri, Mar 19, 2021 at 01:08:05PM -0700, Linus Torvalds wrote:
> On Fri, Mar 19, 2021 at 11:21 AM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > Can we get some movement on this?  Omar is sort of spinning his wheels here
> > trying to get this stuff merged, no major changes have been done in a few
> > postings.
> 
> I'm not Al, and I absolutely detest the IOCB_ENCODED thing, and want
> more explanations of why this should be done that way, and pollute our
> iov_iter handling EVEN MORE.
> 
> Our iov_iter stuff isn't the most legible, and I don't understand why
> anybody would ever think it's a good idea to spread what is clearly a
> "struct" inside multiple different iov extents.
> 
> Honestly, this sounds way more like an ioctl interface than
> read/write. We've done that before.

As Josef just mentioned, I started with an ioctl, and Dave Chinner
suggested doing it this way:
https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/

The nice thing about it is that it sidesteps all of the issues we had
with the dedupe/reflink ioctls over the years where permissions checks
and the like were slightly different from read/write.

> But if it has to be done with an iov_iter, is there *any* reason to
> not make it be a hard rule that iov[0] should not be the entirely of
> the struct, and the code shouldn't ever need to iterate?

For RWF_ENCODED, iov[0] is always used as the entirety of the struct. I
made the helper more generic to support other use cases, but if that's
the main objection I can easily make it specifically use iov[0].

> Also I see references to the man-page, but honestly, that's not how
> the kernel UAPI should be defined ("just read the man-page"), plus I
> refuse to live in the 70's, and consider troff to be an atrocious
> format.

No disagreement here, troff is horrible to read.

> So make the UAPI explanation for this horror be in a legible format
> that is actually part of the kernel so that I can read what the intent
> is, instead of having to decode hieroglypics.

I didn't want to document the UAPI in two places that would need to be
kept in sync, but this is fair enough. I'll add UAPI documentation to
the kernel. In the meantime, for anyone else following along, here's the
formatted man page:

ENCODED_IO(7)          Linux Programmer's Manual         ENCODED_IO(7)

NAME
       encoded_io - overview of encoded I/O

DESCRIPTION
       Several  filesystems (e.g., Btrfs) support transparent encoding
       (e.g., compression, encryption) of data on disk:  written  data
       is encoded by the kernel before it is written to disk, and read
       data is decoded before being returned to  the  user.   In  some
       cases,  it  is useful to skip this encoding step.  For example,
       the user may want to read the compressed contents of a file  or
       write pre-compressed data directly to a file.  This is referred
       to as "encoded I/O".

   Encoded I/O API
       Encoded  I/O  is  specified  with  the  RWF_ENCODED   flag   to
       preadv2(2)  and pwritev2(2).  If RWF_ENCODED is specified, then
       iov[0].iov_base points to an encoded_iov structure, defined  in
       <linux/fs.h> as:

           struct encoded_iov {
               __aligned_u64 len;
               __aligned_u64 unencoded_len;
               __aligned_u64 unencoded_offset;
               __u32 compression;
               __u32 encryption;
           };

       This  may  be extended in the future, so iov[0].iov_len must be
       set to sizeof(struct encoded_iov) for forward/backward compati‐
       bility.  The remaining buffers contain the encoded data.

       compression  and  encryption are the encoding fields.  compres‐
       sion is ENCODED_IOV_COMPRESSION_NONE (zero)  or  a  filesystem-
       specific  ENCODED_IOV_COMPRESSION_*  constant;  see  Filesystem
       support below.  encryption is currently always  ENCODED_IOV_EN‐
       CRYPTION_NONE (zero).

       unencoded_len  is  the length of the unencoded (i.e., decrypted
       and decompressed) data.  unencoded_offset is  the  offset  from
       the first byte of the unencoded data to the first byte of logi‐
       cal data in the file (less than  or  equal  to  unencoded_len).
       len  is  the length of the data in the file (less than or equal
       to unencoded_len - unencoded_offset).  See Extent layout  below
       for some examples.

       If  the  unencoded  data is actually longer than unencoded_len,
       then it is truncated; if it is shorter,  then  it  is  extended
       with zeroes.

       pwritev2(2)  uses  the metadata specified in iov[0], writes the
       encoded data from the remaining buffers, and returns the number
       of  encoded  bytes  written (that is, the sum of iov[n].iov_len
       for 1 <= n < iovcnt; partial writes will not occur).  At  least
       one  encoding  field  must  be non-zero.  Note that the encoded
       data is not validated when it is written; if it  is  not  valid
       (e.g.,  it  cannot be decompressed), then a subsequent read may
       return an error.  If the offset argument to pwritev2(2) is  -1,
       then  the file offset is incremented by len.  If iov[0].iov_len
       is less than sizeof(struct encoded_iov) in the kernel, then any
       fields  unknown to user space are treated as if they were zero;
       if it is greater and any fields unknown to the kernel are  non-
       zero, then this returns -1 and sets errno to E2BIG.

       preadv2(2)  populates  the metadata in iov[0], the encoded data
       in the remaining buffers, and returns  the  number  of  encoded
       bytes  read.   This will only return one extent per call.  This
       can also read data which is not encoded;  all  encoding  fields
       will  be  zero  in  that  case.   If  the  offset  argument  to
       preadv2(2) is -1, then the file offset is incremented  by  len.
       If  iov[0].iov_len  is  less than sizeof(struct encoded_iov) in
       the kernel and any fields unknown to user space  are  non-zero,
       then  preadv2(2)  returns  -1 and sets errno to E2BIG; if it is
       greater, then any fields unknown to the kernel are returned  as
       zero.   If  the provided buffers are not large enough to return
       an entire encoded extent, then preadv2(2) returns -1  and  sets
       errno to ENOBUFS.

       As  the  filesystem page cache typically contains decoded data,
       encoded I/O bypasses the page cache.

   Extent layout
       By using len, unencoded_len, and unencoded_offset, it is possi‐
       ble to refer to a subset of an unencoded extent.

       In  the  simplest case, len is equal to unencoded_len and unen‐
       coded_offset is zero.  This means that the entire unencoded ex‐
       tent is used.

       However,  suppose we read 50 bytes into a file which contains a
       single compressed extent.  The filesystem must still return the
       entire compressed extent for us to be able to decompress it, so
       unencoded_len would be the length of  the  entire  decompressed
       extent.   However, because the read was at offset 50, the first
       50 bytes should be ignored.  Therefore, unencoded_offset  would
       be 50, and len would accordingly be unencoded_len - 50.

       Additionally,  suppose we want to create an encrypted file with
       length 500, but the file is encrypted with a block cipher using
       a  block  size of 4096.  The unencoded data would therefore in‐
       clude the appropriate padding, and unencoded_len would be 4096.
       However,  to  represent the logical size of the file, len would
       be 500 (and unencoded_offset would be 0).

       Similar situations can arise in other cases:

       *  If the filesystem pads data to the filesystem block size be‐
          fore  compressing,  then  compressed  files  with a size un‐
          aligned to the filesystem block size will end with an extent
          with len < unencoded_len.

       *  Extents  cloned  from  the middle of a larger encoded extent
          with  FICLONERANGE  may  have  a  non-zero  unencoded_offset
          and/or len < unencoded_len.

       *  If  the  middle  of  an  encoded  extent is overwritten, the
          filesystem may create extents with a non-zero unencoded_off‐
          set  and/or  len < unencoded_len for the parts that were not
          overwritten.

   Security
       Encoded I/O creates the potential for some security issues:

       *  Encoded writes allow writing arbitrary data which the kernel
          will  decode on a subsequent read.  Decompression algorithms
          are complex and may have bugs which can be exploited by  ma‐
          liciously crafted data.

       *  Encoded reads may return data which is not logically present
          in the file (see the  discussion  of  len  vs  unencoded_len
          above).   It  may  not be intended for this data to be read‐
          able.

       Therefore, encoded I/O requires privilege.  Namely, the RWF_EN‐
       CODED  flag  may  only  be used if the file description has the
       O_ALLOW_ENCODED file status flag set, and  the  O_ALLOW_ENCODED
       flag  may  only be set by a thread with the CAP_SYS_ADMIN capa‐
       bility.  The O_ALLOW_ENCODED flag can be set by open(2) or  fc‐
       ntl(2).   It  can also be cleared by fcntl(2); clearing it does
       not require CAP_SYS_ADMIN.  Note that  it  is  not  cleared  on
       fork(2) or execve(2).  One may wish to use O_CLOEXEC with O_AL‐
       LOW_ENCODED.

   Filesystem support
       Encoded I/O is supported on the following filesystems:

       Btrfs (since Linux 5.13)

              Btrfs supports encoded reads and  writes  of  compressed
              data.  The data is encoded as follows:

              *  If compression is ENCODED_IOV_COMPRESSION_BTRFS_ZLIB,
                 then the encoded data is a single zlib stream.

              *  If compression is ENCODED_IOV_COMPRESSION_BTRFS_ZSTD,
                 then  the  encoded  data  is a single zstd frame com‐
                 pressed with the windowLog compression parameter  set
                 to no more than 17.

              *  If   compression   is   one  of  ENCODED_IOV_COMPRES‐
                 SION_BTRFS_LZO_4K,               ENCODED_IOV_COMPRES‐
                 SION_BTRFS_LZO_8K,               ENCODED_IOV_COMPRES‐
                 SION_BTRFS_LZO_16K,              ENCODED_IOV_COMPRES‐
                 SION_BTRFS_LZO_32K,      or      ENCODED_IOV_COMPRES‐
                 SION_BTRFS_LZO_64K, then the  encoded  data  is  com‐
                 pressed  page  by page (using the page size indicated
                 by the name of the constant) with LZO1X  and  wrapped
                 in  the  format documented in the Linux kernel source
                 file fs/btrfs/lzo.c.

              Additionally,   there   are   some    restrictions    on
              pwritev2(2):

              *  offset  (or  the current file offset if offset is -1)
                 must be aligned to the sector size of the filesystem.

              *  len must  be  aligned  to  the  sector  size  of  the
                 filesystem unless the data ends at or beyond the cur‐
                 rent end of the file.

              *  unencoded_len and the length of the encoded data must
                 each  be  no  more  than 128 KiB.  This limit may in‐
                 crease in the future.

              *  The length of the encoded data must be less  than  or
                 equal to unencoded_len.

              *  If  using  LZO, the filesystem's page size must match
                 the compression page size.

SEE ALSO
       open(2), preadv2(2)

Linux                         2020-11-11                 ENCODED_IO(7)

Christian Brauner March 19, 2021, 8:43 p.m. UTC | #5

On Fri, Mar 19, 2021 at 01:27:25PM -0700, Omar Sandoval wrote:
> On Fri, Mar 19, 2021 at 01:08:05PM -0700, Linus Torvalds wrote:
> > On Fri, Mar 19, 2021 at 11:21 AM Josef Bacik <josef@toxicpanda.com> wrote:
> > >
> > > Can we get some movement on this?  Omar is sort of spinning his wheels here
> > > trying to get this stuff merged, no major changes have been done in a few
> > > postings.
> > 
> > I'm not Al, and I absolutely detest the IOCB_ENCODED thing, and want
> > more explanations of why this should be done that way, and pollute our
> > iov_iter handling EVEN MORE.
> > 
> > Our iov_iter stuff isn't the most legible, and I don't understand why
> > anybody would ever think it's a good idea to spread what is clearly a
> > "struct" inside multiple different iov extents.
> > 
> > Honestly, this sounds way more like an ioctl interface than
> > read/write. We've done that before.
> 
> As Josef just mentioned, I started with an ioctl, and Dave Chinner
> suggested doing it this way:
> https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/
> 
> The nice thing about it is that it sidesteps all of the issues we had
> with the dedupe/reflink ioctls over the years where permissions checks
> and the like were slightly different from read/write.
> 
> > But if it has to be done with an iov_iter, is there *any* reason to
> > not make it be a hard rule that iov[0] should not be the entirely of
> > the struct, and the code shouldn't ever need to iterate?
> 
> For RWF_ENCODED, iov[0] is always used as the entirety of the struct. I
> made the helper more generic to support other use cases, but if that's
> the main objection I can easily make it specifically use iov[0].
> 
> > Also I see references to the man-page, but honestly, that's not how
> > the kernel UAPI should be defined ("just read the man-page"), plus I
> > refuse to live in the 70's, and consider troff to be an atrocious
> > format.
> 
> No disagreement here, troff is horrible to read.

Not a fan of troff either: One thing that might help you a little bit is
to use pandoc to convert to troff. You can write your manpage in
markdown or rst and then convert it into troff via pandoc. Still might
require some massaging but it makes it considerably more pleasant to
write a manpage.

Christian

Linus Torvalds March 19, 2021, 8:55 p.m. UTC | #6

On Fri, Mar 19, 2021 at 1:27 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> For RWF_ENCODED, iov[0] is always used as the entirety of the struct. I
> made the helper more generic to support other use cases, but if that's
> the main objection I can easily make it specifically use iov[0].

Honestly, with new interfaces, I'd prefer to always start off as
limited as possible.

And read/write is not very limited (but O_ALLOW_ENCODED and
RWF_ENCODED at least helps with the "fool suid program to do it"). But
at least we could make sure that the structure then has to be as
strict as humanly possible.

So it's not so much a "main objection" as more about trying to make
the rules stricter in the hope that that at least makes only one very
particular way of doing things valid. I'd hate for user space to start
'streaming" struct data.

> > Also I see references to the man-page, but honestly, that's not how
> > the kernel UAPI should be defined ("just read the man-page"), plus I
> > refuse to live in the 70's, and consider troff to be an atrocious
> > format.
>
> No disagreement here, troff is horrible to read.
>
> > So make the UAPI explanation for this horror be in a legible format
> > that is actually part of the kernel so that I can read what the intent
> > is, instead of having to decode hieroglypics.
>
> I didn't want to document the UAPI in two places that would need to be
> kept in sync

Honestly, I would suggest that nobody ever write troff format stuff.
I don't think it supports UTF-8 properly, for example, which means
that you can't even give credit to people properly etc.

RST isn't perfect, but at least it's human-legible, and it's obviously
what we're encouraging for kernel use. You can use rst2man to convert
to bad formats (and yes, you might lose something in the translation,
eg proper names etc).

Almost anything else(*) is better than troff. But at least I can read
the formatted version.

          Linus

(*) With the possible exception of "info" files. Now *there* is a
truly pointless format maximally designed to make it inconvenient for
users.

Omar Sandoval March 19, 2021, 9:11 p.m. UTC | #7

On Fri, Mar 19, 2021 at 01:55:18PM -0700, Linus Torvalds wrote:
> On Fri, Mar 19, 2021 at 1:27 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > For RWF_ENCODED, iov[0] is always used as the entirety of the struct. I
> > made the helper more generic to support other use cases, but if that's
> > the main objection I can easily make it specifically use iov[0].
> 
> Honestly, with new interfaces, I'd prefer to always start off as
> limited as possible.
> 
> And read/write is not very limited (but O_ALLOW_ENCODED and
> RWF_ENCODED at least helps with the "fool suid program to do it"). But
> at least we could make sure that the structure then has to be as
> strict as humanly possible.
> 
> So it's not so much a "main objection" as more about trying to make
> the rules stricter in the hope that that at least makes only one very
> particular way of doing things valid. I'd hate for user space to start
> 'streaming" struct data.

After spending a few minutes trying to simplify copy_struct_from_iter(),
it's honestly easier to just use the iterate_all_kinds() craziness than
open coding it to only operate on iov[0]. But that's an implementation
detail, and we can trivially make the interface stricter:

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 395ca89e5d9b..41b6b0325d18 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -969,9 +969,9 @@ EXPORT_SYMBOL(copy_page_from_iter);
  * On success, the iterator is advanced @usize bytes. On error, the iterator is
  * not advanced.
  */
-int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i,
-			  size_t usize)
+int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i)
 {
+	size_t usize = iov_iter_single_seg_count(i);
 	if (usize <= ksize) {
 		if (!copy_from_iter_full(dst, usize, i))
 			return -EFAULT;

I.e., the size of the userspace structure is always the remaining size
of the current segment. Maybe we can even throw in a check that we're
either at the beginning of the current segment or the very beginning of
the whole iter, what do you think?

(Again, this is what RWF_ENCODED already does, it was just easier to
write copy_struct_from_iter() more generically).

> > > Also I see references to the man-page, but honestly, that's not how
> > > the kernel UAPI should be defined ("just read the man-page"), plus I
> > > refuse to live in the 70's, and consider troff to be an atrocious
> > > format.
> >
> > No disagreement here, troff is horrible to read.
> >
> > > So make the UAPI explanation for this horror be in a legible format
> > > that is actually part of the kernel so that I can read what the intent
> > > is, instead of having to decode hieroglypics.
> >
> > I didn't want to document the UAPI in two places that would need to be
> > kept in sync
> 
> Honestly, I would suggest that nobody ever write troff format stuff.
> I don't think it supports UTF-8 properly, for example, which means
> that you can't even give credit to people properly etc.
> 
> RST isn't perfect, but at least it's human-legible, and it's obviously
> what we're encouraging for kernel use. You can use rst2man to convert
> to bad formats (and yes, you might lose something in the translation,
> eg proper names etc).

It looks like there's precedent for man pages being generated from the
kernel documentation [1], so I'll try that.

1: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=53666f6c30451cde022f65d35a8d448f5a7132ba

> Almost anything else(*) is better than troff. But at least I can read
> the formatted version.
> 
>           Linus
> 
> (*) With the possible exception of "info" files. Now *there* is a
> truly pointless format maximally designed to make it inconvenient for
> users.

Linus Torvalds March 19, 2021, 9:47 p.m. UTC | #8

On Fri, Mar 19, 2021 at 2:12 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> After spending a few minutes trying to simplify copy_struct_from_iter(),
> it's honestly easier to just use the iterate_all_kinds() craziness than
> open coding it to only operate on iov[0]. But that's an implementation
> detail, and we can trivially make the interface stricter:

This is an improvement, but talking about the iterate_all_kinds()
craziness, I think your existing code is broken.

That third case (kernel pointer source):

+    copy = min(ksize - copied, v.iov_len);
+    memcpy(dst + copied, v.iov_base, copy);
+    if (memchr_inv(v.iov_base, 0, v.iov_len))
+        return -E2BIG;

can't be right. Aren't you checking that it's *all* zero, even the
part you copied?

Our iov_iter stuff is really complicated already, this is part of why
I'm not a huge fan of using it.

I still suspect you'd be better off not using the iterate_all_kinds()
thing at all, and just explicitly checking ITER_BVEC/ITER_KVEC
manually.

Because you can play games like fooling your "copy_struct_from_iter()"
to not copy anything at all with ITER_DISCARD, can't you?

Which then sounds like it might end up being useful as a kernel data
leak, because it will use some random uninitialized kernel memory for
the structure.

Now, I don't think you can actually get that ITER_DISCARD case, so
this is not *really* a problem, but it's another example of how that
iterate_all_kinds() thing has these subtle cases embedded into it.

The whole point of copy_struct_from_iter() is presumably to be the
same kind of "obviously safe" interface as copy_struct_from_user() is
meant to be, so these subtle cases just then make me go "Hmm".

I think just open-coding this when  you know there is no actual
looping going on, and the data has to be at the *beginning*, should be
fairly simple. What makes iterate_all_kinds() complicated is that
iteration, the fact that there can be empty entries in there, but it's
also that "iov_offset" thing etc.

For the case where you just (a) require that iov_offset is zero, and
(b) everything has to fit into the very first iov entry (regardless of
what type that iov entry is), I think you actually end up with a much
simpler model.

I do realize that I am perhaps concentrating a bit too much on this
one part of the patch series, but the iov_iter thing has bitten us
before. And it has bitten really core developers and then Al has had
to fix up mistakes.

In fact, it wasn't that long ago that I asked Al to verify code I
wrote, because I was worried about having missed something subtle. So
now when I see these iov_iter users, it just makes me go all nervous.

          Linus

Omar Sandoval March 19, 2021, 10:46 p.m. UTC | #9

On Fri, Mar 19, 2021 at 02:47:03PM -0700, Linus Torvalds wrote:
> On Fri, Mar 19, 2021 at 2:12 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > After spending a few minutes trying to simplify copy_struct_from_iter(),
> > it's honestly easier to just use the iterate_all_kinds() craziness than
> > open coding it to only operate on iov[0]. But that's an implementation
> > detail, and we can trivially make the interface stricter:
> 
> This is an improvement, but talking about the iterate_all_kinds()
> craziness, I think your existing code is broken.
> 
> That third case (kernel pointer source):
> 
> +    copy = min(ksize - copied, v.iov_len);
> +    memcpy(dst + copied, v.iov_base, copy);
> +    if (memchr_inv(v.iov_base, 0, v.iov_len))
> +        return -E2BIG;
> 
> can't be right. Aren't you checking that it's *all* zero, even the
> part you copied?

Oops, that should of course be

	if (memchr_inv(v.iov_base + copy, 0, v.iov_len - copy))
		return -E2BIG;

like the other cases. Point taken, though.

> Our iov_iter stuff is really complicated already, this is part of why
> I'm not a huge fan of using it.
> 
> I still suspect you'd be better off not using the iterate_all_kinds()
> thing at all, and just explicitly checking ITER_BVEC/ITER_KVEC
> manually.
> 
> Because you can play games like fooling your "copy_struct_from_iter()"
> to not copy anything at all with ITER_DISCARD, can't you?
> 
> Which then sounds like it might end up being useful as a kernel data
> leak, because it will use some random uninitialized kernel memory for
> the structure.
> 
> Now, I don't think you can actually get that ITER_DISCARD case, so
> this is not *really* a problem, but it's another example of how that
> iterate_all_kinds() thing has these subtle cases embedded into it.

Right, that would probably be better off returning EFAULT or something
for ITER_DISCARD.

> The whole point of copy_struct_from_iter() is presumably to be the
> same kind of "obviously safe" interface as copy_struct_from_user() is
> meant to be, so these subtle cases just then make me go "Hmm".
> 
> I think just open-coding this when  you know there is no actual
> looping going on, and the data has to be at the *beginning*, should be
> fairly simple. What makes iterate_all_kinds() complicated is that
> iteration, the fact that there can be empty entries in there, but it's
> also that "iov_offset" thing etc.
> 
> For the case where you just (a) require that iov_offset is zero, and
> (b) everything has to fit into the very first iov entry (regardless of
> what type that iov entry is), I think you actually end up with a much
> simpler model.
> 
> I do realize that I am perhaps concentrating a bit too much on this
> one part of the patch series, but the iov_iter thing has bitten us
> before. And it has bitten really core developers and then Al has had
> to fix up mistakes.
> 
> In fact, it wasn't that long ago that I asked Al to verify code I
> wrote, because I was worried about having missed something subtle. So
> now when I see these iov_iter users, it just makes me go all nervous.

So here's what it looks like with these restrictions (chances are
there's a bug or two in here):

int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i)
{
	size_t usize;
	int ret;

	if (i->iov_offset != 0)
		return -EINVAL;
	if (iter_is_iovec(i)) {
		usize = i->iov->iov_len;
		might_fault();
		if (copyin(dst, i->iov->iov_base, min(ksize, usize)))
			return -EFAULT;
		if (usize > ksize) {
			ret = check_zeroed_user(i->iov->iov_base + ksize,
						usize - ksize);
			if (ret < 0)
				return ret;
			else if (ret == 0)
				return -E2BIG;
		}
	} else if (iov_iter_is_kvec(i)) {
		usize = i->kvec->iov_len;
		memcpy(dst, i->kvec->iov_base, min(ksize, usize));
		if (usize > ksize &&
		    memchr_inv(i->kvec->iov_base + ksize, 0, usize - ksize))
			return -E2BIG;
	} else if (iov_iter_is_bvec(i)) {
		char *p;

		usize = i->bvec->bv_len;
		p = kmap_atomic(i->bvec->bv_page);
		memcpy(dst, p + i->bvec->bv_offset, min(ksize, usize));
		if (usize > ksize &&
		    memchr_inv(p + i->bvec->bv_offset + ksize, 0,
			       usize - ksize)) {
			kunmap_atomic(p);
			return -E2BIG;
		}
		kunmap_atomic(p);
	} else {
		return -EFAULT;
	}
	if (usize < ksize)
		memset(dst + usize, 0, ksize - usize);
	iov_iter_advance(i, usize);
	return 0;
}

Not much shorter, but it is easier to follow.

Linus Torvalds March 20, 2021, 12:31 a.m. UTC | #10

On Fri, Mar 19, 2021 at 3:46 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> Not much shorter, but it is easier to follow.

Yeah, that looks about right to me.

You should probably use kmap_local_page() rather than kmap_atomic()
these days, but other than that this looks fairly straightforward, and
I much prefer the model where we very much force that "must be the
first iovec entry".

As you say, maybe not shorter, but a lot more straightforward.

That said, looking through the patch series, I see at least one other
issue. Look at parisc:

    +#define O_ALLOW_ENCODED 100000000

yeah, that's completely wrong. I see how it happened, but that's _really_ wrong.

I would want others to take a look in case there's something else. I'm
not qualified to comment about (nor do I deeply care) about the btrfs
parts, but the generic interface parts should most definitely get more
attention.

By Al, if possible, but other fs people too..

           Linus

Omar Sandoval March 20, 2021, 8:39 p.m. UTC | #11

On Fri, Mar 19, 2021 at 05:31:18PM -0700, Linus Torvalds wrote:
> On Fri, Mar 19, 2021 at 3:46 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > Not much shorter, but it is easier to follow.
> 
> Yeah, that looks about right to me.
> 
> You should probably use kmap_local_page() rather than kmap_atomic()
> these days, but other than that this looks fairly straightforward, and
> I much prefer the model where we very much force that "must be the
> first iovec entry".

To be exact, this code only enforces that the iov_iter is at the
beginning of the current entry. As far as I can tell, iov_iter doesn't
track its position overall, so there's no way to tell whether the
current entry is the first one.

> As you say, maybe not shorter, but a lot more straightforward.
> 
> That said, looking through the patch series, I see at least one other
> issue. Look at parisc:
> 
>     +#define O_ALLOW_ENCODED 100000000
> 
> yeah, that's completely wrong. I see how it happened, but that's _really_ wrong.

Ugh, that's embarrassing. It also happens to add exactly one new bit to
VALID_OPEN_FLAGS, so the BUILD_BUG_ON() in fcntl_init() didn't catch it.

> I would want others to take a look in case there's something else. I'm
> not qualified to comment about (nor do I deeply care) about the btrfs
> parts, but the generic interface parts should most definitely get more
> attention.
> 
> By Al, if possible, but other fs people too..

That's all I ask. I'll fix your comments and wait a few days to get some
more feedback on the fs side before I resend the patch bomb.

Thanks,
Omar

[v8,00/10] fs: interface for directly reading/writing compressed data

Message

Comments