diff mbox series

[man-pages,v4] Document encoded I/O

Message ID 00f86ed7c25418599e6067cb1dfb186c90ce7bf3.1582931488.git.osandov@fb.com (mailing list archive)
State New, archived
Headers show
Series [man-pages,v4] Document encoded I/O | expand

Commit Message

Omar Sandoval Feb. 28, 2020, 11:13 p.m. UTC
From: Omar Sandoval <osandov@fb.com>

This adds a new page, encoded_io(7), providing an overview of encoded
I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
reference it.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 man2/fcntl.2      |  10 +-
 man2/open.2       |  13 ++
 man2/readv.2      |  64 ++++++++++
 man7/encoded_io.7 | 298 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 384 insertions(+), 1 deletion(-)
 create mode 100644 man7/encoded_io.7

Comments

Amir Goldstein Feb. 29, 2020, 10:28 a.m. UTC | #1
> +encoded_io \- overview of encoded I/O
> +.SH DESCRIPTION
> +Several filesystems (e.g., Btrfs) support transparent encoding
> +(e.g., compression, encryption) of data on disk:
> +written data is encoded by the kernel before it is written to disk,
> +and read data is decoded before being returned to the user.
> +In some cases, it is useful to skip this encoding step.
> +For example, the user may want to read the compressed contents of a file
> +or write pre-compressed data directly to a file.
> +This is referred to as "encoded I/O".
> +.SS Encoded I/O API
> +Encoded I/O is specified with the
> +.B RWF_ENCODED
> +flag to
> +.BR preadv2 (2)
> +and
> +.BR pwritev2 (2).
> +If
> +.B RWF_ENCODED
> +is specified, then
> +.I iov[0].iov_base
> +points to an
> +.I
> +encoded_iov
> +structure, defined in
> +.I <linux/fs.h>
> +as:
> +.PP
> +.in +4n
> +.EX
> +struct encoded_iov {
> +    __aligned_u64 len;
> +    __aligned_u64 unencoded_len;
> +    __aligned_u64 unencoded_offset;
> +    __u32 compression;
> +    __u32 encryption;
> +};

This new API can generate many diverse error conditions that the standard errno
codes are not rich enough to describe.
Maybe add room for encoded io specific error codes in the metadata structure
would be good practice, for example:
- compression method not supported
- encryption method not supported
- the combination of enc/comp is not supported
- and so on


> +.EE
> +.in
> +.PP
> +This may be extended in the future, so
> +.I iov[0].iov_len
> +must be set to
> +.I "sizeof(struct\ encoded_iov)"
> +for forward/backward compatibility.
> +The remaining buffers contain the encoded data.
> +.PP
> +.I compression
> +and
> +.I encryption
> +are the encoding fields.
> +.I compression
> +is one of
> +.B ENCODED_IOV_COMPRESSION_NONE
> +(zero),
> +.BR ENCODED_IOV_COMPRESSION_ZLIB ,
> +.BR ENCODED_IOV_COMPRESSION_LZO ,
> +or
> +.BR ENCODED_IOV_COMPRESSION_ZSTD .
> +.I encryption
> +is currently always
> +.B ENCODED_IOV_ENCRYPTION_NONE
> +(zero).
> +.PP
> +.I unencoded_len
> +is the length of the unencoded (i.e., decrypted and decompressed) data.
> +.I unencoded_offset
> +is the offset into the unencoded data where the data in the file begins
> +(less than or equal to
> +.IR unencoded_len ).
> +.I len
> +is the length of the data in the file
> +(less than or equal to
> +.I unencoded_len
> +-
> +.IR unencoded_offset ).
> +.I
> +.PP
> +In most cases,
> +.I len
> +is equal to
> +.I unencoded_len
> +and
> +.I unencoded_offset
> +is zero.
> +However, it may be necessary to refer to a subset of the unencoded data,
> +usually because a read occurred in the middle of an encoded extent,
> +because part of an extent was overwritten or deallocated in some
> +way (e.g., with
> +.BR write (2),
> +.BR truncate (2),
> +or
> +.BR fallocate (2))
> +or because part of an extent was added to the file (e.g., with
> +.BR ioctl_ficlonerange (2)
> +or
> +.BR ioctl_fideduperange (2)).
> +For example, if
> +.I len
> +is 300,
> +.I unencoded_len
> +is 1000,
> +and
> +.I unencoded_offset
> +is 600,
> +then the encoded data is 1000 bytes long when decoded,
> +of which only the 300 bytes starting at offset 600 are used;
> +the first 600 and last 100 bytes should be ignored.
> +.PP
> +If the unencoded data is actually longer than
> +.IR unencoded_len ,
> +then it is truncated;
> +if it is shorter, then it is extended with zeroes.

I find the unencoded_len/unencoded_offset API extremely confusing and all
the clarifications above did not help to ease this feeling.
Please remind me why does the API need to expose unencoded details at all.
I understand the backup/restore use case for read/write encoded data.
I do not understand how unencoded offset info is relevant to this use case
or what are the other use cases it is relevant for.

> +.PP
> +For
> +.BR pwritev2 (),
> +the metadata should be specified in
> +.IR iov[0] .
> +If
> +.I iov[0].iov_len
> +is less than
> +.I "sizeof(struct\ encoded_iov)"
> +in the kernel,
> +then any fields unknown to userspace are treated as if they were zero;
> +if it is greater and any fields unknown to the kernel are non-zero,
> +then this returns -1 and sets
> +.I errno
> +to
> +.BR E2BIG .
> +The encoded data should be passed in the remaining buffers.
> +This returns the number of encoded bytes written (that is, the sum of
> +.I iov[n].iov_len
> +for 1 <=
> +.I n
> +<
> +.IR iovcnt ;
> +partial writes will not occur).
> +If the
> +.I offset
> +argument to
> +.BR pwritev2 ()
> +is -1, then the file offset is incremented by
> +.IR len .
> +At least one encoding field must be non-zero.
> +Note that the encoded data is not validated when it is written;
> +if it is not valid (e.g., it cannot be decompressed),
> +then a subsequent read may return an error.
> +.PP
> +For
> +.BR preadv2 (),
> +the metadata is returned in
> +.IR iov[0] .
> +If
> +.I iov[0].iov_len
> +is less than
> +.I "sizeof(struct\ encoded_iov)"
> +in the kernel and any fields unknown to userspace are non-zero,
> +then this returns -1 and sets
> +.I errno
> +to
> +.BR E2BIG ;
> +if it is greater,
> +then any fields unknown to the kernel are returned as zero.
> +The encoded data is returned in the remaining buffers.
> +If the provided buffers are not large enough to return an entire encoded
> +extent,
> +then this returns -1 and sets
> +.I errno
> +to
> +.BR ENOBUFS .
> +This returns the number of encoded bytes read.
> +If the
> +.I offset
> +argument to
> +.BR preadv2 ()
> +is -1, then the file offset is incremented by
> +.IR len .
> +This will only return one encoded extent per call.
> +This can also read data which is not encoded;
> +all encoding fields will be zero in that case.
> +.PP
> +As the filesystem page cache typically contains decoded data,
> +encoded I/O bypasses the page cache.
> +.SS Security
> +Encoded I/O creates the potential for some security issues:
> +.IP * 3
> +Encoded writes allow writing arbitrary data which the kernel will decode on
> +a subsequent read. Decompression algorithms are complex and may have bugs
> +which can be exploited by maliciously crafted data.
> +.IP *
> +Encoded reads may return data which is not logically present in the file
> +(see the discussion of
> +.I len
> +vs.
> +.I unencoded_len
> +above).
> +It may not be intended for this data to be readable.
> +.PP
> +Therefore, encoded I/O requires privilege.
> +Namely, the
> +.B RWF_ENCODED
> +flag may only be used when the file was opened with the
> +.B O_ALLOW_ENCODED
> +flag to
> +.BR open (2),
> +which requires the
> +.B CAP_SYS_ADMIN
> +capability.
> +.B O_ALLOW_ENCODED
> +may be set and cleared with
> +.BR fcntl (2).
> +Note that it is not cleared on
> +.BR fork (2)
> +or
> +.BR execve (2);
> +one may wish to use
> +.B O_CLOEXEC
> +with
> +.BR O_ALLOW_ENCODED .

Sigh! If I were an attacker I would be drooling right now.
We want to create a new API to read/write raw encrypted data (even though
you have not implemented any encryption yet) and we use the same old
vulnerable practices that security people have been fighting for decades?
I am not very comfortable with this attitude.
I think we should be much more prudent for the first version of the API.

How about not allowing to set O_ALLOW_ENCODED without O_CLOEXEC.
We may or may not allow to clear O_CLOEXEC while O_ALLOW_ENCODED
is set, in case this is the user intention, but leaving the API as it is is just
asking for trouble IMO.

Thanks,
Amir.
Omar Sandoval Feb. 29, 2020, 6:03 p.m. UTC | #2
On Sat, Feb 29, 2020 at 12:28:41PM +0200, Amir Goldstein wrote:
> > +encoded_io \- overview of encoded I/O
> > +.SH DESCRIPTION
> > +Several filesystems (e.g., Btrfs) support transparent encoding
> > +(e.g., compression, encryption) of data on disk:
> > +written data is encoded by the kernel before it is written to disk,
> > +and read data is decoded before being returned to the user.
> > +In some cases, it is useful to skip this encoding step.
> > +For example, the user may want to read the compressed contents of a file
> > +or write pre-compressed data directly to a file.
> > +This is referred to as "encoded I/O".
> > +.SS Encoded I/O API
> > +Encoded I/O is specified with the
> > +.B RWF_ENCODED
> > +flag to
> > +.BR preadv2 (2)
> > +and
> > +.BR pwritev2 (2).
> > +If
> > +.B RWF_ENCODED
> > +is specified, then
> > +.I iov[0].iov_base
> > +points to an
> > +.I
> > +encoded_iov
> > +structure, defined in
> > +.I <linux/fs.h>
> > +as:
> > +.PP
> > +.in +4n
> > +.EX
> > +struct encoded_iov {
> > +    __aligned_u64 len;
> > +    __aligned_u64 unencoded_len;
> > +    __aligned_u64 unencoded_offset;
> > +    __u32 compression;
> > +    __u32 encryption;
> > +};
> 
> This new API can generate many diverse error conditions that the standard errno
> codes are not rich enough to describe.
> Maybe add room for encoded io specific error codes in the metadata structure
> would be good practice, for example:
> - compression method not supported
> - encryption method not supported
> - the combination of enc/comp is not supported
> - and so on

I like this idea, but it feels like even more iovec abuse. Namely, for
pwritev2(), it feels a little off that we'd be copying _to_ user memory
rather than only copying from. It's probably worth it for better errors,
though.

> > +.EE
> > +.in
> > +.PP
> > +This may be extended in the future, so
> > +.I iov[0].iov_len
> > +must be set to
> > +.I "sizeof(struct\ encoded_iov)"
> > +for forward/backward compatibility.
> > +The remaining buffers contain the encoded data.
> > +.PP
> > +.I compression
> > +and
> > +.I encryption
> > +are the encoding fields.
> > +.I compression
> > +is one of
> > +.B ENCODED_IOV_COMPRESSION_NONE
> > +(zero),
> > +.BR ENCODED_IOV_COMPRESSION_ZLIB ,
> > +.BR ENCODED_IOV_COMPRESSION_LZO ,
> > +or
> > +.BR ENCODED_IOV_COMPRESSION_ZSTD .
> > +.I encryption
> > +is currently always
> > +.B ENCODED_IOV_ENCRYPTION_NONE
> > +(zero).
> > +.PP
> > +.I unencoded_len
> > +is the length of the unencoded (i.e., decrypted and decompressed) data.
> > +.I unencoded_offset
> > +is the offset into the unencoded data where the data in the file begins
> > +(less than or equal to
> > +.IR unencoded_len ).
> > +.I len
> > +is the length of the data in the file
> > +(less than or equal to
> > +.I unencoded_len
> > +-
> > +.IR unencoded_offset ).
> > +.I
> > +.PP
> > +In most cases,
> > +.I len
> > +is equal to
> > +.I unencoded_len
> > +and
> > +.I unencoded_offset
> > +is zero.
> > +However, it may be necessary to refer to a subset of the unencoded data,
> > +usually because a read occurred in the middle of an encoded extent,
> > +because part of an extent was overwritten or deallocated in some
> > +way (e.g., with
> > +.BR write (2),
> > +.BR truncate (2),
> > +or
> > +.BR fallocate (2))
> > +or because part of an extent was added to the file (e.g., with
> > +.BR ioctl_ficlonerange (2)
> > +or
> > +.BR ioctl_fideduperange (2)).
> > +For example, if
> > +.I len
> > +is 300,
> > +.I unencoded_len
> > +is 1000,
> > +and
> > +.I unencoded_offset
> > +is 600,
> > +then the encoded data is 1000 bytes long when decoded,
> > +of which only the 300 bytes starting at offset 600 are used;
> > +the first 600 and last 100 bytes should be ignored.
> > +.PP
> > +If the unencoded data is actually longer than
> > +.IR unencoded_len ,
> > +then it is truncated;
> > +if it is shorter, then it is extended with zeroes.
> 
> I find the unencoded_len/unencoded_offset API extremely confusing and all
> the clarifications above did not help to ease this feeling.
> Please remind me why does the API need to expose unencoded details at all.
> I understand the backup/restore use case for read/write encoded data.
> I do not understand how unencoded offset info is relevant to this use case
> or what are the other use cases it is relevant for.

I agree, it's confusing. However, without this concept on the read side,
there's no way to represent some file extent layouts, and without the
write side, those layouts can't be written back out. That would make
this interface much less useful for backups.

These cases arise in a few ways on Btrfs:

1. Files with a size unaligned to the block size.

   Ignoring inline data, Btrfs always pads data to the filesystem block
   size when compressing. So, a file with a size unaligned to the block
   size will end with an extent that decompresses to a multiple of the
   block size, but logically the file only contains the data up to
   i_size. In this case, len (length up to i_size) < unencoded_len (full
   decompressed length). This can arise simply from writing out an
   unaligned file or from truncating a file unaligned.

2. FICLONERANGE from the middle of an extent.

   Suppose file A has a large compressed extent with
   len = unencoded_len = 128k and unencoded_offset = 0. If the user does
   an FICLONERANGE out of the middle of that extent (say, 64k long and
   4k from the start of the extent), Btrfs creates a "partial" extent
   which references the original extent (in my example, the result would
   have len = 64k, unencoded_offset = 4k, and unencoded_len still 128k).

3. Overwriting the middle of an extent.

   In some cases, when the middle of an extent is overwritten (e.g., an
   O_DIRECT write, FICLONERANGE, or FIDEDUPERANGE), Btrfs splits up the
   overwritten extents into partial extents referencing the original
   extent instead of rewriting the whole extent.

These aren't specific to compression or Btrfs' on-disk format. fscrypt
uses block ciphers for file data, so case 1 is just as relevant for
that. The way Btrfs handles case 2 is the only sane way I can see for
supporting FICLONERANGE for encoded data.

> > +.PP
> > +For
> > +.BR pwritev2 (),
> > +the metadata should be specified in
> > +.IR iov[0] .
> > +If
> > +.I iov[0].iov_len
> > +is less than
> > +.I "sizeof(struct\ encoded_iov)"
> > +in the kernel,
> > +then any fields unknown to userspace are treated as if they were zero;
> > +if it is greater and any fields unknown to the kernel are non-zero,
> > +then this returns -1 and sets
> > +.I errno
> > +to
> > +.BR E2BIG .
> > +The encoded data should be passed in the remaining buffers.
> > +This returns the number of encoded bytes written (that is, the sum of
> > +.I iov[n].iov_len
> > +for 1 <=
> > +.I n
> > +<
> > +.IR iovcnt ;
> > +partial writes will not occur).
> > +If the
> > +.I offset
> > +argument to
> > +.BR pwritev2 ()
> > +is -1, then the file offset is incremented by
> > +.IR len .
> > +At least one encoding field must be non-zero.
> > +Note that the encoded data is not validated when it is written;
> > +if it is not valid (e.g., it cannot be decompressed),
> > +then a subsequent read may return an error.
> > +.PP
> > +For
> > +.BR preadv2 (),
> > +the metadata is returned in
> > +.IR iov[0] .
> > +If
> > +.I iov[0].iov_len
> > +is less than
> > +.I "sizeof(struct\ encoded_iov)"
> > +in the kernel and any fields unknown to userspace are non-zero,
> > +then this returns -1 and sets
> > +.I errno
> > +to
> > +.BR E2BIG ;
> > +if it is greater,
> > +then any fields unknown to the kernel are returned as zero.
> > +The encoded data is returned in the remaining buffers.
> > +If the provided buffers are not large enough to return an entire encoded
> > +extent,
> > +then this returns -1 and sets
> > +.I errno
> > +to
> > +.BR ENOBUFS .
> > +This returns the number of encoded bytes read.
> > +If the
> > +.I offset
> > +argument to
> > +.BR preadv2 ()
> > +is -1, then the file offset is incremented by
> > +.IR len .
> > +This will only return one encoded extent per call.
> > +This can also read data which is not encoded;
> > +all encoding fields will be zero in that case.
> > +.PP
> > +As the filesystem page cache typically contains decoded data,
> > +encoded I/O bypasses the page cache.
> > +.SS Security
> > +Encoded I/O creates the potential for some security issues:
> > +.IP * 3
> > +Encoded writes allow writing arbitrary data which the kernel will decode on
> > +a subsequent read. Decompression algorithms are complex and may have bugs
> > +which can be exploited by maliciously crafted data.
> > +.IP *
> > +Encoded reads may return data which is not logically present in the file
> > +(see the discussion of
> > +.I len
> > +vs.
> > +.I unencoded_len
> > +above).
> > +It may not be intended for this data to be readable.
> > +.PP
> > +Therefore, encoded I/O requires privilege.
> > +Namely, the
> > +.B RWF_ENCODED
> > +flag may only be used when the file was opened with the
> > +.B O_ALLOW_ENCODED
> > +flag to
> > +.BR open (2),
> > +which requires the
> > +.B CAP_SYS_ADMIN
> > +capability.
> > +.B O_ALLOW_ENCODED
> > +may be set and cleared with
> > +.BR fcntl (2).
> > +Note that it is not cleared on
> > +.BR fork (2)
> > +or
> > +.BR execve (2);
> > +one may wish to use
> > +.B O_CLOEXEC
> > +with
> > +.BR O_ALLOW_ENCODED .
> 
> Sigh! If I were an attacker I would be drooling right now.
> We want to create a new API to read/write raw encrypted data (even though
> you have not implemented any encryption yet) and we use the same old
> vulnerable practices that security people have been fighting for decades?
> I am not very comfortable with this attitude.
> I think we should be much more prudent for the first version of the API.
> 
> How about not allowing to set O_ALLOW_ENCODED without O_CLOEXEC.
> We may or may not allow to clear O_CLOEXEC while O_ALLOW_ENCODED
> is set, in case this is the user intention, but leaving the API as it is is just
> asking for trouble IMO.

Ok, I'm fine with requiring O_CLOEXEC for O_ALLOW_ENCODED on open. I'm
pretty sure we want to allow clearing it with fcntl, as that is a very
intentional action.
Amir Goldstein March 1, 2020, 7:26 a.m. UTC | #3
On Sat, Feb 29, 2020 at 8:03 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> On Sat, Feb 29, 2020 at 12:28:41PM +0200, Amir Goldstein wrote:
> > > +encoded_io \- overview of encoded I/O
> > > +.SH DESCRIPTION
> > > +Several filesystems (e.g., Btrfs) support transparent encoding
> > > +(e.g., compression, encryption) of data on disk:
> > > +written data is encoded by the kernel before it is written to disk,
> > > +and read data is decoded before being returned to the user.
> > > +In some cases, it is useful to skip this encoding step.
> > > +For example, the user may want to read the compressed contents of a file
> > > +or write pre-compressed data directly to a file.
> > > +This is referred to as "encoded I/O".
> > > +.SS Encoded I/O API
> > > +Encoded I/O is specified with the
> > > +.B RWF_ENCODED
> > > +flag to
> > > +.BR preadv2 (2)
> > > +and
> > > +.BR pwritev2 (2).
> > > +If
> > > +.B RWF_ENCODED
> > > +is specified, then
> > > +.I iov[0].iov_base
> > > +points to an
> > > +.I
> > > +encoded_iov
> > > +structure, defined in
> > > +.I <linux/fs.h>
> > > +as:
> > > +.PP
> > > +.in +4n
> > > +.EX
> > > +struct encoded_iov {
> > > +    __aligned_u64 len;
> > > +    __aligned_u64 unencoded_len;
> > > +    __aligned_u64 unencoded_offset;
> > > +    __u32 compression;
> > > +    __u32 encryption;
> > > +};
> >
> > This new API can generate many diverse error conditions that the standard errno
> > codes are not rich enough to describe.
> > Maybe add room for encoded io specific error codes in the metadata structure
> > would be good practice, for example:
> > - compression method not supported
> > - encryption method not supported
> > - the combination of enc/comp is not supported
> > - and so on
>
> I like this idea, but it feels like even more iovec abuse. Namely, for

That's true.

> pwritev2(), it feels a little off that we'd be copying _to_ user memory
> rather than only copying from. It's probably worth it for better errors,
> though.
>

Apropos iovec abuse, if encoded io is going to interpret iovec[0] differently
why not interpret iovec arg differently. The result might be less awkward if
the structure passed to preadv2/pwritev2 is struct encoded_iov * instead
of struct iov *.

> > > +.EE
> > > +.in
> > > +.PP
> > > +This may be extended in the future, so
> > > +.I iov[0].iov_len
> > > +must be set to
> > > +.I "sizeof(struct\ encoded_iov)"
> > > +for forward/backward compatibility.
> > > +The remaining buffers contain the encoded data.
> > > +.PP
> > > +.I compression
> > > +and
> > > +.I encryption
> > > +are the encoding fields.
> > > +.I compression
> > > +is one of
> > > +.B ENCODED_IOV_COMPRESSION_NONE
> > > +(zero),
> > > +.BR ENCODED_IOV_COMPRESSION_ZLIB ,
> > > +.BR ENCODED_IOV_COMPRESSION_LZO ,
> > > +or
> > > +.BR ENCODED_IOV_COMPRESSION_ZSTD .
> > > +.I encryption
> > > +is currently always
> > > +.B ENCODED_IOV_ENCRYPTION_NONE
> > > +(zero).
> > > +.PP
> > > +.I unencoded_len
> > > +is the length of the unencoded (i.e., decrypted and decompressed) data.
> > > +.I unencoded_offset
> > > +is the offset into the unencoded data where the data in the file begins
> > > +(less than or equal to
> > > +.IR unencoded_len ).
> > > +.I len
> > > +is the length of the data in the file
> > > +(less than or equal to
> > > +.I unencoded_len
> > > +-
> > > +.IR unencoded_offset ).
> > > +.I
> > > +.PP
> > > +In most cases,
> > > +.I len
> > > +is equal to
> > > +.I unencoded_len
> > > +and
> > > +.I unencoded_offset
> > > +is zero.
> > > +However, it may be necessary to refer to a subset of the unencoded data,
> > > +usually because a read occurred in the middle of an encoded extent,
> > > +because part of an extent was overwritten or deallocated in some
> > > +way (e.g., with
> > > +.BR write (2),
> > > +.BR truncate (2),
> > > +or
> > > +.BR fallocate (2))
> > > +or because part of an extent was added to the file (e.g., with
> > > +.BR ioctl_ficlonerange (2)
> > > +or
> > > +.BR ioctl_fideduperange (2)).
> > > +For example, if
> > > +.I len
> > > +is 300,
> > > +.I unencoded_len
> > > +is 1000,
> > > +and
> > > +.I unencoded_offset
> > > +is 600,
> > > +then the encoded data is 1000 bytes long when decoded,
> > > +of which only the 300 bytes starting at offset 600 are used;
> > > +the first 600 and last 100 bytes should be ignored.
> > > +.PP
> > > +If the unencoded data is actually longer than
> > > +.IR unencoded_len ,
> > > +then it is truncated;
> > > +if it is shorter, then it is extended with zeroes.
> >
> > I find the unencoded_len/unencoded_offset API extremely confusing and all
> > the clarifications above did not help to ease this feeling.
> > Please remind me why does the API need to expose unencoded details at all.
> > I understand the backup/restore use case for read/write encoded data.
> > I do not understand how unencoded offset info is relevant to this use case
> > or what are the other use cases it is relevant for.
>
> I agree, it's confusing. However, without this concept on the read side,
> there's no way to represent some file extent layouts, and without the
> write side, those layouts can't be written back out. That would make
> this interface much less useful for backups.
>
> These cases arise in a few ways on Btrfs:
>
> 1. Files with a size unaligned to the block size.
>
>    Ignoring inline data, Btrfs always pads data to the filesystem block
>    size when compressing. So, a file with a size unaligned to the block
>    size will end with an extent that decompresses to a multiple of the
>    block size, but logically the file only contains the data up to
>    i_size. In this case, len (length up to i_size) < unencoded_len (full
>    decompressed length). This can arise simply from writing out an
>    unaligned file or from truncating a file unaligned.
>
> 2. FICLONERANGE from the middle of an extent.
>
>    Suppose file A has a large compressed extent with
>    len = unencoded_len = 128k and unencoded_offset = 0. If the user does
>    an FICLONERANGE out of the middle of that extent (say, 64k long and
>    4k from the start of the extent), Btrfs creates a "partial" extent
>    which references the original extent (in my example, the result would
>    have len = 64k, unencoded_offset = 4k, and unencoded_len still 128k).
>
> 3. Overwriting the middle of an extent.
>
>    In some cases, when the middle of an extent is overwritten (e.g., an
>    O_DIRECT write, FICLONERANGE, or FIDEDUPERANGE), Btrfs splits up the
>    overwritten extents into partial extents referencing the original
>    extent instead of rewriting the whole extent.
>
> These aren't specific to compression or Btrfs' on-disk format. fscrypt
> uses block ciphers for file data, so case 1 is just as relevant for
> that. The way Btrfs handles case 2 is the only sane way I can see for
> supporting FICLONERANGE for encoded data.
>

I see... so now I understand the complication, but that doesn't mean
that the developers reading the encoded_io documentation will or that
they will get the implementation details right.

IMO, if the only use case for encoded io is backup/restore, then we
should make the API simpler and more oriented to this use case, namely,
serialization -
For all I care, btrfs can still return struct encoded_iov in iov[0],
but the user needs not to know about this and this internal detail should
not be documented nor exposed in UAPI.
btrfs send reads a stream of encoded data and metadata that describes it.
btrfs receive writes the encoded data stream and metadata descriptors that
tell the file system about overlapping extents and whatnot.

Is that something that can work out, or does userspace have to be aware
of encoded extents layout?

Thanks,
Amir.
Omar Sandoval March 11, 2020, 8:47 a.m. UTC | #4
On Sun, Mar 01, 2020 at 09:26:10AM +0200, Amir Goldstein wrote:
> On Sat, Feb 29, 2020 at 8:03 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > On Sat, Feb 29, 2020 at 12:28:41PM +0200, Amir Goldstein wrote:
> > > > +encoded_io \- overview of encoded I/O
> > > > +.SH DESCRIPTION
> > > > +Several filesystems (e.g., Btrfs) support transparent encoding
> > > > +(e.g., compression, encryption) of data on disk:
> > > > +written data is encoded by the kernel before it is written to disk,
> > > > +and read data is decoded before being returned to the user.
> > > > +In some cases, it is useful to skip this encoding step.
> > > > +For example, the user may want to read the compressed contents of a file
> > > > +or write pre-compressed data directly to a file.
> > > > +This is referred to as "encoded I/O".
> > > > +.SS Encoded I/O API
> > > > +Encoded I/O is specified with the
> > > > +.B RWF_ENCODED
> > > > +flag to
> > > > +.BR preadv2 (2)
> > > > +and
> > > > +.BR pwritev2 (2).
> > > > +If
> > > > +.B RWF_ENCODED
> > > > +is specified, then
> > > > +.I iov[0].iov_base
> > > > +points to an
> > > > +.I
> > > > +encoded_iov
> > > > +structure, defined in
> > > > +.I <linux/fs.h>
> > > > +as:
> > > > +.PP
> > > > +.in +4n
> > > > +.EX
> > > > +struct encoded_iov {
> > > > +    __aligned_u64 len;
> > > > +    __aligned_u64 unencoded_len;
> > > > +    __aligned_u64 unencoded_offset;
> > > > +    __u32 compression;
> > > > +    __u32 encryption;
> > > > +};
> > >
> > > This new API can generate many diverse error conditions that the standard errno
> > > codes are not rich enough to describe.
> > > Maybe add room for encoded io specific error codes in the metadata structure
> > > would be good practice, for example:
> > > - compression method not supported
> > > - encryption method not supported
> > > - the combination of enc/comp is not supported
> > > - and so on
> >
> > I like this idea, but it feels like even more iovec abuse. Namely, for
> 
> That's true.
> 
> > pwritev2(), it feels a little off that we'd be copying _to_ user memory
> > rather than only copying from. It's probably worth it for better errors,
> > though.
> >
> 
> Apropos iovec abuse, if encoded io is going to interpret iovec[0] differently
> why not interpret iovec arg differently. The result might be less awkward if
> the structure passed to preadv2/pwritev2 is struct encoded_iov * instead
> of struct iov *.

IMO, that's clunkier both from an API perspective and an implementation
perspective. On the implementation side, we now have to special case a
bunch of places in the VFS that are expecting a struct iovec *. On the
API side, it's so far from p{read,write}v2 that it might as well be an
ioctl or a new system call. (In fact, v1 of this series was a
Btrfs-specific ioctl, but it's much so nicer to reuse the VFS read/write
infrastructure.)

[snip]

> > > I find the unencoded_len/unencoded_offset API extremely confusing and all
> > > the clarifications above did not help to ease this feeling.
> > > Please remind me why does the API need to expose unencoded details at all.
> > > I understand the backup/restore use case for read/write encoded data.
> > > I do not understand how unencoded offset info is relevant to this use case
> > > or what are the other use cases it is relevant for.
> >
> > I agree, it's confusing. However, without this concept on the read side,
> > there's no way to represent some file extent layouts, and without the
> > write side, those layouts can't be written back out. That would make
> > this interface much less useful for backups.
> >
> > These cases arise in a few ways on Btrfs:
> >
> > 1. Files with a size unaligned to the block size.
> >
> >    Ignoring inline data, Btrfs always pads data to the filesystem block
> >    size when compressing. So, a file with a size unaligned to the block
> >    size will end with an extent that decompresses to a multiple of the
> >    block size, but logically the file only contains the data up to
> >    i_size. In this case, len (length up to i_size) < unencoded_len (full
> >    decompressed length). This can arise simply from writing out an
> >    unaligned file or from truncating a file unaligned.
> >
> > 2. FICLONERANGE from the middle of an extent.
> >
> >    Suppose file A has a large compressed extent with
> >    len = unencoded_len = 128k and unencoded_offset = 0. If the user does
> >    an FICLONERANGE out of the middle of that extent (say, 64k long and
> >    4k from the start of the extent), Btrfs creates a "partial" extent
> >    which references the original extent (in my example, the result would
> >    have len = 64k, unencoded_offset = 4k, and unencoded_len still 128k).
> >
> > 3. Overwriting the middle of an extent.
> >
> >    In some cases, when the middle of an extent is overwritten (e.g., an
> >    O_DIRECT write, FICLONERANGE, or FIDEDUPERANGE), Btrfs splits up the
> >    overwritten extents into partial extents referencing the original
> >    extent instead of rewriting the whole extent.
> >
> > These aren't specific to compression or Btrfs' on-disk format. fscrypt
> > uses block ciphers for file data, so case 1 is just as relevant for
> > that. The way Btrfs handles case 2 is the only sane way I can see for
> > supporting FICLONERANGE for encoded data.
> >
> 
> I see... so now I understand the complication, but that doesn't mean
> that the developers reading the encoded_io documentation will or that
> they will get the implementation details right.
> 
> IMO, if the only use case for encoded io is backup/restore, then we
> should make the API simpler and more oriented to this use case, namely,
> serialization -
> For all I care, btrfs can still return struct encoded_iov in iov[0],
> but the user needs not to know about this and this internal detail should
> not be documented nor exposed in UAPI.
> btrfs send reads a stream of encoded data and metadata that describes it.
> btrfs receive writes the encoded data stream and metadata descriptors that
> tell the file system about overlapping extents and whatnot.
> 
> Is that something that can work out, or does userspace have to be aware
> of encoded extents layout?

There are use cases outside of backups that would benefit from being
able to make arbitrary encoded writes. Specifically, one of my
colleagues at Facebook expressed interest in using encoded writes for
package distribution. The idea is that a package could be distributed as
a compressed archive and installed via encoded writes and reflinks to
the proper files, avoiding any need to decompress the package contents
before they're actually accessed. This sort of low-level fiddling needs
a proper UAPI. I'd much rather improve the documentation than make it
opaque.

Thanks,
Omar
Michael Kerrisk (man-pages) April 16, 2020, 12:26 p.m. UTC | #5
Hello Omar,

(Unless you CC both me and mtk.manpages@gmail.com, it's easily
possible that I will miss your man-pages patches.)

What's the status here? I presume the features documented here are not
yet merged, right? Is the aim still to have them merged in the future?

Thanks,

Michael

On Sat, 29 Feb 2020 at 00:16, Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> This adds a new page, encoded_io(7), providing an overview of encoded
> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> reference it.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  man2/fcntl.2      |  10 +-
>  man2/open.2       |  13 ++
>  man2/readv.2      |  64 ++++++++++
>  man7/encoded_io.7 | 298 ++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 384 insertions(+), 1 deletion(-)
>  create mode 100644 man7/encoded_io.7
>
> diff --git a/man2/fcntl.2 b/man2/fcntl.2
> index bb1ac1f5d..15a1010a6 100644
> --- a/man2/fcntl.2
> +++ b/man2/fcntl.2
> @@ -222,8 +222,9 @@ On Linux, this command can change only the
>  .BR O_ASYNC ,
>  .BR O_DIRECT ,
>  .BR O_NOATIME ,
> +.BR O_NONBLOCK ,
>  and
> -.B O_NONBLOCK
> +.B O_ALLOW_ENCODED
>  flags.
>  It is not possible to change the
>  .BR O_DSYNC
> @@ -1821,6 +1822,13 @@ Attempted to clear the
>  flag on a file that has the append-only attribute set.
>  .TP
>  .B EPERM
> +Attempted to set the
> +.B O_ALLOW_ENCODED
> +flag and the calling process did not have the
> +.B CAP_SYS_ADMIN
> +capability.
> +.TP
> +.B EPERM
>  .I cmd
>  was
>  .BR F_ADD_SEALS ,
> diff --git a/man2/open.2 b/man2/open.2
> index 3ab4ee17b..256cb4247 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -421,6 +421,14 @@ was followed by a call to
>  .BR fdatasync (2)).
>  .IR "See NOTES below" .
>  .TP
> +.B O_ALLOW_ENCODED
> +Open the file with encoded I/O permissions;
> +see
> +.BR encoded_io (7).
> +The caller must have the
> +.B CAP_SYS_ADMIN
> +capability.
> +.TP
>  .B O_EXCL
>  Ensure that this call creates the file:
>  if this flag is specified in conjunction with
> @@ -1176,6 +1184,11 @@ did not match the owner of the file and the caller was not privileged.
>  The operation was prevented by a file seal; see
>  .BR fcntl (2).
>  .TP
> +.B EPERM
> +The
> +.B O_ALLOW_ENCODED
> +flag was specified, but the caller was not privileged.
> +.TP
>  .B EROFS
>  .I pathname
>  refers to a file on a read-only filesystem and write access was
> diff --git a/man2/readv.2 b/man2/readv.2
> index af27aa63e..8b5458023 100644
> --- a/man2/readv.2
> +++ b/man2/readv.2
> @@ -265,6 +265,11 @@ the data is always appended to the end of the file.
>  However, if the
>  .I offset
>  argument is \-1, the current file offset is updated.
> +.TP
> +.BR RWF_ENCODED " (since Linux 5.7)"
> +Read or write encoded (e.g., compressed) data.
> +See
> +.BR encoded_io (7).
>  .SH RETURN VALUE
>  On success,
>  .BR readv (),
> @@ -284,6 +289,13 @@ than requested (see
>  and
>  .BR write (2)).
>  .PP
> +If
> +.B
> +RWF_ENCODED
> +was specified in
> +.IR flags ,
> +then the return value is the number of encoded bytes.
> +.PP
>  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
>  .SH ERRORS
>  The errors are as given for
> @@ -314,6 +326,58 @@ is less than zero or greater than the permitted maximum.
>  .TP
>  .B EOPNOTSUPP
>  An unknown flag is specified in \fIflags\fP.
> +.TP
> +.B EOPNOTSUPP
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the filesystem does not implement encoded I/O.
> +.TP
> +.B EPERM
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the file was not opened with the
> +.B O_ALLOW_ENCODED
> +flag.
> +.PP
> +.BR preadv2 ()
> +can fail for the following reasons:
> +.TP
> +.B E2BIG
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and
> +.I iov[0]
> +is not large enough to return the encoding metadata.
> +.TP
> +.B ENOBUFS
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the buffers in
> +.I iov
> +are not big enough to return the encoded data.
> +.PP
> +.BR pwritev2 ()
> +can fail for the following reasons:
> +.TP
> +.B E2BIG
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and
> +.I iov[0]
> +contains non-zero fields
> +after the kernel's
> +.IR "sizeof(struct\ encoded_iov)" .
> +.TP
> +.B EINVAL
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the alignment and/or size requirements are not met.
>  .SH VERSIONS
>  .BR preadv ()
>  and
> diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
> new file mode 100644
> index 000000000..72b40353f
> --- /dev/null
> +++ b/man7/encoded_io.7
> @@ -0,0 +1,298 @@
> +.\" Copyright (c) 2019 by Omar Sandoval <osandov@fb.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.\"
> +.TH ENCODED_IO  7 2019-10-14 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +encoded_io \- overview of encoded I/O
> +.SH DESCRIPTION
> +Several filesystems (e.g., Btrfs) support transparent encoding
> +(e.g., compression, encryption) of data on disk:
> +written data is encoded by the kernel before it is written to disk,
> +and read data is decoded before being returned to the user.
> +In some cases, it is useful to skip this encoding step.
> +For example, the user may want to read the compressed contents of a file
> +or write pre-compressed data directly to a file.
> +This is referred to as "encoded I/O".
> +.SS Encoded I/O API
> +Encoded I/O is specified with the
> +.B RWF_ENCODED
> +flag to
> +.BR preadv2 (2)
> +and
> +.BR pwritev2 (2).
> +If
> +.B RWF_ENCODED
> +is specified, then
> +.I iov[0].iov_base
> +points to an
> +.I
> +encoded_iov
> +structure, defined in
> +.I <linux/fs.h>
> +as:
> +.PP
> +.in +4n
> +.EX
> +struct encoded_iov {
> +    __aligned_u64 len;
> +    __aligned_u64 unencoded_len;
> +    __aligned_u64 unencoded_offset;
> +    __u32 compression;
> +    __u32 encryption;
> +};
> +.EE
> +.in
> +.PP
> +This may be extended in the future, so
> +.I iov[0].iov_len
> +must be set to
> +.I "sizeof(struct\ encoded_iov)"
> +for forward/backward compatibility.
> +The remaining buffers contain the encoded data.
> +.PP
> +.I compression
> +and
> +.I encryption
> +are the encoding fields.
> +.I compression
> +is one of
> +.B ENCODED_IOV_COMPRESSION_NONE
> +(zero),
> +.BR ENCODED_IOV_COMPRESSION_ZLIB ,
> +.BR ENCODED_IOV_COMPRESSION_LZO ,
> +or
> +.BR ENCODED_IOV_COMPRESSION_ZSTD .
> +.I encryption
> +is currently always
> +.B ENCODED_IOV_ENCRYPTION_NONE
> +(zero).
> +.PP
> +.I unencoded_len
> +is the length of the unencoded (i.e., decrypted and decompressed) data.
> +.I unencoded_offset
> +is the offset into the unencoded data where the data in the file begins
> +(less than or equal to
> +.IR unencoded_len ).
> +.I len
> +is the length of the data in the file
> +(less than or equal to
> +.I unencoded_len
> +-
> +.IR unencoded_offset ).
> +.I
> +.PP
> +In most cases,
> +.I len
> +is equal to
> +.I unencoded_len
> +and
> +.I unencoded_offset
> +is zero.
> +However, it may be necessary to refer to a subset of the unencoded data,
> +usually because a read occurred in the middle of an encoded extent,
> +because part of an extent was overwritten or deallocated in some
> +way (e.g., with
> +.BR write (2),
> +.BR truncate (2),
> +or
> +.BR fallocate (2))
> +or because part of an extent was added to the file (e.g., with
> +.BR ioctl_ficlonerange (2)
> +or
> +.BR ioctl_fideduperange (2)).
> +For example, if
> +.I len
> +is 300,
> +.I unencoded_len
> +is 1000,
> +and
> +.I unencoded_offset
> +is 600,
> +then the encoded data is 1000 bytes long when decoded,
> +of which only the 300 bytes starting at offset 600 are used;
> +the first 600 and last 100 bytes should be ignored.
> +.PP
> +If the unencoded data is actually longer than
> +.IR unencoded_len ,
> +then it is truncated;
> +if it is shorter, then it is extended with zeroes.
> +.PP
> +For
> +.BR pwritev2 (),
> +the metadata should be specified in
> +.IR iov[0] .
> +If
> +.I iov[0].iov_len
> +is less than
> +.I "sizeof(struct\ encoded_iov)"
> +in the kernel,
> +then any fields unknown to userspace are treated as if they were zero;
> +if it is greater and any fields unknown to the kernel are non-zero,
> +then this returns -1 and sets
> +.I errno
> +to
> +.BR E2BIG .
> +The encoded data should be passed in the remaining buffers.
> +This returns the number of encoded bytes written (that is, the sum of
> +.I iov[n].iov_len
> +for 1 <=
> +.I n
> +<
> +.IR iovcnt ;
> +partial writes will not occur).
> +If the
> +.I offset
> +argument to
> +.BR pwritev2 ()
> +is -1, then the file offset is incremented by
> +.IR len .
> +At least one encoding field must be non-zero.
> +Note that the encoded data is not validated when it is written;
> +if it is not valid (e.g., it cannot be decompressed),
> +then a subsequent read may return an error.
> +.PP
> +For
> +.BR preadv2 (),
> +the metadata is returned in
> +.IR iov[0] .
> +If
> +.I iov[0].iov_len
> +is less than
> +.I "sizeof(struct\ encoded_iov)"
> +in the kernel and any fields unknown to userspace are non-zero,
> +then this returns -1 and sets
> +.I errno
> +to
> +.BR E2BIG ;
> +if it is greater,
> +then any fields unknown to the kernel are returned as zero.
> +The encoded data is returned in the remaining buffers.
> +If the provided buffers are not large enough to return an entire encoded
> +extent,
> +then this returns -1 and sets
> +.I errno
> +to
> +.BR ENOBUFS .
> +This returns the number of encoded bytes read.
> +If the
> +.I offset
> +argument to
> +.BR preadv2 ()
> +is -1, then the file offset is incremented by
> +.IR len .
> +This will only return one encoded extent per call.
> +This can also read data which is not encoded;
> +all encoding fields will be zero in that case.
> +.PP
> +As the filesystem page cache typically contains decoded data,
> +encoded I/O bypasses the page cache.
> +.SS Security
> +Encoded I/O creates the potential for some security issues:
> +.IP * 3
> +Encoded writes allow writing arbitrary data which the kernel will decode on
> +a subsequent read. Decompression algorithms are complex and may have bugs
> +which can be exploited by maliciously crafted data.
> +.IP *
> +Encoded reads may return data which is not logically present in the file
> +(see the discussion of
> +.I len
> +vs.
> +.I unencoded_len
> +above).
> +It may not be intended for this data to be readable.
> +.PP
> +Therefore, encoded I/O requires privilege.
> +Namely, the
> +.B RWF_ENCODED
> +flag may only be used when the file was opened with the
> +.B O_ALLOW_ENCODED
> +flag to
> +.BR open (2),
> +which requires the
> +.B CAP_SYS_ADMIN
> +capability.
> +.B O_ALLOW_ENCODED
> +may be set and cleared with
> +.BR fcntl (2).
> +Note that it is not cleared on
> +.BR fork (2)
> +or
> +.BR execve (2);
> +one may wish to use
> +.B O_CLOEXEC
> +with
> +.BR O_ALLOW_ENCODED .
> +.SS Filesystem support
> +Encoded I/O is supported on the following filesystems:
> +.TP
> +Btrfs (since Linux 5.8)
> +.IP
> +Btrfs supports encoded reads and writes of compressed data.
> +The data is encoded as follows:
> +.RS
> +.IP * 3
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_ZLIB ,
> +then the encoded data is a single zlib stream.
> +.IP *
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_LZO ,
> +then the encoded data is compressed page by page with LZO1X
> +and wrapped in the format documented in the Linux kernel source file
> +.IR fs/btrfs/lzo.c .
> +.IP *
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_ZSTD ,
> +then the encoded data is a single zstd frame compressed with the
> +.I windowLog
> +compression parameter set to no more than 17.
> +.RE
> +.IP
> +Additionally, there are some restrictions on
> +.BR pwritev2 ():
> +.RS
> +.IP * 3
> +.I offset
> +(or the current file offset if
> +.I offset
> +is -1) must be aligned to the sector size of the filesystem.
> +.IP *
> +.I len
> +must be aligned to the sector size of the filesystem
> +unless the data ends at or beyond the current end of the file.
> +.IP *
> +.I unencoded_len
> +and the length of the encoded data must each be no more than 128 KiB.
> +This limit may increase in the future.
> +.IP *
> +The length of the encoded data must be less than or equal to
> +.IR unencoded_len .
> +.RE
> --
> 2.25.1
>
Omar Sandoval April 16, 2020, 5:02 p.m. UTC | #6
On Thu, Apr 16, 2020 at 02:26:01PM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Omar,
> 
> (Unless you CC both me and mtk.manpages@gmail.com, it's easily
> possible that I will miss your man-pages patches.)

That's good to know, thanks. Do you mind being CCd on man-pages for
features that haven't been finalized yet?

> What's the status here? I presume the features documented here are not
> yet merged, right? Is the aim still to have them merged in the future?

They're not yet merged but I'm still working on having them merged. I'm
still waiting for VFS review.

Thanks!
Michael Kerrisk (man-pages) April 16, 2020, 8:39 p.m. UTC | #7
Hello Omar,

On Thu, 16 Apr 2020 at 19:02, Omar Sandoval <osandov@osandov.com> wrote:
>
> On Thu, Apr 16, 2020 at 02:26:01PM +0200, Michael Kerrisk (man-pages) wrote:
> > Hello Omar,
> >
> > (Unless you CC both me and mtk.manpages@gmail.com, it's easily
> > possible that I will miss your man-pages patches.)
>
> That's good to know, thanks. Do you mind being CCd on man-pages for
> features that haven't been finalized yet?

Please do CC me and linux-man@ on such patches. Just make sure that
the patch notes that the feature is not yet upstream.
>
> > What's the status here? I presume the features documented here are not
> > yet merged, right? Is the aim still to have them merged in the future?
>
> They're not yet merged but I'm still working on having them merged. I'm
> still waiting for VFS review.

Okay.

Thanks,

Michael
diff mbox series

Patch

diff --git a/man2/fcntl.2 b/man2/fcntl.2
index bb1ac1f5d..15a1010a6 100644
--- a/man2/fcntl.2
+++ b/man2/fcntl.2
@@ -222,8 +222,9 @@  On Linux, this command can change only the
 .BR O_ASYNC ,
 .BR O_DIRECT ,
 .BR O_NOATIME ,
+.BR O_NONBLOCK ,
 and
-.B O_NONBLOCK
+.B O_ALLOW_ENCODED
 flags.
 It is not possible to change the
 .BR O_DSYNC
@@ -1821,6 +1822,13 @@  Attempted to clear the
 flag on a file that has the append-only attribute set.
 .TP
 .B EPERM
+Attempted to set the
+.B O_ALLOW_ENCODED
+flag and the calling process did not have the
+.B CAP_SYS_ADMIN
+capability.
+.TP
+.B EPERM
 .I cmd
 was
 .BR F_ADD_SEALS ,
diff --git a/man2/open.2 b/man2/open.2
index 3ab4ee17b..256cb4247 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -421,6 +421,14 @@  was followed by a call to
 .BR fdatasync (2)).
 .IR "See NOTES below" .
 .TP
+.B O_ALLOW_ENCODED
+Open the file with encoded I/O permissions;
+see
+.BR encoded_io (7).
+The caller must have the
+.B CAP_SYS_ADMIN
+capability.
+.TP
 .B O_EXCL
 Ensure that this call creates the file:
 if this flag is specified in conjunction with
@@ -1176,6 +1184,11 @@  did not match the owner of the file and the caller was not privileged.
 The operation was prevented by a file seal; see
 .BR fcntl (2).
 .TP
+.B EPERM
+The
+.B O_ALLOW_ENCODED
+flag was specified, but the caller was not privileged.
+.TP
 .B EROFS
 .I pathname
 refers to a file on a read-only filesystem and write access was
diff --git a/man2/readv.2 b/man2/readv.2
index af27aa63e..8b5458023 100644
--- a/man2/readv.2
+++ b/man2/readv.2
@@ -265,6 +265,11 @@  the data is always appended to the end of the file.
 However, if the
 .I offset
 argument is \-1, the current file offset is updated.
+.TP
+.BR RWF_ENCODED " (since Linux 5.7)"
+Read or write encoded (e.g., compressed) data.
+See
+.BR encoded_io (7).
 .SH RETURN VALUE
 On success,
 .BR readv (),
@@ -284,6 +289,13 @@  than requested (see
 and
 .BR write (2)).
 .PP
+If
+.B
+RWF_ENCODED
+was specified in
+.IR flags ,
+then the return value is the number of encoded bytes.
+.PP
 On error, \-1 is returned, and \fIerrno\fP is set appropriately.
 .SH ERRORS
 The errors are as given for
@@ -314,6 +326,58 @@  is less than zero or greater than the permitted maximum.
 .TP
 .B EOPNOTSUPP
 An unknown flag is specified in \fIflags\fP.
+.TP
+.B EOPNOTSUPP
+.B RWF_ENCODED
+is specified in
+.I flags
+and the filesystem does not implement encoded I/O.
+.TP
+.B EPERM
+.B RWF_ENCODED
+is specified in
+.I flags
+and the file was not opened with the
+.B O_ALLOW_ENCODED
+flag.
+.PP
+.BR preadv2 ()
+can fail for the following reasons:
+.TP
+.B E2BIG
+.B RWF_ENCODED
+is specified in
+.I flags
+and
+.I iov[0]
+is not large enough to return the encoding metadata.
+.TP
+.B ENOBUFS
+.B RWF_ENCODED
+is specified in
+.I flags
+and the buffers in
+.I iov
+are not big enough to return the encoded data.
+.PP
+.BR pwritev2 ()
+can fail for the following reasons:
+.TP
+.B E2BIG
+.B RWF_ENCODED
+is specified in
+.I flags
+and
+.I iov[0]
+contains non-zero fields
+after the kernel's
+.IR "sizeof(struct\ encoded_iov)" .
+.TP
+.B EINVAL
+.B RWF_ENCODED
+is specified in
+.I flags
+and the alignment and/or size requirements are not met.
 .SH VERSIONS
 .BR preadv ()
 and
diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
new file mode 100644
index 000000000..72b40353f
--- /dev/null
+++ b/man7/encoded_io.7
@@ -0,0 +1,298 @@ 
+.\" Copyright (c) 2019 by Omar Sandoval <osandov@fb.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.\"
+.TH ENCODED_IO  7 2019-10-14 "Linux" "Linux Programmer's Manual"
+.SH NAME
+encoded_io \- overview of encoded I/O
+.SH DESCRIPTION
+Several filesystems (e.g., Btrfs) support transparent encoding
+(e.g., compression, encryption) of data on disk:
+written data is encoded by the kernel before it is written to disk,
+and read data is decoded before being returned to the user.
+In some cases, it is useful to skip this encoding step.
+For example, the user may want to read the compressed contents of a file
+or write pre-compressed data directly to a file.
+This is referred to as "encoded I/O".
+.SS Encoded I/O API
+Encoded I/O is specified with the
+.B RWF_ENCODED
+flag to
+.BR preadv2 (2)
+and
+.BR pwritev2 (2).
+If
+.B RWF_ENCODED
+is specified, then
+.I iov[0].iov_base
+points to an
+.I
+encoded_iov
+structure, defined in
+.I <linux/fs.h>
+as:
+.PP
+.in +4n
+.EX
+struct encoded_iov {
+    __aligned_u64 len;
+    __aligned_u64 unencoded_len;
+    __aligned_u64 unencoded_offset;
+    __u32 compression;
+    __u32 encryption;
+};
+.EE
+.in
+.PP
+This may be extended in the future, so
+.I iov[0].iov_len
+must be set to
+.I "sizeof(struct\ encoded_iov)"
+for forward/backward compatibility.
+The remaining buffers contain the encoded data.
+.PP
+.I compression
+and
+.I encryption
+are the encoding fields.
+.I compression
+is one of
+.B ENCODED_IOV_COMPRESSION_NONE
+(zero),
+.BR ENCODED_IOV_COMPRESSION_ZLIB ,
+.BR ENCODED_IOV_COMPRESSION_LZO ,
+or
+.BR ENCODED_IOV_COMPRESSION_ZSTD .
+.I encryption
+is currently always
+.B ENCODED_IOV_ENCRYPTION_NONE
+(zero).
+.PP
+.I unencoded_len
+is the length of the unencoded (i.e., decrypted and decompressed) data.
+.I unencoded_offset
+is the offset into the unencoded data where the data in the file begins
+(less than or equal to
+.IR unencoded_len ).
+.I len
+is the length of the data in the file
+(less than or equal to
+.I unencoded_len
+-
+.IR unencoded_offset ).
+.I
+.PP
+In most cases,
+.I len
+is equal to
+.I unencoded_len
+and
+.I unencoded_offset
+is zero.
+However, it may be necessary to refer to a subset of the unencoded data,
+usually because a read occurred in the middle of an encoded extent,
+because part of an extent was overwritten or deallocated in some
+way (e.g., with
+.BR write (2),
+.BR truncate (2),
+or
+.BR fallocate (2))
+or because part of an extent was added to the file (e.g., with
+.BR ioctl_ficlonerange (2)
+or
+.BR ioctl_fideduperange (2)).
+For example, if
+.I len
+is 300,
+.I unencoded_len
+is 1000,
+and
+.I unencoded_offset
+is 600,
+then the encoded data is 1000 bytes long when decoded,
+of which only the 300 bytes starting at offset 600 are used;
+the first 600 and last 100 bytes should be ignored.
+.PP
+If the unencoded data is actually longer than
+.IR unencoded_len ,
+then it is truncated;
+if it is shorter, then it is extended with zeroes.
+.PP
+For
+.BR pwritev2 (),
+the metadata should be specified in
+.IR iov[0] .
+If
+.I iov[0].iov_len
+is less than
+.I "sizeof(struct\ encoded_iov)"
+in the kernel,
+then any fields unknown to userspace are treated as if they were zero;
+if it is greater and any fields unknown to the kernel are non-zero,
+then this returns -1 and sets
+.I errno
+to
+.BR E2BIG .
+The encoded data should be passed in the remaining buffers.
+This returns the number of encoded bytes written (that is, the sum of
+.I iov[n].iov_len
+for 1 <=
+.I n
+<
+.IR iovcnt ;
+partial writes will not occur).
+If the
+.I offset
+argument to
+.BR pwritev2 ()
+is -1, then the file offset is incremented by
+.IR len .
+At least one encoding field must be non-zero.
+Note that the encoded data is not validated when it is written;
+if it is not valid (e.g., it cannot be decompressed),
+then a subsequent read may return an error.
+.PP
+For
+.BR preadv2 (),
+the metadata is returned in
+.IR iov[0] .
+If
+.I iov[0].iov_len
+is less than
+.I "sizeof(struct\ encoded_iov)"
+in the kernel and any fields unknown to userspace are non-zero,
+then this returns -1 and sets
+.I errno
+to
+.BR E2BIG ;
+if it is greater,
+then any fields unknown to the kernel are returned as zero.
+The encoded data is returned in the remaining buffers.
+If the provided buffers are not large enough to return an entire encoded
+extent,
+then this returns -1 and sets
+.I errno
+to
+.BR ENOBUFS .
+This returns the number of encoded bytes read.
+If the
+.I offset
+argument to
+.BR preadv2 ()
+is -1, then the file offset is incremented by
+.IR len .
+This will only return one encoded extent per call.
+This can also read data which is not encoded;
+all encoding fields will be zero in that case.
+.PP
+As the filesystem page cache typically contains decoded data,
+encoded I/O bypasses the page cache.
+.SS Security
+Encoded I/O creates the potential for some security issues:
+.IP * 3
+Encoded writes allow writing arbitrary data which the kernel will decode on
+a subsequent read. Decompression algorithms are complex and may have bugs
+which can be exploited by maliciously crafted data.
+.IP *
+Encoded reads may return data which is not logically present in the file
+(see the discussion of
+.I len
+vs.
+.I unencoded_len
+above).
+It may not be intended for this data to be readable.
+.PP
+Therefore, encoded I/O requires privilege.
+Namely, the
+.B RWF_ENCODED
+flag may only be used when the file was opened with the
+.B O_ALLOW_ENCODED
+flag to
+.BR open (2),
+which requires the
+.B CAP_SYS_ADMIN
+capability.
+.B O_ALLOW_ENCODED
+may be set and cleared with
+.BR fcntl (2).
+Note that it is not cleared on
+.BR fork (2)
+or
+.BR execve (2);
+one may wish to use
+.B O_CLOEXEC
+with
+.BR O_ALLOW_ENCODED .
+.SS Filesystem support
+Encoded I/O is supported on the following filesystems:
+.TP
+Btrfs (since Linux 5.8)
+.IP
+Btrfs supports encoded reads and writes of compressed data.
+The data is encoded as follows:
+.RS
+.IP * 3
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_ZLIB ,
+then the encoded data is a single zlib stream.
+.IP *
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_LZO ,
+then the encoded data is compressed page by page with LZO1X
+and wrapped in the format documented in the Linux kernel source file
+.IR fs/btrfs/lzo.c .
+.IP *
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_ZSTD ,
+then the encoded data is a single zstd frame compressed with the
+.I windowLog
+compression parameter set to no more than 17.
+.RE
+.IP
+Additionally, there are some restrictions on
+.BR pwritev2 ():
+.RS
+.IP * 3
+.I offset
+(or the current file offset if
+.I offset
+is -1) must be aligned to the sector size of the filesystem.
+.IP *
+.I len
+must be aligned to the sector size of the filesystem
+unless the data ends at or beyond the current end of the file.
+.IP *
+.I unencoded_len
+and the length of the encoded data must each be no more than 128 KiB.
+This limit may increase in the future.
+.IP *
+The length of the encoded data must be less than or equal to
+.IR unencoded_len .
+.RE