diff mbox series

[man-pages,v6] Document encoded I/O

Message ID ec1588a618bd313e5a7c05a7f4954cc2b76ddac3.1605724767.git.osandov@osandov.com (mailing list archive)
State New, archived
Headers show
Series [man-pages,v6] Document encoded I/O | expand

Commit Message

Omar Sandoval Nov. 18, 2020, 7:18 p.m. UTC
From: Omar Sandoval <osandov@fb.com>

This adds a new page, encoded_io(7), providing an overview of encoded
I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
reference it.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: linux-man <linux-man@vger.kernel.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
This feature is not yet upstream.

 man2/fcntl.2      |  10 +-
 man2/open.2       |  23 +++
 man2/readv.2      |  70 +++++++++
 man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 471 insertions(+), 1 deletion(-)
 create mode 100644 man7/encoded_io.7

Comments

Alejandro Colomar (mailing lists; readonly) Nov. 19, 2020, 11:29 p.m. UTC | #1
Hi Omar,

Please, see some fixes below:

Michael, I've also some questions for you below
(you can grep for mtk to find those).

Thanks,

Alex

On 11/18/20 8:18 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This adds a new page, encoded_io(7), providing an overview of encoded
> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> reference it.
> 
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: linux-man <linux-man@vger.kernel.org>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> This feature is not yet upstream.
> 
>  man2/fcntl.2      |  10 +-
>  man2/open.2       |  23 +++
>  man2/readv.2      |  70 +++++++++
>  man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 471 insertions(+), 1 deletion(-)
>  create mode 100644 man7/encoded_io.7
> 
> diff --git a/man2/fcntl.2 b/man2/fcntl.2
> index 546016617..b0d7fa2c3 100644
> --- a/man2/fcntl.2
> +++ b/man2/fcntl.2
> @@ -221,8 +221,9 @@ On Linux, this command can change only the
>  .BR O_ASYNC ,
>  .BR O_DIRECT ,
>  .BR O_NOATIME ,
> +.BR O_NONBLOCK ,
>  and
> -.B O_NONBLOCK
> +.B O_ALLOW_ENCODED
>  flags.
>  It is not possible to change the
>  .BR O_DSYNC
> @@ -1820,6 +1821,13 @@ Attempted to clear the
>  flag on a file that has the append-only attribute set.
>  .TP
>  .B EPERM
> +Attempted to set the
> +.B O_ALLOW_ENCODED
> +flag and the calling process did not have the
> +.B CAP_SYS_ADMIN
> +capability.
> +.TP
> +.B EPERM
>  .I cmd
>  was
>  .BR F_ADD_SEALS ,
> diff --git a/man2/open.2 b/man2/open.2
> index f587b0d95..84697dfa8 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -437,6 +437,16 @@ was followed by a call to
>  .BR fdatasync (2)).
>  .IR "See NOTES below" .
>  .TP
> +.B O_ALLOW_ENCODED

The list is alphabetically sorted;
please, follow that
(O_ALLOW_ENCODED should be the first one).

> +Open the file with encoded I/O permissions;
> +see
> +.BR encoded_io (7).
> +.B O_CLOEXEC
> +must be specified in conjuction with this flag.
> +The caller must have the
> +.B CAP_SYS_ADMIN
> +capability.
> +.TP
>  .B O_EXCL
>  Ensure that this call creates the file:
>  if this flag is specified in conjunction with
> @@ -1082,6 +1092,14 @@ is invalid
>  (e.g., it contains characters not permitted by the underlying filesystem).
>  .TP
>  .B EINVAL
> +.B O_ALLOW_ENCODED
> +was specified in
> +.IR flags ,
> +but
> +.B O_CLOEXEC
> +was not specified.
> +.TP
> +.B EINVAL
>  The final component ("basename") of
>  .I pathname
>  is invalid
> @@ -1238,6 +1256,11 @@ did not match the owner of the file and the caller was not privileged.
>  The operation was prevented by a file seal; see
>  .BR fcntl (2).
>  .TP
> +.B EPERM
> +The
> +.B O_ALLOW_ENCODED
> +flag was specified, but the caller was not privileged.
> +.TP
>  .B EROFS
>  .I pathname
>  refers to a file on a read-only filesystem and write access was
> diff --git a/man2/readv.2 b/man2/readv.2
> index 5a8b74168..c9933acf0 100644
> --- a/man2/readv.2
> +++ b/man2/readv.2
> @@ -264,6 +264,11 @@ the data is always appended to the end of the file.
>  However, if the
>  .I offset
>  argument is \-1, the current file offset is updated.
> +.TP
> +.BR RWF_ENCODED " (since Linux 5.12)"
> +Read or write encoded (e.g., compressed) data.
> +See
> +.BR encoded_io (7).
>  .SH RETURN VALUE
>  On success,
>  .BR readv (),
> @@ -283,6 +288,13 @@ than requested (see
>  and
>  .BR write (2)).
>  .PP
> +If
> +.B
> +RWF_ENCODED

RWF_ENCODED should go in the same line as .B:

[
.B RWF_ENCODED
]

> +was specified in
> +.IR flags ,
> +then the return value is the number of encoded bytes.
> +.PP
>  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
>  .SH ERRORS
>  The errors are as given for
> @@ -313,6 +325,64 @@ is less than zero or greater than the permitted maximum.
>  .TP
>  .B EOPNOTSUPP
>  An unknown flag is specified in \fIflags\fP.
> +.TP
> +.B EOPNOTSUPP
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the filesystem does not implement encoded I/O.
> +.TP
> +.B EPERM
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the file was not opened with the
> +.B O_ALLOW_ENCODED
> +flag.
> +.PP
> +.BR preadv2 ()
> +can fail for the following reasons:

The wording is a bit unclear:

Above your additions (old text, not yours),
it says that some errors apply to preadv2
(as well as to other functions):

[
ERRORS
       The errors are as given for read(2) and write(2).  Furthermore,
       preadv(),  preadv2(),  pwritev(),  and pwritev2() can also fail
       for the same reasons as lseek(2).  Additionally, the  following
       errors are defined:

       EINVAL The  sum  of  the  iov_len  values  overflows an ssize_t
              value.

       EINVAL The vector count, iovcnt, is less than zero  or  greater
              than the permitted maximum.

       EOPNOTSUPP
              An unknown flag is specified in flags.

       EOPNOTSUPP
              RWF_ENCODED  is  specified  in  flags and the filesystem
              does not implement encoded I/O.

       EPERM  RWF_ENCODED is specified in flags and the file  was  not
              opened with the O_ALLOW_ENCODED flag.
]

And then you added a line that says:

[
       preadv2() can fail for the following reasons:
]

Which if read strictly, it says that [only] the following errors apply.

Did you mean that
"preadv3() can _additionally_ fail for the following reasons"?

Could you please be a bit more specific?

The same applies for pwritev2() below.

> +.TP
> +.B E2BIG
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and
> +.I iov[0]
> +is not large enough to return the encoding metadata.
> +.TP
> +.B ENOBUFS
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the buffers in
> +.I iov
> +are not big enough to return the encoded data.
> +.PP
> +.BR pwritev2 ()
> +can fail for the following reasons:
> +.TP
> +.B E2BIG
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and
> +.I iov[0]
> +contains non-zero fields
> +after the kernel's
> +.IR "sizeof(struct\ encoded_iov)" .

Don't escape the space, if the string is already in "".

> +.TP
> +.B EINVAL
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the encoding is unknown or not supported by the filesystem.
> +.TP
> +.B EINVAL
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the alignment and/or size requirements are not met.
>  .SH VERSIONS
>  .BR preadv ()
>  and
> diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
> new file mode 100644
> index 000000000..106fa587b
> --- /dev/null
> +++ b/man7/encoded_io.7
> @@ -0,0 +1,369 @@
> +.\" Copyright (c) 2020 by Omar Sandoval <osandov@fb.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.\"
> +.TH ENCODED_IO  7 2020-11-11 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +encoded_io \- overview of encoded I/O
> +.SH DESCRIPTION
> +Several filesystems (e.g., Btrfs) support transparent encoding
> +(e.g., compression, encryption) of data on disk:
> +written data is encoded by the kernel before it is written to disk,
> +and read data is decoded before being returned to the user.
> +In some cases, it is useful to skip this encoding step.

Here I would use ';' instead of '.'
(and next letter would be lowercase, then).

> +For example, the user may want to read the compressed contents of a file
> +or write pre-compressed data directly to a file.
> +This is referred to as "encoded I/O".
> +.SS Encoded I/O API
> +Encoded I/O is specified with the
> +.B RWF_ENCODED
> +flag to
> +.BR preadv2 (2)
> +and
> +.BR pwritev2 (2).
> +If
> +.B RWF_ENCODED
> +is specified, then
> +.I iov[0].iov_base
> +points to an
> +.I
> +encoded_iov

On the same line, please.

> +structure, defined in
> +.I <linux/fs.h>
> +as:
> +.PP
> +.in +4n
> +.EX
> +struct encoded_iov {
> +    __aligned_u64 len;
> +    __aligned_u64 unencoded_len;
> +    __aligned_u64 unencoded_offset;
> +    __u32 compression;
> +    __u32 encryption;
> +};
> +.EE
> +.in
> +.PP
> +This may be extended in the future, so
> +.I iov[0].iov_len
> +must be set to
> +.I "sizeof(struct\ encoded_iov)"
> +for forward/backward compatibility.
> +The remaining buffers contain the encoded data.
> +.PP
> +.I compression
> +and
> +.I encryption
> +are the encoding fields.
> +.I compression
> +is
> +.B ENCODED_IOV_COMPRESSION_NONE
> +(zero)
> +or a filesystem-specific
> +.B ENCODED_IOV_COMPRESSION

Maybe s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_*/

> +constant;
> +see
> +.BR Filesystem\ support .

Please, write it as [.BR "Filesystem support" .]

and maybe I would change it, to be more specific, to the following:

[
see
.B Filesystem support
below.
]

So that the reader clearly understands it's on the same page.

> +.I encryption
> +is currently always
> +.B ENCODED_IOV_ENCRYPTION_NONE
> +(zero).
> +.PP
> +.I unencoded_len
> +is the length of the unencoded (i.e., decrypted and decompressed) data.
> +.I unencoded_offset
> +is the offset into the unencoded data where the data in the file begins

The above wording is a bit unclear to me.

I suggest the following:

[
.I unencoded_offset
is the offset from the begining of the file
to the first byte of the unencoded data
]

> +(less than or equal to
> +.IR unencoded_len ).
> +.I len
> +is the length of the data in the file
> +(less than or equal to
> +.I unencoded_len
> +-

Here's a question for Michael (mtk):

I've seen (many) cases where these math operations
are written without spaces,
and in the same line (e.g., [.IR a + b]).

I'd like to know your preferences on this,
or what is actually more extended in the manual pages,
to stick with only one of them.

> +.IR unencoded_offset ).
> +See
> +.B Extent layout
> +below for some examples.
> +.I

Were you maybe going to add something there?

If not, please remove that [.I].

> +.PP
> +If the unencoded data is actually longer than
> +.IR unencoded_len ,
> +then it is truncated;
> +if it is shorter, then it is extended with zeroes.
> +.PP
> +

Please, remove that blank line.

> +.BR pwritev2 ()

Should be [.BR pwritev2 (2)]

Michael (mtk),

Am I right in that?  Please, confirm.

> +uses the metadata specified in
> +.IR iov[0] ,
> +writes the encoded data from the remaining buffers,
> +and returns the number of encoded bytes written
> +(that is, the sum of
> +.I iov[n].iov_len
> +for 1 <=
> +.I n
> +<
> +.IR iovcnt ;
> +partial writes will not occur).
> +At least one encoding field must be non-zero.
> +Note that the encoded data is not validated when it is written;
> +if it is not valid (e.g., it cannot be decompressed),
> +then a subsequent read may return an error.
> +If the
> +.I offset
> +argument to
> +.BR pwritev2 ()

Same as above: specify (2).

> +is -1, then the file offset is incremented by
> +.IR len .
> +If
> +.I iov[0].iov_len
> +is less than
> +.I "sizeof(struct\ encoded_iov)"

[.I] allows spaces, so it should be:

[
.I sizeof(struct encoded_iov)
]

> +in the kernel,
> +then any fields unknown to userspace are treated as if they were zero;

s/userspace/user space/

See man-pages(7)::STYLE GUIDE::Preferred terms

> +if it is greater and any fields unknown to the kernel are non-zero,
> +then this returns -1 and sets
> +.I errno
> +to
> +.BR E2BIG .
> +.PP
> +.BR preadv2 ()

Same as above: specify (2).

> +populates the metadata in
> +.IR iov[0] ,
> +the encoded data in the remaining buffers,
> +and returns the number of encoded bytes read.
> +This will only return one extent per call.
> +This can also read data which is not encoded;
> +all encoding fields will be zero in that case.
> +If the
> +.I offset
> +argument to
> +.BR preadv2 ()

Smae as above: specify (2).

> +is -1, then the file offset is incremented by
> +.IR len .
> +If
> +.I iov[0].iov_len
> +is less than
> +.I "sizeof(struct\ encoded_iov)"

Don't need '"' nor '\', as above.

> +in the kernel and any fields unknown to userspace are non-zero,

s/userspace/user space/

> +then
> +.BR preadv2 ()

(2)

> +returns -1 and sets
> +.I errno
> +to
> +.BR E2BIG ;
> +if it is greater,
> +then any fields unknown to the kernel are returned as zero.
> +If the provided buffers are not large enough to return an entire encoded
> +extent,

Please use semantic newlines.
I haven't checked that in the text above,
so if you happen to find that there's any other line
that should also be fixed in that sense, please do so.

To understand 'semantic newlines',
please have a look at
man-pages(7)::STYLE GUIDE::Use semantic newlines

Basically, split lines at the most natural separation point,
instead of just when the line gets over the margin.

> +then
> +.BR preadv2 ()

(2)

> +returns -1 and sets
> +.I errno
> +to
> +.BR ENOBUFS .
> +.PP
> +As the filesystem page cache typically contains decoded data,
> +encoded I/O bypasses the page cache.
> +.SS Extent layout
> +By using
> +.IR len ,
> +.IR unencoded_len ,
> +and
> +.IR unencoded_offset ,
> +it is possible to refer to a subset of an unencoded extent.
> +.PP
> +In the simplest case,
> +.I len
> +is equal to
> +.I unencoded_len
> +and
> +.I unencoded_offset
> +is zero.
> +This means that the entire unencoded extent is used.
> +.PP
> +However, suppose we read 50 bytes into a file
> +which contains a single compressed extent.
> +The filesystem must still return the entire compressed extent
> +for us to be able to decompress it,
> +so
> +.I unencoded_len
> +would be the length of the entire decompressed extent.
> +However, because the read was at offset 50,
> +the first 50 bytes should be ignored.
> +Therefore,
> +.I unencoded_offset
> +would be 50,
> +and
> +.I len
> +would accordingly be
> +.IR unencoded_len\ -\ 50 .

This formats everything as I, except for the last dot.
Replace by:

[
.I unencoded
- 50.
]

Michael (mtk), same as above:
to space, or not to space?  That is the question :p

Personally, I find spaces more clear.

> +.PP
> +Additionally, suppose we want to create an encrypted file with length 500,
> +but the file is encrypted with a block cipher using a block size of 4096.
> +The unencoded data would therefore include the appropriate padding,
> +and
> +.I unencoded_len
> +would be 4096.
> +However, to represent the logical size of the file,
> +.I len
> +would be 500
> +(and
> +.I unencoded_offset
> +would be 0).
> +.PP
> +Similar situations can arise in other cases:
> +.IP * 3
> +If the filesystem pads data to the filesystem block size before compressing,
> +then compressed files with a size unaligned to the filesystem block size will
> +end with an extent with
> +.I len
> +<
> +.IR unencoded_len .
> +.IP *
> +Extents cloned from the middle of a larger encoded extent with
> +.B FICLONERANGE
> +may have a non-zero
> +.I unencoded_offset
> +and/or
> +.I len
> +<
> +.IR unencoded_len .
> +.IP *
> +If the middle of an encoded extent is overwritten,
> +the filesystem may create extents with a non-zero
> +.I unencoded_offset
> +and/or
> +.I len
> +<
> +.I unencoded_len
> +for the parts that were not overwritten.
> +.SS Security
> +Encoded I/O creates the potential for some security issues:
> +.IP * 3
> +Encoded writes allow writing arbitrary data which the kernel will decode on
> +a subsequent read. Decompression algorithms are complex and may have bugs
> +which can be exploited by maliciously crafted data.
> +.IP *
> +Encoded reads may return data which is not logically present in the file
> +(see the discussion of
> +.I len
> +vs.

Please, s/vs./vs/
See the reasons below:

Michael (mtk),

Here the renderer outputs a double space
(as for separating two sentences).

Are you okay with that?

I haven't found any other "\<vs\>\.".
However, I've found a few "\<vs\>[^\.]".

> +.I unencoded_len
> +above).
> +It may not be intended for this data to be readable.
> +.PP
> +Therefore, encoded I/O requires privilege.
> +Namely, the
> +.B RWF_ENCODED
> +flag may only be used when the file was opened with the
> +.B O_ALLOW_ENCODED
> +flag to
> +.BR open (2),
> +which requires the
> +.B CAP_SYS_ADMIN
> +capability.
> +The
> +.B O_CLOEXEC
> +flag must be specified in conjunction with
> +.BR O_ALLOW_ENCODED .
> +This avoids accidentally leaking the encoded I/O privilege
> +(it is not cleared on
> +.BR fork (2)
> +or
> +.BR execve (2)
> +otherwise).
> +If
> +.B O_ALLOW_ENCODED
> +without
> +.B O_CLOEXEC
> +is desired,
> +.B O_CLOEXEC
> +can be cleared afterwards with
> +.BR fnctl (2).
> +.BR fcntl (2)
> +can also clear or set
> +.B O_ALLOW_ENCODED
> +(including without
> +.BR O_CLOEXEC ).
> +.SS Filesystem support
> +Encoded I/O is supported on the following filesystems:
> +.TP
> +Btrfs (since Linux 5.12)
> +.IP
> +Btrfs supports encoded reads and writes of compressed data.
> +The data is encoded as follows:
> +.RS
> +.IP * 3
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ,
> +then the encoded data is a single zlib stream.
> +.IP *
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ,
> +then the encoded data is a single zstd frame compressed with the
> +.I windowLog
> +compression parameter set to no more than 17.
> +.IP *
> +If
> +.I compression
> +is one of
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ,
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ,
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ,
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ,
> +or
> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ,
> +then the encoded data is compressed page by page
> +(using the page size indicated by the name of the constant)
> +with LZO1X
> +and wrapped in the format documented in the Linux kernel source file
> +.IR fs/btrfs/lzo.c .
> +.RE
> +.IP
> +Additionally, there are some restrictions on
> +.BR pwritev2 ():

(2)

> +.RS
> +.IP * 3
> +.I offset
> +(or the current file offset if
> +.I offset
> +is -1) must be aligned to the sector size of the filesystem.
> +.IP *
> +.I len
> +must be aligned to the sector size of the filesystem
> +unless the data ends at or beyond the current end of the file.
> +.IP *
> +.I unencoded_len
> +and the length of the encoded data must each be no more than 128 KiB.
> +This limit may increase in the future.
> +.IP *
> +The length of the encoded data must be less than or equal to
> +.IR unencoded_len .
> +.IP *
> +If using LZO, the filesystem's page size must match the compression page size.
> +.RE
> 

Please, add a SEE ALSO section, which should at least point to
preadv2(2) (or pwritev2(2), if you prefer):

[
.SH SEE ALSO
.BR preadv2 (2)
]
Alejandro Colomar (man-pages) Nov. 20, 2020, 2:06 p.m. UTC | #2
Hi Omar and Michael,

please, see below.

Thanks,

Alex

On 11/20/20 12:29 AM, Alejandro Colomar (mailing lists; readonly) wrote:
> Hi Omar,
> 
> Please, see some fixes below:
> 
> Michael, I've also some questions for you below
> (you can grep for mtk to find those).
> 
> Thanks,
> 
> Alex
> 
> On 11/18/20 8:18 PM, Omar Sandoval wrote:
>> From: Omar Sandoval <osandov@fb.com>
>>
>> This adds a new page, encoded_io(7), providing an overview of encoded
>> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
>> reference it.
>>
>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>> Cc: linux-man <linux-man@vger.kernel.org>
>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>> ---
>> This feature is not yet upstream.
>>
>>  man2/fcntl.2      |  10 +-
>>  man2/open.2       |  23 +++
>>  man2/readv.2      |  70 +++++++++
>>  man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 471 insertions(+), 1 deletion(-)
>>  create mode 100644 man7/encoded_io.7
>>
>> diff --git a/man2/fcntl.2 b/man2/fcntl.2
>> index 546016617..b0d7fa2c3 100644
>> --- a/man2/fcntl.2
>> +++ b/man2/fcntl.2
>> @@ -221,8 +221,9 @@ On Linux, this command can change only the
>>  .BR O_ASYNC ,
>>  .BR O_DIRECT ,
>>  .BR O_NOATIME ,
>> +.BR O_NONBLOCK ,
>>  and
>> -.B O_NONBLOCK
>> +.B O_ALLOW_ENCODED
>>  flags.
>>  It is not possible to change the
>>  .BR O_DSYNC
>> @@ -1820,6 +1821,13 @@ Attempted to clear the
>>  flag on a file that has the append-only attribute set.
>>  .TP
>>  .B EPERM
>> +Attempted to set the
>> +.B O_ALLOW_ENCODED
>> +flag and the calling process did not have the
>> +.B CAP_SYS_ADMIN
>> +capability.
>> +.TP
>> +.B EPERM
>>  .I cmd
>>  was
>>  .BR F_ADD_SEALS ,
>> diff --git a/man2/open.2 b/man2/open.2
>> index f587b0d95..84697dfa8 100644
>> --- a/man2/open.2
>> +++ b/man2/open.2
>> @@ -437,6 +437,16 @@ was followed by a call to
>>  .BR fdatasync (2)).
>>  .IR "See NOTES below" .
>>  .TP
>> +.B O_ALLOW_ENCODED
> 
> The list is alphabetically sorted;
> please, follow that
> (O_ALLOW_ENCODED should be the first one).
> 
>> +Open the file with encoded I/O permissions;
>> +see
>> +.BR encoded_io (7).
>> +.B O_CLOEXEC
>> +must be specified in conjuction with this flag.
>> +The caller must have the
>> +.B CAP_SYS_ADMIN
>> +capability.
>> +.TP
>>  .B O_EXCL
>>  Ensure that this call creates the file:
>>  if this flag is specified in conjunction with
>> @@ -1082,6 +1092,14 @@ is invalid
>>  (e.g., it contains characters not permitted by the underlying filesystem).
>>  .TP
>>  .B EINVAL
>> +.B O_ALLOW_ENCODED
>> +was specified in
>> +.IR flags ,
>> +but
>> +.B O_CLOEXEC
>> +was not specified.
>> +.TP
>> +.B EINVAL
>>  The final component ("basename") of
>>  .I pathname
>>  is invalid
>> @@ -1238,6 +1256,11 @@ did not match the owner of the file and the caller was not privileged.
>>  The operation was prevented by a file seal; see
>>  .BR fcntl (2).
>>  .TP
>> +.B EPERM
>> +The
>> +.B O_ALLOW_ENCODED
>> +flag was specified, but the caller was not privileged.
>> +.TP
>>  .B EROFS
>>  .I pathname
>>  refers to a file on a read-only filesystem and write access was
>> diff --git a/man2/readv.2 b/man2/readv.2
>> index 5a8b74168..c9933acf0 100644
>> --- a/man2/readv.2
>> +++ b/man2/readv.2
>> @@ -264,6 +264,11 @@ the data is always appended to the end of the file.
>>  However, if the
>>  .I offset
>>  argument is \-1, the current file offset is updated.
>> +.TP
>> +.BR RWF_ENCODED " (since Linux 5.12)"
>> +Read or write encoded (e.g., compressed) data.
>> +See
>> +.BR encoded_io (7).
>>  .SH RETURN VALUE
>>  On success,
>>  .BR readv (),
>> @@ -283,6 +288,13 @@ than requested (see
>>  and
>>  .BR write (2)).
>>  .PP
>> +If
>> +.B
>> +RWF_ENCODED
> 
> RWF_ENCODED should go in the same line as .B:
> 
> [
> .B RWF_ENCODED
> ]
> 
>> +was specified in
>> +.IR flags ,
>> +then the return value is the number of encoded bytes.
>> +.PP
>>  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
>>  .SH ERRORS
>>  The errors are as given for
>> @@ -313,6 +325,64 @@ is less than zero or greater than the permitted maximum.
>>  .TP
>>  .B EOPNOTSUPP
>>  An unknown flag is specified in \fIflags\fP.
>> +.TP
>> +.B EOPNOTSUPP
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and the filesystem does not implement encoded I/O.
>> +.TP
>> +.B EPERM
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and the file was not opened with the
>> +.B O_ALLOW_ENCODED
>> +flag.
>> +.PP
>> +.BR preadv2 ()
>> +can fail for the following reasons:
> 
> The wording is a bit unclear:
> 
> Above your additions (old text, not yours),
> it says that some errors apply to preadv2
> (as well as to other functions):
> 
> [
> ERRORS
>        The errors are as given for read(2) and write(2).  Furthermore,
>        preadv(),  preadv2(),  pwritev(),  and pwritev2() can also fail
>        for the same reasons as lseek(2).  Additionally, the  following
>        errors are defined:
> 
>        EINVAL The  sum  of  the  iov_len  values  overflows an ssize_t
>               value.
> 
>        EINVAL The vector count, iovcnt, is less than zero  or  greater
>               than the permitted maximum.
> 
>        EOPNOTSUPP
>               An unknown flag is specified in flags.
> 
>        EOPNOTSUPP
>               RWF_ENCODED  is  specified  in  flags and the filesystem
>               does not implement encoded I/O.
> 
>        EPERM  RWF_ENCODED is specified in flags and the file  was  not
>               opened with the O_ALLOW_ENCODED flag.
> ]
> 
> And then you added a line that says:
> 
> [
>        preadv2() can fail for the following reasons:
> ]
> 
> Which if read strictly, it says that [only] the following errors apply.
> 
> Did you mean that
> "preadv3() can _additionally_ fail for the following reasons"?
> 
> Could you please be a bit more specific?
> 
> The same applies for pwritev2() below.
> 
>> +.TP
>> +.B E2BIG
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and
>> +.I iov[0]
>> +is not large enough to return the encoding metadata.
>> +.TP
>> +.B ENOBUFS
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and the buffers in
>> +.I iov
>> +are not big enough to return the encoded data.
>> +.PP
>> +.BR pwritev2 ()
>> +can fail for the following reasons:
>> +.TP
>> +.B E2BIG
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and
>> +.I iov[0]
>> +contains non-zero fields
>> +after the kernel's
>> +.IR "sizeof(struct\ encoded_iov)" .
> 
> Don't escape the space, if the string is already in "".
> 
>> +.TP
>> +.B EINVAL
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and the encoding is unknown or not supported by the filesystem.
>> +.TP
>> +.B EINVAL
>> +.B RWF_ENCODED
>> +is specified in
>> +.I flags
>> +and the alignment and/or size requirements are not met.
>>  .SH VERSIONS
>>  .BR preadv ()
>>  and
>> diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
>> new file mode 100644
>> index 000000000..106fa587b
>> --- /dev/null
>> +++ b/man7/encoded_io.7
>> @@ -0,0 +1,369 @@
>> +.\" Copyright (c) 2020 by Omar Sandoval <osandov@fb.com>
>> +.\"
>> +.\" %%%LICENSE_START(VERBATIM)
>> +.\" Permission is granted to make and distribute verbatim copies of this
>> +.\" manual provided the copyright notice and this permission notice are
>> +.\" preserved on all copies.
>> +.\"
>> +.\" Permission is granted to copy and distribute modified versions of this
>> +.\" manual under the conditions for verbatim copying, provided that the
>> +.\" entire resulting derived work is distributed under the terms of a
>> +.\" permission notice identical to this one.
>> +.\"
>> +.\" Since the Linux kernel and libraries are constantly changing, this
>> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
>> +.\" responsibility for errors or omissions, or for damages resulting from
>> +.\" the use of the information contained herein.  The author(s) may not
>> +.\" have taken the same level of care in the production of this manual,
>> +.\" which is licensed free of charge, as they might when working
>> +.\" professionally.
>> +.\"
>> +.\" Formatted or processed versions of this manual, if unaccompanied by
>> +.\" the source, must acknowledge the copyright and authors of this work.
>> +.\" %%%LICENSE_END
>> +.\"
>> +.\"
>> +.TH ENCODED_IO  7 2020-11-11 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +encoded_io \- overview of encoded I/O
>> +.SH DESCRIPTION
>> +Several filesystems (e.g., Btrfs) support transparent encoding
>> +(e.g., compression, encryption) of data on disk:
>> +written data is encoded by the kernel before it is written to disk,
>> +and read data is decoded before being returned to the user.
>> +In some cases, it is useful to skip this encoding step.
> 
> Here I would use ';' instead of '.'
> (and next letter would be lowercase, then).
> 
>> +For example, the user may want to read the compressed contents of a file
>> +or write pre-compressed data directly to a file.
>> +This is referred to as "encoded I/O".
>> +.SS Encoded I/O API
>> +Encoded I/O is specified with the
>> +.B RWF_ENCODED
>> +flag to
>> +.BR preadv2 (2)
>> +and
>> +.BR pwritev2 (2).
>> +If
>> +.B RWF_ENCODED
>> +is specified, then
>> +.I iov[0].iov_base
>> +points to an
>> +.I
>> +encoded_iov
> 
> On the same line, please.
> 
>> +structure, defined in
>> +.I <linux/fs.h>
>> +as:
>> +.PP
>> +.in +4n
>> +.EX
>> +struct encoded_iov {
>> +    __aligned_u64 len;
>> +    __aligned_u64 unencoded_len;
>> +    __aligned_u64 unencoded_offset;
>> +    __u32 compression;
>> +    __u32 encryption;
>> +};
>> +.EE
>> +.in
>> +.PP
>> +This may be extended in the future, so
>> +.I iov[0].iov_len
>> +must be set to
>> +.I "sizeof(struct\ encoded_iov)"
>> +for forward/backward compatibility.
>> +The remaining buffers contain the encoded data.
>> +.PP
>> +.I compression
>> +and
>> +.I encryption
>> +are the encoding fields.
>> +.I compression
>> +is
>> +.B ENCODED_IOV_COMPRESSION_NONE
>> +(zero)
>> +or a filesystem-specific
>> +.B ENCODED_IOV_COMPRESSION
> 
> Maybe s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_*/

Or s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_/

I'm not sure about existing practice.

Michael (mtk), what would you do here?

> 
>> +constant;
>> +see
>> +.BR Filesystem\ support .
> 
> Please, write it as [.BR "Filesystem support" .]
> 
> and maybe I would change it, to be more specific, to the following:
> 
> [
> see
> .B Filesystem support
> below.
> ]
> 
> So that the reader clearly understands it's on the same page.
> 
>> +.I encryption
>> +is currently always
>> +.B ENCODED_IOV_ENCRYPTION_NONE
>> +(zero).
>> +.PP
>> +.I unencoded_len
>> +is the length of the unencoded (i.e., decrypted and decompressed) data.
>> +.I unencoded_offset
>> +is the offset into the unencoded data where the data in the file begins
> 
> The above wording is a bit unclear to me.
> 
> I suggest the following:
> 
> [
> .I unencoded_offset
> is the offset from the begining of the file
> to the first byte of the unencoded data
> ]
> 
>> +(less than or equal to
>> +.IR unencoded_len ).
>> +.I len
>> +is the length of the data in the file
>> +(less than or equal to
>> +.I unencoded_len
>> +-
> 
> Here's a question for Michael (mtk):
> 
> I've seen (many) cases where these math operations
> are written without spaces,
> and in the same line (e.g., [.IR a + b]).
> 
> I'd like to know your preferences on this,
> or what is actually more extended in the manual pages,
> to stick with only one of them.
> 
>> +.IR unencoded_offset ).
>> +See
>> +.B Extent layout
>> +below for some examples.
>> +.I
> 
> Were you maybe going to add something there?
> 
> If not, please remove that [.I].
> 
>> +.PP
>> +If the unencoded data is actually longer than
>> +.IR unencoded_len ,
>> +then it is truncated;
>> +if it is shorter, then it is extended with zeroes.
>> +.PP
>> +
> 
> Please, remove that blank line.
> 
>> +.BR pwritev2 ()
> 
> Should be [.BR pwritev2 (2)]
> 
> Michael (mtk),
> 
> Am I right in that?  Please, confirm.
> 
>> +uses the metadata specified in
>> +.IR iov[0] ,
>> +writes the encoded data from the remaining buffers,
>> +and returns the number of encoded bytes written
>> +(that is, the sum of
>> +.I iov[n].iov_len
>> +for 1 <=
>> +.I n
>> +<
>> +.IR iovcnt ;
>> +partial writes will not occur).
>> +At least one encoding field must be non-zero.
>> +Note that the encoded data is not validated when it is written;
>> +if it is not valid (e.g., it cannot be decompressed),
>> +then a subsequent read may return an error.
>> +If the
>> +.I offset
>> +argument to
>> +.BR pwritev2 ()
> 
> Same as above: specify (2).
> 
>> +is -1, then the file offset is incremented by
>> +.IR len .
>> +If
>> +.I iov[0].iov_len
>> +is less than
>> +.I "sizeof(struct\ encoded_iov)"
> 
> [.I] allows spaces, so it should be:
> 
> [
> .I sizeof(struct encoded_iov)
> ]
> 
>> +in the kernel,
>> +then any fields unknown to userspace are treated as if they were zero;
> 
> s/userspace/user space/
> 
> See man-pages(7)::STYLE GUIDE::Preferred terms
> 
>> +if it is greater and any fields unknown to the kernel are non-zero,
>> +then this returns -1 and sets
>> +.I errno
>> +to
>> +.BR E2BIG .
>> +.PP
>> +.BR preadv2 ()
> 
> Same as above: specify (2).
> 
>> +populates the metadata in
>> +.IR iov[0] ,
>> +the encoded data in the remaining buffers,
>> +and returns the number of encoded bytes read.
>> +This will only return one extent per call.
>> +This can also read data which is not encoded;
>> +all encoding fields will be zero in that case.
>> +If the
>> +.I offset
>> +argument to
>> +.BR preadv2 ()
> 
> Smae as above: specify (2).
> 
>> +is -1, then the file offset is incremented by
>> +.IR len .
>> +If
>> +.I iov[0].iov_len
>> +is less than
>> +.I "sizeof(struct\ encoded_iov)"
> 
> Don't need '"' nor '\', as above.
> 
>> +in the kernel and any fields unknown to userspace are non-zero,
> 
> s/userspace/user space/
> 
>> +then
>> +.BR preadv2 ()
> 
> (2)
> 
>> +returns -1 and sets
>> +.I errno
>> +to
>> +.BR E2BIG ;
>> +if it is greater,
>> +then any fields unknown to the kernel are returned as zero.
>> +If the provided buffers are not large enough to return an entire encoded
>> +extent,
> 
> Please use semantic newlines.
> I haven't checked that in the text above,
> so if you happen to find that there's any other line
> that should also be fixed in that sense, please do so.
> 
> To understand 'semantic newlines',
> please have a look at
> man-pages(7)::STYLE GUIDE::Use semantic newlines
> 
> Basically, split lines at the most natural separation point,
> instead of just when the line gets over the margin.
> 
>> +then
>> +.BR preadv2 ()
> 
> (2)
> 
>> +returns -1 and sets
>> +.I errno
>> +to
>> +.BR ENOBUFS .
>> +.PP
>> +As the filesystem page cache typically contains decoded data,
>> +encoded I/O bypasses the page cache.
>> +.SS Extent layout
>> +By using
>> +.IR len ,
>> +.IR unencoded_len ,
>> +and
>> +.IR unencoded_offset ,
>> +it is possible to refer to a subset of an unencoded extent.
>> +.PP
>> +In the simplest case,
>> +.I len
>> +is equal to
>> +.I unencoded_len
>> +and
>> +.I unencoded_offset
>> +is zero.
>> +This means that the entire unencoded extent is used.
>> +.PP
>> +However, suppose we read 50 bytes into a file
>> +which contains a single compressed extent.
>> +The filesystem must still return the entire compressed extent
>> +for us to be able to decompress it,
>> +so
>> +.I unencoded_len
>> +would be the length of the entire decompressed extent.
>> +However, because the read was at offset 50,
>> +the first 50 bytes should be ignored.
>> +Therefore,
>> +.I unencoded_offset
>> +would be 50,
>> +and
>> +.I len
>> +would accordingly be
>> +.IR unencoded_len\ -\ 50 .
> 
> This formats everything as I, except for the last dot.
> Replace by:
> 
> [
> .I unencoded
> - 50.
> ]
> 
> Michael (mtk), same as above:
> to space, or not to space?  That is the question :p
> 
> Personally, I find spaces more clear.
> 
>> +.PP
>> +Additionally, suppose we want to create an encrypted file with length 500,
>> +but the file is encrypted with a block cipher using a block size of 4096.
>> +The unencoded data would therefore include the appropriate padding,
>> +and
>> +.I unencoded_len
>> +would be 4096.
>> +However, to represent the logical size of the file,
>> +.I len
>> +would be 500
>> +(and
>> +.I unencoded_offset
>> +would be 0).
>> +.PP
>> +Similar situations can arise in other cases:
>> +.IP * 3
>> +If the filesystem pads data to the filesystem block size before compressing,
>> +then compressed files with a size unaligned to the filesystem block size will
>> +end with an extent with
>> +.I len
>> +<
>> +.IR unencoded_len .
>> +.IP *
>> +Extents cloned from the middle of a larger encoded extent with
>> +.B FICLONERANGE
>> +may have a non-zero
>> +.I unencoded_offset
>> +and/or
>> +.I len
>> +<
>> +.IR unencoded_len .
>> +.IP *
>> +If the middle of an encoded extent is overwritten,
>> +the filesystem may create extents with a non-zero
>> +.I unencoded_offset
>> +and/or
>> +.I len
>> +<
>> +.I unencoded_len
>> +for the parts that were not overwritten.
>> +.SS Security
>> +Encoded I/O creates the potential for some security issues:
>> +.IP * 3
>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>> +a subsequent read. Decompression algorithms are complex and may have bugs
>> +which can be exploited by maliciously crafted data.
>> +.IP *
>> +Encoded reads may return data which is not logically present in the file
>> +(see the discussion of
>> +.I len
>> +vs.
> 
> Please, s/vs./vs/
> See the reasons below:
> 
> Michael (mtk),
> 
> Here the renderer outputs a double space
> (as for separating two sentences).
> 
> Are you okay with that?
> 
> I haven't found any other "\<vs\>\.".
> However, I've found a few "\<vs\>[^\.]".
> 
>> +.I unencoded_len
>> +above).
>> +It may not be intended for this data to be readable.
>> +.PP
>> +Therefore, encoded I/O requires privilege.
>> +Namely, the
>> +.B RWF_ENCODED
>> +flag may only be used when the file was opened with the
>> +.B O_ALLOW_ENCODED
>> +flag to
>> +.BR open (2),
>> +which requires the
>> +.B CAP_SYS_ADMIN
>> +capability.
>> +The
>> +.B O_CLOEXEC
>> +flag must be specified in conjunction with
>> +.BR O_ALLOW_ENCODED .
>> +This avoids accidentally leaking the encoded I/O privilege
>> +(it is not cleared on
>> +.BR fork (2)
>> +or
>> +.BR execve (2)
>> +otherwise).
>> +If
>> +.B O_ALLOW_ENCODED
>> +without
>> +.B O_CLOEXEC
>> +is desired,
>> +.B O_CLOEXEC
>> +can be cleared afterwards with
>> +.BR fnctl (2).
>> +.BR fcntl (2)
>> +can also clear or set
>> +.B O_ALLOW_ENCODED
>> +(including without
>> +.BR O_CLOEXEC ).
>> +.SS Filesystem support
>> +Encoded I/O is supported on the following filesystems:
>> +.TP
>> +Btrfs (since Linux 5.12)
>> +.IP
>> +Btrfs supports encoded reads and writes of compressed data.
>> +The data is encoded as follows:
>> +.RS
>> +.IP * 3
>> +If
>> +.I compression
>> +is
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ,
>> +then the encoded data is a single zlib stream.
>> +.IP *
>> +If
>> +.I compression
>> +is
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ,
>> +then the encoded data is a single zstd frame compressed with the
>> +.I windowLog
>> +compression parameter set to no more than 17.
>> +.IP *
>> +If
>> +.I compression
>> +is one of
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ,
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ,
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ,
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ,
>> +or
>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ,
>> +then the encoded data is compressed page by page
>> +(using the page size indicated by the name of the constant)
>> +with LZO1X
>> +and wrapped in the format documented in the Linux kernel source file
>> +.IR fs/btrfs/lzo.c .
>> +.RE
>> +.IP
>> +Additionally, there are some restrictions on
>> +.BR pwritev2 ():
> 
> (2)
> 
>> +.RS
>> +.IP * 3
>> +.I offset
>> +(or the current file offset if
>> +.I offset
>> +is -1) must be aligned to the sector size of the filesystem.
>> +.IP *
>> +.I len
>> +must be aligned to the sector size of the filesystem
>> +unless the data ends at or beyond the current end of the file.
>> +.IP *
>> +.I unencoded_len
>> +and the length of the encoded data must each be no more than 128 KiB.
>> +This limit may increase in the future.
>> +.IP *
>> +The length of the encoded data must be less than or equal to
>> +.IR unencoded_len .
>> +.IP *
>> +If using LZO, the filesystem's page size must match the compression page size.
>> +.RE
>>
> 
> Please, add a SEE ALSO section, which should at least point to
> preadv2(2) (or pwritev2(2), if you prefer):
> 
> [
> .SH SEE ALSO
> .BR preadv2 (2)
> ]
>
Alejandro Colomar (man-pages) Nov. 20, 2020, 3:03 p.m. UTC | #3
Hi Omar,

I found a wording of mine to be a bit confusing.
Please see below.

Thanks,

Alex

On 11/20/20 3:06 PM, Alejandro Colomar (man-pages) wrote:
> Hi Omar and Michael,
> 
> please, see below.
> 
> Thanks,
> 
> Alex
> 
> On 11/20/20 12:29 AM, Alejandro Colomar (mailing lists; readonly) wrote:
>> Hi Omar,
>>
>> Please, see some fixes below:
>>
>> Michael, I've also some questions for you below
>> (you can grep for mtk to find those).
>>
>> Thanks,
>>
>> Alex
>>
>> On 11/18/20 8:18 PM, Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> This adds a new page, encoded_io(7), providing an overview of encoded
>>> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
>>> reference it.
>>>
>>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>>> Cc: linux-man <linux-man@vger.kernel.org>
>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>> ---
>>> This feature is not yet upstream.
>>>
>>>  man2/fcntl.2      |  10 +-
>>>  man2/open.2       |  23 +++
>>>  man2/readv.2      |  70 +++++++++
>>>  man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
>>>  4 files changed, 471 insertions(+), 1 deletion(-)
>>>  create mode 100644 man7/encoded_io.7
>>>
>>> diff --git a/man2/fcntl.2 b/man2/fcntl.2
>>> index 546016617..b0d7fa2c3 100644
>>> --- a/man2/fcntl.2
>>> +++ b/man2/fcntl.2
>>> @@ -221,8 +221,9 @@ On Linux, this command can change only the
>>>  .BR O_ASYNC ,
>>>  .BR O_DIRECT ,
>>>  .BR O_NOATIME ,
>>> +.BR O_NONBLOCK ,
>>>  and
>>> -.B O_NONBLOCK
>>> +.B O_ALLOW_ENCODED
>>>  flags.
>>>  It is not possible to change the
>>>  .BR O_DSYNC
>>> @@ -1820,6 +1821,13 @@ Attempted to clear the
>>>  flag on a file that has the append-only attribute set.
>>>  .TP
>>>  .B EPERM
>>> +Attempted to set the
>>> +.B O_ALLOW_ENCODED
>>> +flag and the calling process did not have the
>>> +.B CAP_SYS_ADMIN
>>> +capability.
>>> +.TP
>>> +.B EPERM
>>>  .I cmd
>>>  was
>>>  .BR F_ADD_SEALS ,
>>> diff --git a/man2/open.2 b/man2/open.2
>>> index f587b0d95..84697dfa8 100644
>>> --- a/man2/open.2
>>> +++ b/man2/open.2
>>> @@ -437,6 +437,16 @@ was followed by a call to
>>>  .BR fdatasync (2)).
>>>  .IR "See NOTES below" .
>>>  .TP
>>> +.B O_ALLOW_ENCODED
>>
>> The list is alphabetically sorted;
>> please, follow that
>> (O_ALLOW_ENCODED should be the first one).
>>
>>> +Open the file with encoded I/O permissions;
>>> +see
>>> +.BR encoded_io (7).
>>> +.B O_CLOEXEC
>>> +must be specified in conjuction with this flag.
>>> +The caller must have the
>>> +.B CAP_SYS_ADMIN
>>> +capability.
>>> +.TP
>>>  .B O_EXCL
>>>  Ensure that this call creates the file:
>>>  if this flag is specified in conjunction with
>>> @@ -1082,6 +1092,14 @@ is invalid
>>>  (e.g., it contains characters not permitted by the underlying filesystem).
>>>  .TP
>>>  .B EINVAL
>>> +.B O_ALLOW_ENCODED
>>> +was specified in
>>> +.IR flags ,
>>> +but
>>> +.B O_CLOEXEC
>>> +was not specified.
>>> +.TP
>>> +.B EINVAL
>>>  The final component ("basename") of
>>>  .I pathname
>>>  is invalid
>>> @@ -1238,6 +1256,11 @@ did not match the owner of the file and the caller was not privileged.
>>>  The operation was prevented by a file seal; see
>>>  .BR fcntl (2).
>>>  .TP
>>> +.B EPERM
>>> +The
>>> +.B O_ALLOW_ENCODED
>>> +flag was specified, but the caller was not privileged.
>>> +.TP
>>>  .B EROFS
>>>  .I pathname
>>>  refers to a file on a read-only filesystem and write access was
>>> diff --git a/man2/readv.2 b/man2/readv.2
>>> index 5a8b74168..c9933acf0 100644
>>> --- a/man2/readv.2
>>> +++ b/man2/readv.2
>>> @@ -264,6 +264,11 @@ the data is always appended to the end of the file.
>>>  However, if the
>>>  .I offset
>>>  argument is \-1, the current file offset is updated.
>>> +.TP
>>> +.BR RWF_ENCODED " (since Linux 5.12)"
>>> +Read or write encoded (e.g., compressed) data.
>>> +See
>>> +.BR encoded_io (7).
>>>  .SH RETURN VALUE
>>>  On success,
>>>  .BR readv (),
>>> @@ -283,6 +288,13 @@ than requested (see
>>>  and
>>>  .BR write (2)).
>>>  .PP
>>> +If
>>> +.B
>>> +RWF_ENCODED
>>
>> RWF_ENCODED should go in the same line as .B:
>>
>> [
>> .B RWF_ENCODED
>> ]
>>
>>> +was specified in
>>> +.IR flags ,
>>> +then the return value is the number of encoded bytes.
>>> +.PP
>>>  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
>>>  .SH ERRORS
>>>  The errors are as given for
>>> @@ -313,6 +325,64 @@ is less than zero or greater than the permitted maximum.
>>>  .TP
>>>  .B EOPNOTSUPP
>>>  An unknown flag is specified in \fIflags\fP.
>>> +.TP
>>> +.B EOPNOTSUPP
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and the filesystem does not implement encoded I/O.
>>> +.TP
>>> +.B EPERM
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and the file was not opened with the
>>> +.B O_ALLOW_ENCODED
>>> +flag.
>>> +.PP
>>> +.BR preadv2 ()
>>> +can fail for the following reasons:
>>
>> The wording is a bit unclear:
>>
>> Above your additions (old text, not yours),
>> it says that some errors apply to preadv2
>> (as well as to other functions):
>>
>> [
>> ERRORS
>>        The errors are as given for read(2) and write(2).  Furthermore,
>>        preadv(),  preadv2(),  pwritev(),  and pwritev2() can also fail
>>        for the same reasons as lseek(2).  Additionally, the  following
>>        errors are defined:
>>
>>        EINVAL The  sum  of  the  iov_len  values  overflows an ssize_t
>>               value.
>>
>>        EINVAL The vector count, iovcnt, is less than zero  or  greater
>>               than the permitted maximum.
>>
>>        EOPNOTSUPP
>>               An unknown flag is specified in flags.
>>
>>        EOPNOTSUPP
>>               RWF_ENCODED  is  specified  in  flags and the filesystem
>>               does not implement encoded I/O.
>>
>>        EPERM  RWF_ENCODED is specified in flags and the file  was  not
>>               opened with the O_ALLOW_ENCODED flag.
>> ]
>>
>> And then you added a line that says:
>>
>> [
>>        preadv2() can fail for the following reasons:
>> ]
>>
>> Which if read strictly, it says that [only] the following errors apply.
>>
>> Did you mean that
>> "preadv3() can _additionally_ fail for the following reasons"?
>>
>> Could you please be a bit more specific?
>>
>> The same applies for pwritev2() below.
>>
>>> +.TP
>>> +.B E2BIG
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and
>>> +.I iov[0]
>>> +is not large enough to return the encoding metadata.
>>> +.TP
>>> +.B ENOBUFS
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and the buffers in
>>> +.I iov
>>> +are not big enough to return the encoded data.
>>> +.PP
>>> +.BR pwritev2 ()
>>> +can fail for the following reasons:
>>> +.TP
>>> +.B E2BIG
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and
>>> +.I iov[0]
>>> +contains non-zero fields
>>> +after the kernel's
>>> +.IR "sizeof(struct\ encoded_iov)" .
>>
>> Don't escape the space, if the string is already in "".
>>
>>> +.TP
>>> +.B EINVAL
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and the encoding is unknown or not supported by the filesystem.
>>> +.TP
>>> +.B EINVAL
>>> +.B RWF_ENCODED
>>> +is specified in
>>> +.I flags
>>> +and the alignment and/or size requirements are not met.
>>>  .SH VERSIONS
>>>  .BR preadv ()
>>>  and
>>> diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
>>> new file mode 100644
>>> index 000000000..106fa587b
>>> --- /dev/null
>>> +++ b/man7/encoded_io.7
>>> @@ -0,0 +1,369 @@
>>> +.\" Copyright (c) 2020 by Omar Sandoval <osandov@fb.com>
>>> +.\"
>>> +.\" %%%LICENSE_START(VERBATIM)
>>> +.\" Permission is granted to make and distribute verbatim copies of this
>>> +.\" manual provided the copyright notice and this permission notice are
>>> +.\" preserved on all copies.
>>> +.\"
>>> +.\" Permission is granted to copy and distribute modified versions of this
>>> +.\" manual under the conditions for verbatim copying, provided that the
>>> +.\" entire resulting derived work is distributed under the terms of a
>>> +.\" permission notice identical to this one.
>>> +.\"
>>> +.\" Since the Linux kernel and libraries are constantly changing, this
>>> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
>>> +.\" responsibility for errors or omissions, or for damages resulting from
>>> +.\" the use of the information contained herein.  The author(s) may not
>>> +.\" have taken the same level of care in the production of this manual,
>>> +.\" which is licensed free of charge, as they might when working
>>> +.\" professionally.
>>> +.\"
>>> +.\" Formatted or processed versions of this manual, if unaccompanied by
>>> +.\" the source, must acknowledge the copyright and authors of this work.
>>> +.\" %%%LICENSE_END
>>> +.\"
>>> +.\"
>>> +.TH ENCODED_IO  7 2020-11-11 "Linux" "Linux Programmer's Manual"
>>> +.SH NAME
>>> +encoded_io \- overview of encoded I/O
>>> +.SH DESCRIPTION
>>> +Several filesystems (e.g., Btrfs) support transparent encoding
>>> +(e.g., compression, encryption) of data on disk:
>>> +written data is encoded by the kernel before it is written to disk,
>>> +and read data is decoded before being returned to the user.
>>> +In some cases, it is useful to skip this encoding step.
>>
>> Here I would use ';' instead of '.'
>> (and next letter would be lowercase, then).
>>
>>> +For example, the user may want to read the compressed contents of a file
>>> +or write pre-compressed data directly to a file.
>>> +This is referred to as "encoded I/O".
>>> +.SS Encoded I/O API
>>> +Encoded I/O is specified with the
>>> +.B RWF_ENCODED
>>> +flag to
>>> +.BR preadv2 (2)
>>> +and
>>> +.BR pwritev2 (2).
>>> +If
>>> +.B RWF_ENCODED
>>> +is specified, then
>>> +.I iov[0].iov_base
>>> +points to an
>>> +.I
>>> +encoded_iov
>>
>> On the same line, please.
>>
>>> +structure, defined in
>>> +.I <linux/fs.h>
>>> +as:
>>> +.PP
>>> +.in +4n
>>> +.EX
>>> +struct encoded_iov {
>>> +    __aligned_u64 len;
>>> +    __aligned_u64 unencoded_len;
>>> +    __aligned_u64 unencoded_offset;
>>> +    __u32 compression;
>>> +    __u32 encryption;
>>> +};
>>> +.EE
>>> +.in
>>> +.PP
>>> +This may be extended in the future, so
>>> +.I iov[0].iov_len
>>> +must be set to
>>> +.I "sizeof(struct\ encoded_iov)"
>>> +for forward/backward compatibility.
>>> +The remaining buffers contain the encoded data.
>>> +.PP
>>> +.I compression
>>> +and
>>> +.I encryption
>>> +are the encoding fields.
>>> +.I compression
>>> +is
>>> +.B ENCODED_IOV_COMPRESSION_NONE
>>> +(zero)
>>> +or a filesystem-specific
>>> +.B ENCODED_IOV_COMPRESSION
>>
>> Maybe s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_*/
> 
> Or s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_/
> 
> I'm not sure about existing practice.
> 
> Michael (mtk), what would you do here?
> 
>>
>>> +constant;
>>> +see
>>> +.BR Filesystem\ support .
>>
>> Please, write it as [.BR "Filesystem support" .]
>>
>> and maybe I would change it, to be more specific, to the following:
>>
>> [
>> see
>> .B Filesystem support
>> below.
>> ]
>>
>> So that the reader clearly understands it's on the same page.
>>
>>> +.I encryption
>>> +is currently always
>>> +.B ENCODED_IOV_ENCRYPTION_NONE
>>> +(zero).
>>> +.PP
>>> +.I unencoded_len
>>> +is the length of the unencoded (i.e., decrypted and decompressed) data.
>>> +.I unencoded_offset
>>> +is the offset into the unencoded data where the data in the file begins
>>
>> The above wording is a bit unclear to me.
>>
>> I suggest the following:
>>
>> [
>> .I unencoded_offset
>> is the offset from the begining of the file
>> to the first byte of the unencoded data
>> ]

Now I've read it again, and my wording was even worse than yours.
I think yours can be understood after a few reads.

However, I'll still try to reword mine to see if I add some value:

[
.I unencoded_offset
is the offset from the first byte of the unencoded data
to the first byte of logical data.
]

If you prefer yours, or a mix, that's fine.

>>
>>> +(less than or equal to
>>> +.IR unencoded_len ).
>>> +.I len
>>> +is the length of the data in the file
>>> +(less than or equal to
>>> +.I unencoded_len
>>> +-
>>
>> Here's a question for Michael (mtk):
>>
>> I've seen (many) cases where these math operations
>> are written without spaces,
>> and in the same line (e.g., [.IR a + b]).
>>
>> I'd like to know your preferences on this,
>> or what is actually more extended in the manual pages,
>> to stick with only one of them.
>>
>>> +.IR unencoded_offset ).
>>> +See
>>> +.B Extent layout
>>> +below for some examples.
>>> +.I
>>
>> Were you maybe going to add something there?
>>
>> If not, please remove that [.I].
>>
>>> +.PP
>>> +If the unencoded data is actually longer than
>>> +.IR unencoded_len ,
>>> +then it is truncated;
>>> +if it is shorter, then it is extended with zeroes.
>>> +.PP
>>> +
>>
>> Please, remove that blank line.
>>
>>> +.BR pwritev2 ()
>>
>> Should be [.BR pwritev2 (2)]
>>
>> Michael (mtk),
>>
>> Am I right in that?  Please, confirm.
>>
>>> +uses the metadata specified in
>>> +.IR iov[0] ,
>>> +writes the encoded data from the remaining buffers,
>>> +and returns the number of encoded bytes written
>>> +(that is, the sum of
>>> +.I iov[n].iov_len
>>> +for 1 <=
>>> +.I n
>>> +<
>>> +.IR iovcnt ;
>>> +partial writes will not occur).
>>> +At least one encoding field must be non-zero.
>>> +Note that the encoded data is not validated when it is written;
>>> +if it is not valid (e.g., it cannot be decompressed),
>>> +then a subsequent read may return an error.
>>> +If the
>>> +.I offset
>>> +argument to
>>> +.BR pwritev2 ()
>>
>> Same as above: specify (2).
>>
>>> +is -1, then the file offset is incremented by
>>> +.IR len .
>>> +If
>>> +.I iov[0].iov_len
>>> +is less than
>>> +.I "sizeof(struct\ encoded_iov)"
>>
>> [.I] allows spaces, so it should be:
>>
>> [
>> .I sizeof(struct encoded_iov)
>> ]
>>
>>> +in the kernel,
>>> +then any fields unknown to userspace are treated as if they were zero;
>>
>> s/userspace/user space/
>>
>> See man-pages(7)::STYLE GUIDE::Preferred terms
>>
>>> +if it is greater and any fields unknown to the kernel are non-zero,
>>> +then this returns -1 and sets
>>> +.I errno
>>> +to
>>> +.BR E2BIG .
>>> +.PP
>>> +.BR preadv2 ()
>>
>> Same as above: specify (2).
>>
>>> +populates the metadata in
>>> +.IR iov[0] ,
>>> +the encoded data in the remaining buffers,
>>> +and returns the number of encoded bytes read.
>>> +This will only return one extent per call.
>>> +This can also read data which is not encoded;
>>> +all encoding fields will be zero in that case.
>>> +If the
>>> +.I offset
>>> +argument to
>>> +.BR preadv2 ()
>>
>> Smae as above: specify (2).
>>
>>> +is -1, then the file offset is incremented by
>>> +.IR len .
>>> +If
>>> +.I iov[0].iov_len
>>> +is less than
>>> +.I "sizeof(struct\ encoded_iov)"
>>
>> Don't need '"' nor '\', as above.
>>
>>> +in the kernel and any fields unknown to userspace are non-zero,
>>
>> s/userspace/user space/
>>
>>> +then
>>> +.BR preadv2 ()
>>
>> (2)
>>
>>> +returns -1 and sets
>>> +.I errno
>>> +to
>>> +.BR E2BIG ;
>>> +if it is greater,
>>> +then any fields unknown to the kernel are returned as zero.
>>> +If the provided buffers are not large enough to return an entire encoded
>>> +extent,
>>
>> Please use semantic newlines.
>> I haven't checked that in the text above,
>> so if you happen to find that there's any other line
>> that should also be fixed in that sense, please do so.
>>
>> To understand 'semantic newlines',
>> please have a look at
>> man-pages(7)::STYLE GUIDE::Use semantic newlines
>>
>> Basically, split lines at the most natural separation point,
>> instead of just when the line gets over the margin.
>>
>>> +then
>>> +.BR preadv2 ()
>>
>> (2)
>>
>>> +returns -1 and sets
>>> +.I errno
>>> +to
>>> +.BR ENOBUFS .
>>> +.PP
>>> +As the filesystem page cache typically contains decoded data,
>>> +encoded I/O bypasses the page cache.
>>> +.SS Extent layout
>>> +By using
>>> +.IR len ,
>>> +.IR unencoded_len ,
>>> +and
>>> +.IR unencoded_offset ,
>>> +it is possible to refer to a subset of an unencoded extent.
>>> +.PP
>>> +In the simplest case,
>>> +.I len
>>> +is equal to
>>> +.I unencoded_len
>>> +and
>>> +.I unencoded_offset
>>> +is zero.
>>> +This means that the entire unencoded extent is used.
>>> +.PP
>>> +However, suppose we read 50 bytes into a file
>>> +which contains a single compressed extent.
>>> +The filesystem must still return the entire compressed extent
>>> +for us to be able to decompress it,
>>> +so
>>> +.I unencoded_len
>>> +would be the length of the entire decompressed extent.
>>> +However, because the read was at offset 50,
>>> +the first 50 bytes should be ignored.
>>> +Therefore,
>>> +.I unencoded_offset
>>> +would be 50,
>>> +and
>>> +.I len
>>> +would accordingly be
>>> +.IR unencoded_len\ -\ 50 .
>>
>> This formats everything as I, except for the last dot.
>> Replace by:
>>
>> [
>> .I unencoded
>> - 50.
>> ]
>>
>> Michael (mtk), same as above:
>> to space, or not to space?  That is the question :p
>>
>> Personally, I find spaces more clear.
>>
>>> +.PP
>>> +Additionally, suppose we want to create an encrypted file with length 500,
>>> +but the file is encrypted with a block cipher using a block size of 4096.
>>> +The unencoded data would therefore include the appropriate padding,
>>> +and
>>> +.I unencoded_len
>>> +would be 4096.
>>> +However, to represent the logical size of the file,
>>> +.I len
>>> +would be 500
>>> +(and
>>> +.I unencoded_offset
>>> +would be 0).
>>> +.PP
>>> +Similar situations can arise in other cases:
>>> +.IP * 3
>>> +If the filesystem pads data to the filesystem block size before compressing,
>>> +then compressed files with a size unaligned to the filesystem block size will
>>> +end with an extent with
>>> +.I len
>>> +<
>>> +.IR unencoded_len .
>>> +.IP *
>>> +Extents cloned from the middle of a larger encoded extent with
>>> +.B FICLONERANGE
>>> +may have a non-zero
>>> +.I unencoded_offset
>>> +and/or
>>> +.I len
>>> +<
>>> +.IR unencoded_len .
>>> +.IP *
>>> +If the middle of an encoded extent is overwritten,
>>> +the filesystem may create extents with a non-zero
>>> +.I unencoded_offset
>>> +and/or
>>> +.I len
>>> +<
>>> +.I unencoded_len
>>> +for the parts that were not overwritten.
>>> +.SS Security
>>> +Encoded I/O creates the potential for some security issues:
>>> +.IP * 3
>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>> +which can be exploited by maliciously crafted data.
>>> +.IP *
>>> +Encoded reads may return data which is not logically present in the file
>>> +(see the discussion of
>>> +.I len
>>> +vs.
>>
>> Please, s/vs./vs/
>> See the reasons below:
>>
>> Michael (mtk),
>>
>> Here the renderer outputs a double space
>> (as for separating two sentences).
>>
>> Are you okay with that?
>>
>> I haven't found any other "\<vs\>\.".
>> However, I've found a few "\<vs\>[^\.]".
>>
>>> +.I unencoded_len
>>> +above).
>>> +It may not be intended for this data to be readable.
>>> +.PP
>>> +Therefore, encoded I/O requires privilege.
>>> +Namely, the
>>> +.B RWF_ENCODED
>>> +flag may only be used when the file was opened with the
>>> +.B O_ALLOW_ENCODED
>>> +flag to
>>> +.BR open (2),
>>> +which requires the
>>> +.B CAP_SYS_ADMIN
>>> +capability.
>>> +The
>>> +.B O_CLOEXEC
>>> +flag must be specified in conjunction with
>>> +.BR O_ALLOW_ENCODED .
>>> +This avoids accidentally leaking the encoded I/O privilege
>>> +(it is not cleared on
>>> +.BR fork (2)
>>> +or
>>> +.BR execve (2)
>>> +otherwise).
>>> +If
>>> +.B O_ALLOW_ENCODED
>>> +without
>>> +.B O_CLOEXEC
>>> +is desired,
>>> +.B O_CLOEXEC
>>> +can be cleared afterwards with
>>> +.BR fnctl (2).
>>> +.BR fcntl (2)
>>> +can also clear or set
>>> +.B O_ALLOW_ENCODED
>>> +(including without
>>> +.BR O_CLOEXEC ).
>>> +.SS Filesystem support
>>> +Encoded I/O is supported on the following filesystems:
>>> +.TP
>>> +Btrfs (since Linux 5.12)
>>> +.IP
>>> +Btrfs supports encoded reads and writes of compressed data.
>>> +The data is encoded as follows:
>>> +.RS
>>> +.IP * 3
>>> +If
>>> +.I compression
>>> +is
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ,
>>> +then the encoded data is a single zlib stream.
>>> +.IP *
>>> +If
>>> +.I compression
>>> +is
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ,
>>> +then the encoded data is a single zstd frame compressed with the
>>> +.I windowLog
>>> +compression parameter set to no more than 17.
>>> +.IP *
>>> +If
>>> +.I compression
>>> +is one of
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ,
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ,
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ,
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ,
>>> +or
>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ,
>>> +then the encoded data is compressed page by page
>>> +(using the page size indicated by the name of the constant)
>>> +with LZO1X
>>> +and wrapped in the format documented in the Linux kernel source file
>>> +.IR fs/btrfs/lzo.c .
>>> +.RE
>>> +.IP
>>> +Additionally, there are some restrictions on
>>> +.BR pwritev2 ():
>>
>> (2)
>>
>>> +.RS
>>> +.IP * 3
>>> +.I offset
>>> +(or the current file offset if
>>> +.I offset
>>> +is -1) must be aligned to the sector size of the filesystem.
>>> +.IP *
>>> +.I len
>>> +must be aligned to the sector size of the filesystem
>>> +unless the data ends at or beyond the current end of the file.
>>> +.IP *
>>> +.I unencoded_len
>>> +and the length of the encoded data must each be no more than 128 KiB.
>>> +This limit may increase in the future.
>>> +.IP *
>>> +The length of the encoded data must be less than or equal to
>>> +.IR unencoded_len .
>>> +.IP *
>>> +If using LZO, the filesystem's page size must match the compression page size.
>>> +.RE
>>>
>>
>> Please, add a SEE ALSO section, which should at least point to
>> preadv2(2) (or pwritev2(2), if you prefer):
>>
>> [
>> .SH SEE ALSO
>> .BR preadv2 (2)
>> ]
>>
Omar Sandoval Nov. 30, 2020, 7:35 p.m. UTC | #4
On Fri, Nov 20, 2020 at 04:03:44PM +0100, Alejandro Colomar (man-pages) wrote:
> Hi Omar,
> 
> I found a wording of mine to be a bit confusing.
> Please see below.
> 
> Thanks,
> 
> Alex
> 
> On 11/20/20 3:06 PM, Alejandro Colomar (man-pages) wrote:
> > Hi Omar and Michael,
> > 
> > please, see below.
> > 
> > Thanks,
> > 
> > Alex
> > 
> > On 11/20/20 12:29 AM, Alejandro Colomar (mailing lists; readonly) wrote:
> >> Hi Omar,
> >>
> >> Please, see some fixes below:
> >>
> >> Michael, I've also some questions for you below
> >> (you can grep for mtk to find those).
> >>
> >> Thanks,
> >>
> >> Alex

Thanks for the suggestions, I'll incorporate those into the next
submission.
Alejandro Colomar (man-pages) Dec. 1, 2020, 2:36 p.m. UTC | #5
Hello Michael,

Could you please have a look at a few doubts down there?
Just grep 'mtk' and you'll find them ;)

Thanks,

Alex

On 11/20/20 4:03 PM, Alejandro Colomar (man-pages) wrote:
> Hi Omar,
> 
> I found a wording of mine to be a bit confusing.
> Please see below.
> 
> Thanks,
> 
> Alex
> 
> On 11/20/20 3:06 PM, Alejandro Colomar (man-pages) wrote:
>> Hi Omar and Michael,
>>
>> please, see below.
>>
>> Thanks,
>>
>> Alex
>>
>> On 11/20/20 12:29 AM, Alejandro Colomar (mailing lists; readonly) wrote:
>>> Hi Omar,
>>>
>>> Please, see some fixes below:
>>>
>>> Michael, I've also some questions for you below
>>> (you can grep for mtk to find those).
>>>
>>> Thanks,
>>>
>>> Alex
>>>
>>> On 11/18/20 8:18 PM, Omar Sandoval wrote:
>>>> From: Omar Sandoval <osandov@fb.com>
>>>>
>>>> This adds a new page, encoded_io(7), providing an overview of encoded
>>>> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
>>>> reference it.
>>>>
>>>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>>>> Cc: linux-man <linux-man@vger.kernel.org>
>>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>>> ---
>>>> This feature is not yet upstream.
>>>>
>>>>  man2/fcntl.2      |  10 +-
>>>>  man2/open.2       |  23 +++
>>>>  man2/readv.2      |  70 +++++++++
>>>>  man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>  4 files changed, 471 insertions(+), 1 deletion(-)
>>>>  create mode 100644 man7/encoded_io.7
>>>>
>>>> diff --git a/man2/fcntl.2 b/man2/fcntl.2
>>>> index 546016617..b0d7fa2c3 100644
>>>> --- a/man2/fcntl.2
>>>> +++ b/man2/fcntl.2
>>>> @@ -221,8 +221,9 @@ On Linux, this command can change only the
>>>>  .BR O_ASYNC ,
>>>>  .BR O_DIRECT ,
>>>>  .BR O_NOATIME ,
>>>> +.BR O_NONBLOCK ,
>>>>  and
>>>> -.B O_NONBLOCK
>>>> +.B O_ALLOW_ENCODED
>>>>  flags.
>>>>  It is not possible to change the
>>>>  .BR O_DSYNC
>>>> @@ -1820,6 +1821,13 @@ Attempted to clear the
>>>>  flag on a file that has the append-only attribute set.
>>>>  .TP
>>>>  .B EPERM
>>>> +Attempted to set the
>>>> +.B O_ALLOW_ENCODED
>>>> +flag and the calling process did not have the
>>>> +.B CAP_SYS_ADMIN
>>>> +capability.
>>>> +.TP
>>>> +.B EPERM
>>>>  .I cmd
>>>>  was
>>>>  .BR F_ADD_SEALS ,
>>>> diff --git a/man2/open.2 b/man2/open.2
>>>> index f587b0d95..84697dfa8 100644
>>>> --- a/man2/open.2
>>>> +++ b/man2/open.2
>>>> @@ -437,6 +437,16 @@ was followed by a call to
>>>>  .BR fdatasync (2)).
>>>>  .IR "See NOTES below" .
>>>>  .TP
>>>> +.B O_ALLOW_ENCODED
>>>
>>> The list is alphabetically sorted;
>>> please, follow that
>>> (O_ALLOW_ENCODED should be the first one).
>>>
>>>> +Open the file with encoded I/O permissions;
>>>> +see
>>>> +.BR encoded_io (7).
>>>> +.B O_CLOEXEC
>>>> +must be specified in conjuction with this flag.
>>>> +The caller must have the
>>>> +.B CAP_SYS_ADMIN
>>>> +capability.
>>>> +.TP
>>>>  .B O_EXCL
>>>>  Ensure that this call creates the file:
>>>>  if this flag is specified in conjunction with
>>>> @@ -1082,6 +1092,14 @@ is invalid
>>>>  (e.g., it contains characters not permitted by the underlying filesystem).
>>>>  .TP
>>>>  .B EINVAL
>>>> +.B O_ALLOW_ENCODED
>>>> +was specified in
>>>> +.IR flags ,
>>>> +but
>>>> +.B O_CLOEXEC
>>>> +was not specified.
>>>> +.TP
>>>> +.B EINVAL
>>>>  The final component ("basename") of
>>>>  .I pathname
>>>>  is invalid
>>>> @@ -1238,6 +1256,11 @@ did not match the owner of the file and the caller was not privileged.
>>>>  The operation was prevented by a file seal; see
>>>>  .BR fcntl (2).
>>>>  .TP
>>>> +.B EPERM
>>>> +The
>>>> +.B O_ALLOW_ENCODED
>>>> +flag was specified, but the caller was not privileged.
>>>> +.TP
>>>>  .B EROFS
>>>>  .I pathname
>>>>  refers to a file on a read-only filesystem and write access was
>>>> diff --git a/man2/readv.2 b/man2/readv.2
>>>> index 5a8b74168..c9933acf0 100644
>>>> --- a/man2/readv.2
>>>> +++ b/man2/readv.2
>>>> @@ -264,6 +264,11 @@ the data is always appended to the end of the file.
>>>>  However, if the
>>>>  .I offset
>>>>  argument is \-1, the current file offset is updated.
>>>> +.TP
>>>> +.BR RWF_ENCODED " (since Linux 5.12)"
>>>> +Read or write encoded (e.g., compressed) data.
>>>> +See
>>>> +.BR encoded_io (7).
>>>>  .SH RETURN VALUE
>>>>  On success,
>>>>  .BR readv (),
>>>> @@ -283,6 +288,13 @@ than requested (see
>>>>  and
>>>>  .BR write (2)).
>>>>  .PP
>>>> +If
>>>> +.B
>>>> +RWF_ENCODED
>>>
>>> RWF_ENCODED should go in the same line as .B:
>>>
>>> [
>>> .B RWF_ENCODED
>>> ]
>>>
>>>> +was specified in
>>>> +.IR flags ,
>>>> +then the return value is the number of encoded bytes.
>>>> +.PP
>>>>  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
>>>>  .SH ERRORS
>>>>  The errors are as given for
>>>> @@ -313,6 +325,64 @@ is less than zero or greater than the permitted maximum.
>>>>  .TP
>>>>  .B EOPNOTSUPP
>>>>  An unknown flag is specified in \fIflags\fP.
>>>> +.TP
>>>> +.B EOPNOTSUPP
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and the filesystem does not implement encoded I/O.
>>>> +.TP
>>>> +.B EPERM
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and the file was not opened with the
>>>> +.B O_ALLOW_ENCODED
>>>> +flag.
>>>> +.PP
>>>> +.BR preadv2 ()
>>>> +can fail for the following reasons:
>>>
>>> The wording is a bit unclear:
>>>
>>> Above your additions (old text, not yours),
>>> it says that some errors apply to preadv2
>>> (as well as to other functions):
>>>
>>> [
>>> ERRORS
>>>        The errors are as given for read(2) and write(2).  Furthermore,
>>>        preadv(),  preadv2(),  pwritev(),  and pwritev2() can also fail
>>>        for the same reasons as lseek(2).  Additionally, the  following
>>>        errors are defined:
>>>
>>>        EINVAL The  sum  of  the  iov_len  values  overflows an ssize_t
>>>               value.
>>>
>>>        EINVAL The vector count, iovcnt, is less than zero  or  greater
>>>               than the permitted maximum.
>>>
>>>        EOPNOTSUPP
>>>               An unknown flag is specified in flags.
>>>
>>>        EOPNOTSUPP
>>>               RWF_ENCODED  is  specified  in  flags and the filesystem
>>>               does not implement encoded I/O.
>>>
>>>        EPERM  RWF_ENCODED is specified in flags and the file  was  not
>>>               opened with the O_ALLOW_ENCODED flag.
>>> ]
>>>
>>> And then you added a line that says:
>>>
>>> [
>>>        preadv2() can fail for the following reasons:
>>> ]
>>>
>>> Which if read strictly, it says that [only] the following errors apply.
>>>
>>> Did you mean that
>>> "preadv3() can _additionally_ fail for the following reasons"?
>>>
>>> Could you please be a bit more specific?
>>>
>>> The same applies for pwritev2() below.
>>>
>>>> +.TP
>>>> +.B E2BIG
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and
>>>> +.I iov[0]
>>>> +is not large enough to return the encoding metadata.
>>>> +.TP
>>>> +.B ENOBUFS
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and the buffers in
>>>> +.I iov
>>>> +are not big enough to return the encoded data.
>>>> +.PP
>>>> +.BR pwritev2 ()
>>>> +can fail for the following reasons:
>>>> +.TP
>>>> +.B E2BIG
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and
>>>> +.I iov[0]
>>>> +contains non-zero fields
>>>> +after the kernel's
>>>> +.IR "sizeof(struct\ encoded_iov)" .
>>>
>>> Don't escape the space, if the string is already in "".
>>>
>>>> +.TP
>>>> +.B EINVAL
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and the encoding is unknown or not supported by the filesystem.
>>>> +.TP
>>>> +.B EINVAL
>>>> +.B RWF_ENCODED
>>>> +is specified in
>>>> +.I flags
>>>> +and the alignment and/or size requirements are not met.
>>>>  .SH VERSIONS
>>>>  .BR preadv ()
>>>>  and
>>>> diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
>>>> new file mode 100644
>>>> index 000000000..106fa587b
>>>> --- /dev/null
>>>> +++ b/man7/encoded_io.7
>>>> @@ -0,0 +1,369 @@
>>>> +.\" Copyright (c) 2020 by Omar Sandoval <osandov@fb.com>
>>>> +.\"
>>>> +.\" %%%LICENSE_START(VERBATIM)
>>>> +.\" Permission is granted to make and distribute verbatim copies of this
>>>> +.\" manual provided the copyright notice and this permission notice are
>>>> +.\" preserved on all copies.
>>>> +.\"
>>>> +.\" Permission is granted to copy and distribute modified versions of this
>>>> +.\" manual under the conditions for verbatim copying, provided that the
>>>> +.\" entire resulting derived work is distributed under the terms of a
>>>> +.\" permission notice identical to this one.
>>>> +.\"
>>>> +.\" Since the Linux kernel and libraries are constantly changing, this
>>>> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
>>>> +.\" responsibility for errors or omissions, or for damages resulting from
>>>> +.\" the use of the information contained herein.  The author(s) may not
>>>> +.\" have taken the same level of care in the production of this manual,
>>>> +.\" which is licensed free of charge, as they might when working
>>>> +.\" professionally.
>>>> +.\"
>>>> +.\" Formatted or processed versions of this manual, if unaccompanied by
>>>> +.\" the source, must acknowledge the copyright and authors of this work.
>>>> +.\" %%%LICENSE_END
>>>> +.\"
>>>> +.\"
>>>> +.TH ENCODED_IO  7 2020-11-11 "Linux" "Linux Programmer's Manual"
>>>> +.SH NAME
>>>> +encoded_io \- overview of encoded I/O
>>>> +.SH DESCRIPTION
>>>> +Several filesystems (e.g., Btrfs) support transparent encoding
>>>> +(e.g., compression, encryption) of data on disk:
>>>> +written data is encoded by the kernel before it is written to disk,
>>>> +and read data is decoded before being returned to the user.
>>>> +In some cases, it is useful to skip this encoding step.
>>>
>>> Here I would use ';' instead of '.'
>>> (and next letter would be lowercase, then).
>>>
>>>> +For example, the user may want to read the compressed contents of a file
>>>> +or write pre-compressed data directly to a file.
>>>> +This is referred to as "encoded I/O".
>>>> +.SS Encoded I/O API
>>>> +Encoded I/O is specified with the
>>>> +.B RWF_ENCODED
>>>> +flag to
>>>> +.BR preadv2 (2)
>>>> +and
>>>> +.BR pwritev2 (2).
>>>> +If
>>>> +.B RWF_ENCODED
>>>> +is specified, then
>>>> +.I iov[0].iov_base
>>>> +points to an
>>>> +.I
>>>> +encoded_iov
>>>
>>> On the same line, please.
>>>
>>>> +structure, defined in
>>>> +.I <linux/fs.h>
>>>> +as:
>>>> +.PP
>>>> +.in +4n
>>>> +.EX
>>>> +struct encoded_iov {
>>>> +    __aligned_u64 len;
>>>> +    __aligned_u64 unencoded_len;
>>>> +    __aligned_u64 unencoded_offset;
>>>> +    __u32 compression;
>>>> +    __u32 encryption;
>>>> +};
>>>> +.EE
>>>> +.in
>>>> +.PP
>>>> +This may be extended in the future, so
>>>> +.I iov[0].iov_len
>>>> +must be set to
>>>> +.I "sizeof(struct\ encoded_iov)"
>>>> +for forward/backward compatibility.
>>>> +The remaining buffers contain the encoded data.
>>>> +.PP
>>>> +.I compression
>>>> +and
>>>> +.I encryption
>>>> +are the encoding fields.
>>>> +.I compression
>>>> +is
>>>> +.B ENCODED_IOV_COMPRESSION_NONE
>>>> +(zero)
>>>> +or a filesystem-specific
>>>> +.B ENCODED_IOV_COMPRESSION
>>>
>>> Maybe s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_*/
>>
>> Or s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_/
>>
>> I'm not sure about existing practice.
>>
>> Michael (mtk), what would you do here?
>>
>>>
>>>> +constant;
>>>> +see
>>>> +.BR Filesystem\ support .
>>>
>>> Please, write it as [.BR "Filesystem support" .]
>>>
>>> and maybe I would change it, to be more specific, to the following:
>>>
>>> [
>>> see
>>> .B Filesystem support
>>> below.
>>> ]
>>>
>>> So that the reader clearly understands it's on the same page.
>>>
>>>> +.I encryption
>>>> +is currently always
>>>> +.B ENCODED_IOV_ENCRYPTION_NONE
>>>> +(zero).
>>>> +.PP
>>>> +.I unencoded_len
>>>> +is the length of the unencoded (i.e., decrypted and decompressed) data.
>>>> +.I unencoded_offset
>>>> +is the offset into the unencoded data where the data in the file begins
>>>
>>> The above wording is a bit unclear to me.
>>>
>>> I suggest the following:
>>>
>>> [
>>> .I unencoded_offset
>>> is the offset from the begining of the file
>>> to the first byte of the unencoded data
>>> ]
> 
> Now I've read it again, and my wording was even worse than yours.
> I think yours can be understood after a few reads.
> 
> However, I'll still try to reword mine to see if I add some value:
> 
> [
> .I unencoded_offset
> is the offset from the first byte of the unencoded data
> to the first byte of logical data.
> ]
> 
> If you prefer yours, or a mix, that's fine.
> 
>>>
>>>> +(less than or equal to
>>>> +.IR unencoded_len ).
>>>> +.I len
>>>> +is the length of the data in the file
>>>> +(less than or equal to
>>>> +.I unencoded_len
>>>> +-
>>>
>>> Here's a question for Michael (mtk):
>>>
>>> I've seen (many) cases where these math operations
>>> are written without spaces,
>>> and in the same line (e.g., [.IR a + b]).
>>>
>>> I'd like to know your preferences on this,
>>> or what is actually more extended in the manual pages,
>>> to stick with only one of them.
>>>
>>>> +.IR unencoded_offset ).
>>>> +See
>>>> +.B Extent layout
>>>> +below for some examples.
>>>> +.I
>>>
>>> Were you maybe going to add something there?
>>>
>>> If not, please remove that [.I].
>>>
>>>> +.PP
>>>> +If the unencoded data is actually longer than
>>>> +.IR unencoded_len ,
>>>> +then it is truncated;
>>>> +if it is shorter, then it is extended with zeroes.
>>>> +.PP
>>>> +
>>>
>>> Please, remove that blank line.
>>>
>>>> +.BR pwritev2 ()
>>>
>>> Should be [.BR pwritev2 (2)]
>>>
>>> Michael (mtk),
>>>
>>> Am I right in that?  Please, confirm.
>>>
>>>> +uses the metadata specified in
>>>> +.IR iov[0] ,
>>>> +writes the encoded data from the remaining buffers,
>>>> +and returns the number of encoded bytes written
>>>> +(that is, the sum of
>>>> +.I iov[n].iov_len
>>>> +for 1 <=
>>>> +.I n
>>>> +<
>>>> +.IR iovcnt ;
>>>> +partial writes will not occur).
>>>> +At least one encoding field must be non-zero.
>>>> +Note that the encoded data is not validated when it is written;
>>>> +if it is not valid (e.g., it cannot be decompressed),
>>>> +then a subsequent read may return an error.
>>>> +If the
>>>> +.I offset
>>>> +argument to
>>>> +.BR pwritev2 ()
>>>
>>> Same as above: specify (2).
>>>
>>>> +is -1, then the file offset is incremented by
>>>> +.IR len .
>>>> +If
>>>> +.I iov[0].iov_len
>>>> +is less than
>>>> +.I "sizeof(struct\ encoded_iov)"
>>>
>>> [.I] allows spaces, so it should be:
>>>
>>> [
>>> .I sizeof(struct encoded_iov)
>>> ]
>>>
>>>> +in the kernel,
>>>> +then any fields unknown to userspace are treated as if they were zero;
>>>
>>> s/userspace/user space/
>>>
>>> See man-pages(7)::STYLE GUIDE::Preferred terms
>>>
>>>> +if it is greater and any fields unknown to the kernel are non-zero,
>>>> +then this returns -1 and sets
>>>> +.I errno
>>>> +to
>>>> +.BR E2BIG .
>>>> +.PP
>>>> +.BR preadv2 ()
>>>
>>> Same as above: specify (2).
>>>
>>>> +populates the metadata in
>>>> +.IR iov[0] ,
>>>> +the encoded data in the remaining buffers,
>>>> +and returns the number of encoded bytes read.
>>>> +This will only return one extent per call.
>>>> +This can also read data which is not encoded;
>>>> +all encoding fields will be zero in that case.
>>>> +If the
>>>> +.I offset
>>>> +argument to
>>>> +.BR preadv2 ()
>>>
>>> Smae as above: specify (2).
>>>
>>>> +is -1, then the file offset is incremented by
>>>> +.IR len .
>>>> +If
>>>> +.I iov[0].iov_len
>>>> +is less than
>>>> +.I "sizeof(struct\ encoded_iov)"
>>>
>>> Don't need '"' nor '\', as above.
>>>
>>>> +in the kernel and any fields unknown to userspace are non-zero,
>>>
>>> s/userspace/user space/
>>>
>>>> +then
>>>> +.BR preadv2 ()
>>>
>>> (2)
>>>
>>>> +returns -1 and sets
>>>> +.I errno
>>>> +to
>>>> +.BR E2BIG ;
>>>> +if it is greater,
>>>> +then any fields unknown to the kernel are returned as zero.
>>>> +If the provided buffers are not large enough to return an entire encoded
>>>> +extent,
>>>
>>> Please use semantic newlines.
>>> I haven't checked that in the text above,
>>> so if you happen to find that there's any other line
>>> that should also be fixed in that sense, please do so.
>>>
>>> To understand 'semantic newlines',
>>> please have a look at
>>> man-pages(7)::STYLE GUIDE::Use semantic newlines
>>>
>>> Basically, split lines at the most natural separation point,
>>> instead of just when the line gets over the margin.
>>>
>>>> +then
>>>> +.BR preadv2 ()
>>>
>>> (2)
>>>
>>>> +returns -1 and sets
>>>> +.I errno
>>>> +to
>>>> +.BR ENOBUFS .
>>>> +.PP
>>>> +As the filesystem page cache typically contains decoded data,
>>>> +encoded I/O bypasses the page cache.
>>>> +.SS Extent layout
>>>> +By using
>>>> +.IR len ,
>>>> +.IR unencoded_len ,
>>>> +and
>>>> +.IR unencoded_offset ,
>>>> +it is possible to refer to a subset of an unencoded extent.
>>>> +.PP
>>>> +In the simplest case,
>>>> +.I len
>>>> +is equal to
>>>> +.I unencoded_len
>>>> +and
>>>> +.I unencoded_offset
>>>> +is zero.
>>>> +This means that the entire unencoded extent is used.
>>>> +.PP
>>>> +However, suppose we read 50 bytes into a file
>>>> +which contains a single compressed extent.
>>>> +The filesystem must still return the entire compressed extent
>>>> +for us to be able to decompress it,
>>>> +so
>>>> +.I unencoded_len
>>>> +would be the length of the entire decompressed extent.
>>>> +However, because the read was at offset 50,
>>>> +the first 50 bytes should be ignored.
>>>> +Therefore,
>>>> +.I unencoded_offset
>>>> +would be 50,
>>>> +and
>>>> +.I len
>>>> +would accordingly be
>>>> +.IR unencoded_len\ -\ 50 .
>>>
>>> This formats everything as I, except for the last dot.
>>> Replace by:
>>>
>>> [
>>> .I unencoded
>>> - 50.
>>> ]
>>>
>>> Michael (mtk), same as above:
>>> to space, or not to space?  That is the question :p
>>>
>>> Personally, I find spaces more clear.
>>>
>>>> +.PP
>>>> +Additionally, suppose we want to create an encrypted file with length 500,
>>>> +but the file is encrypted with a block cipher using a block size of 4096.
>>>> +The unencoded data would therefore include the appropriate padding,
>>>> +and
>>>> +.I unencoded_len
>>>> +would be 4096.
>>>> +However, to represent the logical size of the file,
>>>> +.I len
>>>> +would be 500
>>>> +(and
>>>> +.I unencoded_offset
>>>> +would be 0).
>>>> +.PP
>>>> +Similar situations can arise in other cases:
>>>> +.IP * 3
>>>> +If the filesystem pads data to the filesystem block size before compressing,
>>>> +then compressed files with a size unaligned to the filesystem block size will
>>>> +end with an extent with
>>>> +.I len
>>>> +<
>>>> +.IR unencoded_len .
>>>> +.IP *
>>>> +Extents cloned from the middle of a larger encoded extent with
>>>> +.B FICLONERANGE
>>>> +may have a non-zero
>>>> +.I unencoded_offset
>>>> +and/or
>>>> +.I len
>>>> +<
>>>> +.IR unencoded_len .
>>>> +.IP *
>>>> +If the middle of an encoded extent is overwritten,
>>>> +the filesystem may create extents with a non-zero
>>>> +.I unencoded_offset
>>>> +and/or
>>>> +.I len
>>>> +<
>>>> +.I unencoded_len
>>>> +for the parts that were not overwritten.
>>>> +.SS Security
>>>> +Encoded I/O creates the potential for some security issues:
>>>> +.IP * 3
>>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>>> +which can be exploited by maliciously crafted data.
>>>> +.IP *
>>>> +Encoded reads may return data which is not logically present in the file
>>>> +(see the discussion of
>>>> +.I len
>>>> +vs.
>>>
>>> Please, s/vs./vs/
>>> See the reasons below:
>>>
>>> Michael (mtk),
>>>
>>> Here the renderer outputs a double space
>>> (as for separating two sentences).
>>>
>>> Are you okay with that?
>>>
>>> I haven't found any other "\<vs\>\.".
>>> However, I've found a few "\<vs\>[^\.]".
>>>
>>>> +.I unencoded_len
>>>> +above).
>>>> +It may not be intended for this data to be readable.
>>>> +.PP
>>>> +Therefore, encoded I/O requires privilege.
>>>> +Namely, the
>>>> +.B RWF_ENCODED
>>>> +flag may only be used when the file was opened with the
>>>> +.B O_ALLOW_ENCODED
>>>> +flag to
>>>> +.BR open (2),
>>>> +which requires the
>>>> +.B CAP_SYS_ADMIN
>>>> +capability.
>>>> +The
>>>> +.B O_CLOEXEC
>>>> +flag must be specified in conjunction with
>>>> +.BR O_ALLOW_ENCODED .
>>>> +This avoids accidentally leaking the encoded I/O privilege
>>>> +(it is not cleared on
>>>> +.BR fork (2)
>>>> +or
>>>> +.BR execve (2)
>>>> +otherwise).
>>>> +If
>>>> +.B O_ALLOW_ENCODED
>>>> +without
>>>> +.B O_CLOEXEC
>>>> +is desired,
>>>> +.B O_CLOEXEC
>>>> +can be cleared afterwards with
>>>> +.BR fnctl (2).
>>>> +.BR fcntl (2)
>>>> +can also clear or set
>>>> +.B O_ALLOW_ENCODED
>>>> +(including without
>>>> +.BR O_CLOEXEC ).
>>>> +.SS Filesystem support
>>>> +Encoded I/O is supported on the following filesystems:
>>>> +.TP
>>>> +Btrfs (since Linux 5.12)
>>>> +.IP
>>>> +Btrfs supports encoded reads and writes of compressed data.
>>>> +The data is encoded as follows:
>>>> +.RS
>>>> +.IP * 3
>>>> +If
>>>> +.I compression
>>>> +is
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ,
>>>> +then the encoded data is a single zlib stream.
>>>> +.IP *
>>>> +If
>>>> +.I compression
>>>> +is
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ,
>>>> +then the encoded data is a single zstd frame compressed with the
>>>> +.I windowLog
>>>> +compression parameter set to no more than 17.
>>>> +.IP *
>>>> +If
>>>> +.I compression
>>>> +is one of
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ,
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ,
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ,
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ,
>>>> +or
>>>> +.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ,
>>>> +then the encoded data is compressed page by page
>>>> +(using the page size indicated by the name of the constant)
>>>> +with LZO1X
>>>> +and wrapped in the format documented in the Linux kernel source file
>>>> +.IR fs/btrfs/lzo.c .
>>>> +.RE
>>>> +.IP
>>>> +Additionally, there are some restrictions on
>>>> +.BR pwritev2 ():
>>>
>>> (2)
>>>
>>>> +.RS
>>>> +.IP * 3
>>>> +.I offset
>>>> +(or the current file offset if
>>>> +.I offset
>>>> +is -1) must be aligned to the sector size of the filesystem.
>>>> +.IP *
>>>> +.I len
>>>> +must be aligned to the sector size of the filesystem
>>>> +unless the data ends at or beyond the current end of the file.
>>>> +.IP *
>>>> +.I unencoded_len
>>>> +and the length of the encoded data must each be no more than 128 KiB.
>>>> +This limit may increase in the future.
>>>> +.IP *
>>>> +The length of the encoded data must be less than or equal to
>>>> +.IR unencoded_len .
>>>> +.IP *
>>>> +If using LZO, the filesystem's page size must match the compression page size.
>>>> +.RE
>>>>
>>>
>>> Please, add a SEE ALSO section, which should at least point to
>>> preadv2(2) (or pwritev2(2), if you prefer):
>>>
>>> [
>>> .SH SEE ALSO
>>> .BR preadv2 (2)
>>> ]
>>>
Michael Kerrisk (man-pages) Dec. 1, 2020, 8:12 p.m. UTC | #6
Hello Alex,

On 11/20/20 4:03 PM, Alejandro Colomar (man-pages) wrote:
> Hi Omar,
> 
> I found a wording of mine to be a bit confusing.
> Please see below.
> 
> Thanks,
> 
> Alex
> 
> On 11/20/20 3:06 PM, Alejandro Colomar (man-pages) wrote:
>> Hi Omar and Michael,
>>
>> please, see below.
>>
>> Thanks,
>>
>> Alex
>>
>> On 11/20/20 12:29 AM, Alejandro Colomar (mailing lists; readonly) wrote:
>>> Hi Omar,
>>>
>>> Please, see some fixes below:
>>>
>>> Michael, I've also some questions for you below
>>> (you can grep for mtk to find those).
>>>
>>> Thanks,
>>>
>>> Alex
>>>
>>> On 11/18/20 8:18 PM, Omar Sandoval wrote:
>>>> From: Omar Sandoval <osandov@fb.com>
>>>>
>>>> This adds a new page, encoded_io(7), providing an overview of encoded
>>>> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
>>>> reference it.
>>>>
>>>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>>>> Cc: linux-man <linux-man@vger.kernel.org>
>>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>>> ---
>>>> This feature is not yet upstream.
>>>>
>>>>  man2/fcntl.2      |  10 +-
>>>>  man2/open.2       |  23 +++
>>>>  man2/readv.2      |  70 +++++++++
>>>>  man7/encoded_io.7 | 369 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>  4 files changed, 471 insertions(+), 1 deletion(-)
>>>>  create mode 100644 man7/encoded_io.7
>>>>
>>>> diff --git a/man2/fcntl.2 b/man2/fcntl.2
>>>> index 546016617..b0d7fa2c3 100644
>>>> --- a/man2/fcntl.2
>>>> +++ b/man2/fcntl.2
>>>> @@ -221,8 +221,9 @@ On Linux, this command can change only the

[...]

>>>> +.PP
>>>> +This may be extended in the future, so
>>>> +.I iov[0].iov_len
>>>> +must be set to
>>>> +.I "sizeof(struct\ encoded_iov)"
>>>> +for forward/backward compatibility.
>>>> +The remaining buffers contain the encoded data.
>>>> +.PP
>>>> +.I compression
>>>> +and
>>>> +.I encryption
>>>> +are the encoding fields.
>>>> +.I compression
>>>> +is
>>>> +.B ENCODED_IOV_COMPRESSION_NONE
>>>> +(zero)
>>>> +or a filesystem-specific
>>>> +.B ENCODED_IOV_COMPRESSION
>>>
>>> Maybe s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_*/
>>
>> Or s/ENCODED_IOV_COMPRESSION/ENCODED_IOV_COMPRESSION_/
>>
>> I'm not sure about existing practice.
>>
>> Michael (mtk), what would you do here?

I think I've tended towards the former
(ENCODED_IOV_COMPRESSION_*) in the past.

>>
>>>
>>>> +constant;
>>>> +see
>>>> +.BR Filesystem\ support .
>>>
>>> Please, write it as [.BR "Filesystem support" .]
>>>
>>> and maybe I would change it, to be more specific, to the following:
>>>
>>> [
>>> see
>>> .B Filesystem support
>>> below.
>>> ]
>>>
>>> So that the reader clearly understands it's on the same page.
>>>
>>>> +.I encryption
>>>> +is currently always
>>>> +.B ENCODED_IOV_ENCRYPTION_NONE
>>>> +(zero).
>>>> +.PP
>>>> +.I unencoded_len
>>>> +is the length of the unencoded (i.e., decrypted and decompressed) data.
>>>> +.I unencoded_offset
>>>> +is the offset into the unencoded data where the data in the file begins
>>>
>>> The above wording is a bit unclear to me.
>>>
>>> I suggest the following:
>>>
>>> [
>>> .I unencoded_offset
>>> is the offset from the begining of the file
>>> to the first byte of the unencoded data
>>> ]
> 
> Now I've read it again, and my wording was even worse than yours.
> I think yours can be understood after a few reads.
> 
> However, I'll still try to reword mine to see if I add some value:
> 
> [
> .I unencoded_offset
> is the offset from the first byte of the unencoded data
> to the first byte of logical data.
> ]
> 
> If you prefer yours, or a mix, that's fine.
> 
>>>
>>>> +(less than or equal to
>>>> +.IR unencoded_len ).
>>>> +.I len
>>>> +is the length of the data in the file
>>>> +(less than or equal to
>>>> +.I unencoded_len
>>>> +-
>>>
>>> Here's a question for Michael (mtk):
>>>
>>> I've seen (many) cases where these math operations
>>> are written without spaces,
>>> and in the same line (e.g., [.IR a + b]).
>>>
>>> I'd like to know your preferences on this,
>>> or what is actually more extended in the manual pages,
>>> to stick with only one of them.

I suspect there's a lot of inconsistency across pages. For simple
cases like this, I think writing it without spaces is fine, and
perhaps even preferable.

>>>
>>>> +.IR unencoded_offset ).
>>>> +See
>>>> +.B Extent layout
>>>> +below for some examples.
>>>> +.I
>>>
>>> Were you maybe going to add something there?
>>>
>>> If not, please remove that [.I].
>>>
>>>> +.PP
>>>> +If the unencoded data is actually longer than
>>>> +.IR unencoded_len ,
>>>> +then it is truncated;
>>>> +if it is shorter, then it is extended with zeroes.
>>>> +.PP
>>>> +
>>>
>>> Please, remove that blank line.
>>>
>>>> +.BR pwritev2 ()
>>>
>>> Should be [.BR pwritev2 (2)]
>>>
>>> Michael (mtk),
>>>
>>> Am I right in that?  Please, confirm.

Yes. References to functions documented in other pages should
include the section number in parentheses.

[...]

>>>> +.PP
>>>> +However, suppose we read 50 bytes into a file
>>>> +which contains a single compressed extent.
>>>> +The filesystem must still return the entire compressed extent
>>>> +for us to be able to decompress it,
>>>> +so
>>>> +.I unencoded_len
>>>> +would be the length of the entire decompressed extent.
>>>> +However, because the read was at offset 50,
>>>> +the first 50 bytes should be ignored.
>>>> +Therefore,
>>>> +.I unencoded_offset
>>>> +would be 50,
>>>> +and
>>>> +.I len
>>>> +would accordingly be
>>>> +.IR unencoded_len\ -\ 50 .
>>>
>>> This formats everything as I, except for the last dot.
>>> Replace by:
>>>
>>> [
>>> .I unencoded
>>> - 50.
>>> ]
>>>
>>> Michael (mtk), same as above:
>>> to space, or not to space?  That is the question :p

In this case, perhaps

.IR unencoded \-1

[...]


>>>> +.SS Security
>>>> +Encoded I/O creates the potential for some security issues:
>>>> +.IP * 3
>>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>>> +which can be exploited by maliciously crafted data.
>>>> +.IP *
>>>> +Encoded reads may return data which is not logically present in the file
>>>> +(see the discussion of
>>>> +.I len
>>>> +vs.
>>>
>>> Please, s/vs./vs/
>>> See the reasons below:
>>>
>>> Michael (mtk),
>>>
>>> Here the renderer outputs a double space
>>> (as for separating two sentences).
>>>
>>> Are you okay with that?

Yes, that should probably be avoided. I'm not sure what the
correct way is to prevent that in groff though. I mean, one
could write

.RI "vs.\ " unencoded_len

but I think that simply creates a nonbreaking space,
which is not exactly what is desired.

[....]

Thanks,

Michael
Michael Kerrisk (man-pages) Dec. 1, 2020, 8:20 p.m. UTC | #7
>>>>> +.SS Security
>>>>> +Encoded I/O creates the potential for some security issues:
>>>>> +.IP * 3
>>>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>>>> +which can be exploited by maliciously crafted data.
>>>>> +.IP *
>>>>> +Encoded reads may return data which is not logically present in the file
>>>>> +(see the discussion of
>>>>> +.I len
>>>>> +vs.
>>>>
>>>> Please, s/vs./vs/
>>>> See the reasons below:
>>>>
>>>> Michael (mtk),
>>>>
>>>> Here the renderer outputs a double space
>>>> (as for separating two sentences).
>>>>
>>>> Are you okay with that?
> 
> Yes, that should probably be avoided. I'm not sure what the
> correct way is to prevent that in groff though. I mean, one
> could write
> 
> .RI "vs.\ " unencoded_len
> 
> but I think that simply creates a nonbreaking space,
> which is not exactly what is desired.

Ahh -- found it. From https://groff.ffii.org/groff/groff-1.21.pdf,
we can write:

vs.\&

to prevent the double space.

Thanks,

Michael
G. Branden Robinson Dec. 1, 2020, 8:21 p.m. UTC | #8
At 2020-12-01T21:12:47+0100, Michael Kerrisk (man-pages) wrote:
> >>>> +vs.
> >>>
> >>> Please, s/vs./vs/
> >>> See the reasons below:
> >>>
> >>> Michael (mtk),
> >>>
> >>> Here the renderer outputs a double space
> >>> (as for separating two sentences).
> >>>
> >>> Are you okay with that?
> 
> Yes, that should probably be avoided. I'm not sure what the
> correct way is to prevent that in groff though. I mean, one
> could write
> 
> .RI "vs.\ " unencoded_len
> 
> but I think that simply creates a nonbreaking space,
> which is not exactly what is desired.

Use the non-printing input break escape sequence, "\&", to suppress
end-of-sentence detection.  This is not a groffism, it goes back to
1970s nroff and troff.

I'm attaching a couple of pages from some introductory material I wrote
for the groff Texinfo manual in the forthcoming 1.23.0.

Regards,
Branden
Alejandro Colomar (man-pages) Dec. 1, 2020, 9:34 p.m. UTC | #9
Hi Branden,

A good read, as always!

Thanks,

Alex

On 12/1/20 9:21 PM, G. Branden Robinson wrote:
> At 2020-12-01T21:12:47+0100, Michael Kerrisk (man-pages) wrote:
>>>>>> +vs.
>>>>>
>>>>> Please, s/vs./vs/
>>>>> See the reasons below:
>>>>>
>>>>> Michael (mtk),
>>>>>
>>>>> Here the renderer outputs a double space
>>>>> (as for separating two sentences).
>>>>>
>>>>> Are you okay with that?
>>
>> Yes, that should probably be avoided. I'm not sure what the
>> correct way is to prevent that in groff though. I mean, one
>> could write
>>
>> .RI "vs.\ " unencoded_len
>>
>> but I think that simply creates a nonbreaking space,
>> which is not exactly what is desired.
> 
> Use the non-printing input break escape sequence, "\&", to suppress
> end-of-sentence detection.  This is not a groffism, it goes back to
> 1970s nroff and troff.
> 
> I'm attaching a couple of pages from some introductory material I wrote
> for the groff Texinfo manual in the forthcoming 1.23.0.
> 
> Regards,
> Branden
>
Alejandro Colomar (man-pages) Dec. 1, 2020, 9:35 p.m. UTC | #10
Hi Michael,

On 12/1/20 9:20 PM, Michael Kerrisk (man-pages) wrote:
>>>>>> +.SS Security
>>>>>> +Encoded I/O creates the potential for some security issues:
>>>>>> +.IP * 3
>>>>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>>>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>>>>> +which can be exploited by maliciously crafted data.
>>>>>> +.IP *
>>>>>> +Encoded reads may return data which is not logically present in the file
>>>>>> +(see the discussion of
>>>>>> +.I len
>>>>>> +vs.
>>>>>
>>>>> Please, s/vs./vs/
>>>>> See the reasons below:
>>>>>
>>>>> Michael (mtk),
>>>>>
>>>>> Here the renderer outputs a double space
>>>>> (as for separating two sentences).
>>>>>
>>>>> Are you okay with that?
>>
>> Yes, that should probably be avoided. I'm not sure what the
>> correct way is to prevent that in groff though. I mean, one
>> could write
>>
>> .RI "vs.\ " unencoded_len
>>
>> but I think that simply creates a nonbreaking space,
>> which is not exactly what is desired.
> 
> Ahh -- found it. From https://groff.ffii.org/groff/groff-1.21.pdf,
> we can write:
> 
> vs.\&
> 
> to prevent the double space.

Nice to see it's possible.
However, I would argue for simplicity,
and use a simple 'vs',
which is already in use.

Cheers,

Alex

> 
> Thanks,
> 
> Michael
> 
>
Michael Kerrisk (man-pages) Dec. 1, 2020, 9:56 p.m. UTC | #11
Hi Alex,

On 12/1/20 10:35 PM, Alejandro Colomar (man-pages) wrote:
> Hi Michael,
> 
> On 12/1/20 9:20 PM, Michael Kerrisk (man-pages) wrote:
>>>>>>> +.SS Security
>>>>>>> +Encoded I/O creates the potential for some security issues:
>>>>>>> +.IP * 3
>>>>>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>>>>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>>>>>> +which can be exploited by maliciously crafted data.
>>>>>>> +.IP *
>>>>>>> +Encoded reads may return data which is not logically present in the file
>>>>>>> +(see the discussion of
>>>>>>> +.I len
>>>>>>> +vs.
>>>>>>
>>>>>> Please, s/vs./vs/
>>>>>> See the reasons below:
>>>>>>
>>>>>> Michael (mtk),
>>>>>>
>>>>>> Here the renderer outputs a double space
>>>>>> (as for separating two sentences).
>>>>>>
>>>>>> Are you okay with that?
>>>
>>> Yes, that should probably be avoided. I'm not sure what the
>>> correct way is to prevent that in groff though. I mean, one
>>> could write
>>>
>>> .RI "vs.\ " unencoded_len
>>>
>>> but I think that simply creates a nonbreaking space,
>>> which is not exactly what is desired.
>>
>> Ahh -- found it. From https://groff.ffii.org/groff/groff-1.21.pdf,
>> we can write:
>>
>> vs.\&
>>
>> to prevent the double space.
> 
> Nice to see it's possible.
> However, I would argue for simplicity,
> and use a simple 'vs',
> which is already in use.

Indeed better. Thanks for noticing that.

Thanks,

Michael
Michael Kerrisk (man-pages) Dec. 1, 2020, 9:58 p.m. UTC | #12
Hi Branden,

On 12/1/20 9:21 PM, G. Branden Robinson wrote:
> At 2020-12-01T21:12:47+0100, Michael Kerrisk (man-pages) wrote:
>>>>>> +vs.
>>>>>
>>>>> Please, s/vs./vs/
>>>>> See the reasons below:
>>>>>
>>>>> Michael (mtk),
>>>>>
>>>>> Here the renderer outputs a double space
>>>>> (as for separating two sentences).
>>>>>
>>>>> Are you okay with that?
>>
>> Yes, that should probably be avoided. I'm not sure what the
>> correct way is to prevent that in groff though. I mean, one
>> could write
>>
>> .RI "vs.\ " unencoded_len
>>
>> but I think that simply creates a nonbreaking space,
>> which is not exactly what is desired.
> 
> Use the non-printing input break escape sequence, "\&", to suppress
> end-of-sentence detection.  This is not a groffism, it goes back to
> 1970s nroff and troff.

Yes, I spotted it about two minutes before your mail. And before that, I 
was thinking, "should we really bother Branden with a question like 
this?" :-)

> I'm attaching a couple of pages from some introductory material I wrote
> for the groff Texinfo manual in the forthcoming 1.23.0.

As ever, thanks for jumping in, Branden.

Cheers,

Michael
Alejandro Colomar (man-pages) Dec. 18, 2020, 10:32 a.m. UTC | #13
Hi Omar,

Linux 5.10 has been recently released.
Do you have any updates for this patch?

Thanks,

Alex

On 12/1/20 10:56 PM, Michael Kerrisk (man-pages) wrote:
> Hi Alex,
> 
> On 12/1/20 10:35 PM, Alejandro Colomar (man-pages) wrote:
>> Hi Michael,
>>
>> On 12/1/20 9:20 PM, Michael Kerrisk (man-pages) wrote:
>>>>>>>> +.SS Security
>>>>>>>> +Encoded I/O creates the potential for some security issues:
>>>>>>>> +.IP * 3
>>>>>>>> +Encoded writes allow writing arbitrary data which the kernel will decode on
>>>>>>>> +a subsequent read. Decompression algorithms are complex and may have bugs
>>>>>>>> +which can be exploited by maliciously crafted data.
>>>>>>>> +.IP *
>>>>>>>> +Encoded reads may return data which is not logically present in the file
>>>>>>>> +(see the discussion of
>>>>>>>> +.I len
>>>>>>>> +vs.
>>>>>>>
>>>>>>> Please, s/vs./vs/
>>>>>>> See the reasons below:
>>>>>>>
>>>>>>> Michael (mtk),
>>>>>>>
>>>>>>> Here the renderer outputs a double space
>>>>>>> (as for separating two sentences).
>>>>>>>
>>>>>>> Are you okay with that?
>>>>
>>>> Yes, that should probably be avoided. I'm not sure what the
>>>> correct way is to prevent that in groff though. I mean, one
>>>> could write
>>>>
>>>> .RI "vs.\ " unencoded_len
>>>>
>>>> but I think that simply creates a nonbreaking space,
>>>> which is not exactly what is desired.
>>>
>>> Ahh -- found it. From https://groff.ffii.org/groff/groff-1.21.pdf,
>>> we can write:
>>>
>>> vs.\&
>>>
>>> to prevent the double space.
>>
>> Nice to see it's possible.
>> However, I would argue for simplicity,
>> and use a simple 'vs',
>> which is already in use.
> 
> Indeed better. Thanks for noticing that.
> 
> Thanks,
> 
> Michael
> 
>
Omar Sandoval Jan. 12, 2021, 1:12 a.m. UTC | #14
On Fri, Dec 18, 2020 at 11:32:17AM +0100, Alejandro Colomar (man-pages) wrote:
> Hi Omar,
> 
> Linux 5.10 has been recently released.
> Do you have any updates for this patch?
> 
> Thanks,
> 
> Alex

Hi, Alex,

Now that the holidays are over I'm revisiting this series and plan to
send a new version this week or next.

Thanks,
Omar
diff mbox series

Patch

diff --git a/man2/fcntl.2 b/man2/fcntl.2
index 546016617..b0d7fa2c3 100644
--- a/man2/fcntl.2
+++ b/man2/fcntl.2
@@ -221,8 +221,9 @@  On Linux, this command can change only the
 .BR O_ASYNC ,
 .BR O_DIRECT ,
 .BR O_NOATIME ,
+.BR O_NONBLOCK ,
 and
-.B O_NONBLOCK
+.B O_ALLOW_ENCODED
 flags.
 It is not possible to change the
 .BR O_DSYNC
@@ -1820,6 +1821,13 @@  Attempted to clear the
 flag on a file that has the append-only attribute set.
 .TP
 .B EPERM
+Attempted to set the
+.B O_ALLOW_ENCODED
+flag and the calling process did not have the
+.B CAP_SYS_ADMIN
+capability.
+.TP
+.B EPERM
 .I cmd
 was
 .BR F_ADD_SEALS ,
diff --git a/man2/open.2 b/man2/open.2
index f587b0d95..84697dfa8 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -437,6 +437,16 @@  was followed by a call to
 .BR fdatasync (2)).
 .IR "See NOTES below" .
 .TP
+.B O_ALLOW_ENCODED
+Open the file with encoded I/O permissions;
+see
+.BR encoded_io (7).
+.B O_CLOEXEC
+must be specified in conjuction with this flag.
+The caller must have the
+.B CAP_SYS_ADMIN
+capability.
+.TP
 .B O_EXCL
 Ensure that this call creates the file:
 if this flag is specified in conjunction with
@@ -1082,6 +1092,14 @@  is invalid
 (e.g., it contains characters not permitted by the underlying filesystem).
 .TP
 .B EINVAL
+.B O_ALLOW_ENCODED
+was specified in
+.IR flags ,
+but
+.B O_CLOEXEC
+was not specified.
+.TP
+.B EINVAL
 The final component ("basename") of
 .I pathname
 is invalid
@@ -1238,6 +1256,11 @@  did not match the owner of the file and the caller was not privileged.
 The operation was prevented by a file seal; see
 .BR fcntl (2).
 .TP
+.B EPERM
+The
+.B O_ALLOW_ENCODED
+flag was specified, but the caller was not privileged.
+.TP
 .B EROFS
 .I pathname
 refers to a file on a read-only filesystem and write access was
diff --git a/man2/readv.2 b/man2/readv.2
index 5a8b74168..c9933acf0 100644
--- a/man2/readv.2
+++ b/man2/readv.2
@@ -264,6 +264,11 @@  the data is always appended to the end of the file.
 However, if the
 .I offset
 argument is \-1, the current file offset is updated.
+.TP
+.BR RWF_ENCODED " (since Linux 5.12)"
+Read or write encoded (e.g., compressed) data.
+See
+.BR encoded_io (7).
 .SH RETURN VALUE
 On success,
 .BR readv (),
@@ -283,6 +288,13 @@  than requested (see
 and
 .BR write (2)).
 .PP
+If
+.B
+RWF_ENCODED
+was specified in
+.IR flags ,
+then the return value is the number of encoded bytes.
+.PP
 On error, \-1 is returned, and \fIerrno\fP is set appropriately.
 .SH ERRORS
 The errors are as given for
@@ -313,6 +325,64 @@  is less than zero or greater than the permitted maximum.
 .TP
 .B EOPNOTSUPP
 An unknown flag is specified in \fIflags\fP.
+.TP
+.B EOPNOTSUPP
+.B RWF_ENCODED
+is specified in
+.I flags
+and the filesystem does not implement encoded I/O.
+.TP
+.B EPERM
+.B RWF_ENCODED
+is specified in
+.I flags
+and the file was not opened with the
+.B O_ALLOW_ENCODED
+flag.
+.PP
+.BR preadv2 ()
+can fail for the following reasons:
+.TP
+.B E2BIG
+.B RWF_ENCODED
+is specified in
+.I flags
+and
+.I iov[0]
+is not large enough to return the encoding metadata.
+.TP
+.B ENOBUFS
+.B RWF_ENCODED
+is specified in
+.I flags
+and the buffers in
+.I iov
+are not big enough to return the encoded data.
+.PP
+.BR pwritev2 ()
+can fail for the following reasons:
+.TP
+.B E2BIG
+.B RWF_ENCODED
+is specified in
+.I flags
+and
+.I iov[0]
+contains non-zero fields
+after the kernel's
+.IR "sizeof(struct\ encoded_iov)" .
+.TP
+.B EINVAL
+.B RWF_ENCODED
+is specified in
+.I flags
+and the encoding is unknown or not supported by the filesystem.
+.TP
+.B EINVAL
+.B RWF_ENCODED
+is specified in
+.I flags
+and the alignment and/or size requirements are not met.
 .SH VERSIONS
 .BR preadv ()
 and
diff --git a/man7/encoded_io.7 b/man7/encoded_io.7
new file mode 100644
index 000000000..106fa587b
--- /dev/null
+++ b/man7/encoded_io.7
@@ -0,0 +1,369 @@ 
+.\" Copyright (c) 2020 by Omar Sandoval <osandov@fb.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.\"
+.TH ENCODED_IO  7 2020-11-11 "Linux" "Linux Programmer's Manual"
+.SH NAME
+encoded_io \- overview of encoded I/O
+.SH DESCRIPTION
+Several filesystems (e.g., Btrfs) support transparent encoding
+(e.g., compression, encryption) of data on disk:
+written data is encoded by the kernel before it is written to disk,
+and read data is decoded before being returned to the user.
+In some cases, it is useful to skip this encoding step.
+For example, the user may want to read the compressed contents of a file
+or write pre-compressed data directly to a file.
+This is referred to as "encoded I/O".
+.SS Encoded I/O API
+Encoded I/O is specified with the
+.B RWF_ENCODED
+flag to
+.BR preadv2 (2)
+and
+.BR pwritev2 (2).
+If
+.B RWF_ENCODED
+is specified, then
+.I iov[0].iov_base
+points to an
+.I
+encoded_iov
+structure, defined in
+.I <linux/fs.h>
+as:
+.PP
+.in +4n
+.EX
+struct encoded_iov {
+    __aligned_u64 len;
+    __aligned_u64 unencoded_len;
+    __aligned_u64 unencoded_offset;
+    __u32 compression;
+    __u32 encryption;
+};
+.EE
+.in
+.PP
+This may be extended in the future, so
+.I iov[0].iov_len
+must be set to
+.I "sizeof(struct\ encoded_iov)"
+for forward/backward compatibility.
+The remaining buffers contain the encoded data.
+.PP
+.I compression
+and
+.I encryption
+are the encoding fields.
+.I compression
+is
+.B ENCODED_IOV_COMPRESSION_NONE
+(zero)
+or a filesystem-specific
+.B ENCODED_IOV_COMPRESSION
+constant;
+see
+.BR Filesystem\ support .
+.I encryption
+is currently always
+.B ENCODED_IOV_ENCRYPTION_NONE
+(zero).
+.PP
+.I unencoded_len
+is the length of the unencoded (i.e., decrypted and decompressed) data.
+.I unencoded_offset
+is the offset into the unencoded data where the data in the file begins
+(less than or equal to
+.IR unencoded_len ).
+.I len
+is the length of the data in the file
+(less than or equal to
+.I unencoded_len
+-
+.IR unencoded_offset ).
+See
+.B Extent layout
+below for some examples.
+.I
+.PP
+If the unencoded data is actually longer than
+.IR unencoded_len ,
+then it is truncated;
+if it is shorter, then it is extended with zeroes.
+.PP
+
+.BR pwritev2 ()
+uses the metadata specified in
+.IR iov[0] ,
+writes the encoded data from the remaining buffers,
+and returns the number of encoded bytes written
+(that is, the sum of
+.I iov[n].iov_len
+for 1 <=
+.I n
+<
+.IR iovcnt ;
+partial writes will not occur).
+At least one encoding field must be non-zero.
+Note that the encoded data is not validated when it is written;
+if it is not valid (e.g., it cannot be decompressed),
+then a subsequent read may return an error.
+If the
+.I offset
+argument to
+.BR pwritev2 ()
+is -1, then the file offset is incremented by
+.IR len .
+If
+.I iov[0].iov_len
+is less than
+.I "sizeof(struct\ encoded_iov)"
+in the kernel,
+then any fields unknown to userspace are treated as if they were zero;
+if it is greater and any fields unknown to the kernel are non-zero,
+then this returns -1 and sets
+.I errno
+to
+.BR E2BIG .
+.PP
+.BR preadv2 ()
+populates the metadata in
+.IR iov[0] ,
+the encoded data in the remaining buffers,
+and returns the number of encoded bytes read.
+This will only return one extent per call.
+This can also read data which is not encoded;
+all encoding fields will be zero in that case.
+If the
+.I offset
+argument to
+.BR preadv2 ()
+is -1, then the file offset is incremented by
+.IR len .
+If
+.I iov[0].iov_len
+is less than
+.I "sizeof(struct\ encoded_iov)"
+in the kernel and any fields unknown to userspace are non-zero,
+then
+.BR preadv2 ()
+returns -1 and sets
+.I errno
+to
+.BR E2BIG ;
+if it is greater,
+then any fields unknown to the kernel are returned as zero.
+If the provided buffers are not large enough to return an entire encoded
+extent,
+then
+.BR preadv2 ()
+returns -1 and sets
+.I errno
+to
+.BR ENOBUFS .
+.PP
+As the filesystem page cache typically contains decoded data,
+encoded I/O bypasses the page cache.
+.SS Extent layout
+By using
+.IR len ,
+.IR unencoded_len ,
+and
+.IR unencoded_offset ,
+it is possible to refer to a subset of an unencoded extent.
+.PP
+In the simplest case,
+.I len
+is equal to
+.I unencoded_len
+and
+.I unencoded_offset
+is zero.
+This means that the entire unencoded extent is used.
+.PP
+However, suppose we read 50 bytes into a file
+which contains a single compressed extent.
+The filesystem must still return the entire compressed extent
+for us to be able to decompress it,
+so
+.I unencoded_len
+would be the length of the entire decompressed extent.
+However, because the read was at offset 50,
+the first 50 bytes should be ignored.
+Therefore,
+.I unencoded_offset
+would be 50,
+and
+.I len
+would accordingly be
+.IR unencoded_len\ -\ 50 .
+.PP
+Additionally, suppose we want to create an encrypted file with length 500,
+but the file is encrypted with a block cipher using a block size of 4096.
+The unencoded data would therefore include the appropriate padding,
+and
+.I unencoded_len
+would be 4096.
+However, to represent the logical size of the file,
+.I len
+would be 500
+(and
+.I unencoded_offset
+would be 0).
+.PP
+Similar situations can arise in other cases:
+.IP * 3
+If the filesystem pads data to the filesystem block size before compressing,
+then compressed files with a size unaligned to the filesystem block size will
+end with an extent with
+.I len
+<
+.IR unencoded_len .
+.IP *
+Extents cloned from the middle of a larger encoded extent with
+.B FICLONERANGE
+may have a non-zero
+.I unencoded_offset
+and/or
+.I len
+<
+.IR unencoded_len .
+.IP *
+If the middle of an encoded extent is overwritten,
+the filesystem may create extents with a non-zero
+.I unencoded_offset
+and/or
+.I len
+<
+.I unencoded_len
+for the parts that were not overwritten.
+.SS Security
+Encoded I/O creates the potential for some security issues:
+.IP * 3
+Encoded writes allow writing arbitrary data which the kernel will decode on
+a subsequent read. Decompression algorithms are complex and may have bugs
+which can be exploited by maliciously crafted data.
+.IP *
+Encoded reads may return data which is not logically present in the file
+(see the discussion of
+.I len
+vs.
+.I unencoded_len
+above).
+It may not be intended for this data to be readable.
+.PP
+Therefore, encoded I/O requires privilege.
+Namely, the
+.B RWF_ENCODED
+flag may only be used when the file was opened with the
+.B O_ALLOW_ENCODED
+flag to
+.BR open (2),
+which requires the
+.B CAP_SYS_ADMIN
+capability.
+The
+.B O_CLOEXEC
+flag must be specified in conjunction with
+.BR O_ALLOW_ENCODED .
+This avoids accidentally leaking the encoded I/O privilege
+(it is not cleared on
+.BR fork (2)
+or
+.BR execve (2)
+otherwise).
+If
+.B O_ALLOW_ENCODED
+without
+.B O_CLOEXEC
+is desired,
+.B O_CLOEXEC
+can be cleared afterwards with
+.BR fnctl (2).
+.BR fcntl (2)
+can also clear or set
+.B O_ALLOW_ENCODED
+(including without
+.BR O_CLOEXEC ).
+.SS Filesystem support
+Encoded I/O is supported on the following filesystems:
+.TP
+Btrfs (since Linux 5.12)
+.IP
+Btrfs supports encoded reads and writes of compressed data.
+The data is encoded as follows:
+.RS
+.IP * 3
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_BTRFS_ZLIB ,
+then the encoded data is a single zlib stream.
+.IP *
+If
+.I compression
+is
+.BR ENCODED_IOV_COMPRESSION_BTRFS_ZSTD ,
+then the encoded data is a single zstd frame compressed with the
+.I windowLog
+compression parameter set to no more than 17.
+.IP *
+If
+.I compression
+is one of
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K ,
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K ,
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K ,
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K ,
+or
+.BR ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K ,
+then the encoded data is compressed page by page
+(using the page size indicated by the name of the constant)
+with LZO1X
+and wrapped in the format documented in the Linux kernel source file
+.IR fs/btrfs/lzo.c .
+.RE
+.IP
+Additionally, there are some restrictions on
+.BR pwritev2 ():
+.RS
+.IP * 3
+.I offset
+(or the current file offset if
+.I offset
+is -1) must be aligned to the sector size of the filesystem.
+.IP *
+.I len
+must be aligned to the sector size of the filesystem
+unless the data ends at or beyond the current end of the file.
+.IP *
+.I unencoded_len
+and the length of the encoded data must each be no more than 128 KiB.
+This limit may increase in the future.
+.IP *
+The length of the encoded data must be less than or equal to
+.IR unencoded_len .
+.IP *
+If using LZO, the filesystem's page size must match the compression page size.
+.RE