mbox series

[RFC,v2,0/5] fs: interface for directly reading/writing compressed data

Message ID cover.1571164762.git.osandov@fb.com (mailing list archive)
Headers show
Series fs: interface for directly reading/writing compressed data | expand

Message

Omar Sandoval Oct. 15, 2019, 6:42 p.m. UTC
From: Omar Sandoval <osandov@fb.com>

Hello,

This series adds an API for reading compressed data on a filesystem
without decompressing it as well as support for writing compressed data
directly to the filesystem. It is based on my previous series which
added a Btrfs-specific ioctl [1], but it is now an extension to
preadv2()/pwritev2() as suggested by Dave Chinner [2]. I've included a
man page patch describing the API in detail. Test cases and examples
programs are available [3].

The use case that I have in mind is Btrfs send/receive: currently, when
sending data from one compressed filesystem to another, the sending side
decompresses the data and the receiving side recompresses it before
writing it out. This is wasteful and can be avoided if we can just send
and write compressed extents. The send part will be implemented in a
separate series, as this API can stand alone.

Patches 1 and 2 add the VFS support. Patch 3 is a Btrfs prep patch.
Patch 4 implements encoded reads for Btrfs, and patch 5 implements
encoded writes.

Changes from v1 [4]:

- Encoded reads are now also implemented.
- The encoded_iov structure now includes metadata for referring to a
  subset of decoded data. This is required to handle certain cases where
  a compressed extent is truncated, hole punched, or otherwise sliced up
  and Btrfs chooses to reflect this in metadata instead of decompressing
  the whole extent and rewriting the pieces. We call these "bookend
  extents" in Btrfs, but any filesystem supporting transparent encoding
  is likely to have a similar concept.
- The behavior of the filesystem when the decompressed data is longer
  than or shorter than expected is more strictly defined (truncate and
  zero extend, respectively).
- As pointed out by Jann Horn [5], the capability check done at
  read/write time in v1 was incorrect; v2 adds an explicit open flag
  (which can be changed with fcntl()). As this can be trivially combined
  with O_CLOEXEC, I did not add any sort of automatic clearing on exec.

I wanted to get the ball rolling on reviewing the interface, so the
Btrfs implementation has a couple of smaller todos:

- Encoded reads do not yet implement repair for disk/checksum failures.
- Encoded writes do not yet support inline extents or bookend extents.

This is based on v5.4-rc3

Please share any comments on the API or implementation. Thanks!

1: https://lore.kernel.org/linux-fsdevel/cover.1567623877.git.osandov@fb.com/
2: https://lore.kernel.org/linux-fsdevel/20190906212710.GI7452@vader/
3: https://github.com/osandov/xfstests/tree/rwf-encoded
4: https://lore.kernel.org/linux-btrfs/cover.1568875700.git.osandov@fb.com/
5: https://lore.kernel.org/linux-btrfs/CAG48ez2GKv15Uj6Wzv0sG5v2bXyrSaCtRTw5Ok_ovja_CiO_fQ@mail.gmail.com/

Omar Sandoval (5):
  fs: add O_ENCODED open flag
  fs: add RWF_ENCODED for reading/writing compressed data
  btrfs: generalize btrfs_lookup_bio_sums_dio()
  btrfs: implement RWF_ENCODED reads
  btrfs: implement RWF_ENCODED writes

 fs/btrfs/compression.c           |   6 +-
 fs/btrfs/compression.h           |   5 +-
 fs/btrfs/ctree.h                 |   9 +-
 fs/btrfs/file-item.c             |  18 +-
 fs/btrfs/file.c                  |  52 ++-
 fs/btrfs/inode.c                 | 663 ++++++++++++++++++++++++++++++-
 fs/fcntl.c                       |  10 +-
 fs/namei.c                       |   4 +
 include/linux/fcntl.h            |   2 +-
 include/linux/fs.h               |  14 +
 include/uapi/asm-generic/fcntl.h |   4 +
 include/uapi/linux/fs.h          |  26 +-
 mm/filemap.c                     |  82 +++-
 13 files changed, 851 insertions(+), 44 deletions(-)

Comments

Dave Chinner Oct. 20, 2019, 11:05 p.m. UTC | #1
On Tue, Oct 15, 2019 at 11:42:38AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Hello,
> 
> This series adds an API for reading compressed data on a filesystem
> without decompressing it as well as support for writing compressed data
> directly to the filesystem. It is based on my previous series which
> added a Btrfs-specific ioctl [1], but it is now an extension to
> preadv2()/pwritev2() as suggested by Dave Chinner [2]. I've included a
> man page patch describing the API in detail. Test cases and examples
> programs are available [3].
> 
> The use case that I have in mind is Btrfs send/receive: currently, when
> sending data from one compressed filesystem to another, the sending side
> decompresses the data and the receiving side recompresses it before
> writing it out. This is wasteful and can be avoided if we can just send
> and write compressed extents. The send part will be implemented in a
> separate series, as this API can stand alone.
> 
> Patches 1 and 2 add the VFS support. Patch 3 is a Btrfs prep patch.
> Patch 4 implements encoded reads for Btrfs, and patch 5 implements
> encoded writes.
> 
> Changes from v1 [4]:
> 
> - Encoded reads are now also implemented.
> - The encoded_iov structure now includes metadata for referring to a
>   subset of decoded data. This is required to handle certain cases where
>   a compressed extent is truncated, hole punched, or otherwise sliced up
>   and Btrfs chooses to reflect this in metadata instead of decompressing
>   the whole extent and rewriting the pieces. We call these "bookend
>   extents" in Btrfs, but any filesystem supporting transparent encoding
>   is likely to have a similar concept.

Where's the in-kernel documentation for this API? You're encoding a
specific set of behaviours into the user API, so this needs a whole
heap of documentation in the generic code to describe how it works
so that other filesystems implementing have a well defined guideline
to what they need to support.

Also, I don't see any test code for this - can you please add
support for RWF_ENCODED to xfs_io and write a suite of unit tests
for fstests that exercise the user API fully?  Given our history of
screwing up new user APIs, this absolutely should not be merged
until there is a full set of generic unit tests written and reviewed
for it and support has been added to fsstress, fsx, and other test
utilities to fuzz and stress the implementation as part of normal
day-to-day filesystem development...

Cheers,

Dave.
Omar Sandoval Oct. 21, 2019, 7:04 p.m. UTC | #2
On Mon, Oct 21, 2019 at 10:05:01AM +1100, Dave Chinner wrote:
> On Tue, Oct 15, 2019 at 11:42:38AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Hello,
> > 
> > This series adds an API for reading compressed data on a filesystem
> > without decompressing it as well as support for writing compressed data
> > directly to the filesystem. It is based on my previous series which
> > added a Btrfs-specific ioctl [1], but it is now an extension to
> > preadv2()/pwritev2() as suggested by Dave Chinner [2]. I've included a
> > man page patch describing the API in detail. Test cases and examples
> > programs are available [3].
> > 
> > The use case that I have in mind is Btrfs send/receive: currently, when
> > sending data from one compressed filesystem to another, the sending side
> > decompresses the data and the receiving side recompresses it before
> > writing it out. This is wasteful and can be avoided if we can just send
> > and write compressed extents. The send part will be implemented in a
> > separate series, as this API can stand alone.
> > 
> > Patches 1 and 2 add the VFS support. Patch 3 is a Btrfs prep patch.
> > Patch 4 implements encoded reads for Btrfs, and patch 5 implements
> > encoded writes.
> > 
> > Changes from v1 [4]:
> > 
> > - Encoded reads are now also implemented.
> > - The encoded_iov structure now includes metadata for referring to a
> >   subset of decoded data. This is required to handle certain cases where
> >   a compressed extent is truncated, hole punched, or otherwise sliced up
> >   and Btrfs chooses to reflect this in metadata instead of decompressing
> >   the whole extent and rewriting the pieces. We call these "bookend
> >   extents" in Btrfs, but any filesystem supporting transparent encoding
> >   is likely to have a similar concept.
> 
> Where's the in-kernel documentation for this API? You're encoding a
> specific set of behaviours into the user API, so this needs a whole
> heap of documentation in the generic code to describe how it works
> so that other filesystems implementing have a well defined guideline
> to what they need to support.

The man-page I sent is quite detailed, but sure, I can add the relevant
information to the generic code, as well.

> Also, I don't see any test code for this -

It's in the cover letter: https://github.com/osandov/xfstests/tree/rwf-encoded

I haven't sent those patches up because it's tedious to rework and
resend them for each little tweak we make to the API.

> can you please add
> support for RWF_ENCODED to xfs_io and write a suite of unit tests
> for fstests that exercise the user API fully?

Reading requires filesystem-specific decoding, and I wasn't sure if that
would be a good fit for xfs_io. Alternatively, it could dump the raw
buffer to stdout, but whatever interprets it also needs the metadata, so
there'd need to be some sort of protocol between xfs_io and whatever
interprets it. I added a btrfs_read_encoded program in my xfstests
branch above instead. It should be easy enough to move the encoded_write
test program to xfs_io pwrite, though.

> Given our history of
> screwing up new user APIs, this absolutely should not be merged
> until there is a full set of generic unit tests written and reviewed
> for it and support has been added to fsstress, fsx, and other test
> utilities to fuzz and stress the implementation as part of normal
> day-to-day filesystem development...

Sure thing, I'll add support to those tools once the API isn't in flux
so much.