[10/13] iomap: use a function pointer for dio submits

Message ID	20190802220048.16142-11-rgoldwyn@suse.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Goldwyn Rodrigues <rgoldwyn@suse.de> To: linux-fsdevel@vger.kernel.org Cc: linux-btrfs@vger.kernel.org, hch@lst.de, darrick.wong@oracle.com, ruansy.fnst@cn.fujitsu.com, Goldwyn Rodrigues <rgoldwyn@suse.com> Subject: [PATCH 10/13] iomap: use a function pointer for dio submits Date: Fri, 2 Aug 2019 17:00:45 -0500 Message-Id: <20190802220048.16142-11-rgoldwyn@suse.de> In-Reply-To: <20190802220048.16142-1-rgoldwyn@suse.de> References: <20190802220048.16142-1-rgoldwyn@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	Btrfs iomap \| expand [v2,0/13] Btrfs iomap [01/13] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O [02/13] iomap: Read page from srcmap for IOMAP_COW [03/13] btrfs: Eliminate PagePrivate for btrfs data pages [04/13] btrfs: Add a simple buffered iomap write [05/13] btrfs: Add CoW in iomap based writes [06/13] btrfs: remove buffered write code made unnecessary [07/13] btrfs: basic direct read operation [08/13] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() [09/13] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() [10/13] iomap: use a function pointer for dio submits [11/13] btrfs: Use iomap_dio_rw for performing direct I/O writes [12/13] btrfs: Remove btrfs_dio_data and __btrfs_direct_write [13/13] btrfs: update inode size during bio completion

Goldwyn Rodrigues Aug. 2, 2019, 10 p.m. UTC

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This helps filesystems to perform tasks on the bio while
submitting for I/O. Since btrfs requires the position
we are working on, pass pos to iomap_dio_submit_bio()

The correct place for submit_io() is not page_ops. Would it
better to rename the structure to something like iomap_io_ops
or put it directly under struct iomap?

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap/direct-io.c  | 16 +++++++++++-----
 include/linux/iomap.h |  1 +
 2 files changed, 12 insertions(+), 5 deletions(-)

Darrick J. Wong Aug. 3, 2019, 12:21 a.m. UTC | #1

On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> This helps filesystems to perform tasks on the bio while
> submitting for I/O. Since btrfs requires the position
> we are working on, pass pos to iomap_dio_submit_bio()

What /does/ btrfs_submit_direct do, anyway?  Looks like it's a custom
submission function that ... does something related to setting
checksums?  And, uh, RAID?

> The correct place for submit_io() is not page_ops. Would it
> better to rename the structure to something like iomap_io_ops
> or put it directly under struct iomap?

Seeing as the ->iomap_begin handler knows if the requested op is a
buffered write or a direct write, what if we just declare a union of
ops?

e.g.

struct iomap_page_ops;
struct iomap_directio_ops;

struct iomap {
	<usual stuff>
	union {
		const struct iomap_page_ops *page_ops;
		const struct iomap_directio_ops *directio_ops;
	};
};

--D

> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/iomap/direct-io.c  | 16 +++++++++++-----
>  include/linux/iomap.h |  1 +
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 5279029c7a3c..a802e66bf11f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
>  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
>  
>  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
> -		struct bio *bio)
> +		struct bio *bio, loff_t pos)
>  {
>  	atomic_inc(&dio->ref);
>  
> @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
>  		bio_set_polled(bio, dio->iocb);
>  
>  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> -	dio->submit.cookie = submit_bio(bio);
> +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> +		iomap->page_ops->submit_io(bio, file_inode(dio->iocb->ki_filp),
> +				pos);
> +		dio->submit.cookie = BLK_QC_T_NONE;
> +	} else {
> +		dio->submit.cookie = submit_bio(bio);
> +	}
>  }
>  
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> @@ -195,7 +201,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  	get_page(page);
>  	__bio_add_page(bio, page, len, 0);
>  	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
> -	iomap_dio_submit_bio(dio, iomap, bio);
> +	iomap_dio_submit_bio(dio, iomap, bio, pos);
>  }
>  
>  static loff_t
> @@ -301,11 +307,11 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		iov_iter_advance(dio->submit.iter, n);
>  
>  		dio->size += n;
> -		pos += n;
>  		copied += n;
>  
>  		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
> -		iomap_dio_submit_bio(dio, iomap, bio);
> +		iomap_dio_submit_bio(dio, iomap, bio, pos);
> +		pos += n;
>  	} while (nr_pages);
>  
>  	/*
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5b2055e8ca8a..6617e4b6fb6d 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -92,6 +92,7 @@ struct iomap_page_ops {
>  			struct iomap *iomap);
>  	void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
>  			struct page *page, struct iomap *iomap);
> +	dio_submit_t 		*submit_io;
>  };
>  
>  /*
> -- 
> 2.16.4
>

Dave Chinner Aug. 4, 2019, 11:43 p.m. UTC | #2

On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> This helps filesystems to perform tasks on the bio while
> submitting for I/O. Since btrfs requires the position
> we are working on, pass pos to iomap_dio_submit_bio()
> 
> The correct place for submit_io() is not page_ops. Would it
> better to rename the structure to something like iomap_io_ops
> or put it directly under struct iomap?
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/iomap/direct-io.c  | 16 +++++++++++-----
>  include/linux/iomap.h |  1 +
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 5279029c7a3c..a802e66bf11f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
>  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
>  
>  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
> -		struct bio *bio)
> +		struct bio *bio, loff_t pos)
>  {
>  	atomic_inc(&dio->ref);
>  
> @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
>  		bio_set_polled(bio, dio->iocb);
>  
>  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> -	dio->submit.cookie = submit_bio(bio);
> +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> +		iomap->page_ops->submit_io(bio, file_inode(dio->iocb->ki_filp),
> +				pos);
> +		dio->submit.cookie = BLK_QC_T_NONE;
> +	} else {
> +		dio->submit.cookie = submit_bio(bio);
> +	}

I don't really like this at all. Apart from the fact it doesn't work
with block device polling (RWF_HIPRI), the iomap architecture is
supposed to resolve the file offset -> block device + LBA mapping
completely up front and so all that remains to be done is build and
submit the bio(s) to the block device.

What I see here is a hack to work around the fact that btrfs has
implemented both file data transformations and device mapping layer
functionality as a filesystem layer between file data bio building
and device bio submission. And as the btrfs file data mapping
(->iomap_begin) is completely unaware that there is further block
mapping to be done before block device bio submission, any generic
code that btrfs uses requires special IO submission hooks rather
than just calling submit_bio().

I'm not 100% sure what the solution here is, but the one thing we
must resist is turning the iomap code into a mess of custom hooks
that only one filesystem uses. We've been taught this lesson time
and time again - the iomap infrastructure exists because stuff like
bufferheads and the old direct IO code ended up so full of special
case code that it ossified and became unmodifiable and
unmaintainable.

We do not want to go down that path again. 

IMO, the iomap IO model needs to be restructured to support post-IO
and pre-IO data verification/calculation/transformation operations
so all the work that needs to be done at the inode/offset context
level can be done in the iomap path before bio submission/after
bio completion. This will allow infrastructure like fscrypt, data
compression, data checksums, etc to be suported generically, not
just by individual filesystems that provide a ->submit_io hook.

As for the btrfs needing to slice and dice bios for multiple
devices?  That should be done via a block device ->make_request
function, not a custom hook in the iomap code.

That's why I don't like this hook - I think hiding data operations
and/or custom bio manipulations in opaque filesystem callouts is
completely the wrong approach to be taking. We need to do these
things in a generic manner so that all filesystems (and block
devices!) that use the iomap infrastructure can take advantage of
them, not just one of them.

Quite frankly, I don't care if it takes more time and work up front,
I'm tired of expedient hacks to merge code quickly repeatedly biting
us on the arse and wasting far more time sorting out than we would
have spent getting it right in the first place.

Cheers,

Dave.

Goldwyn Rodrigues Aug. 5, 2019, 4:08 p.m. UTC | #3

On Fri, 2019-08-02 at 17:21 -0700,  Darrick J. Wong  wrote:
> On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > This helps filesystems to perform tasks on the bio while
> > submitting for I/O. Since btrfs requires the position
> > we are working on, pass pos to iomap_dio_submit_bio()
> 
> What /does/ btrfs_submit_direct do, anyway?  Looks like it's a custom
> submission function that ... does something related to setting
> checksums?  And, uh, RAID?

Yes and yes.

> 
> > The correct place for submit_io() is not page_ops. Would it
> > better to rename the structure to something like iomap_io_ops
> > or put it directly under struct iomap?
> 
> Seeing as the ->iomap_begin handler knows if the requested op is a
> buffered write or a direct write, what if we just declare a union of
> ops?
> 
> e.g.
> 
> struct iomap_page_ops;
> struct iomap_directio_ops;
> 
> struct iomap {
> 	<usual stuff>
> 	union {
> 		const struct iomap_page_ops *page_ops;
> 		const struct iomap_directio_ops *directio_ops;
> 	};
> };

Yes, that looks good. Thanks. I will incorporate it.

> 
> --D
> 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/iomap/direct-io.c  | 16 +++++++++++-----
> >  include/linux/iomap.h |  1 +
> >  2 files changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index 5279029c7a3c..a802e66bf11f 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool
> > spin)
> >  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
> >  
> >  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct
> > iomap *iomap,
> > -		struct bio *bio)
> > +		struct bio *bio, loff_t pos)
> >  {
> >  	atomic_inc(&dio->ref);
> >  
> > @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct
> > iomap_dio *dio, struct iomap *iomap,
> >  		bio_set_polled(bio, dio->iocb);
> >  
> >  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> > -	dio->submit.cookie = submit_bio(bio);
> > +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> > +		iomap->page_ops->submit_io(bio, file_inode(dio-
> > >iocb->ki_filp),
> > +				pos);
> > +		dio->submit.cookie = BLK_QC_T_NONE;
> > +	} else {
> > +		dio->submit.cookie = submit_bio(bio);
> > +	}
> >  }
> >  
> >  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > @@ -195,7 +201,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct
> > iomap *iomap, loff_t pos,
> >  	get_page(page);
> >  	__bio_add_page(bio, page, len, 0);
> >  	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
> > -	iomap_dio_submit_bio(dio, iomap, bio);
> > +	iomap_dio_submit_bio(dio, iomap, bio, pos);
> >  }
> >  
> >  static loff_t
> > @@ -301,11 +307,11 @@ iomap_dio_bio_actor(struct inode *inode,
> > loff_t pos, loff_t length,
> >  		iov_iter_advance(dio->submit.iter, n);
> >  
> >  		dio->size += n;
> > -		pos += n;
> >  		copied += n;
> >  
> >  		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
> > -		iomap_dio_submit_bio(dio, iomap, bio);
> > +		iomap_dio_submit_bio(dio, iomap, bio, pos);
> > +		pos += n;
> >  	} while (nr_pages);
> >  
> >  	/*
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 5b2055e8ca8a..6617e4b6fb6d 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -92,6 +92,7 @@ struct iomap_page_ops {
> >  			struct iomap *iomap);
> >  	void (*page_done)(struct inode *inode, loff_t pos,
> > unsigned copied,
> >  			struct page *page, struct iomap *iomap);
> > +	dio_submit_t 		*submit_io;
> >  };
> >  
> >  /*
> > -- 
> > 2.16.4
> > 
> 
>

Goldwyn Rodrigues Aug. 5, 2019, 4:08 p.m. UTC | #4

On Mon, 2019-08-05 at 09:43 +1000, Dave Chinner wrote:
> On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > This helps filesystems to perform tasks on the bio while
> > submitting for I/O. Since btrfs requires the position
> > we are working on, pass pos to iomap_dio_submit_bio()
> > 
> > The correct place for submit_io() is not page_ops. Would it
> > better to rename the structure to something like iomap_io_ops
> > or put it directly under struct iomap?
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/iomap/direct-io.c  | 16 +++++++++++-----
> >  include/linux/iomap.h |  1 +
> >  2 files changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index 5279029c7a3c..a802e66bf11f 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool
> > spin)
> >  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
> >  
> >  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct
> > iomap *iomap,
> > -		struct bio *bio)
> > +		struct bio *bio, loff_t pos)
> >  {
> >  	atomic_inc(&dio->ref);
> >  
> > @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct
> > iomap_dio *dio, struct iomap *iomap,
> >  		bio_set_polled(bio, dio->iocb);
> >  
> >  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> > -	dio->submit.cookie = submit_bio(bio);
> > +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> > +		iomap->page_ops->submit_io(bio, file_inode(dio-
> > >iocb->ki_filp),
> > +				pos);
> > +		dio->submit.cookie = BLK_QC_T_NONE;
> > +	} else {
> > +		dio->submit.cookie = submit_bio(bio);
> > +	}
> 
> I don't really like this at all. Apart from the fact it doesn't work
> with block device polling (RWF_HIPRI), the iomap architecture is

That can be added, no? Should be relayed when we clone the bio.

> supposed to resolve the file offset -> block device + LBA mapping
> completely up front and so all that remains to be done is build and
> submit the bio(s) to the block device.
> 
> What I see here is a hack to work around the fact that btrfs has
> implemented both file data transformations and device mapping layer
> functionality as a filesystem layer between file data bio building
> and device bio submission. And as the btrfs file data mapping
> (->iomap_begin) is completely unaware that there is further block
> mapping to be done before block device bio submission, any generic
> code that btrfs uses requires special IO submission hooks rather
> than just calling submit_bio().
> 
> I'm not 100% sure what the solution here is, but the one thing we
> must resist is turning the iomap code into a mess of custom hooks
> that only one filesystem uses. We've been taught this lesson time
> and time again - the iomap infrastructure exists because stuff like
> bufferheads and the old direct IO code ended up so full of special
> case code that it ossified and became unmodifiable and
> unmaintainable.
> 
> We do not want to go down that path again. 
> 
> IMO, the iomap IO model needs to be restructured to support post-IO
> and pre-IO data verification/calculation/transformation operations
> so all the work that needs to be done at the inode/offset context
> level can be done in the iomap path before bio submission/after
> bio completion. This will allow infrastructure like fscrypt, data
> compression, data checksums, etc to be suported generically, not
> just by individual filesystems that provide a ->submit_io hook.
> 
> As for the btrfs needing to slice and dice bios for multiple
> devices?  That should be done via a block device ->make_request
> function, not a custom hook in the iomap code.

btrfs differentiates the way how metadata and data is
handled/replicated/stored. We would still need an entry point in the
iomap code to handle the I/O submission.

> 
> That's why I don't like this hook - I think hiding data operations
> and/or custom bio manipulations in opaque filesystem callouts is
> completely the wrong approach to be taking. We need to do these
> things in a generic manner so that all filesystems (and block
> devices!) that use the iomap infrastructure can take advantage of
> them, not just one of them.
> 
> Quite frankly, I don't care if it takes more time and work up front,
> I'm tired of expedient hacks to merge code quickly repeatedly biting
> us on the arse and wasting far more time sorting out than we would
> have spent getting it right in the first place.

Sure. I am open to ideas. What are you proposing?

Dave Chinner Aug. 5, 2019, 9:54 p.m. UTC | #5

On Mon, Aug 05, 2019 at 04:08:43PM +0000, Goldwyn Rodrigues wrote:
> On Mon, 2019-08-05 at 09:43 +1000, Dave Chinner wrote:
> > On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > 
> > > This helps filesystems to perform tasks on the bio while
> > > submitting for I/O. Since btrfs requires the position
> > > we are working on, pass pos to iomap_dio_submit_bio()
> > > 
> > > The correct place for submit_io() is not page_ops. Would it
> > > better to rename the structure to something like iomap_io_ops
> > > or put it directly under struct iomap?
> > > 
> > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > ---
> > >  fs/iomap/direct-io.c  | 16 +++++++++++-----
> > >  include/linux/iomap.h |  1 +
> > >  2 files changed, 12 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index 5279029c7a3c..a802e66bf11f 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool
> > > spin)
> > >  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
> > >  
> > >  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct
> > > iomap *iomap,
> > > -		struct bio *bio)
> > > +		struct bio *bio, loff_t pos)
> > >  {
> > >  	atomic_inc(&dio->ref);
> > >  
> > > @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct
> > > iomap_dio *dio, struct iomap *iomap,
> > >  		bio_set_polled(bio, dio->iocb);
> > >  
> > >  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> > > -	dio->submit.cookie = submit_bio(bio);
> > > +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> > > +		iomap->page_ops->submit_io(bio, file_inode(dio-
> > > >iocb->ki_filp),
> > > +				pos);
> > > +		dio->submit.cookie = BLK_QC_T_NONE;
> > > +	} else {
> > > +		dio->submit.cookie = submit_bio(bio);
> > > +	}
> > 
> > I don't really like this at all. Apart from the fact it doesn't work
> > with block device polling (RWF_HIPRI), the iomap architecture is
> 
> That can be added, no? Should be relayed when we clone the bio.

No idea how that all is supposed to work when you split a single bio
into multiple bios. I'm pretty sure the iomap code is broken for
that case, too -  Jens was silent on how to fix other than to say
"it wasn't important so we didn't care to make sure it worked". So
it's not clear to me exactly how block polling is supposed to work
when a an IO needs to be split into multiple submissions...

> > supposed to resolve the file offset -> block device + LBA mapping
> > completely up front and so all that remains to be done is build and
> > submit the bio(s) to the block device.
> > 
> > What I see here is a hack to work around the fact that btrfs has
> > implemented both file data transformations and device mapping layer
> > functionality as a filesystem layer between file data bio building
> > and device bio submission. And as the btrfs file data mapping
> > (->iomap_begin) is completely unaware that there is further block
> > mapping to be done before block device bio submission, any generic
> > code that btrfs uses requires special IO submission hooks rather
> > than just calling submit_bio().
> > 
> > I'm not 100% sure what the solution here is, but the one thing we
> > must resist is turning the iomap code into a mess of custom hooks
> > that only one filesystem uses. We've been taught this lesson time
> > and time again - the iomap infrastructure exists because stuff like
> > bufferheads and the old direct IO code ended up so full of special
> > case code that it ossified and became unmodifiable and
> > unmaintainable.
> > 
> > We do not want to go down that path again. 
> > 
> > IMO, the iomap IO model needs to be restructured to support post-IO
> > and pre-IO data verification/calculation/transformation operations
> > so all the work that needs to be done at the inode/offset context
> > level can be done in the iomap path before bio submission/after
> > bio completion. This will allow infrastructure like fscrypt, data
> > compression, data checksums, etc to be suported generically, not
> > just by individual filesystems that provide a ->submit_io hook.
> > 
> > As for the btrfs needing to slice and dice bios for multiple
> > devices?  That should be done via a block device ->make_request
> > function, not a custom hook in the iomap code.
> 
> btrfs differentiates the way how metadata and data is
> handled/replicated/stored. We would still need an entry point in the
> iomap code to handle the I/O submission.

This is a data IO path. How metadata is stored/replicated is
irrelevant to this code path...

> > That's why I don't like this hook - I think hiding data operations
> > and/or custom bio manipulations in opaque filesystem callouts is
> > completely the wrong approach to be taking. We need to do these
> > things in a generic manner so that all filesystems (and block
> > devices!) that use the iomap infrastructure can take advantage of
> > them, not just one of them.
> > 
> > Quite frankly, I don't care if it takes more time and work up front,
> > I'm tired of expedient hacks to merge code quickly repeatedly biting
> > us on the arse and wasting far more time sorting out than we would
> > have spent getting it right in the first place.
> 
> Sure. I am open to ideas. What are you proposing?

That you think about how to normalise the btrfs IO path to fit into
the standard iomap/blockdev model, rather than adding special hacks
to iomap to allow an opaque, custom, IO model to be shoe-horned into
the generic code.

For example, post-read validation requires end-io processing,
whether it be encryption, decompression, CRC/T10 validation, etc. The
iomap end-io completion has all the information needed to run these
things, whether it be a callout to the filesystem for custom
processing checking, or a generic "decrypt into supplied data page"
sort of thing. These all need to be done in the same place, so we
should have common support for this. And I suspect the iomap should
also state in a flag that something like this is necessary (e.g.
IOMAP_FL_ENCRYPTED indicates post-IO decryption needs to be run).

Similarly, on the IO submit side we have need for a pre-IO
processing hook. That can be used to encrypt, compress, calculate
data CRCs, do pre-IO COW processing (XFS requires a hook for this),
etc.

These hooks are needed for for both buffered and direct IO, and they
are needed for more filesystems than just btrfs. fscrypt will need
them, XFS needs them, etc. So rather than hide data CRCs,
compression, and encryption deep inside the btrfs code, pull it up
into common layers that are called by the generic code. THis will
leave with just the things like mirroring, raid, IO retries, etc
below the iomap code, and that's all stuff that can be done behind a
->make_request function that is passed a bio...

Cheers,

Dave.

Gao Xiang Aug. 8, 2019, 4:26 a.m. UTC | #6

On Tue, Aug 06, 2019 at 07:54:58AM +1000, Dave Chinner wrote:
> On Mon, Aug 05, 2019 at 04:08:43PM +0000, Goldwyn Rodrigues wrote:
> > On Mon, 2019-08-05 at 09:43 +1000, Dave Chinner wrote:
> > > On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > 
> > > > This helps filesystems to perform tasks on the bio while
> > > > submitting for I/O. Since btrfs requires the position
> > > > we are working on, pass pos to iomap_dio_submit_bio()
> > > > 
> > > > The correct place for submit_io() is not page_ops. Would it
> > > > better to rename the structure to something like iomap_io_ops
> > > > or put it directly under struct iomap?
> > > > 
> > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > ---
> > > >  fs/iomap/direct-io.c  | 16 +++++++++++-----
> > > >  include/linux/iomap.h |  1 +
> > > >  2 files changed, 12 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > index 5279029c7a3c..a802e66bf11f 100644
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool
> > > > spin)
> > > >  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
> > > >  
> > > >  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct
> > > > iomap *iomap,
> > > > -		struct bio *bio)
> > > > +		struct bio *bio, loff_t pos)
> > > >  {
> > > >  	atomic_inc(&dio->ref);
> > > >  
> > > > @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct
> > > > iomap_dio *dio, struct iomap *iomap,
> > > >  		bio_set_polled(bio, dio->iocb);
> > > >  
> > > >  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> > > > -	dio->submit.cookie = submit_bio(bio);
> > > > +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> > > > +		iomap->page_ops->submit_io(bio, file_inode(dio-
> > > > >iocb->ki_filp),
> > > > +				pos);
> > > > +		dio->submit.cookie = BLK_QC_T_NONE;
> > > > +	} else {
> > > > +		dio->submit.cookie = submit_bio(bio);
> > > > +	}
> > > 
> > > I don't really like this at all. Apart from the fact it doesn't work
> > > with block device polling (RWF_HIPRI), the iomap architecture is
> > 
> > That can be added, no? Should be relayed when we clone the bio.
> 
> No idea how that all is supposed to work when you split a single bio
> into multiple bios. I'm pretty sure the iomap code is broken for
> that case, too -  Jens was silent on how to fix other than to say
> "it wasn't important so we didn't care to make sure it worked". So
> it's not clear to me exactly how block polling is supposed to work
> when a an IO needs to be split into multiple submissions...
> 
> > > supposed to resolve the file offset -> block device + LBA mapping
> > > completely up front and so all that remains to be done is build and
> > > submit the bio(s) to the block device.
> > > 
> > > What I see here is a hack to work around the fact that btrfs has
> > > implemented both file data transformations and device mapping layer
> > > functionality as a filesystem layer between file data bio building
> > > and device bio submission. And as the btrfs file data mapping
> > > (->iomap_begin) is completely unaware that there is further block
> > > mapping to be done before block device bio submission, any generic
> > > code that btrfs uses requires special IO submission hooks rather
> > > than just calling submit_bio().
> > > 
> > > I'm not 100% sure what the solution here is, but the one thing we
> > > must resist is turning the iomap code into a mess of custom hooks
> > > that only one filesystem uses. We've been taught this lesson time
> > > and time again - the iomap infrastructure exists because stuff like
> > > bufferheads and the old direct IO code ended up so full of special
> > > case code that it ossified and became unmodifiable and
> > > unmaintainable.
> > > 
> > > We do not want to go down that path again. 
> > > 
> > > IMO, the iomap IO model needs to be restructured to support post-IO
> > > and pre-IO data verification/calculation/transformation operations
> > > so all the work that needs to be done at the inode/offset context
> > > level can be done in the iomap path before bio submission/after
> > > bio completion. This will allow infrastructure like fscrypt, data
> > > compression, data checksums, etc to be suported generically, not
> > > just by individual filesystems that provide a ->submit_io hook.
> > > 
> > > As for the btrfs needing to slice and dice bios for multiple
> > > devices?  That should be done via a block device ->make_request
> > > function, not a custom hook in the iomap code.
> > 
> > btrfs differentiates the way how metadata and data is
> > handled/replicated/stored. We would still need an entry point in the
> > iomap code to handle the I/O submission.
> 
> This is a data IO path. How metadata is stored/replicated is
> irrelevant to this code path...
> 
> > > That's why I don't like this hook - I think hiding data operations
> > > and/or custom bio manipulations in opaque filesystem callouts is
> > > completely the wrong approach to be taking. We need to do these
> > > things in a generic manner so that all filesystems (and block
> > > devices!) that use the iomap infrastructure can take advantage of
> > > them, not just one of them.
> > > 
> > > Quite frankly, I don't care if it takes more time and work up front,
> > > I'm tired of expedient hacks to merge code quickly repeatedly biting
> > > us on the arse and wasting far more time sorting out than we would
> > > have spent getting it right in the first place.
> > 
> > Sure. I am open to ideas. What are you proposing?
> 
> That you think about how to normalise the btrfs IO path to fit into
> the standard iomap/blockdev model, rather than adding special hacks
> to iomap to allow an opaque, custom, IO model to be shoe-horned into
> the generic code.
> 
> For example, post-read validation requires end-io processing,
> whether it be encryption, decompression, CRC/T10 validation, etc. The
> iomap end-io completion has all the information needed to run these
> things, whether it be a callout to the filesystem for custom
> processing checking, or a generic "decrypt into supplied data page"
> sort of thing. These all need to be done in the same place, so we
> should have common support for this. And I suspect the iomap should
> also state in a flag that something like this is necessary (e.g.
> IOMAP_FL_ENCRYPTED indicates post-IO decryption needs to be run).

Add some word to this topic, I think introducing a generic full approach
to IOMAP for encryption, decompression, verification is hard to meet all
filesystems, and seems unnecessary, especially data compression is involved.

Since the data decompression will expand the data, therefore the logical
data size is not same as the physical data size:

1) IO submission should be applied to all physical data, but data
   decompression will be eventually applied to logical mapping.
   As for EROFS, it submits all physical pages with page->private
   points to management structure which maintain all logical pages
   as well for further decompression. And time-sharing approach is
   used to save the L2P mapping array in these allocated pages itself.

   In addition, IOMAP also needs to consider fixed-sized output/input
   difference which is filesystem specific and I have no idea whether
   involveing too many code for each requirement is really good for IOMAP;

2) The post-read processing order is another negotiable stuff.
   Although there is no benefit to select verity->decrypt rather than
   decrypt->verity; but when compression is involved, the different
   orders could be selected by different filesystem users:

    1. decrypt->verity->decompress

    2. verity->decompress->decrypt

    3. decompress->decrypt->verity

   1. and 2. could cause less computation since it processes
   compressed data, and the security is good enough since
   the behavior of decompression algorithm is deterministic.
   3 could cause more computation.

All I want to say is the post process is so complicated since we have
many selection if encryption, decompression, verification are all involved.

Maybe introduce a core subset to IOMAP is better for long-term
maintainment and better performance. And we should consider it
more carefully.

Thanks,
Gao Xiang

> 
> Similarly, on the IO submit side we have need for a pre-IO
> processing hook. That can be used to encrypt, compress, calculate
> data CRCs, do pre-IO COW processing (XFS requires a hook for this),
> etc.
> 
> These hooks are needed for for both buffered and direct IO, and they
> are needed for more filesystems than just btrfs. fscrypt will need
> them, XFS needs them, etc. So rather than hide data CRCs,
> compression, and encryption deep inside the btrfs code, pull it up
> into common layers that are called by the generic code. THis will
> leave with just the things like mirroring, raid, IO retries, etc
> below the iomap code, and that's all stuff that can be done behind a
> ->make_request function that is passed a bio...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Gao Xiang Aug. 8, 2019, 4:52 a.m. UTC | #7

On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> On Tue, Aug 06, 2019 at 07:54:58AM +1000, Dave Chinner wrote:
> > On Mon, Aug 05, 2019 at 04:08:43PM +0000, Goldwyn Rodrigues wrote:
> > > On Mon, 2019-08-05 at 09:43 +1000, Dave Chinner wrote:
> > > > On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> > > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > > 
> > > > > This helps filesystems to perform tasks on the bio while
> > > > > submitting for I/O. Since btrfs requires the position
> > > > > we are working on, pass pos to iomap_dio_submit_bio()
> > > > > 
> > > > > The correct place for submit_io() is not page_ops. Would it
> > > > > better to rename the structure to something like iomap_io_ops
> > > > > or put it directly under struct iomap?
> > > > > 
> > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > > ---
> > > > >  fs/iomap/direct-io.c  | 16 +++++++++++-----
> > > > >  include/linux/iomap.h |  1 +
> > > > >  2 files changed, 12 insertions(+), 5 deletions(-)
> > > > > 
> > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > > index 5279029c7a3c..a802e66bf11f 100644
> > > > > --- a/fs/iomap/direct-io.c
> > > > > +++ b/fs/iomap/direct-io.c
> > > > > @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool
> > > > > spin)
> > > > >  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
> > > > >  
> > > > >  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct
> > > > > iomap *iomap,
> > > > > -		struct bio *bio)
> > > > > +		struct bio *bio, loff_t pos)
> > > > >  {
> > > > >  	atomic_inc(&dio->ref);
> > > > >  
> > > > > @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct
> > > > > iomap_dio *dio, struct iomap *iomap,
> > > > >  		bio_set_polled(bio, dio->iocb);
> > > > >  
> > > > >  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> > > > > -	dio->submit.cookie = submit_bio(bio);
> > > > > +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> > > > > +		iomap->page_ops->submit_io(bio, file_inode(dio-
> > > > > >iocb->ki_filp),
> > > > > +				pos);
> > > > > +		dio->submit.cookie = BLK_QC_T_NONE;
> > > > > +	} else {
> > > > > +		dio->submit.cookie = submit_bio(bio);
> > > > > +	}
> > > > 
> > > > I don't really like this at all. Apart from the fact it doesn't work
> > > > with block device polling (RWF_HIPRI), the iomap architecture is
> > > 
> > > That can be added, no? Should be relayed when we clone the bio.
> > 
> > No idea how that all is supposed to work when you split a single bio
> > into multiple bios. I'm pretty sure the iomap code is broken for
> > that case, too -  Jens was silent on how to fix other than to say
> > "it wasn't important so we didn't care to make sure it worked". So
> > it's not clear to me exactly how block polling is supposed to work
> > when a an IO needs to be split into multiple submissions...
> > 
> > > > supposed to resolve the file offset -> block device + LBA mapping
> > > > completely up front and so all that remains to be done is build and
> > > > submit the bio(s) to the block device.
> > > > 
> > > > What I see here is a hack to work around the fact that btrfs has
> > > > implemented both file data transformations and device mapping layer
> > > > functionality as a filesystem layer between file data bio building
> > > > and device bio submission. And as the btrfs file data mapping
> > > > (->iomap_begin) is completely unaware that there is further block
> > > > mapping to be done before block device bio submission, any generic
> > > > code that btrfs uses requires special IO submission hooks rather
> > > > than just calling submit_bio().
> > > > 
> > > > I'm not 100% sure what the solution here is, but the one thing we
> > > > must resist is turning the iomap code into a mess of custom hooks
> > > > that only one filesystem uses. We've been taught this lesson time
> > > > and time again - the iomap infrastructure exists because stuff like
> > > > bufferheads and the old direct IO code ended up so full of special
> > > > case code that it ossified and became unmodifiable and
> > > > unmaintainable.
> > > > 
> > > > We do not want to go down that path again. 
> > > > 
> > > > IMO, the iomap IO model needs to be restructured to support post-IO
> > > > and pre-IO data verification/calculation/transformation operations
> > > > so all the work that needs to be done at the inode/offset context
> > > > level can be done in the iomap path before bio submission/after
> > > > bio completion. This will allow infrastructure like fscrypt, data
> > > > compression, data checksums, etc to be suported generically, not
> > > > just by individual filesystems that provide a ->submit_io hook.
> > > > 
> > > > As for the btrfs needing to slice and dice bios for multiple
> > > > devices?  That should be done via a block device ->make_request
> > > > function, not a custom hook in the iomap code.
> > > 
> > > btrfs differentiates the way how metadata and data is
> > > handled/replicated/stored. We would still need an entry point in the
> > > iomap code to handle the I/O submission.
> > 
> > This is a data IO path. How metadata is stored/replicated is
> > irrelevant to this code path...
> > 
> > > > That's why I don't like this hook - I think hiding data operations
> > > > and/or custom bio manipulations in opaque filesystem callouts is
> > > > completely the wrong approach to be taking. We need to do these
> > > > things in a generic manner so that all filesystems (and block
> > > > devices!) that use the iomap infrastructure can take advantage of
> > > > them, not just one of them.
> > > > 
> > > > Quite frankly, I don't care if it takes more time and work up front,
> > > > I'm tired of expedient hacks to merge code quickly repeatedly biting
> > > > us on the arse and wasting far more time sorting out than we would
> > > > have spent getting it right in the first place.
> > > 
> > > Sure. I am open to ideas. What are you proposing?
> > 
> > That you think about how to normalise the btrfs IO path to fit into
> > the standard iomap/blockdev model, rather than adding special hacks
> > to iomap to allow an opaque, custom, IO model to be shoe-horned into
> > the generic code.
> > 
> > For example, post-read validation requires end-io processing,
> > whether it be encryption, decompression, CRC/T10 validation, etc. The
> > iomap end-io completion has all the information needed to run these
> > things, whether it be a callout to the filesystem for custom
> > processing checking, or a generic "decrypt into supplied data page"
> > sort of thing. These all need to be done in the same place, so we
> > should have common support for this. And I suspect the iomap should
> > also state in a flag that something like this is necessary (e.g.
> > IOMAP_FL_ENCRYPTED indicates post-IO decryption needs to be run).
> 
> Add some word to this topic, I think introducing a generic full approach
> to IOMAP for encryption, decompression, verification is hard to meet all
> filesystems, and seems unnecessary, especially data compression is involved.
> 
> Since the data decompression will expand the data, therefore the logical
> data size is not same as the physical data size:
> 
> 1) IO submission should be applied to all physical data, but data
>    decompression will be eventually applied to logical mapping.
>    As for EROFS, it submits all physical pages with page->private
>    points to management structure which maintain all logical pages
>    as well for further decompression. And time-sharing approach is
>    used to save the L2P mapping array in these allocated pages itself.
> 
>    In addition, IOMAP also needs to consider fixed-sized output/input
>    difference which is filesystem specific and I have no idea whether
>    involveing too many code for each requirement is really good for IOMAP;
> 
> 2) The post-read processing order is another negotiable stuff.
>    Although there is no benefit to select verity->decrypt rather than
>    decrypt->verity; but when compression is involved, the different
>    orders could be selected by different filesystem users:
> 
>     1. decrypt->verity->decompress
> 
>     2. verity->decompress->decrypt
> 
>     3. decompress->decrypt->verity

maybe "4. decrypt->decompress->verity" is useful as well.

some post-read processing operates on physical data size and
the other post-read processing operates on logical data size.

> 
>    1. and 2. could cause less computation since it processes

and less verify data IO as well.

>    compressed data, and the security is good enough since
>    the behavior of decompression algorithm is deterministic.
>    3 could cause more computation.
> 
> All I want to say is the post process is so complicated since we have
> many selection if encryption, decompression, verification are all involved.

Correct the above word, I mean "all I want to say is the pre/post
process is so complicated", therefore a full generic approach for
decryption, decompression, verification is hard.

Thanks,
Gao Xiang

> 
> Maybe introduce a core subset to IOMAP is better for long-term
> maintainment and better performance. And we should consider it
> more carefully.
> 
> Thanks,
> Gao Xiang
> 
> > 
> > Similarly, on the IO submit side we have need for a pre-IO
> > processing hook. That can be used to encrypt, compress, calculate
> > data CRCs, do pre-IO COW processing (XFS requires a hook for this),
> > etc.
> > 
> > These hooks are needed for for both buffered and direct IO, and they
> > are needed for more filesystems than just btrfs. fscrypt will need
> > them, XFS needs them, etc. So rather than hide data CRCs,
> > compression, and encryption deep inside the btrfs code, pull it up
> > into common layers that are called by the generic code. THis will
> > leave with just the things like mirroring, raid, IO retries, etc
> > below the iomap code, and that's all stuff that can be done behind a
> > ->make_request function that is passed a bio...
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

Eric Biggers Aug. 8, 2019, 5:49 a.m. UTC | #8

On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > 
> > > > That's why I don't like this hook - I think hiding data operations
> > > > and/or custom bio manipulations in opaque filesystem callouts is
> > > > completely the wrong approach to be taking. We need to do these
> > > > things in a generic manner so that all filesystems (and block
> > > > devices!) that use the iomap infrastructure can take advantage of
> > > > them, not just one of them.
> > > > 
> > > > Quite frankly, I don't care if it takes more time and work up front,
> > > > I'm tired of expedient hacks to merge code quickly repeatedly biting
> > > > us on the arse and wasting far more time sorting out than we would
> > > > have spent getting it right in the first place.
> > > 
> > > Sure. I am open to ideas. What are you proposing?
> > 
> > That you think about how to normalise the btrfs IO path to fit into
> > the standard iomap/blockdev model, rather than adding special hacks
> > to iomap to allow an opaque, custom, IO model to be shoe-horned into
> > the generic code.
> > 
> > For example, post-read validation requires end-io processing,
> > whether it be encryption, decompression, CRC/T10 validation, etc. The
> > iomap end-io completion has all the information needed to run these
> > things, whether it be a callout to the filesystem for custom
> > processing checking, or a generic "decrypt into supplied data page"
> > sort of thing. These all need to be done in the same place, so we
> > should have common support for this. And I suspect the iomap should
> > also state in a flag that something like this is necessary (e.g.
> > IOMAP_FL_ENCRYPTED indicates post-IO decryption needs to be run).
> 
> Add some word to this topic, I think introducing a generic full approach
> to IOMAP for encryption, decompression, verification is hard to meet all
> filesystems, and seems unnecessary, especially data compression is involved.
> 
> Since the data decompression will expand the data, therefore the logical
> data size is not same as the physical data size:
> 
> 1) IO submission should be applied to all physical data, but data
>    decompression will be eventually applied to logical mapping.
>    As for EROFS, it submits all physical pages with page->private
>    points to management structure which maintain all logical pages
>    as well for further decompression. And time-sharing approach is
>    used to save the L2P mapping array in these allocated pages itself.
> 
>    In addition, IOMAP also needs to consider fixed-sized output/input
>    difference which is filesystem specific and I have no idea whether
>    involveing too many code for each requirement is really good for IOMAP;
> 
> 2) The post-read processing order is another negotiable stuff.
>    Although there is no benefit to select verity->decrypt rather than
>    decrypt->verity; but when compression is involved, the different
>    orders could be selected by different filesystem users:
> 
>     1. decrypt->verity->decompress
> 
>     2. verity->decompress->decrypt
> 
>     3. decompress->decrypt->verity
> 
>    1. and 2. could cause less computation since it processes
>    compressed data, and the security is good enough since
>    the behavior of decompression algorithm is deterministic.
>    3 could cause more computation.
> 
> All I want to say is the post process is so complicated since we have
> many selection if encryption, decompression, verification are all involved.
> 
> Maybe introduce a core subset to IOMAP is better for long-term
> maintainment and better performance. And we should consider it
> more carefully.
> 

FWIW, the only order that actually makes sense is decrypt->decompress->verity.

Decrypt before decompress, i.e. encrypt after compress, because only the
plaintext can be compressible; the ciphertext isn't.

Verity last, on the original data, because otherwise the file hash that
fs-verity reports would be specific to that particular inode on-disk and
therefore would be useless for authenticating the file's user-visible contents.

[By "verity" I mean specifically fs-verity.  Integrity-only block checksums are
a different case; those can be done at any point, but doing them on the
compressed data would make sense as then there would be less to checksum.

And yes, compression+encryption leaks information about the original data, so
may not be advisable.  My point is just that if the two are nevertheless
combined, it only makes sense to compress the plaintext.]

- Eric

Gao Xiang Aug. 8, 2019, 6:28 a.m. UTC | #9

Hi Eric,

On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > > 
> > > > > That's why I don't like this hook - I think hiding data operations
> > > > > and/or custom bio manipulations in opaque filesystem callouts is
> > > > > completely the wrong approach to be taking. We need to do these
> > > > > things in a generic manner so that all filesystems (and block
> > > > > devices!) that use the iomap infrastructure can take advantage of
> > > > > them, not just one of them.
> > > > > 
> > > > > Quite frankly, I don't care if it takes more time and work up front,
> > > > > I'm tired of expedient hacks to merge code quickly repeatedly biting
> > > > > us on the arse and wasting far more time sorting out than we would
> > > > > have spent getting it right in the first place.
> > > > 
> > > > Sure. I am open to ideas. What are you proposing?
> > > 
> > > That you think about how to normalise the btrfs IO path to fit into
> > > the standard iomap/blockdev model, rather than adding special hacks
> > > to iomap to allow an opaque, custom, IO model to be shoe-horned into
> > > the generic code.
> > > 
> > > For example, post-read validation requires end-io processing,
> > > whether it be encryption, decompression, CRC/T10 validation, etc. The
> > > iomap end-io completion has all the information needed to run these
> > > things, whether it be a callout to the filesystem for custom
> > > processing checking, or a generic "decrypt into supplied data page"
> > > sort of thing. These all need to be done in the same place, so we
> > > should have common support for this. And I suspect the iomap should
> > > also state in a flag that something like this is necessary (e.g.
> > > IOMAP_FL_ENCRYPTED indicates post-IO decryption needs to be run).
> > 
> > Add some word to this topic, I think introducing a generic full approach
> > to IOMAP for encryption, decompression, verification is hard to meet all
> > filesystems, and seems unnecessary, especially data compression is involved.
> > 
> > Since the data decompression will expand the data, therefore the logical
> > data size is not same as the physical data size:
> > 
> > 1) IO submission should be applied to all physical data, but data
> >    decompression will be eventually applied to logical mapping.
> >    As for EROFS, it submits all physical pages with page->private
> >    points to management structure which maintain all logical pages
> >    as well for further decompression. And time-sharing approach is
> >    used to save the L2P mapping array in these allocated pages itself.
> > 
> >    In addition, IOMAP also needs to consider fixed-sized output/input
> >    difference which is filesystem specific and I have no idea whether
> >    involveing too many code for each requirement is really good for IOMAP;
> > 
> > 2) The post-read processing order is another negotiable stuff.
> >    Although there is no benefit to select verity->decrypt rather than
> >    decrypt->verity; but when compression is involved, the different
> >    orders could be selected by different filesystem users:
> > 
> >     1. decrypt->verity->decompress
> > 
> >     2. verity->decompress->decrypt
> > 
> >     3. decompress->decrypt->verity
> > 
> >    1. and 2. could cause less computation since it processes
> >    compressed data, and the security is good enough since
> >    the behavior of decompression algorithm is deterministic.
> >    3 could cause more computation.
> > 
> > All I want to say is the post process is so complicated since we have
> > many selection if encryption, decompression, verification are all involved.
> > 
> > Maybe introduce a core subset to IOMAP is better for long-term
> > maintainment and better performance. And we should consider it
> > more carefully.
> > 
> 
> FWIW, the only order that actually makes sense is decrypt->decompress->verity.

I am not just talking about fsverity as you mentioned below.

> 
> Decrypt before decompress, i.e. encrypt after compress, because only the
> plaintext can be compressible; the ciphertext isn't.

There could be some potential users need partially decrypt/decompress,
but that is minor. I don't want to talk about this detail in this topic.

> 
> Verity last, on the original data, because otherwise the file hash that
> fs-verity reports would be specific to that particular inode on-disk and
> therefore would be useless for authenticating the file's user-visible contents.
> 
> [By "verity" I mean specifically fs-verity.  Integrity-only block checksums are
> a different case; those can be done at any point, but doing them on the
> compressed data would make sense as then there would be less to checksum.
> 
> And yes, compression+encryption leaks information about the original data, so
> may not be advisable.  My point is just that if the two are nevertheless
> combined, it only makes sense to compress the plaintext.]

I cannot fully agree with your point. (I was not talking of fs-verity, it's
a generic approach of verity approach.)

Considering we introduce a block-based verity solution for all on-disk data
to EROFS later. It means all data/compressed data and metadata are already
from a trusted source at least (like dm-verity).

Either verity->decompress or decompress->verity is safe since either
decompression algotithms or verity algorithms are _deterministic_ and
should be considered _bugfree_ therefore it should have one result.

And if you say decompression algorithm is untrusted because of bug or
somewhat, I think verity algorithm as well. In other words, if we consider
software/hardware bugs, we cannot trust any combination of results.

A advantage of verity->decompress over decompress->verity is that
the verity data is smaller than decompress->verity, so
  1) we can have less I/O for most I/O patterns;
and
  2) we can consume less CPUs.

Take a step back, there are many compression algorithm in the
user-space like apk or what ever, so the plaintext is in a
relatively speaking. We cannot consider the data to end-user is
absolutely right.

Thanks,
Gao Xiang


> 
> - Eric

Dave Chinner Aug. 8, 2019, 8:16 a.m. UTC | #10

On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> FWIW, the only order that actually makes sense is decrypt->decompress->verity.

*nod*

Especially once we get the inline encryption support for fscrypt so
the storage layer can offload the encrypt/decrypt to hardware via
the bio containing plaintext. That pretty much forces fscrypt to be
the lowest layer of the filesystem transformation stack.  This
hardware offload capability also places lots of limits on what you
can do with block-based verity layers below the filesystem. e.g.
using dm-verity when you don't know if there's hardware encryption
below or software encryption on top becomes problematic...

So really, from a filesystem and iomap perspective, What Eric says
is the right - it's the only order that makes sense...

Cheers,

Dave.

Gao Xiang Aug. 8, 2019, 8:57 a.m. UTC | #11

Hi Dave,

On Thu, Aug 08, 2019 at 06:16:47PM +1000, Dave Chinner wrote:
> On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> 
> *nod*
> 
> Especially once we get the inline encryption support for fscrypt so
> the storage layer can offload the encrypt/decrypt to hardware via
> the bio containing plaintext. That pretty much forces fscrypt to be
> the lowest layer of the filesystem transformation stack.  This
> hardware offload capability also places lots of limits on what you
> can do with block-based verity layers below the filesystem. e.g.
> using dm-verity when you don't know if there's hardware encryption
> below or software encryption on top becomes problematic...
> 
> So really, from a filesystem and iomap perspective, What Eric says
> is the right - it's the only order that makes sense...

Don't be surprised there will be a decrypt/verity/decompress
all-in-one hardware approach for such stuff. 30% random IO (no matter
hardware or software approach) can be saved that is greatly helpful
for user experience on embedded devices with too limited source.

and I really got a SHA256 CPU hardware bug years ago.

I don't want to talk more on tendency, it depends on real scenerio
and user selection (server or embedded device).

For security consideration, these approaches are all the same
level --- these approaches all from the same signed key and
storage source, all transformation A->B->C or A->C->B are equal.

For bug-free, we can fuzzer compression/verity algorithms even
the whole file-system stack. There is another case other than
security consideration.

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Gao Xiang Aug. 8, 2019, 9:29 a.m. UTC | #12

On Thu, Aug 08, 2019 at 06:16:47PM +1000, Dave Chinner wrote:
> On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> 
> *nod*
> 
> Especially once we get the inline encryption support for fscrypt so
> the storage layer can offload the encrypt/decrypt to hardware via
> the bio containing plaintext. That pretty much forces fscrypt to be
> the lowest layer of the filesystem transformation stack.  This
> hardware offload capability also places lots of limits on what you
> can do with block-based verity layers below the filesystem. e.g.
> using dm-verity when you don't know if there's hardware encryption
> below or software encryption on top becomes problematic...

Add a word, I was just talking benefits between "decrypt->decompress->
verity" and "decrypt->verity->decompress", I think both forms are
compatible with inline en/decryption. I don't care which level
"decrypt" is at... But maybe some user cares. Am I missing something?

Thanks,
Gao Xiang

Gao Xiang Aug. 8, 2019, 11:21 a.m. UTC | #13

On Thu, Aug 08, 2019 at 05:29:47PM +0800, Gao Xiang wrote:
> On Thu, Aug 08, 2019 at 06:16:47PM +1000, Dave Chinner wrote:
> > On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> > 
> > *nod*
> > 
> > Especially once we get the inline encryption support for fscrypt so
> > the storage layer can offload the encrypt/decrypt to hardware via
> > the bio containing plaintext. That pretty much forces fscrypt to be
> > the lowest layer of the filesystem transformation stack.  This
> > hardware offload capability also places lots of limits on what you
> > can do with block-based verity layers below the filesystem. e.g.
> > using dm-verity when you don't know if there's hardware encryption
> > below or software encryption on top becomes problematic...

...and I'm not talking of fs-verity, I personally think fs-verity
is great. I am only talking about a generic stuff.

In order to know which level becomes problematic, there even could
be another choice "decrypt->verity1->decompress->verity2" for such
requirement (assuming verity1/2 themselves are absolutely bug-free),
verity1 can be a strong merkle tree and verity2 is a weak form (just
like a simple Adler-32/crc32 in compressed block), thus we can locate
whether it's a decrypt or decompress bug.

Many compression algorithm containers already have such a weak
form such as gzip algorithm, so there is no need to add such
an extra step to postprocess.

and I have no idea which (decrypt->verity1->decompress->verity2 or
decrypt->decompress->verity) is faster since verity2 is rather simple.
However, if we use the only strong form in the end, there could be
a lot of extra IO and expensive multiple-level computations if files
are highly compressible.

On the other hand, such verity2 can be computed offline / avoided
by fuzzer tools for read-only scenerios (for example, after building
these images and do a full image verification with the given kernel)
in order to make sure its stability (In any case, I'm talking about
how to make those algorithms bug-free).

All I want to say is I think "decrypt->verity->decompress" is
reasonable as well.

Thanks,
Gao Xiang

> 
> Add a word, I was just talking benefits between "decrypt->decompress->
> verity" and "decrypt->verity->decompress", I think both forms are
> compatible with inline en/decryption. I don't care which level
> "decrypt" is at... But maybe some user cares. Am I missing something?
> 
> Thanks,
> Gao Xiang
>

Gao Xiang Aug. 8, 2019, 1:11 p.m. UTC | #14

On Thu, Aug 08, 2019 at 07:21:39PM +0800, Gao Xiang wrote:
> On Thu, Aug 08, 2019 at 05:29:47PM +0800, Gao Xiang wrote:
> > On Thu, Aug 08, 2019 at 06:16:47PM +1000, Dave Chinner wrote:
> > > On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > > > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> > > 
> > > *nod*
> > > 
> > > Especially once we get the inline encryption support for fscrypt so
> > > the storage layer can offload the encrypt/decrypt to hardware via
> > > the bio containing plaintext. That pretty much forces fscrypt to be
> > > the lowest layer of the filesystem transformation stack.  This
> > > hardware offload capability also places lots of limits on what you
> > > can do with block-based verity layers below the filesystem. e.g.
> > > using dm-verity when you don't know if there's hardware encryption
> > > below or software encryption on top becomes problematic...
> 
> ...and I'm not talking of fs-verity, I personally think fs-verity
> is great. I am only talking about a generic stuff.
> 
> In order to know which level becomes problematic, there even could
> be another choice "decrypt->verity1->decompress->verity2" for such
> requirement (assuming verity1/2 themselves are absolutely bug-free),
> verity1 can be a strong merkle tree and verity2 is a weak form (just
> like a simple Adler-32/crc32 in compressed block), thus we can locate
> whether it's a decrypt or decompress bug.
> 
> Many compression algorithm containers already have such a weak
> form such as gzip algorithm, so there is no need to add such
> an extra step to postprocess.
> 
> and I have no idea which (decrypt->verity1->decompress->verity2 or
> decrypt->decompress->verity) is faster since verity2 is rather simple.
> However, if we use the only strong form in the end, there could be
> a lot of extra IO and expensive multiple-level computations if files
> are highly compressible.
> 
> On the other hand, such verity2 can be computed offline / avoided
> by fuzzer tools for read-only scenerios (for example, after building
> these images and do a full image verification with the given kernel)
> in order to make sure its stability (In any case, I'm talking about
> how to make those algorithms bug-free).
> 
> All I want to say is I think "decrypt->verity->decompress" is
> reasonable as well.

... And another fundamental concern is that if we don't verify earlier
(I mean on-disk data), then untrusted data will be transformed
(decompressed and even decrypted if no inline encryption) with risk,
and it seems _vulnerable_ if such decrypt / decompress algorithms have
_security issues_ (such as Buffer Overflow). It seems that it's less
security than do verity earlier.

Thanks,
Gao Xiang

> 
> Thanks,
> Gao Xiang
> 
> > 
> > Add a word, I was just talking benefits between "decrypt->decompress->
> > verity" and "decrypt->verity->decompress", I think both forms are
> > compatible with inline en/decryption. I don't care which level
> > "decrypt" is at... But maybe some user cares. Am I missing something?
> > 
> > Thanks,
> > Gao Xiang
> >

Matthew Wilcox Aug. 9, 2019, 8:45 p.m. UTC | #15

On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> >     1. decrypt->verity->decompress
> > 
> >     2. verity->decompress->decrypt
> > 
> >     3. decompress->decrypt->verity
> > 
> >    1. and 2. could cause less computation since it processes
> >    compressed data, and the security is good enough since
> >    the behavior of decompression algorithm is deterministic.
> >    3 could cause more computation.
> > 
> > All I want to say is the post process is so complicated since we have
> > many selection if encryption, decompression, verification are all involved.
> > 
> > Maybe introduce a core subset to IOMAP is better for long-term
> > maintainment and better performance. And we should consider it
> > more carefully.
> > 
> 
> FWIW, the only order that actually makes sense is decrypt->decompress->verity.

That used to be true, but a paper in 2004 suggested it's not true.
Further work in this space in 2009 based on block ciphers:
https://arxiv.org/pdf/1009.1759

It looks like it'd be computationally expensive to do, but feasible.

> Decrypt before decompress, i.e. encrypt after compress, because only the
> plaintext can be compressible; the ciphertext isn't.

Gao Xiang Aug. 9, 2019, 11:45 p.m. UTC | #16

Hi Willy,

On Fri, Aug 09, 2019 at 01:45:17PM -0700, Matthew Wilcox wrote:
> On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > >     1. decrypt->verity->decompress
> > > 
> > >     2. verity->decompress->decrypt
> > > 
> > >     3. decompress->decrypt->verity
> > > 
> > >    1. and 2. could cause less computation since it processes
> > >    compressed data, and the security is good enough since
> > >    the behavior of decompression algorithm is deterministic.
> > >    3 could cause more computation.
> > > 
> > > All I want to say is the post process is so complicated since we have
> > > many selection if encryption, decompression, verification are all involved.
> > > 
> > > Maybe introduce a core subset to IOMAP is better for long-term
> > > maintainment and better performance. And we should consider it
> > > more carefully.
> > > 
> > 
> > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> 
> That used to be true, but a paper in 2004 suggested it's not true.
> Further work in this space in 2009 based on block ciphers:
> https://arxiv.org/pdf/1009.1759
> 
> It looks like it'd be computationally expensive to do, but feasible.

Yes, maybe someone cares where encrypt is at due to their system design.

and I thought over these days, I have to repeat my thought of verity
again :( the meaningful order ought to be "decrypt->verity->decompress"
rather than "decrypt->decompress->verity" if compression is involved.

since most (de)compress algorithms are complex enough (allocate memory and
do a lot of unsafe stuffes such as wildcopy) and even maybe unsafe by its
design, we cannot do verity in the end for security consideration thus
the whole system can be vulnerable by this order from malformed on-disk
data. In other words, we need to verify on compressed data.

Fsverity is fine for me since most decrypt algorithms is stable and reliable
and no compression by its design, but if some decrypt software algorithms is
complicated enough, I'd suggest "verity->decrypt" as well to some extent.

Considering transformation "A->B->C->D->....->verity", if any of "A->B->C
->D->..." is attacked by the malformed on-disk data... It would crash or
even root the whole operating system.

All in all, we have to verify data earlier in order to get trusted data
for later complex transformation chains.

The performance benefit I described in my previous email, it seems no need
to say again... please take them into consideration and I think it's no
easy to get a unique generic post-read order for all real systems.

Thanks,
Gao Xiang

> 
> > Decrypt before decompress, i.e. encrypt after compress, because only the
> > plaintext can be compressible; the ciphertext isn't.

Eric Biggers Aug. 10, 2019, 12:17 a.m. UTC | #17

On Fri, Aug 09, 2019 at 01:45:17PM -0700, Matthew Wilcox wrote:
> On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > >     1. decrypt->verity->decompress
> > > 
> > >     2. verity->decompress->decrypt
> > > 
> > >     3. decompress->decrypt->verity
> > > 
> > >    1. and 2. could cause less computation since it processes
> > >    compressed data, and the security is good enough since
> > >    the behavior of decompression algorithm is deterministic.
> > >    3 could cause more computation.
> > > 
> > > All I want to say is the post process is so complicated since we have
> > > many selection if encryption, decompression, verification are all involved.
> > > 
> > > Maybe introduce a core subset to IOMAP is better for long-term
> > > maintainment and better performance. And we should consider it
> > > more carefully.
> > > 
> > 
> > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> 
> That used to be true, but a paper in 2004 suggested it's not true.
> Further work in this space in 2009 based on block ciphers:
> https://arxiv.org/pdf/1009.1759
> 
> It looks like it'd be computationally expensive to do, but feasible.
> 
> > Decrypt before decompress, i.e. encrypt after compress, because only the
> > plaintext can be compressible; the ciphertext isn't.

It's an interesting paper, but even assuming that "compress after encrypt" could
provide some actual benefit over the usual order (I can't think of any in this
context), it doesn't sound practical.  From what I understand from that paper:

- It assumes the compressor just *knows* a priori some pattern in the plaintext,
  i.e. it can't be arbitrary data.  E.g. the compressor for CBC encrypted data
  assumes that each 128 bits of plaintext is drawn from a distibution much
  smaller than the 2^128 possible values, e.g. at most a certain number of bits
  are set.  If any other data is encrypted+compressed, then the compressor will
  corrupt it, and it's impossible for it to detect that it did so.

  That alone makes it unusable for any use case we're talking about here.

- It only works for some specific encryption modes, and even then each
  encryption mode needs a custom compression algorithm designed just for it.
  I don't see how it could work for XTS, let alone a wide-block mode.

- The decompressor needs access to the encryption key.  [If that's allowed, why
  can't the compressor have access to it too?]

- It's almost certainly *much* slower and won't compress as well as conventional
  compression algorithms (gzip, LZ4, ZSTD, ...) that operate on the plaintext.

Eric

Eric Biggers Aug. 10, 2019, 12:31 a.m. UTC | #18

On Sat, Aug 10, 2019 at 07:45:59AM +0800, Gao Xiang wrote:
> Hi Willy,
> 
> On Fri, Aug 09, 2019 at 01:45:17PM -0700, Matthew Wilcox wrote:
> > On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > > On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > > >     1. decrypt->verity->decompress
> > > > 
> > > >     2. verity->decompress->decrypt
> > > > 
> > > >     3. decompress->decrypt->verity
> > > > 
> > > >    1. and 2. could cause less computation since it processes
> > > >    compressed data, and the security is good enough since
> > > >    the behavior of decompression algorithm is deterministic.
> > > >    3 could cause more computation.
> > > > 
> > > > All I want to say is the post process is so complicated since we have
> > > > many selection if encryption, decompression, verification are all involved.
> > > > 
> > > > Maybe introduce a core subset to IOMAP is better for long-term
> > > > maintainment and better performance. And we should consider it
> > > > more carefully.
> > > > 
> > > 
> > > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> > 
> > That used to be true, but a paper in 2004 suggested it's not true.
> > Further work in this space in 2009 based on block ciphers:
> > https://arxiv.org/pdf/1009.1759
> > 
> > It looks like it'd be computationally expensive to do, but feasible.
> 
> Yes, maybe someone cares where encrypt is at due to their system design.
> 
> and I thought over these days, I have to repeat my thought of verity
> again :( the meaningful order ought to be "decrypt->verity->decompress"
> rather than "decrypt->decompress->verity" if compression is involved.
> 
> since most (de)compress algorithms are complex enough (allocate memory and
> do a lot of unsafe stuffes such as wildcopy) and even maybe unsafe by its
> design, we cannot do verity in the end for security consideration thus
> the whole system can be vulnerable by this order from malformed on-disk
> data. In other words, we need to verify on compressed data.
> 
> Fsverity is fine for me since most decrypt algorithms is stable and reliable
> and no compression by its design, but if some decrypt software algorithms is
> complicated enough, I'd suggest "verity->decrypt" as well to some extent.
> 
> Considering transformation "A->B->C->D->....->verity", if any of "A->B->C
> ->D->..." is attacked by the malformed on-disk data... It would crash or
> even root the whole operating system.
> 
> All in all, we have to verify data earlier in order to get trusted data
> for later complex transformation chains.
> 
> The performance benefit I described in my previous email, it seems no need
> to say again... please take them into consideration and I think it's no
> easy to get a unique generic post-read order for all real systems.
> 

While it would be nice to protect against filesystem bugs, it's not the point of
fs-verity.  fs-verity is about authenticating the contents the *user* sees, so
that e.g. a file can be distributed to many computers and it can be
authenticated regardless of exactly what other filesystem features were used
when it was stored on disk.  Different computers may use:

- Different filesystems
- Different compression algorithms (or no compression)
- Different compression strengths, even with same algorithm
- Different divisions of the file into compression units
- Different encryption algorithms (or no encryption)
- Different encryption keys, even with same algorithm
- Different encryption nonces, even with same key

All those change the on-disk data; only the user-visible data stays the same.

Bugs in filesystems may also be exploited regardless of fs-verity, as the
attacker (able to manipulate on-disk image) can create a malicious file without
fs-verity enabled, somewhere else on the filesystem.

If you actually want to authenticate the full filesystem image, you need to use
dm-verity, which is designed for that.

- Eric

Eric Biggers Aug. 10, 2019, 12:50 a.m. UTC | #19

On Fri, Aug 09, 2019 at 05:31:35PM -0700, Eric Biggers wrote:
> On Sat, Aug 10, 2019 at 07:45:59AM +0800, Gao Xiang wrote:
> > Hi Willy,
> > 
> > On Fri, Aug 09, 2019 at 01:45:17PM -0700, Matthew Wilcox wrote:
> > > On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > > > On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > > > >     1. decrypt->verity->decompress
> > > > > 
> > > > >     2. verity->decompress->decrypt
> > > > > 
> > > > >     3. decompress->decrypt->verity
> > > > > 
> > > > >    1. and 2. could cause less computation since it processes
> > > > >    compressed data, and the security is good enough since
> > > > >    the behavior of decompression algorithm is deterministic.
> > > > >    3 could cause more computation.
> > > > > 
> > > > > All I want to say is the post process is so complicated since we have
> > > > > many selection if encryption, decompression, verification are all involved.
> > > > > 
> > > > > Maybe introduce a core subset to IOMAP is better for long-term
> > > > > maintainment and better performance. And we should consider it
> > > > > more carefully.
> > > > > 
> > > > 
> > > > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> > > 
> > > That used to be true, but a paper in 2004 suggested it's not true.
> > > Further work in this space in 2009 based on block ciphers:
> > > https://arxiv.org/pdf/1009.1759
> > > 
> > > It looks like it'd be computationally expensive to do, but feasible.
> > 
> > Yes, maybe someone cares where encrypt is at due to their system design.
> > 
> > and I thought over these days, I have to repeat my thought of verity
> > again :( the meaningful order ought to be "decrypt->verity->decompress"
> > rather than "decrypt->decompress->verity" if compression is involved.
> > 
> > since most (de)compress algorithms are complex enough (allocate memory and
> > do a lot of unsafe stuffes such as wildcopy) and even maybe unsafe by its
> > design, we cannot do verity in the end for security consideration thus
> > the whole system can be vulnerable by this order from malformed on-disk
> > data. In other words, we need to verify on compressed data.
> > 
> > Fsverity is fine for me since most decrypt algorithms is stable and reliable
> > and no compression by its design, but if some decrypt software algorithms is
> > complicated enough, I'd suggest "verity->decrypt" as well to some extent.
> > 
> > Considering transformation "A->B->C->D->....->verity", if any of "A->B->C
> > ->D->..." is attacked by the malformed on-disk data... It would crash or
> > even root the whole operating system.
> > 
> > All in all, we have to verify data earlier in order to get trusted data
> > for later complex transformation chains.
> > 
> > The performance benefit I described in my previous email, it seems no need
> > to say again... please take them into consideration and I think it's no
> > easy to get a unique generic post-read order for all real systems.
> > 
> 
> While it would be nice to protect against filesystem bugs, it's not the point of
> fs-verity.  fs-verity is about authenticating the contents the *user* sees, so
> that e.g. a file can be distributed to many computers and it can be
> authenticated regardless of exactly what other filesystem features were used
> when it was stored on disk.  Different computers may use:
> 
> - Different filesystems
> - Different compression algorithms (or no compression)
> - Different compression strengths, even with same algorithm
> - Different divisions of the file into compression units
> - Different encryption algorithms (or no encryption)
> - Different encryption keys, even with same algorithm
> - Different encryption nonces, even with same key
> 
> All those change the on-disk data; only the user-visible data stays the same.
> 
> Bugs in filesystems may also be exploited regardless of fs-verity, as the
> attacker (able to manipulate on-disk image) can create a malicious file without
> fs-verity enabled, somewhere else on the filesystem.
> 
> If you actually want to authenticate the full filesystem image, you need to use
> dm-verity, which is designed for that.
> 

Also keep in mind that ideally the encryption layer would do authenticated
encryption, so that during decrypt->decompress->verity the blocks only get past
the decrypt step if they're authentically from someone with the encryption key.
That's currently missing from fscrypt for practical reasons (read/write
per-block metadata is really hard on most filesystems), but in an ideal world it
would be there.  The fs-verity step is conceptually different, but it seems it's
being conflated with this missing step.

- Eric

Gao Xiang Aug. 10, 2019, 1:13 a.m. UTC | #20

On Fri, Aug 09, 2019 at 05:31:36PM -0700, Eric Biggers wrote:
> On Sat, Aug 10, 2019 at 07:45:59AM +0800, Gao Xiang wrote:
> > Hi Willy,
> > 
> > On Fri, Aug 09, 2019 at 01:45:17PM -0700, Matthew Wilcox wrote:
> > > On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > > > On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > > > >     1. decrypt->verity->decompress
> > > > > 
> > > > >     2. verity->decompress->decrypt
> > > > > 
> > > > >     3. decompress->decrypt->verity
> > > > > 
> > > > >    1. and 2. could cause less computation since it processes
> > > > >    compressed data, and the security is good enough since
> > > > >    the behavior of decompression algorithm is deterministic.
> > > > >    3 could cause more computation.
> > > > > 
> > > > > All I want to say is the post process is so complicated since we have
> > > > > many selection if encryption, decompression, verification are all involved.
> > > > > 
> > > > > Maybe introduce a core subset to IOMAP is better for long-term
> > > > > maintainment and better performance. And we should consider it
> > > > > more carefully.
> > > > > 
> > > > 
> > > > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> > > 
> > > That used to be true, but a paper in 2004 suggested it's not true.
> > > Further work in this space in 2009 based on block ciphers:
> > > https://arxiv.org/pdf/1009.1759
> > > 
> > > It looks like it'd be computationally expensive to do, but feasible.
> > 
> > Yes, maybe someone cares where encrypt is at due to their system design.
> > 
> > and I thought over these days, I have to repeat my thought of verity
> > again :( the meaningful order ought to be "decrypt->verity->decompress"
> > rather than "decrypt->decompress->verity" if compression is involved.
> > 
> > since most (de)compress algorithms are complex enough (allocate memory and
> > do a lot of unsafe stuffes such as wildcopy) and even maybe unsafe by its
> > design, we cannot do verity in the end for security consideration thus
> > the whole system can be vulnerable by this order from malformed on-disk
> > data. In other words, we need to verify on compressed data.
> > 
> > Fsverity is fine for me since most decrypt algorithms is stable and reliable
> > and no compression by its design, but if some decrypt software algorithms is
> > complicated enough, I'd suggest "verity->decrypt" as well to some extent.
> > 
> > Considering transformation "A->B->C->D->....->verity", if any of "A->B->C
> > ->D->..." is attacked by the malformed on-disk data... It would crash or
> > even root the whole operating system.
> > 
> > All in all, we have to verify data earlier in order to get trusted data
> > for later complex transformation chains.
> > 
> > The performance benefit I described in my previous email, it seems no need
> > to say again... please take them into consideration and I think it's no
> > easy to get a unique generic post-read order for all real systems.
> > 
> 
> While it would be nice to protect against filesystem bugs, it's not the point of
> fs-verity.  fs-verity is about authenticating the contents the *user* sees, so
> that e.g. a file can be distributed to many computers and it can be
> authenticated regardless of exactly what other filesystem features were used
> when it was stored on disk.  Different computers may use:
> 
> - Different filesystems
> - Different compression algorithms (or no compression)
> - Different compression strengths, even with same algorithm
> - Different divisions of the file into compression units
> - Different encryption algorithms (or no encryption)
> - Different encryption keys, even with same algorithm
> - Different encryption nonces, even with same key
> 
> All those change the on-disk data; only the user-visible data stays the same.

Yes, I agree with fs-verity use case, and I can get some limitation
as well. (I am not arguing fs-verity in this topic at all...)

> 
> Bugs in filesystems may also be exploited regardless of fs-verity, as the
> attacker (able to manipulate on-disk image) can create a malicious file without
> fs-verity enabled, somewhere else on the filesystem.
> 
> If you actually want to authenticate the full filesystem image, you need to use
> dm-verity, which is designed for that.

Yes, but for generic consideration, there is a limitation for dm-verity
since it needs filesystems should be read-only;

and that raises another consideration -- verity should be in block/fs,
and I think what fscrypt answers is also appropriate to verity in fs
(since we have dm-crypt as well), that is that we could consider
multiple key R/W verification as well and blah-blah-blah-blah-blah...

I think all is fine at the moment, in this topic, again, I just try to
say a generic post-read approach is hard and complicated, not for some
specific feature.

Thanks,
Gao Xiang

> 
> - Eric

Gao Xiang Aug. 10, 2019, 1:34 a.m. UTC | #21

On Fri, Aug 09, 2019 at 05:50:40PM -0700, Eric Biggers wrote:
> On Fri, Aug 09, 2019 at 05:31:35PM -0700, Eric Biggers wrote:
> > On Sat, Aug 10, 2019 at 07:45:59AM +0800, Gao Xiang wrote:
> > > Hi Willy,
> > > 
> > > On Fri, Aug 09, 2019 at 01:45:17PM -0700, Matthew Wilcox wrote:
> > > > On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > > > > On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > > > > >     1. decrypt->verity->decompress
> > > > > > 
> > > > > >     2. verity->decompress->decrypt
> > > > > > 
> > > > > >     3. decompress->decrypt->verity
> > > > > > 
> > > > > >    1. and 2. could cause less computation since it processes
> > > > > >    compressed data, and the security is good enough since
> > > > > >    the behavior of decompression algorithm is deterministic.
> > > > > >    3 could cause more computation.
> > > > > > 
> > > > > > All I want to say is the post process is so complicated since we have
> > > > > > many selection if encryption, decompression, verification are all involved.
> > > > > > 
> > > > > > Maybe introduce a core subset to IOMAP is better for long-term
> > > > > > maintainment and better performance. And we should consider it
> > > > > > more carefully.
> > > > > > 
> > > > > 
> > > > > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> > > > 
> > > > That used to be true, but a paper in 2004 suggested it's not true.
> > > > Further work in this space in 2009 based on block ciphers:
> > > > https://arxiv.org/pdf/1009.1759
> > > > 
> > > > It looks like it'd be computationally expensive to do, but feasible.
> > > 
> > > Yes, maybe someone cares where encrypt is at due to their system design.
> > > 
> > > and I thought over these days, I have to repeat my thought of verity
> > > again :( the meaningful order ought to be "decrypt->verity->decompress"
> > > rather than "decrypt->decompress->verity" if compression is involved.
> > > 
> > > since most (de)compress algorithms are complex enough (allocate memory and
> > > do a lot of unsafe stuffes such as wildcopy) and even maybe unsafe by its
> > > design, we cannot do verity in the end for security consideration thus
> > > the whole system can be vulnerable by this order from malformed on-disk
> > > data. In other words, we need to verify on compressed data.
> > > 
> > > Fsverity is fine for me since most decrypt algorithms is stable and reliable
> > > and no compression by its design, but if some decrypt software algorithms is
> > > complicated enough, I'd suggest "verity->decrypt" as well to some extent.
> > > 
> > > Considering transformation "A->B->C->D->....->verity", if any of "A->B->C
> > > ->D->..." is attacked by the malformed on-disk data... It would crash or
> > > even root the whole operating system.
> > > 
> > > All in all, we have to verify data earlier in order to get trusted data
> > > for later complex transformation chains.
> > > 
> > > The performance benefit I described in my previous email, it seems no need
> > > to say again... please take them into consideration and I think it's no
> > > easy to get a unique generic post-read order for all real systems.
> > > 
> > 
> > While it would be nice to protect against filesystem bugs, it's not the point of
> > fs-verity.  fs-verity is about authenticating the contents the *user* sees, so
> > that e.g. a file can be distributed to many computers and it can be
> > authenticated regardless of exactly what other filesystem features were used
> > when it was stored on disk.  Different computers may use:
> > 
> > - Different filesystems
> > - Different compression algorithms (or no compression)
> > - Different compression strengths, even with same algorithm
> > - Different divisions of the file into compression units
> > - Different encryption algorithms (or no encryption)
> > - Different encryption keys, even with same algorithm
> > - Different encryption nonces, even with same key
> > 
> > All those change the on-disk data; only the user-visible data stays the same.
> > 
> > Bugs in filesystems may also be exploited regardless of fs-verity, as the
> > attacker (able to manipulate on-disk image) can create a malicious file without
> > fs-verity enabled, somewhere else on the filesystem.
> > 
> > If you actually want to authenticate the full filesystem image, you need to use
> > dm-verity, which is designed for that.
> > 
> 
> Also keep in mind that ideally the encryption layer would do authenticated
> encryption, so that during decrypt->decompress->verity the blocks only get past
> the decrypt step if they're authentically from someone with the encryption key.
> That's currently missing from fscrypt for practical reasons (read/write
> per-block metadata is really hard on most filesystems), but in an ideal world it
> would be there.  The fs-verity step is conceptually different, but it seems it's
> being conflated with this missing step.

Yes, but encryption could be not enabled mandatorily for all the post-read data,
and not all encrypt algorithms are authenticated encryption...blah-blah-blah...

I want to stop here :) and I think it depends on real requirements, and I don't
want the geneeric post-read process is too limited by specfic chains....

Thanks,
Gao XIang

> 
> - Eric

[10/13] iomap: use a function pointer for dio submits

Commit Message

Comments

Patch