diff mbox series

[3/3] xfs: alignment check bio buffers

Message ID 20190821083820.11725-4-david@fromorbit.com (mailing list archive)
State New, archived
Headers show
Series xfs: avoid IO issues unaligned memory allocation | expand

Commit Message

Dave Chinner Aug. 21, 2019, 8:38 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Add memory buffer alignment validation checks to bios built in XFS
to catch bugs that will result in silent data corruption in block
drivers that cannot handle unaligned memory buffers but don't
validate the incoming buffer alignment is correct.

Known drivers with these issues are xenblk, brd and pmem.

Despite there being nothing XFS specific to xfs_bio_add_page(), this
function was created to do the required validation because the block
layer developers that keep telling us that is not possible to
validate buffer alignment in bio_add_page(), and even if it was
possible it would be too much overhead to do at runtime.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bio_io.c | 32 +++++++++++++++++++++++++++++---
 fs/xfs/xfs_buf.c    |  2 +-
 fs/xfs/xfs_linux.h  |  3 +++
 fs/xfs/xfs_log.c    |  6 +++++-
 4 files changed, 38 insertions(+), 5 deletions(-)

Comments

Brian Foster Aug. 21, 2019, 1:39 p.m. UTC | #1
On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add memory buffer alignment validation checks to bios built in XFS
> to catch bugs that will result in silent data corruption in block
> drivers that cannot handle unaligned memory buffers but don't
> validate the incoming buffer alignment is correct.
> 
> Known drivers with these issues are xenblk, brd and pmem.
> 
> Despite there being nothing XFS specific to xfs_bio_add_page(), this
> function was created to do the required validation because the block
> layer developers that keep telling us that is not possible to
> validate buffer alignment in bio_add_page(), and even if it was
> possible it would be too much overhead to do at runtime.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bio_io.c | 32 +++++++++++++++++++++++++++++---
>  fs/xfs/xfs_buf.c    |  2 +-
>  fs/xfs/xfs_linux.h  |  3 +++
>  fs/xfs/xfs_log.c    |  6 +++++-
>  4 files changed, 38 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> index e2148f2d5d6b..fbaea643c000 100644
> --- a/fs/xfs/xfs_bio_io.c
> +++ b/fs/xfs/xfs_bio_io.c
> @@ -9,6 +9,27 @@ static inline unsigned int bio_max_vecs(unsigned int count)
>  	return min_t(unsigned, howmany(count, PAGE_SIZE), BIO_MAX_PAGES);
>  }
>  
> +int
> +xfs_bio_add_page(
> +	struct bio	*bio,
> +	struct page	*page,
> +	unsigned int	len,
> +	unsigned int	offset)
> +{
> +	struct request_queue	*q = bio->bi_disk->queue;
> +	bool		same_page = false;
> +
> +	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> +		return -EIO;
> +
> +	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> +		if (bio_full(bio, len))
> +			return 0;
> +		__bio_add_page(bio, page, len, offset);
> +	}
> +	return len;
> +}
> +

Seems reasonable to me. Looks like bio_add_page() with an error check.
;) Given that, what's the need to open-code bio_add_page() here rather
than just call it after the check?

>  int
>  xfs_rw_bdev(
>  	struct block_device	*bdev,
...
> @@ -36,9 +57,12 @@ xfs_rw_bdev(
>  		unsigned int	off = offset_in_page(data);
>  		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
>  
> -		while (bio_add_page(bio, page, len, off) != len) {
> +		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
>  			struct bio	*prev = bio;
>  
> +			if (ret < 0)
> +				goto submit;
> +

Hmm.. is submitting the bio really the right thing to do if we get here
and have failed to add any pages to the bio? If we're already seeing
weird behavior for bios with unaligned data memory, this seems like a
recipe for similar weirdness. We'd also end up doing a partial write in
scenarios where we already know we're returning an error. Perhaps we
should create an error path or use a check similar to what is already in
xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
when we already know we're going to return an error) to call bio_endio()
to undo any chaining.

>  			bio = bio_alloc(GFP_KERNEL, bio_max_vecs(left));
>  			bio_copy_dev(bio, prev);
>  			bio->bi_iter.bi_sector = bio_end_sector(prev);
...
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7bd1f31febfc..a2d499baee9c 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1294,7 +1294,7 @@ xfs_buf_ioapply_map(
>  		if (nbytes > size)
>  			nbytes = size;
>  
> -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> +		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
>  				      offset);

Similar concern here. The higher level code seems written under the
assumption that bio_add_page() returns success or zero. In this case the
error is basically tossed so we can also attempt to split/chain an empty
bio, or even better, submit a partial write and continue operating as if
nothing happened (outside of the warning). The latter case should be
handled as a log I/O error one way or another.

Brian

>  		if (rbytes < nbytes)
>  			break;
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index ca15105681ca..e71c7bf3e714 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -145,6 +145,9 @@ static inline void delay(long ticks)
>  	schedule_timeout_uninterruptible(ticks);
>  }
>  
> +int xfs_bio_add_page(struct bio *bio, struct page *page, unsigned int len,
> +			unsigned int offset);
> +
>  /*
>   * XFS wrapper structure for sysfs support. It depends on external data
>   * structures and is embedded in various internal data structures to implement
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 1830d185d7fc..52f7d840d09e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1673,8 +1673,12 @@ xlog_map_iclog_data(
>  		struct page	*page = kmem_to_page(data);
>  		unsigned int	off = offset_in_page(data);
>  		size_t		len = min_t(size_t, count, PAGE_SIZE - off);
> +		int		ret;
>  
> -		WARN_ON_ONCE(bio_add_page(bio, page, len, off) != len);
> +		ret = xfs_bio_add_page(bio, page, len, off);
> +		WARN_ON_ONCE(ret != len);
> +		if (ret < 0)
> +			break;
>  
>  		data += len;
>  		count -= len;
> -- 
> 2.23.0.rc1
>
Dave Chinner Aug. 21, 2019, 9:39 p.m. UTC | #2
On Wed, Aug 21, 2019 at 09:39:04AM -0400, Brian Foster wrote:
> On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add memory buffer alignment validation checks to bios built in XFS
> > to catch bugs that will result in silent data corruption in block
> > drivers that cannot handle unaligned memory buffers but don't
> > validate the incoming buffer alignment is correct.
> > 
> > Known drivers with these issues are xenblk, brd and pmem.
> > 
> > Despite there being nothing XFS specific to xfs_bio_add_page(), this
> > function was created to do the required validation because the block
> > layer developers that keep telling us that is not possible to
> > validate buffer alignment in bio_add_page(), and even if it was
> > possible it would be too much overhead to do at runtime.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_bio_io.c | 32 +++++++++++++++++++++++++++++---
> >  fs/xfs/xfs_buf.c    |  2 +-
> >  fs/xfs/xfs_linux.h  |  3 +++
> >  fs/xfs/xfs_log.c    |  6 +++++-
> >  4 files changed, 38 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> > index e2148f2d5d6b..fbaea643c000 100644
> > --- a/fs/xfs/xfs_bio_io.c
> > +++ b/fs/xfs/xfs_bio_io.c
> > @@ -9,6 +9,27 @@ static inline unsigned int bio_max_vecs(unsigned int count)
> >  	return min_t(unsigned, howmany(count, PAGE_SIZE), BIO_MAX_PAGES);
> >  }
> >  
> > +int
> > +xfs_bio_add_page(
> > +	struct bio	*bio,
> > +	struct page	*page,
> > +	unsigned int	len,
> > +	unsigned int	offset)
> > +{
> > +	struct request_queue	*q = bio->bi_disk->queue;
> > +	bool		same_page = false;
> > +
> > +	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> > +		return -EIO;
> > +
> > +	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> > +		if (bio_full(bio, len))
> > +			return 0;
> > +		__bio_add_page(bio, page, len, offset);
> > +	}
> > +	return len;
> > +}
> > +
> 
> Seems reasonable to me. Looks like bio_add_page() with an error check.
> ;)

It is exactly that.

> Given that, what's the need to open-code bio_add_page() here rather
> than just call it after the check?

Largely because I wasn't sure exactly where the best place was to
add the check. Copy first, hack to pieces, make work, then worry
about other stuff :)

I could change it to calling calling bio_add_page() directly. Not
really that fussed as it's all exported functionality...


> >  int
> >  xfs_rw_bdev(
> >  	struct block_device	*bdev,
> ...
> > @@ -36,9 +57,12 @@ xfs_rw_bdev(
> >  		unsigned int	off = offset_in_page(data);
> >  		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
> >  
> > -		while (bio_add_page(bio, page, len, off) != len) {
> > +		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
> >  			struct bio	*prev = bio;
> >  
> > +			if (ret < 0)
> > +				goto submit;
> > +
> 
> Hmm.. is submitting the bio really the right thing to do if we get here
> and have failed to add any pages to the bio?

Yes, because we really should wait for chained bios to complete
before returning. we can only do that by waiting on completion for
the entire chain, and the simplest way to do that is submit the
bio...

> If we're already seeing
> weird behavior for bios with unaligned data memory, this seems like a
> recipe for similar weirdness. We'd also end up doing a partial write in
> scenarios where we already know we're returning an error.

THe partial write will happen anyway on a chained bio, something we
know already happens from the bug in bio chaining that was found in
this code earlier in the cycle.

> Perhaps we
> should create an error path or use a check similar to what is already in
> xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
> when we already know we're going to return an error) to call bio_endio()
> to undo any chaining.

xfs_buf_ioapply_map() only fails with an error if it occurs on the
first bio_add_page() call to a bio. If it fails on the second, then
it submits the bio, allocates a new bio, and tries again. If that
fails, then it frees the bio and does error processing.

That code can get away with such behaviour because it is not
chaining bios. each bio is it's own IO, and failures there already
result in partial IO completion with an error onthe buffer. This
code here will fail with a partial IO completion and return an error.

So, in effect, this code provides exactly the same behaviour and
result as the xfs_buf_ioapply_map() code does...

> 
> >  			bio = bio_alloc(GFP_KERNEL, bio_max_vecs(left));
> >  			bio_copy_dev(bio, prev);
> >  			bio->bi_iter.bi_sector = bio_end_sector(prev);
> ...
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 7bd1f31febfc..a2d499baee9c 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1294,7 +1294,7 @@ xfs_buf_ioapply_map(
> >  		if (nbytes > size)
> >  			nbytes = size;
> >  
> > -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > +		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
> >  				      offset);
> 
> Similar concern here. The higher level code seems written under the
> assumption that bio_add_page() returns success or zero. In this case the
> error is basically tossed so we can also attempt to split/chain an empty
> bio, or even better, submit a partial write and continue operating as if
> nothing happened (outside of the warning). The latter case should be
> handled as a log I/O error one way or another.

See above - it does result in a failure when offset/nbytes is not
aligned to the underlying device...

Cheers,

Dave.
Christoph Hellwig Aug. 21, 2019, 11:29 p.m. UTC | #3
On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add memory buffer alignment validation checks to bios built in XFS
> to catch bugs that will result in silent data corruption in block
> drivers that cannot handle unaligned memory buffers but don't
> validate the incoming buffer alignment is correct.
> 
> Known drivers with these issues are xenblk, brd and pmem.
> 
> Despite there being nothing XFS specific to xfs_bio_add_page(), this
> function was created to do the required validation because the block
> layer developers that keep telling us that is not possible to
> validate buffer alignment in bio_add_page(), and even if it was
> possible it would be too much overhead to do at runtime.

I really don't think we should life this to XFS, but instead fix it
in the block layer.  And that is not only because I have a pending
series lifting bits you are touching to the block layer..

> +int
> +xfs_bio_add_page(
> +	struct bio	*bio,
> +	struct page	*page,
> +	unsigned int	len,
> +	unsigned int	offset)
> +{
> +	struct request_queue	*q = bio->bi_disk->queue;
> +	bool		same_page = false;
> +
> +	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> +		return -EIO;
> +
> +	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> +		if (bio_full(bio, len))
> +			return 0;
> +		__bio_add_page(bio, page, len, offset);
> +	}
> +	return len;

I know Jens disagree, but with the amount of bugs we've been hitting
thangs to slub (and I'm pretty sure we have a more hiding outside of
XFS) I think we need to add the blk_rq_aligned check to bio_add_page.

Note that all current callers of bio_add_page can only really check
for the return value != the added len anyway, so it is not going to
make anything worse.
Christoph Hellwig Aug. 21, 2019, 11:30 p.m. UTC | #4
On Wed, Aug 21, 2019 at 09:39:04AM -0400, Brian Foster wrote:
> > @@ -36,9 +57,12 @@ xfs_rw_bdev(
> >  		unsigned int	off = offset_in_page(data);
> >  		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
> >  
> > -		while (bio_add_page(bio, page, len, off) != len) {
> > +		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
> >  			struct bio	*prev = bio;
> >  
> > +			if (ret < 0)
> > +				goto submit;
> > +
> 
> Hmm.. is submitting the bio really the right thing to do if we get here
> and have failed to add any pages to the bio? If we're already seeing
> weird behavior for bios with unaligned data memory, this seems like a
> recipe for similar weirdness. We'd also end up doing a partial write in
> scenarios where we already know we're returning an error. Perhaps we
> should create an error path or use a check similar to what is already in
> xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
> when we already know we're going to return an error) to call bio_endio()
> to undo any chaining.

It is not the right thing to do.  Calling bio_endio after setting
an error is the right thing to do (modulo any other cleanup needed).
Dave Chinner Aug. 22, 2019, 12:37 a.m. UTC | #5
On Wed, Aug 21, 2019 at 04:29:45PM -0700, Christoph Hellwig wrote:
> On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add memory buffer alignment validation checks to bios built in XFS
> > to catch bugs that will result in silent data corruption in block
> > drivers that cannot handle unaligned memory buffers but don't
> > validate the incoming buffer alignment is correct.
> > 
> > Known drivers with these issues are xenblk, brd and pmem.
> > 
> > Despite there being nothing XFS specific to xfs_bio_add_page(), this
> > function was created to do the required validation because the block
> > layer developers that keep telling us that is not possible to
> > validate buffer alignment in bio_add_page(), and even if it was
> > possible it would be too much overhead to do at runtime.
> 
> I really don't think we should life this to XFS, but instead fix it
> in the block layer.  And that is not only because I have a pending
> series lifting bits you are touching to the block layer..

I agree, but....

> 
> > +int
> > +xfs_bio_add_page(
> > +	struct bio	*bio,
> > +	struct page	*page,
> > +	unsigned int	len,
> > +	unsigned int	offset)
> > +{
> > +	struct request_queue	*q = bio->bi_disk->queue;
> > +	bool		same_page = false;
> > +
> > +	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> > +		return -EIO;
> > +
> > +	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> > +		if (bio_full(bio, len))
> > +			return 0;
> > +		__bio_add_page(bio, page, len, offset);
> > +	}
> > +	return len;
> 
> I know Jens disagree, but with the amount of bugs we've been hitting
> thangs to slub (and I'm pretty sure we have a more hiding outside of
> XFS) I think we need to add the blk_rq_aligned check to bio_add_page.

... I'm not prepared to fight this battle to get this initial fix
into the code. Get the fix merged, then we can 

> Note that all current callers of bio_add_page can only really check
> for the return value != the added len anyway, so it is not going to
> make anything worse.

It does make things worse - it turns multi-bio chaining loops like
the one xfs_rw_bdev() into an endless loop as they don't make
progress - they just keep allocating a new bio and retrying the same
badly aligned buffer and failing. So if we want an alignment failure
to error out, callers need to handle the failure, not treat it like
a full bio.

Cheers,

Dave.
Dave Chinner Aug. 22, 2019, 12:44 a.m. UTC | #6
On Wed, Aug 21, 2019 at 04:30:41PM -0700, Christoph Hellwig wrote:
> On Wed, Aug 21, 2019 at 09:39:04AM -0400, Brian Foster wrote:
> > > @@ -36,9 +57,12 @@ xfs_rw_bdev(
> > >  		unsigned int	off = offset_in_page(data);
> > >  		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
> > >  
> > > -		while (bio_add_page(bio, page, len, off) != len) {
> > > +		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
> > >  			struct bio	*prev = bio;
> > >  
> > > +			if (ret < 0)
> > > +				goto submit;
> > > +
> > 
> > Hmm.. is submitting the bio really the right thing to do if we get here
> > and have failed to add any pages to the bio? If we're already seeing
> > weird behavior for bios with unaligned data memory, this seems like a
> > recipe for similar weirdness. We'd also end up doing a partial write in
> > scenarios where we already know we're returning an error. Perhaps we
> > should create an error path or use a check similar to what is already in
> > xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
> > when we already know we're going to return an error) to call bio_endio()
> > to undo any chaining.
> 
> It is not the right thing to do.  Calling bio_endio after setting
> an error is the right thing to do (modulo any other cleanup needed).

bio_endio() doesn't wait for completion or all the IO in progress.

In fact, if we have chained bios still in progress, it does
absolutely nothing:

void bio_endio(struct bio *bio)
{
again:
>>>>>   if (!bio_remaining_done(bio))
                return;

and so we return still with the memory we've put into the buffers in
active use by the chained bios under IO. On error, we'll free the
allocated buffer immediately, and that means we've got a
use-after-free as the bios in progress still have references to it.
If it's heap memory we are using here, then that's a memory
corruption (read) or kernel memory leak (write) vector.

So we have to wait for IO completion before we return from this
function, and AFAICT, the only way to do that is to call
submit_bio_wait() on the parent of the bio chain to wait for all
child bios to drop their references and call bio_endio() on the
parent of the chain....

Cheers,

Dave.
Ming Lei Aug. 22, 2019, 2:50 a.m. UTC | #7
On Thu, Aug 22, 2019 at 8:06 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Add memory buffer alignment validation checks to bios built in XFS
> > to catch bugs that will result in silent data corruption in block
> > drivers that cannot handle unaligned memory buffers but don't
> > validate the incoming buffer alignment is correct.
> >
> > Known drivers with these issues are xenblk, brd and pmem.
> >
> > Despite there being nothing XFS specific to xfs_bio_add_page(), this
> > function was created to do the required validation because the block
> > layer developers that keep telling us that is not possible to
> > validate buffer alignment in bio_add_page(), and even if it was
> > possible it would be too much overhead to do at runtime.
>
> I really don't think we should life this to XFS, but instead fix it
> in the block layer.  And that is not only because I have a pending
> series lifting bits you are touching to the block layer..
>
> > +int
> > +xfs_bio_add_page(
> > +     struct bio      *bio,
> > +     struct page     *page,
> > +     unsigned int    len,
> > +     unsigned int    offset)
> > +{
> > +     struct request_queue    *q = bio->bi_disk->queue;
> > +     bool            same_page = false;
> > +
> > +     if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> > +             return -EIO;
> > +
> > +     if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> > +             if (bio_full(bio, len))
> > +                     return 0;
> > +             __bio_add_page(bio, page, len, offset);
> > +     }
> > +     return len;
>
> I know Jens disagree, but with the amount of bugs we've been hitting
> thangs to slub (and I'm pretty sure we have a more hiding outside of
> XFS) I think we need to add the blk_rq_aligned check to bio_add_page.

It isn't correct to blk_rq_aligned() here because 'len' has to be logical block
size aligned, instead of DMA aligned only.

Also not sure all users may setup bio->bi_disk well before adding page to bio,
since it is allowed to do that now.

If slub buffer crosses two pages, block layer may not handle it at all
even though
un-aligned 'offset' issue is solved.

Thanks,
Ming Lei
Dave Chinner Aug. 22, 2019, 4:49 a.m. UTC | #8
On Thu, Aug 22, 2019 at 10:50:02AM +0800, Ming Lei wrote:
> On Thu, Aug 22, 2019 at 8:06 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > Add memory buffer alignment validation checks to bios built in XFS
> > > to catch bugs that will result in silent data corruption in block
> > > drivers that cannot handle unaligned memory buffers but don't
> > > validate the incoming buffer alignment is correct.
> > >
> > > Known drivers with these issues are xenblk, brd and pmem.
> > >
> > > Despite there being nothing XFS specific to xfs_bio_add_page(), this
> > > function was created to do the required validation because the block
> > > layer developers that keep telling us that is not possible to
> > > validate buffer alignment in bio_add_page(), and even if it was
> > > possible it would be too much overhead to do at runtime.
> >
> > I really don't think we should life this to XFS, but instead fix it
> > in the block layer.  And that is not only because I have a pending
> > series lifting bits you are touching to the block layer..
> >
> > > +int
> > > +xfs_bio_add_page(
> > > +     struct bio      *bio,
> > > +     struct page     *page,
> > > +     unsigned int    len,
> > > +     unsigned int    offset)
> > > +{
> > > +     struct request_queue    *q = bio->bi_disk->queue;
> > > +     bool            same_page = false;
> > > +
> > > +     if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> > > +             return -EIO;
> > > +
> > > +     if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> > > +             if (bio_full(bio, len))
> > > +                     return 0;
> > > +             __bio_add_page(bio, page, len, offset);
> > > +     }
> > > +     return len;
> >
> > I know Jens disagree, but with the amount of bugs we've been hitting
> > thangs to slub (and I'm pretty sure we have a more hiding outside of
> > XFS) I think we need to add the blk_rq_aligned check to bio_add_page.
> 
> It isn't correct to blk_rq_aligned() here because 'len' has to be logical block
> size aligned, instead of DMA aligned only.

News to me.

AFAIA, the overall _IO_ that is being built needs to be a multiple
of the logical block size in total size (i.e. bio->bi_iter.size)
because sub sector IO is not allowed. But queue DMA limits are not
defined in sectors - they define the scatter/gather DMA capability
of the hardware, and that's what individual segments (bvecs) need to
align to.  That's what blk_rq_aligned() checks here - that the bvec
segment aligns to what the underlying driver(s) requires, not that
the entire IO is sector sized and aligned.

Also, think about multipage bvecs - the pages we are spanning here
are contiguous pages, so this should end up merging them and turning
it into a single multipage bvec whose length is sector size
aligned...

> Also not sure all users may setup bio->bi_disk well before adding page to bio,
> since it is allowed to do that now.

XFS does, so I just don't care about random users of bio_add_page()
in this patch. Somebody else can run the block layer gauntlet to get
these checks moved into generic code and they've already been
rejected twice as unnecessary.

> If slub buffer crosses two pages, block layer may not handle it at all
> even though
> un-aligned 'offset' issue is solved.

A slub buffer crossing two _contiguous_ pages should end up merged
as a multipage bvec. But I'm curious, what does adding multiple
contiguous pages to a bio actually break?

-Dave.
Ming Lei Aug. 22, 2019, 7:23 a.m. UTC | #9
On Thu, Aug 22, 2019 at 02:49:05PM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2019 at 10:50:02AM +0800, Ming Lei wrote:
> > On Thu, Aug 22, 2019 at 8:06 AM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > Add memory buffer alignment validation checks to bios built in XFS
> > > > to catch bugs that will result in silent data corruption in block
> > > > drivers that cannot handle unaligned memory buffers but don't
> > > > validate the incoming buffer alignment is correct.
> > > >
> > > > Known drivers with these issues are xenblk, brd and pmem.
> > > >
> > > > Despite there being nothing XFS specific to xfs_bio_add_page(), this
> > > > function was created to do the required validation because the block
> > > > layer developers that keep telling us that is not possible to
> > > > validate buffer alignment in bio_add_page(), and even if it was
> > > > possible it would be too much overhead to do at runtime.
> > >
> > > I really don't think we should life this to XFS, but instead fix it
> > > in the block layer.  And that is not only because I have a pending
> > > series lifting bits you are touching to the block layer..
> > >
> > > > +int
> > > > +xfs_bio_add_page(
> > > > +     struct bio      *bio,
> > > > +     struct page     *page,
> > > > +     unsigned int    len,
> > > > +     unsigned int    offset)
> > > > +{
> > > > +     struct request_queue    *q = bio->bi_disk->queue;
> > > > +     bool            same_page = false;
> > > > +
> > > > +     if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> > > > +             return -EIO;
> > > > +
> > > > +     if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> > > > +             if (bio_full(bio, len))
> > > > +                     return 0;
> > > > +             __bio_add_page(bio, page, len, offset);
> > > > +     }
> > > > +     return len;
> > >
> > > I know Jens disagree, but with the amount of bugs we've been hitting
> > > thangs to slub (and I'm pretty sure we have a more hiding outside of
> > > XFS) I think we need to add the blk_rq_aligned check to bio_add_page.
> > 
> > It isn't correct to blk_rq_aligned() here because 'len' has to be logical block
> > size aligned, instead of DMA aligned only.
> 
> News to me.
> 
> AFAIA, the overall _IO_ that is being built needs to be a multiple
> of the logical block size in total size (i.e. bio->bi_iter.size)

Right.

> because sub sector IO is not allowed. But queue DMA limits are not
> defined in sectors - they define the scatter/gather DMA capability
> of the hardware, and that's what individual segments (bvecs) need to
> align to.  That's what blk_rq_aligned() checks here - that the bvec

Segment isn't same with bvec. We build segment via scatterlist interface
from bvecs in case that driver needs segment for DMA between CPU and
HBA. The built segment has to respect every kinds of queue limits.

Now there are two kinds of bio, one is called fs bio, the other one is
bio for doing IO from/to the device. Block layer splits fs bio into
bios with proper size for doing IO.

If one bvec is added with un-aligned length to fs bio, and if this bvec
can't be merged with the following ones, how can block layer handle that?
For example, this bvec is un-aligned with virt boundary, then one single
bio is allocated for doing IO of this bvec, then sub-sector IO is
generated.

> segment aligns to what the underlying driver(s) requires, not that
> the entire IO is sector sized and aligned.

Not every drivers need to handle segment, some drivers simply handle
single-page bvec(pmem, brd, zram, ...) or multi-page bvec(loop).

Then un-aligned bvec may cause trouble for drivers which single-page bvec.

> 
> Also, think about multipage bvecs - the pages we are spanning here
> are contiguous pages, so this should end up merging them and turning
> it into a single multipage bvec whose length is sector size
> aligned...

This way works for drivers which use segment, and most of drivers
belong to this type.

> 
> > Also not sure all users may setup bio->bi_disk well before adding page to bio,
> > since it is allowed to do that now.
> 
> XFS does, so I just don't care about random users of bio_add_page()
> in this patch. Somebody else can run the block layer gauntlet to get
> these checks moved into generic code and they've already been
> rejected twice as unnecessary.

If the check is to be added on bio_add_page(), every users have to be
audited.

> 
> > If slub buffer crosses two pages, block layer may not handle it at all
> > even though
> > un-aligned 'offset' issue is solved.
> 
> A slub buffer crossing two _contiguous_ pages should end up merged
> as a multipage bvec. But I'm curious, what does adding multiple
> contiguous pages to a bio actually break?

Some drivers don't or can't handle multi-page bvec, we have to split
it into single-page bvec, then un-aligned bvec is seen by this drivers.

thanks,
Ming
Christoph Hellwig Aug. 22, 2019, 8:03 a.m. UTC | #10
On Thu, Aug 22, 2019 at 10:37:45AM +1000, Dave Chinner wrote:
> > I know Jens disagree, but with the amount of bugs we've been hitting
> > thangs to slub (and I'm pretty sure we have a more hiding outside of
> > XFS) I think we need to add the blk_rq_aligned check to bio_add_page.
> 
> ... I'm not prepared to fight this battle to get this initial fix
> into the code. Get the fix merged, then we can 

Well, the initial fix are the first two patches.  This patch really
just adds a safety belt.  I'll happily take over the effort to get
sensible checks in the block code if you give me a couple weeks,
in the meantime I'd prefer if we could skip this third patch for now.

> > Note that all current callers of bio_add_page can only really check
> > for the return value != the added len anyway, so it is not going to
> > make anything worse.
> 
> It does make things worse - it turns multi-bio chaining loops like
> the one xfs_rw_bdev() into an endless loop as they don't make
> progress - they just keep allocating a new bio and retrying the same
> badly aligned buffer and failing. So if we want an alignment failure
> to error out, callers need to handle the failure, not treat it like
> a full bio.

True.  And we need to improve the interface here, which I'm going
to try to fit into my series that lift some common bio filling code
to the core.  I'll add you to the cc list.
Christoph Hellwig Aug. 22, 2019, 8:08 a.m. UTC | #11
On Thu, Aug 22, 2019 at 02:49:05PM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2019 at 10:50:02AM +0800, Ming Lei wrote:
> > It isn't correct to blk_rq_aligned() here because 'len' has to be logical block
> > size aligned, instead of DMA aligned only.

Even if len would have to be a multiple of the sector size, that doesn't
mean calling blk_rq_aligned would be incorrect, just possibly not
catching all issues.

But as Dave outlined I don't think it is a problem in any way.
Dave Chinner Aug. 22, 2019, 10:17 a.m. UTC | #12
On Thu, Aug 22, 2019 at 01:03:12AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2019 at 10:37:45AM +1000, Dave Chinner wrote:
> > > I know Jens disagree, but with the amount of bugs we've been hitting
> > > thangs to slub (and I'm pretty sure we have a more hiding outside of
> > > XFS) I think we need to add the blk_rq_aligned check to bio_add_page.
> > 
> > ... I'm not prepared to fight this battle to get this initial fix
> > into the code. Get the fix merged, then we can 
> 
> Well, the initial fix are the first two patches.  This patch really
> just adds a safety belt.  I'll happily take over the effort to get
> sensible checks in the block code if you give me a couple weeks,
> in the meantime I'd prefer if we could skip this third patch for now.

Fine by me. I'll just repost the current versions of the first two
patches (now three) in the morning.

Cheers,

Dave.
Ming Lei Aug. 22, 2019, 10:20 a.m. UTC | #13
On Thu, Aug 22, 2019 at 01:08:52AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2019 at 02:49:05PM +1000, Dave Chinner wrote:
> > On Thu, Aug 22, 2019 at 10:50:02AM +0800, Ming Lei wrote:
> > > It isn't correct to blk_rq_aligned() here because 'len' has to be logical block
> > > size aligned, instead of DMA aligned only.
> 
> Even if len would have to be a multiple of the sector size, that doesn't
> mean calling blk_rq_aligned would be incorrect, just possibly not
> catching all issues.

In theory, fs bio shouldn't care any DMA limits, which should have been done
on splitted bio for doing IO to device.

Also .dma_alignment isn't considered in blk_stack_limits(), so in case
of DM, MD or other stacking drivers, fs code won't know the accurate
.dma_alignment of underlying queues at all, and the stacking driver's
queue dma alignment is still 512.

Also suppose the check is added, I am a bit curious how fs code handles the
failure, so could you explain a bit about the failure handling?

Thanks, 
Ming
Brian Foster Aug. 22, 2019, 1:47 p.m. UTC | #14
On Thu, Aug 22, 2019 at 07:39:30AM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 09:39:04AM -0400, Brian Foster wrote:
> > On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Add memory buffer alignment validation checks to bios built in XFS
> > > to catch bugs that will result in silent data corruption in block
> > > drivers that cannot handle unaligned memory buffers but don't
> > > validate the incoming buffer alignment is correct.
> > > 
> > > Known drivers with these issues are xenblk, brd and pmem.
> > > 
> > > Despite there being nothing XFS specific to xfs_bio_add_page(), this
> > > function was created to do the required validation because the block
> > > layer developers that keep telling us that is not possible to
> > > validate buffer alignment in bio_add_page(), and even if it was
> > > possible it would be too much overhead to do at runtime.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_bio_io.c | 32 +++++++++++++++++++++++++++++---
> > >  fs/xfs/xfs_buf.c    |  2 +-
> > >  fs/xfs/xfs_linux.h  |  3 +++
> > >  fs/xfs/xfs_log.c    |  6 +++++-
> > >  4 files changed, 38 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> > > index e2148f2d5d6b..fbaea643c000 100644
> > > --- a/fs/xfs/xfs_bio_io.c
> > > +++ b/fs/xfs/xfs_bio_io.c
> > > @@ -9,6 +9,27 @@ static inline unsigned int bio_max_vecs(unsigned int count)
> > >  	return min_t(unsigned, howmany(count, PAGE_SIZE), BIO_MAX_PAGES);
> > >  }
> > >  
> > > +int
> > > +xfs_bio_add_page(
> > > +	struct bio	*bio,
> > > +	struct page	*page,
> > > +	unsigned int	len,
> > > +	unsigned int	offset)
> > > +{
> > > +	struct request_queue	*q = bio->bi_disk->queue;
> > > +	bool		same_page = false;
> > > +
> > > +	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> > > +		return -EIO;
> > > +
> > > +	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> > > +		if (bio_full(bio, len))
> > > +			return 0;
> > > +		__bio_add_page(bio, page, len, offset);
> > > +	}
> > > +	return len;
> > > +}
> > > +
> > 
> > Seems reasonable to me. Looks like bio_add_page() with an error check.
> > ;)
> 
> It is exactly that.
> 
> > Given that, what's the need to open-code bio_add_page() here rather
> > than just call it after the check?
> 
> Largely because I wasn't sure exactly where the best place was to
> add the check. Copy first, hack to pieces, make work, then worry
> about other stuff :)
> 
> I could change it to calling calling bio_add_page() directly. Not
> really that fussed as it's all exported functionality...
> 
> 
> > >  int
> > >  xfs_rw_bdev(
> > >  	struct block_device	*bdev,
> > ...
> > > @@ -36,9 +57,12 @@ xfs_rw_bdev(
> > >  		unsigned int	off = offset_in_page(data);
> > >  		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
> > >  
> > > -		while (bio_add_page(bio, page, len, off) != len) {
> > > +		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
> > >  			struct bio	*prev = bio;
> > >  
> > > +			if (ret < 0)
> > > +				goto submit;
> > > +
> > 
> > Hmm.. is submitting the bio really the right thing to do if we get here
> > and have failed to add any pages to the bio?
> 
> Yes, because we really should wait for chained bios to complete
> before returning. we can only do that by waiting on completion for
> the entire chain, and the simplest way to do that is submit the
> bio...
> 

Ok. I guess that makes sense here because these are sync I/Os and this
appears to be the only way to wait on a partially constructed chain
(that I can tell).

> > If we're already seeing
> > weird behavior for bios with unaligned data memory, this seems like a
> > recipe for similar weirdness. We'd also end up doing a partial write in
> > scenarios where we already know we're returning an error.
> 
> THe partial write will happen anyway on a chained bio, something we
> know already happens from the bug in bio chaining that was found in
> this code earlier in the cycle.
> 

Sure, a failure in a bio chain is essentially a partial write similar to
if one such bio in the chain had failed. What about the non-chaining
case? What happens on submission of an empty read/write bio? Does
behavior depend on the underlying storage stack?

Note that I see this patch is likely going away. I'm just saying I think
we should have some confirmation on how the block layer behaves in these
situations before we take an approach that relies on it. Is an empty bio
essentially a no-op that serves as a serialization mechanism? Does the
block layer return an error? Etc.

> > Perhaps we
> > should create an error path or use a check similar to what is already in
> > xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
> > when we already know we're going to return an error) to call bio_endio()
> > to undo any chaining.
> 
> xfs_buf_ioapply_map() only fails with an error if it occurs on the
> first bio_add_page() call to a bio. If it fails on the second, then
> it submits the bio, allocates a new bio, and tries again. If that
> fails, then it frees the bio and does error processing.
> 
> That code can get away with such behaviour because it is not
> chaining bios. each bio is it's own IO, and failures there already
> result in partial IO completion with an error onthe buffer. This
> code here will fail with a partial IO completion and return an error.
> 

Right, that is the behavior I was referring to above.

> So, in effect, this code provides exactly the same behaviour and
> result as the xfs_buf_ioapply_map() code does...
> 

In this case, we return an error that the caller will handle and
shutdown the fs. That's reasonable behavior here since these are sync
I/Os.

> > 
> > >  			bio = bio_alloc(GFP_KERNEL, bio_max_vecs(left));
> > >  			bio_copy_dev(bio, prev);
> > >  			bio->bi_iter.bi_sector = bio_end_sector(prev);
> > ...
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 7bd1f31febfc..a2d499baee9c 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -1294,7 +1294,7 @@ xfs_buf_ioapply_map(
> > >  		if (nbytes > size)
> > >  			nbytes = size;
> > >  
> > > -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > > +		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > >  				      offset);
> > 
> > Similar concern here. The higher level code seems written under the
> > assumption that bio_add_page() returns success or zero. In this case the
> > error is basically tossed so we can also attempt to split/chain an empty
> > bio, or even better, submit a partial write and continue operating as if
> > nothing happened (outside of the warning). The latter case should be
> > handled as a log I/O error one way or another.
> 
> See above - it does result in a failure when offset/nbytes is not
> aligned to the underlying device...
> 

Oops, this comment was misplaced. I meant to write this in response to
the changes to xlog_map_iclog_data(), not xfs_buf_ioapply_map() (hence
the note about handling log I/O errors above).

The former has no such error checks that I can see. AFAICT, if we
construct a partial or empty bio here, we'd warn and return.
xlog_map_iclog_data() is a void function so the caller has no indication
anything went wrong and submits the bio. If an empty (?) or partial bio
doesn't return an error on bio completion, we've completely failed to
account for the fact that we've written a partial iclog to disk.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Aug. 22, 2019, 11:03 p.m. UTC | #15
On Thu, Aug 22, 2019 at 09:47:16AM -0400, Brian Foster wrote:
> On Thu, Aug 22, 2019 at 07:39:30AM +1000, Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 09:39:04AM -0400, Brian Foster wrote:
> > > On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > Yes, because we really should wait for chained bios to complete
> > before returning. we can only do that by waiting on completion for
> > the entire chain, and the simplest way to do that is submit the
> > bio...
> > 
> 
> Ok. I guess that makes sense here because these are sync I/Os and this
> appears to be the only way to wait on a partially constructed chain
> (that I can tell).

Same here. I can't find any code that manually waits for a partially
submitted chain to complete...

> > > If we're already seeing
> > > weird behavior for bios with unaligned data memory, this seems like a
> > > recipe for similar weirdness. We'd also end up doing a partial write in
> > > scenarios where we already know we're returning an error.
> > 
> > THe partial write will happen anyway on a chained bio, something we
> > know already happens from the bug in bio chaining that was found in
> > this code earlier in the cycle.
> > 
> 
> Sure, a failure in a bio chain is essentially a partial write similar to
> if one such bio in the chain had failed. What about the non-chaining
> case? What happens on submission of an empty read/write bio? Does
> behavior depend on the underlying storage stack?

Seems to work just fine when I tested the code without the aligned
allocation patch. I do note that the block layer does allow zero
length (non-data) bios for things like flush requests.
Unsurprisingly, theres no checks to disallow completely empty bios
from being submitted, so I'm asumming that it's handled just fine
given the code worked when tested...

> Note that I see this patch is likely going away. I'm just saying I think
> we should have some confirmation on how the block layer behaves in these
> situations before we take an approach that relies on it. Is an empty bio
> essentially a no-op that serves as a serialization mechanism? Does the
> block layer return an error? Etc.

submit_bio_wait() didn't return an error, and even if it did it gets
overridden by the error from misalignment.

> > > Perhaps we
> > > should create an error path or use a check similar to what is already in
> > > xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
> > > when we already know we're going to return an error) to call bio_endio()
> > > to undo any chaining.

bio_endio() doesn't wait for chaining. If there's children chained,
it just returns immediately and that leads to the use-after-free
situation I described to Christoph...

bio chaining just seems incredibly fragile in the face of errors
during chain building...

> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -1294,7 +1294,7 @@ xfs_buf_ioapply_map(
> > > >  		if (nbytes > size)
> > > >  			nbytes = size;
> > > >  
> > > > -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > > > +		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > > >  				      offset);
> > > 
> > > Similar concern here. The higher level code seems written under the
> > > assumption that bio_add_page() returns success or zero. In this case the
> > > error is basically tossed so we can also attempt to split/chain an empty
> > > bio, or even better, submit a partial write and continue operating as if
> > > nothing happened (outside of the warning). The latter case should be
> > > handled as a log I/O error one way or another.
> > 
> > See above - it does result in a failure when offset/nbytes is not
> > aligned to the underlying device...
> > 
> 
> Oops, this comment was misplaced. I meant to write this in response to
> the changes to xlog_map_iclog_data(), not xfs_buf_ioapply_map() (hence
> the note about handling log I/O errors above).
> 
> The former has no such error checks that I can see. AFAICT, if we
> construct a partial or empty bio here, we'd warn and return.
> xlog_map_iclog_data() is a void function so the caller has no indication
> anything went wrong and submits the bio. If an empty (?) or partial bio
> doesn't return an error on bio completion, we've completely failed to
> account for the fact that we've written a partial iclog to disk.

This is more a "make sure we don't get stuck in an endless loop"
situation. If we've got unaligned iclogs, then a large, multi-page
allocation (32-256k) has failed to be page aligned. This is not
memory that comes from the slab, so alignment problems here indicate
a major problem with the page allocation and/or vmalloc code. It
just should never happen and if it does, then log writes failing are
going to be the least of our worries.

Cheers,

Dave.
Christoph Hellwig Aug. 23, 2019, 12:14 a.m. UTC | #16
On Thu, Aug 22, 2019 at 06:20:00PM +0800, Ming Lei wrote:
> In theory, fs bio shouldn't care any DMA limits, which should have been done
> on splitted bio for doing IO to device.
> 
> Also .dma_alignment isn't considered in blk_stack_limits(), so in case
> of DM, MD or other stacking drivers, fs code won't know the accurate
> .dma_alignment of underlying queues at all, and the stacking driver's
> queue dma alignment is still 512.

Trying to handling alignment lower down means bounce buffering, so I
don't think trying to hndle it is a sane idea.  I'd be much happier to
say non-passthrough bios need 512 byte alignment, period.  That should
cover all the sane cases and we can easily check for it.  The occasional
device that would need larger alignment just needs to deal with it.

> Also suppose the check is added, I am a bit curious how fs code handles the
> failure, so could you explain a bit about the failure handling?

Even just an assert is a a start.  But a bio_add_page variant with
saner return value semantic would be helpful, and I have some ideas
there that I need to try out first.
Ming Lei Aug. 23, 2019, 1:19 a.m. UTC | #17
On Thu, Aug 22, 2019 at 05:14:40PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2019 at 06:20:00PM +0800, Ming Lei wrote:
> > In theory, fs bio shouldn't care any DMA limits, which should have been done
> > on splitted bio for doing IO to device.
> > 
> > Also .dma_alignment isn't considered in blk_stack_limits(), so in case
> > of DM, MD or other stacking drivers, fs code won't know the accurate
> > .dma_alignment of underlying queues at all, and the stacking driver's
> > queue dma alignment is still 512.
> 
> Trying to handling alignment lower down means bounce buffering, so I
> don't think trying to hndle it is a sane idea.  I'd be much happier to
> say non-passthrough bios need 512 byte alignment, period.  That should
> cover all the sane cases and we can easily check for it.  The occasional
> device that would need larger alignment just needs to deal with it.

Yeah, I agree we need to avoid bounce buffer, and it is fine to check
512 simply.

Also we should consider the interface/protocol between fs and block layer,
it could make both sides happy to always align offset & length with logical
block size. And that is reasonable for fs bio.


Thanks,
Ming
Brian Foster Aug. 23, 2019, 12:33 p.m. UTC | #18
On Fri, Aug 23, 2019 at 09:03:57AM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2019 at 09:47:16AM -0400, Brian Foster wrote:
> > On Thu, Aug 22, 2019 at 07:39:30AM +1000, Dave Chinner wrote:
> > > On Wed, Aug 21, 2019 at 09:39:04AM -0400, Brian Foster wrote:
> > > > On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > > Yes, because we really should wait for chained bios to complete
> > > before returning. we can only do that by waiting on completion for
> > > the entire chain, and the simplest way to do that is submit the
> > > bio...
> > > 
> > 
> > Ok. I guess that makes sense here because these are sync I/Os and this
> > appears to be the only way to wait on a partially constructed chain
> > (that I can tell).
> 
> Same here. I can't find any code that manually waits for a partially
> submitted chain to complete...
> 
> > > > If we're already seeing
> > > > weird behavior for bios with unaligned data memory, this seems like a
> > > > recipe for similar weirdness. We'd also end up doing a partial write in
> > > > scenarios where we already know we're returning an error.
> > > 
> > > THe partial write will happen anyway on a chained bio, something we
> > > know already happens from the bug in bio chaining that was found in
> > > this code earlier in the cycle.
> > > 
> > 
> > Sure, a failure in a bio chain is essentially a partial write similar to
> > if one such bio in the chain had failed. What about the non-chaining
> > case? What happens on submission of an empty read/write bio? Does
> > behavior depend on the underlying storage stack?
> 
> Seems to work just fine when I tested the code without the aligned
> allocation patch. I do note that the block layer does allow zero
> length (non-data) bios for things like flush requests.
> Unsurprisingly, theres no checks to disallow completely empty bios
> from being submitted, so I'm asumming that it's handled just fine
> given the code worked when tested...
> 

Yeah, I'm aware of flush requests and such. That's why I was asking
about empty read/write bios, since an empty bio by itself is not some
blatant error.

What exactly does "it works" mean here, though? Does an empty read/write
bio return with an error or just behave as a serialization no-op? It
might not matter for this particular sync I/O case, but it could make a
difference for others.

> > Note that I see this patch is likely going away. I'm just saying I think
> > we should have some confirmation on how the block layer behaves in these
> > situations before we take an approach that relies on it. Is an empty bio
> > essentially a no-op that serves as a serialization mechanism? Does the
> > block layer return an error? Etc.
> 
> submit_bio_wait() didn't return an error, and even if it did it gets
> overridden by the error from misalignment.
> 

I wrote "block layer," not submit_bio_wait() specifically. I'm asking
whether the bio might complete with an error or not.

> > > > Perhaps we
> > > > should create an error path or use a check similar to what is already in
> > > > xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
> > > > when we already know we're going to return an error) to call bio_endio()
> > > > to undo any chaining.
> 
> bio_endio() doesn't wait for chaining. If there's children chained,
> it just returns immediately and that leads to the use-after-free
> situation I described to Christoph...
> 
> bio chaining just seems incredibly fragile in the face of errors
> during chain building...
> 

(The above is from two mails ago. Not sure why you're replying here..?)
Do note that bio_remaining_done() has special case chaining logic,
though. It's not just a no-op as implied. It looks to me that it behaves
as if the I/O were submitted and completed (async), but I could easily
be missing something as I'm not familiar with the code..

> > > > > --- a/fs/xfs/xfs_buf.c
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -1294,7 +1294,7 @@ xfs_buf_ioapply_map(
> > > > >  		if (nbytes > size)
> > > > >  			nbytes = size;
> > > > >  
> > > > > -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > > > > +		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > > > >  				      offset);
> > > > 
> > > > Similar concern here. The higher level code seems written under the
> > > > assumption that bio_add_page() returns success or zero. In this case the
> > > > error is basically tossed so we can also attempt to split/chain an empty
> > > > bio, or even better, submit a partial write and continue operating as if
> > > > nothing happened (outside of the warning). The latter case should be
> > > > handled as a log I/O error one way or another.
> > > 
> > > See above - it does result in a failure when offset/nbytes is not
> > > aligned to the underlying device...
> > > 
> > 
> > Oops, this comment was misplaced. I meant to write this in response to
> > the changes to xlog_map_iclog_data(), not xfs_buf_ioapply_map() (hence
> > the note about handling log I/O errors above).
> > 
> > The former has no such error checks that I can see. AFAICT, if we
> > construct a partial or empty bio here, we'd warn and return.
> > xlog_map_iclog_data() is a void function so the caller has no indication
> > anything went wrong and submits the bio. If an empty (?) or partial bio
> > doesn't return an error on bio completion, we've completely failed to
> > account for the fact that we've written a partial iclog to disk.
> 
> This is more a "make sure we don't get stuck in an endless loop"
> situation. If we've got unaligned iclogs, then a large, multi-page
> allocation (32-256k) has failed to be page aligned. This is not
> memory that comes from the slab, so alignment problems here indicate
> a major problem with the page allocation and/or vmalloc code. It
> just should never happen and if it does, then log writes failing are
> going to be the least of our worries.
> 

There is no infinite loop vector here. That aside, the iclog data buffer
is not a multi-page allocation on all systems. And FWIW, I don't think
all multi-page allocations are non-slab (slab/slub/slob appear to have
different levels of multi-page support, though that probably doesn't
matter on 4k systems for our supported iclog sizes). Otherwise, we're
just making assumptions on current and future failure semantics of an
external subsystem that's out of our control, and doing so to justify
lazy error handling logic that can be trivially fixed. All that's needed
here is to ensure that the new error path eventually translates into a
shutdown like all other log I/O related errors.

My understanding is this patch is being dropped for now so this feedback
is just for reference should this logic re-emerge when we have a
bio_add_page() variant with new error semantics..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
index e2148f2d5d6b..fbaea643c000 100644
--- a/fs/xfs/xfs_bio_io.c
+++ b/fs/xfs/xfs_bio_io.c
@@ -9,6 +9,27 @@  static inline unsigned int bio_max_vecs(unsigned int count)
 	return min_t(unsigned, howmany(count, PAGE_SIZE), BIO_MAX_PAGES);
 }
 
+int
+xfs_bio_add_page(
+	struct bio	*bio,
+	struct page	*page,
+	unsigned int	len,
+	unsigned int	offset)
+{
+	struct request_queue	*q = bio->bi_disk->queue;
+	bool		same_page = false;
+
+	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
+		return -EIO;
+
+	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
+		if (bio_full(bio, len))
+			return 0;
+		__bio_add_page(bio, page, len, offset);
+	}
+	return len;
+}
+
 int
 xfs_rw_bdev(
 	struct block_device	*bdev,
@@ -20,7 +41,7 @@  xfs_rw_bdev(
 {
 	unsigned int		is_vmalloc = is_vmalloc_addr(data);
 	unsigned int		left = count;
-	int			error;
+	int			error, ret = 0;
 	struct bio		*bio;
 
 	if (is_vmalloc && op == REQ_OP_WRITE)
@@ -36,9 +57,12 @@  xfs_rw_bdev(
 		unsigned int	off = offset_in_page(data);
 		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
 
-		while (bio_add_page(bio, page, len, off) != len) {
+		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
 			struct bio	*prev = bio;
 
+			if (ret < 0)
+				goto submit;
+
 			bio = bio_alloc(GFP_KERNEL, bio_max_vecs(left));
 			bio_copy_dev(bio, prev);
 			bio->bi_iter.bi_sector = bio_end_sector(prev);
@@ -48,14 +72,16 @@  xfs_rw_bdev(
 			submit_bio(prev);
 		}
 
+		ret = 0;
 		data += len;
 		left -= len;
 	} while (left > 0);
 
+submit:
 	error = submit_bio_wait(bio);
 	bio_put(bio);
 
 	if (is_vmalloc && op == REQ_OP_READ)
 		invalidate_kernel_vmap_range(data, count);
-	return error;
+	return ret ? ret : error;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7bd1f31febfc..a2d499baee9c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1294,7 +1294,7 @@  xfs_buf_ioapply_map(
 		if (nbytes > size)
 			nbytes = size;
 
-		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
+		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
 				      offset);
 		if (rbytes < nbytes)
 			break;
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index ca15105681ca..e71c7bf3e714 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -145,6 +145,9 @@  static inline void delay(long ticks)
 	schedule_timeout_uninterruptible(ticks);
 }
 
+int xfs_bio_add_page(struct bio *bio, struct page *page, unsigned int len,
+			unsigned int offset);
+
 /*
  * XFS wrapper structure for sysfs support. It depends on external data
  * structures and is embedded in various internal data structures to implement
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 1830d185d7fc..52f7d840d09e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1673,8 +1673,12 @@  xlog_map_iclog_data(
 		struct page	*page = kmem_to_page(data);
 		unsigned int	off = offset_in_page(data);
 		size_t		len = min_t(size_t, count, PAGE_SIZE - off);
+		int		ret;
 
-		WARN_ON_ONCE(bio_add_page(bio, page, len, off) != len);
+		ret = xfs_bio_add_page(bio, page, len, off);
+		WARN_ON_ONCE(ret != len);
+		if (ret < 0)
+			break;
 
 		data += len;
 		count -= len;