diff mbox series

[05/20] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset

Message ID 20230712211115.2174650-6-kent.overstreet@linux.dev (mailing list archive)
State New, archived
Headers show
Series None | expand

Commit Message

Kent Overstreet July 12, 2023, 9:11 p.m. UTC
bio_iov_iter_get_pages() trims the IO based on the block size of the
block device the IO will be issued to.

However, bcachefs is a multi device filesystem; when we're creating the
bio we don't yet know which block device the bio will be submitted to -
we have to handle the alignment checks elsewhere.

Thus this is needed to avoid a null ptr deref.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
---
 block/bio.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig July 24, 2023, 5:34 p.m. UTC | #1
On Wed, Jul 12, 2023 at 05:11:00PM -0400, Kent Overstreet wrote:
> bio_iov_iter_get_pages() trims the IO based on the block size of the
> block device the IO will be issued to.
> 
> However, bcachefs is a multi device filesystem; when we're creating the
> bio we don't yet know which block device the bio will be submitted to -
> we have to handle the alignment checks elsewhere.

So, we've been trying really hard to always make sure to pass a bdev
to anything that allocates a bio, mostly due due the fact that we
actually derive information like the blk-cgroup associations from it.

The whole blk-cgroup stuff is actually a problem for non-trivial
multi-device setups.  XFS gets away fine because each file just
sits on either the main or RT device and no user I/O goes to the
log device, and btrfs papers over it in a weird way by always
associating with the last added device, which is in many ways gross
and wrong, but at least satisfies the assumptions made in blk-cgroup.

How do you plan to deal with this?  Because I really don't want folks
just to go ahead and ignore the issues, we need to actually sort this
out.
Kent Overstreet July 25, 2023, 2:43 a.m. UTC | #2
On Mon, Jul 24, 2023 at 10:34:20AM -0700, Christoph Hellwig wrote:
> On Wed, Jul 12, 2023 at 05:11:00PM -0400, Kent Overstreet wrote:
> > bio_iov_iter_get_pages() trims the IO based on the block size of the
> > block device the IO will be issued to.
> > 
> > However, bcachefs is a multi device filesystem; when we're creating the
> > bio we don't yet know which block device the bio will be submitted to -
> > we have to handle the alignment checks elsewhere.
> 
> So, we've been trying really hard to always make sure to pass a bdev
> to anything that allocates a bio, mostly due due the fact that we
> actually derive information like the blk-cgroup associations from it.
> 
> The whole blk-cgroup stuff is actually a problem for non-trivial
> multi-device setups.  XFS gets away fine because each file just
> sits on either the main or RT device and no user I/O goes to the
> log device, and btrfs papers over it in a weird way by always
> associating with the last added device, which is in many ways gross
> and wrong, but at least satisfies the assumptions made in blk-cgroup.
> 
> How do you plan to deal with this?  Because I really don't want folks
> just to go ahead and ignore the issues, we need to actually sort this
> out.

Doing the blk-cgroup association at bio alloc time sounds broken to me,
because of stacking block devices - why was the association not done at
generic_make_request() time?
Christoph Hellwig July 26, 2023, 1:23 p.m. UTC | #3
On Mon, Jul 24, 2023 at 10:43:12PM -0400, Kent Overstreet wrote:
> Doing the blk-cgroup association at bio alloc time sounds broken to me,
> because of stacking block devices - why was the association not done at
> generic_make_request() time?

Because blk-cgroup not only works at the lowest level in the stack,
but also for stackable block devices.  It's not a design decision I
particularly agree with, but it's been there forever.

So we need to assign it when creating the bio (we used to do it at
submission time, but the way it was done was horribly ineffcient,
that's why I moved it to allocation time), and then when hitting a
stacked device it get reassinged (which still is horribly inefficient).
Kent Overstreet Aug. 1, 2023, 7:04 p.m. UTC | #4
On Wed, Jul 26, 2023 at 06:23:05AM -0700, Christoph Hellwig wrote:
> On Mon, Jul 24, 2023 at 10:43:12PM -0400, Kent Overstreet wrote:
> > Doing the blk-cgroup association at bio alloc time sounds broken to me,
> > because of stacking block devices - why was the association not done at
> > generic_make_request() time?
> 
> Because blk-cgroup not only works at the lowest level in the stack,
> but also for stackable block devices.  It's not a design decision I
> particularly agree with, but it's been there forever.

You're setting the association only to the highest block device in the
stack - how on earth is it supposed to work with anything lower?

And looking at bio_associate_blkg(), this code looks completely broken.
It's checking bio->bi_blkg, but that's just been set to NULL in
bio_init().

And this is your code, so I think you need to go over this again.

Anyways, bio_associate_blkg() is also called by bio_set_dev(), which
means passing the block device to bio_init() was a completely pointless
change. bcachefs uses bio_set_dev(), so everything is fine.
Christoph Hellwig Aug. 2, 2023, 11:47 a.m. UTC | #5
On Tue, Aug 01, 2023 at 03:04:50PM -0400, Kent Overstreet wrote:
> > Because blk-cgroup not only works at the lowest level in the stack,
> > but also for stackable block devices.  It's not a design decision I
> > particularly agree with, but it's been there forever.
> 
> You're setting the association only to the highest block device in the
> stack - how on earth is it supposed to work with anything lower?

Hey, ask the cgroup folks as they come up with it.  I'm not going to
defend the logic here.

> And looking at bio_associate_blkg(), this code looks completely broken.
> It's checking bio->bi_blkg, but that's just been set to NULL in
> bio_init().

It's checking bi_blkg because it can also be called from bio_set_dev.

> And this is your code, so I think you need to go over this again.

It's "my code" in the sene of that I did one big round of unwinding
the even bigger mess that was there.  There is another few rounds needed
for the code to vaguely make sense.
Kent Overstreet Aug. 2, 2023, 4:44 p.m. UTC | #6
On Wed, Aug 02, 2023 at 04:47:18AM -0700, Christoph Hellwig wrote:
> On Tue, Aug 01, 2023 at 03:04:50PM -0400, Kent Overstreet wrote:
> > > Because blk-cgroup not only works at the lowest level in the stack,
> > > but also for stackable block devices.  It's not a design decision I
> > > particularly agree with, but it's been there forever.
> > 
> > You're setting the association only to the highest block device in the
> > stack - how on earth is it supposed to work with anything lower?
> 
> Hey, ask the cgroup folks as they come up with it.  I'm not going to
> defend the logic here.
> 
> > And looking at bio_associate_blkg(), this code looks completely broken.
> > It's checking bio->bi_blkg, but that's just been set to NULL in
> > bio_init().
> 
> It's checking bi_blkg because it can also be called from bio_set_dev.

So bio_set_dev() has subtly different behaviour than passing the block
device to bio_init()?

That's just broken.

> 
> > And this is your code, so I think you need to go over this again.
> 
> It's "my code" in the sene of that I did one big round of unwinding
> the even bigger mess that was there.  There is another few rounds needed
> for the code to vaguely make sense.

Well, I'll watch for those patches then...
diff mbox series

Patch

diff --git a/block/bio.c b/block/bio.c
index 1e75840d17..e74a04ea14 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1245,7 +1245,7 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	struct page **pages = (struct page **)bv;
 	ssize_t size, left;
 	unsigned len, i = 0;
-	size_t offset, trim;
+	size_t offset;
 	int ret = 0;
 
 	/*
@@ -1274,10 +1274,12 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
-	trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
-	iov_iter_revert(iter, trim);
+	if (bio->bi_bdev) {
+		size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
+		iov_iter_revert(iter, trim);
+		size -= trim;
+	}
 
-	size -= trim;
 	if (unlikely(!size)) {
 		ret = -EFAULT;
 		goto out;