diff mbox series

[vfs.all,22/26] block: stash a bdev_file to read/write raw blcok_device

Message ID 20240406090930.2252838-23-yukuai1@huaweicloud.com (mailing list archive)
State New
Headers show
Series fs & block: remove bdev->bd_inode | expand

Commit Message

Yu Kuai April 6, 2024, 9:09 a.m. UTC
From: Yu Kuai <yukuai3@huawei.com>

So that iomap and bffer_head can convert to use bdev_file in following
patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/bdev.c              | 137 +++++++++++++++++++++++++++++---------
 include/linux/blk_types.h |   1 +
 2 files changed, 107 insertions(+), 31 deletions(-)

Comments

Al Viro April 6, 2024, 7:42 p.m. UTC | #1
On Sat, Apr 06, 2024 at 05:09:26PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> So that iomap and bffer_head can convert to use bdev_file in following
> patches.

Let me see if I got it straight.  You introduce dummy struct file instances
(no methods, nothing).  The *ONLY* purpose they serve is to correspond to
opened instances of struct bdev.  No other use is possible.

You shove them into ->i_private of bdevfs inodes.  Lifetime rules are...
odd.

In bdev_open() you arrange for such beast to be present.  You never
return it anywhere, they only get accessed via ->i_private, exposing
it at least to fs/buffer.c.  Reference to those suckers get stored
(without grabbing refcount) into buffer_head instances.

And all of that is for... what, exactly?
Al Viro April 6, 2024, 8:29 p.m. UTC | #2
On Sat, Apr 06, 2024 at 08:42:06PM +0100, Al Viro wrote:
> On Sat, Apr 06, 2024 at 05:09:26PM +0800, Yu Kuai wrote:
> > From: Yu Kuai <yukuai3@huawei.com>
> > 
> > So that iomap and bffer_head can convert to use bdev_file in following
> > patches.
> 
> Let me see if I got it straight.  You introduce dummy struct file instances
> (no methods, nothing).  The *ONLY* purpose they serve is to correspond to
> opened instances of struct bdev.  No other use is possible.
> 
> You shove them into ->i_private of bdevfs inodes.  Lifetime rules are...
> odd.
> 
> In bdev_open() you arrange for such beast to be present.  You never
> return it anywhere, they only get accessed via ->i_private, exposing
> it at least to fs/buffer.c.  Reference to those suckers get stored
> (without grabbing refcount) into buffer_head instances.
> 
> And all of that is for... what, exactly?

Put another way, what's the endgame here?  Are you going to try and
propagate those beasts down into bio_alloc()?  Because if you do not,
you need to keep struct block_device * around anyway.

We use ->b_bdev for several things:
	* passing to bio_alloc() (quite a few places)
	* %pg in debugging printks
	* (rare) passing to write_boundary_block().
	* (twice) passing to clean_bdev_aliases().
	* (once) passing to __find_get_block().
	* one irregular use as a key in lookup_bh_lru()

IDGI...
Yu Kuai April 7, 2024, 1:18 a.m. UTC | #3
Hi,

在 2024/04/07 4:29, Al Viro 写道:
> On Sat, Apr 06, 2024 at 08:42:06PM +0100, Al Viro wrote:
>> On Sat, Apr 06, 2024 at 05:09:26PM +0800, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> So that iomap and bffer_head can convert to use bdev_file in following
>>> patches.
>>
>> Let me see if I got it straight.  You introduce dummy struct file instances
>> (no methods, nothing).  The *ONLY* purpose they serve is to correspond to
>> opened instances of struct bdev.  No other use is possible.

Yes, this is the only purpose.
>>
>> You shove them into ->i_private of bdevfs inodes.  Lifetime rules are...
>> odd.
>>
>> In bdev_open() you arrange for such beast to be present.  You never
>> return it anywhere, they only get accessed via ->i_private, exposing
>> it at least to fs/buffer.c.  Reference to those suckers get stored
>> (without grabbing refcount) into buffer_head instances.
>>
>> And all of that is for... what, exactly?
> 
> Put another way, what's the endgame here?  Are you going to try and
> propagate those beasts down into bio_alloc()?  Because if you do not,
> you need to keep struct block_device * around anyway.

Yes, patch 23-26 already do the work to remove the field block_device
and convert to use bdev_file for iomap and buffer_head.

Or maybe you prefer the idea from last version to keep the block_device
field in iomap/buffer_head, and use it for raw block_device fops?

Thanks,
Kuai

> 
> We use ->b_bdev for several things:
> 	* passing to bio_alloc() (quite a few places)
> 	* %pg in debugging printks
> 	* (rare) passing to write_boundary_block().
> 	* (twice) passing to clean_bdev_aliases().
> 	* (once) passing to __find_get_block().
> 	* one irregular use as a key in lookup_bh_lru()
> 
> IDGI...
> .
>
Al Viro April 7, 2024, 1:51 a.m. UTC | #4
On Sun, Apr 07, 2024 at 09:18:20AM +0800, Yu Kuai wrote:

> Yes, patch 23-26 already do the work to remove the field block_device
> and convert to use bdev_file for iomap and buffer_head.

What for?  I mean, what makes that dummy struct file * any better than
struct block_device *?  What's the point?

I agree that keeping an opened struct file for a block device is
a good idea - certainly better than weird crap used to carry the
"how had it been opened" along with bdev.  But that does *not*
mean not keeping ->s_bdev around; we might or might not find that
convenient, but it's not "struct block_device is Evil(tm), let's
exorcise".

Why do we care to do anything to struct buffer_head?  Or to
struct bio, for that matter...

I'm not saying that parts of the patchset do not make sense on
their own, but I don't understand what the last part is all
about.

Al, still going through that series...
Yu Kuai April 7, 2024, 2:34 a.m. UTC | #5
Hi,

在 2024/04/07 9:51, Al Viro 写道:
> On Sun, Apr 07, 2024 at 09:18:20AM +0800, Yu Kuai wrote:
> 
>> Yes, patch 23-26 already do the work to remove the field block_device
>> and convert to use bdev_file for iomap and buffer_head.
> 
> What for?  I mean, what makes that dummy struct file * any better than
> struct block_device *?  What's the point?
> 
> I agree that keeping an opened struct file for a block device is
> a good idea - certainly better than weird crap used to carry the
> "how had it been opened" along with bdev.  But that does *not*
> mean not keeping ->s_bdev around; we might or might not find that
> convenient, but it's not "struct block_device is Evil(tm), let's
> exorcise".
> 
> Why do we care to do anything to struct buffer_head?  Or to
> struct bio, for that matter...

Other than raw block_device fops, other filesystems can use the opened
bdev_file directly for iomap and buffer_head, and they actually don't
need to reference block_device anymore. The point here is that whether
we want to keep a special handling for block_device fops or not. There
are two proposes now:

- one is from Christian to keep using block_device for block_device
fops, in order to do that, a new flag and some special handling is added
to iomap and buffer_head. See the patch from last version [1].

- one is from this patchset, allocate a *dummy* bdev_file just for iomap
and buffer_head to access bdev and bd_inode.

I personally prefer the later one, that's why there is a new version,
however, what do I know? That will depend on how people think.

[1] 
https://lore.kernel.org/all/20240222124555.2049140-20-yukuai1@huaweicloud.com/
Thanks,
Kuai

> 
> I'm not saying that parts of the patchset do not make sense on
> their own, but I don't understand what the last part is all
> about.
> 
> Al, still going through that series...
> .
>
Al Viro April 7, 2024, 3:06 a.m. UTC | #6
On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote:

> Other than raw block_device fops, other filesystems can use the opened
> bdev_file directly for iomap and buffer_head, and they actually don't
> need to reference block_device anymore. The point here is that whether

What do you mean, "reference"?  The counting reference is to opened
file; ->s_bdev is a cached pointer to associated struct block_device,
and neither it nor pointers in buffer_head are valid past the moment
when you close the file.  Storing (non-counting) pointers to struct
file in struct buffer_head is not different in that respect - they
are *still* only valid while the "master" reference is held.

Again, what's the point of storing struct file * in struct buffer_head
or struct iomap?  In any instances of those structures?

There is a good reason to have it in places that keep a reference to
opened block device - the kind that _keeps_ the device opened.  Namely,
there's state that need to be carried from the place where we'd opened
the sucker to the place where we close it, and that state is better
carried by opened file.

But neither iomap nor buffer_head contain anything of that sort -
the lifetime management of the opened device is not in their
competence.  As the matter of fact, the logics around closing
those opened devices (bdev_release()) makes sure that no
instances of buffer_head (or iomap) will outlive them.
And they don't care about any extra state - everything
they use is in block_device and coallocated inode.

I could've easily missed something in one of the threads around
the earlier iterations of the patchset; if that's the case,
could somebody restate the rationale for that part and/or
post relevant lore.kernel.org links?  Christian?  hch?
What am I missing here?
Yu Kuai April 7, 2024, 3:21 a.m. UTC | #7
Hi,

在 2024/04/07 11:06, Al Viro 写道:
> On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote:
> 
>> Other than raw block_device fops, other filesystems can use the opened
>> bdev_file directly for iomap and buffer_head, and they actually don't
>> need to reference block_device anymore. The point here is that whether
> 
> What do you mean, "reference"?  The counting reference is to opened
> file; ->s_bdev is a cached pointer to associated struct block_device,
> and neither it nor pointers in buffer_head are valid past the moment
> when you close the file.  Storing (non-counting) pointers to struct
> file in struct buffer_head is not different in that respect - they
> are *still* only valid while the "master" reference is held.
> 
> Again, what's the point of storing struct file * in struct buffer_head
> or struct iomap?  In any instances of those structures?

Perhaps this is what you missed, like the title of this set, in order to
remove direct acceess of bdev->bd_inode from fs/buffer, we must store
bdev_file in buffer_head and iomap, and 'bdev->bd_inode' is replaced
with 'file_inode(bdev)' now.

Some history of previous discussions:

[1] https://lore.kernel.org/all/ZWRDeQ4K8BiYnV+X@infradead.org/
[2] 
https://lore.kernel.org/all/28237ec3-c3c1-1f0c-5250-04a88845d4a6@huaweicloud.com/
[3] 
https://lore.kernel.org/all/20240129-vfs-bdev-file-bd_inode-v1-0-42eb9eea96cf@kernel.org/

Thanks,
Kuai

> 
> There is a good reason to have it in places that keep a reference to
> opened block device - the kind that _keeps_ the device opened.  Namely,
> there's state that need to be carried from the place where we'd opened
> the sucker to the place where we close it, and that state is better
> carried by opened file.
> 
> But neither iomap nor buffer_head contain anything of that sort -
> the lifetime management of the opened device is not in their
> competence.  As the matter of fact, the logics around closing
> those opened devices (bdev_release()) makes sure that no
> instances of buffer_head (or iomap) will outlive them.
> And they don't care about any extra state - everything
> they use is in block_device and coallocated inode.
> 
> I could've easily missed something in one of the threads around
> the earlier iterations of the patchset; if that's the case,
> could somebody restate the rationale for that part and/or
> post relevant lore.kernel.org links?  Christian?  hch?
> What am I missing here?
> .
>
Al Viro April 7, 2024, 4:57 a.m. UTC | #8
On Sun, Apr 07, 2024 at 11:21:56AM +0800, Yu Kuai wrote:

> Perhaps this is what you missed, like the title of this set, in order to
> remove direct acceess of bdev->bd_inode from fs/buffer, we must store
> bdev_file in buffer_head and iomap, and 'bdev->bd_inode' is replaced
> with 'file_inode(bdev)' now.

TBH, that looks like a very massive overkill that doesn't address
the real issues - you are still poking in that coallocated struct
inode, only instead of fetching it from pointer in struct block_device
you get it from a pointer in a dummy struct file.  How is that an
improvement?  After all, grepping for '\->[ 	]*bd_inode\>' and
looking through the few remaining users in e.g. fs/buffer.c is
*much* easier than grepping for file_inode callers.

AFAICS, Christoph's objections had been about the need to use saner
APIs instead of getting to inode in some way and poking in it.
And I agree that quite a few things in that series do just
that.  The final part doesn't.

IMO, those dummy struct file (used as convenient storage for pointer
to address_space *and* to the damn inode, with all its guts hanging
out) are simply wrong.

To reiterate:
	* we need to reduce the number of uses of those inodes
	* we need to find out what *is* getting used and sort out
the sane set of primitives; that's hard to do when we still have
a lot of noise.
	* we need convert to those saner primitives
	* we need to prevent reintroduction of noise, or at least
make such reintroduced noise easy to catch and whack.

->bd_inode is a problem because it's an attractive nuisance.  Removing
it would be fine, if there wasn't a harder to spot alternative way to
get the same pointer.  Try to grep for file_inode and bd_inode resp.
and compare the hit counts.  Seriously reduced set of bd_inode users is
fine - my impression is that after this series without the final part
we'd be down to 20 users or so.  In the meanwhile, there's about 1.4e3
users of file_inode(), most of them completely unrelated to block
devices.

PS: in grow_dev_folio() we probably want
	struct address_space *mapping = bdev->bd_inode->i_mapping;
instead of
	struct inode *inode = bdev->bd_inode;
as one of the preliminary chunks.
FWIW, it really looks like address_space (== page cache of block device,
not an unreasonably candidate for primitive) and block size (well,
logarithm thereof) cover the majority of what remains, with device
size possibly being (remote) third...
Al Viro April 7, 2024, 5:11 a.m. UTC | #9
On Sun, Apr 07, 2024 at 05:57:58AM +0100, Al Viro wrote:

> PS: in grow_dev_folio() we probably want
> 	struct address_space *mapping = bdev->bd_inode->i_mapping;
> instead of
> 	struct inode *inode = bdev->bd_inode;
> as one of the preliminary chunks.
> FWIW, it really looks like address_space (== page cache of block device,
> not an unreasonably candidate for primitive) and block size (well,
> logarithm thereof) cover the majority of what remains, with device
> size possibly being (remote) third...

Incidentally, how painful would it be to switch __bread_gfp() and __bread()
to passing *logarithm* of block size instead of block size?  And possibly
supply the same to clean_bdev_aliases()...

That would reduce fs/buffer.c uses to just "give me the address_space of
that block device"...
Al Viro April 7, 2024, 5:21 a.m. UTC | #10
On Sun, Apr 07, 2024 at 06:11:19AM +0100, Al Viro wrote:
> On Sun, Apr 07, 2024 at 05:57:58AM +0100, Al Viro wrote:
> 
> > PS: in grow_dev_folio() we probably want
> > 	struct address_space *mapping = bdev->bd_inode->i_mapping;
> > instead of
> > 	struct inode *inode = bdev->bd_inode;
> > as one of the preliminary chunks.
> > FWIW, it really looks like address_space (== page cache of block device,
> > not an unreasonably candidate for primitive) and block size (well,
> > logarithm thereof) cover the majority of what remains, with device
> > size possibly being (remote) third...
> 
> Incidentally, how painful would it be to switch __bread_gfp() and __bread()
> to passing *logarithm* of block size instead of block size?  And possibly
> supply the same to clean_bdev_aliases()...
> 
> That would reduce fs/buffer.c uses to just "give me the address_space of
> that block device"...

... and from what I've seen in your series, it very much looks like after
that we could replace ->bd_inode with ->bd_mapping, turning your bdev_mapping()
into an inline and (hopefully) leaving the few remaining uses of bdev_inode()
outside of block/bdev.c _not_ on hot paths.  If nothing else, it would
make it much easier to grep for remaining odd stuff.

Might trim the btrfs parts of the series, at that - a lot of that seems to
be "how do we propagate opened file instead of just bdev, so that we could
get to its ->f_mapping deep in call chain"...

Again, all of that is only if __bread...() conversion to log(size) is feasible
without a massive PITA - there might be dragons...
Al Viro April 9, 2024, 4:26 a.m. UTC | #11
On Sun, Apr 07, 2024 at 11:21:56AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2024/04/07 11:06, Al Viro 写道:
> > On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote:
> > 
> > > Other than raw block_device fops, other filesystems can use the opened
> > > bdev_file directly for iomap and buffer_head, and they actually don't
> > > need to reference block_device anymore. The point here is that whether
> > 
> > What do you mean, "reference"?  The counting reference is to opened
> > file; ->s_bdev is a cached pointer to associated struct block_device,
> > and neither it nor pointers in buffer_head are valid past the moment
> > when you close the file.  Storing (non-counting) pointers to struct
> > file in struct buffer_head is not different in that respect - they
> > are *still* only valid while the "master" reference is held.
> > 
> > Again, what's the point of storing struct file * in struct buffer_head
> > or struct iomap?  In any instances of those structures?
> 
> Perhaps this is what you missed, like the title of this set, in order to
> remove direct acceess of bdev->bd_inode from fs/buffer, we must store
> bdev_file in buffer_head and iomap, and 'bdev->bd_inode' is replaced
> with 'file_inode(bdev)' now.

BTW, what does that have to do with iomap?  All it passes ->bdev to is
	1) bio_alloc()
	2) bio_alloc_bioset()
	3) bio_init()
	4) bdev_logical_block_size()
	5) bdev_iter_is_aligned()
	6) bdev_fua() 
	7) bdev_write_cache()

None of those goes anywhere near fs/buffer.c or uses ->bd_inode, AFAICS.

Again, what's the point?  It feels like you are trying to replace *all*
uses of struct block_device with struct file, just because.

If that's what's going on, please don't.  Using struct file instead
of that bdev_handle crap - sure, makes perfect sense.  But shoving it
down into struct bio really, really does not.

I'd suggest to start with adding ->bd_mapping as the first step and
converting the places where mapping is all we want to using that.
Right at the beginning of your series.  Then let's see what gets
left.

And leave ->bd_inode there for now; don't blindly replace it with
->bd_mapping->host everywhere.  It's much easier to grep for.
The point of the exercise is to find what do we really need ->bd_inode
for and what primitives are missing, not getting rid of a bad word...
Al Viro April 9, 2024, 4:53 a.m. UTC | #12
On Tue, Apr 09, 2024 at 05:26:43AM +0100, Al Viro wrote:
> On Sun, Apr 07, 2024 at 11:21:56AM +0800, Yu Kuai wrote:
> > Hi,
> > 
> > 在 2024/04/07 11:06, Al Viro 写道:
> > > On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote:
> > > 
> > > > Other than raw block_device fops, other filesystems can use the opened
> > > > bdev_file directly for iomap and buffer_head, and they actually don't
> > > > need to reference block_device anymore. The point here is that whether
> > > 
> > > What do you mean, "reference"?  The counting reference is to opened
> > > file; ->s_bdev is a cached pointer to associated struct block_device,
> > > and neither it nor pointers in buffer_head are valid past the moment
> > > when you close the file.  Storing (non-counting) pointers to struct
> > > file in struct buffer_head is not different in that respect - they
> > > are *still* only valid while the "master" reference is held.
> > > 
> > > Again, what's the point of storing struct file * in struct buffer_head
> > > or struct iomap?  In any instances of those structures?
> > 
> > Perhaps this is what you missed, like the title of this set, in order to
> > remove direct acceess of bdev->bd_inode from fs/buffer, we must store
> > bdev_file in buffer_head and iomap, and 'bdev->bd_inode' is replaced
> > with 'file_inode(bdev)' now.
> 
> BTW, what does that have to do with iomap?  All it passes ->bdev to is
> 	1) bio_alloc()
> 	2) bio_alloc_bioset()
> 	3) bio_init()
> 	4) bdev_logical_block_size()
> 	5) bdev_iter_is_aligned()
> 	6) bdev_fua() 
> 	7) bdev_write_cache()
> 
> None of those goes anywhere near fs/buffer.c or uses ->bd_inode, AFAICS.

Note that callers of iomap stuff in block/fops.c *do* have struct file *,
so there's no problem with getting to inode - there the use of ->f_mapping->host
is normal for ->write_iter()/->read_iter() instances.  Same for filemap_read()
and iomap_file_buffered_write().

As the matter of fact, the only use of ->bd_inode in block/fops.c is easily
killable, as discussed upthread.
Yu Kuai April 9, 2024, 6:22 a.m. UTC | #13
Hi,

在 2024/04/09 12:26, Al Viro 写道:
> On Sun, Apr 07, 2024 at 11:21:56AM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2024/04/07 11:06, Al Viro 写道:
>>> On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote:
>>>
>>>> Other than raw block_device fops, other filesystems can use the opened
>>>> bdev_file directly for iomap and buffer_head, and they actually don't
>>>> need to reference block_device anymore. The point here is that whether
>>>
>>> What do you mean, "reference"?  The counting reference is to opened
>>> file; ->s_bdev is a cached pointer to associated struct block_device,
>>> and neither it nor pointers in buffer_head are valid past the moment
>>> when you close the file.  Storing (non-counting) pointers to struct
>>> file in struct buffer_head is not different in that respect - they
>>> are *still* only valid while the "master" reference is held.
>>>
>>> Again, what's the point of storing struct file * in struct buffer_head
>>> or struct iomap?  In any instances of those structures?
>>
>> Perhaps this is what you missed, like the title of this set, in order to
>> remove direct acceess of bdev->bd_inode from fs/buffer, we must store
>> bdev_file in buffer_head and iomap, and 'bdev->bd_inode' is replaced
>> with 'file_inode(bdev)' now.
> 
> BTW, what does that have to do with iomap?  All it passes ->bdev to is
> 	1) bio_alloc()
> 	2) bio_alloc_bioset()
> 	3) bio_init()
> 	4) bdev_logical_block_size()
> 	5) bdev_iter_is_aligned()
> 	6) bdev_fua()
> 	7) bdev_write_cache()
> 
> None of those goes anywhere near fs/buffer.c or uses ->bd_inode, AFAICS.
> 
> Again, what's the point?  It feels like you are trying to replace *all*
> uses of struct block_device with struct file, just because.
> 
> If that's what's going on, please don't.  Using struct file instead
> of that bdev_handle crap - sure, makes perfect sense.  But shoving it
> down into struct bio really, really does not.
> 
> I'd suggest to start with adding ->bd_mapping as the first step and
> converting the places where mapping is all we want to using that.
> Right at the beginning of your series.  Then let's see what gets
> left.

Thanks so much for your advice, in fact, I totally agree with this that
adding a 'bd_mapping' or expose the helper bdev_mapping().

However, I will let Christoph and Jan to make the decision, when they
get time to take a look at this.

Thanks!
Kuai

> 
> And leave ->bd_inode there for now; don't blindly replace it with
> ->bd_mapping->host everywhere.  It's much easier to grep for.
> The point of the exercise is to find what do we really need ->bd_inode
> for and what primitives are missing, not getting rid of a bad word...
> .
>
Christian Brauner April 9, 2024, 9 a.m. UTC | #14
On Sun, Apr 07, 2024 at 04:06:10AM +0100, Al Viro wrote:
> On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote:
> 
> > Other than raw block_device fops, other filesystems can use the opened
> > bdev_file directly for iomap and buffer_head, and they actually don't
> > need to reference block_device anymore. The point here is that whether
> 
> What do you mean, "reference"?  The counting reference is to opened
> file; ->s_bdev is a cached pointer to associated struct block_device,
> and neither it nor pointers in buffer_head are valid past the moment
> when you close the file.  Storing (non-counting) pointers to struct
> file in struct buffer_head is not different in that respect - they
> are *still* only valid while the "master" reference is held.
> 
> Again, what's the point of storing struct file * in struct buffer_head
> or struct iomap?  In any instances of those structures?
> 
> There is a good reason to have it in places that keep a reference to
> opened block device - the kind that _keeps_ the device opened.  Namely,
> there's state that need to be carried from the place where we'd opened
> the sucker to the place where we close it, and that state is better
> carried by opened file.
> 
> But neither iomap nor buffer_head contain anything of that sort -
> the lifetime management of the opened device is not in their
> competence.  As the matter of fact, the logics around closing
> those opened devices (bdev_release()) makes sure that no
> instances of buffer_head (or iomap) will outlive them.
> And they don't care about any extra state - everything
> they use is in block_device and coallocated inode.
> 
> I could've easily missed something in one of the threads around
> the earlier iterations of the patchset; if that's the case,
> could somebody restate the rationale for that part and/or
> post relevant lore.kernel.org links?  Christian?  hch?
> What am I missing here?

The original series was a simple RFC/POC to show that struct file could
be used to remove bd_inode access in a wide variety of situations. But
as I've mentioned in that thread I wasn't happy with various aspects of
the approach which is why I never pushed forward with it. The part where
we pushed struct file into buffer_header was the most obvious one.
Christian Brauner April 9, 2024, 10:23 a.m. UTC | #15
> +static int __stash_bdev_file(struct block_device *bdev)

I've said that on the previous version. I think that this is really
error prone and seems overall like an unpleasant solution. I would
really like to avoid going down that route.

I think a chunk of this series is good though specicially simple
conversions of individual filesystems where file_inode() or f_mapping
makes sense. There's a few exceptions where we might be better of
replacing the current apis with something else (I think Al touched on
that somewhere further down the thread.).

I'd suggest the straightforward bd_inode removals into a separate series
that I can take.

Thanks for working on all of this. It's certainly a contentious area.
Yu Kuai April 9, 2024, 11:53 a.m. UTC | #16
Hi,

在 2024/04/09 18:23, Christian Brauner 写道:
>> +static int __stash_bdev_file(struct block_device *bdev)
> 
> I've said that on the previous version. I think that this is really
> error prone and seems overall like an unpleasant solution. I would
> really like to avoid going down that route.

Yes, I see your point, and it's indeed reasonable.

> 
> I think a chunk of this series is good though specicially simple
> conversions of individual filesystems where file_inode() or f_mapping
> makes sense. There's a few exceptions where we might be better of
> replacing the current apis with something else (I think Al touched on
> that somewhere further down the thread.).
> 
> I'd suggest the straightforward bd_inode removals into a separate series
> that I can take.
> 
> Thanks for working on all of this. It's certainly a contentious area.

How about following simple patch to expose bdev_mapping() for
fs/buffer.c for now?

Thanks,
Kuai

diff --git a/block/blk.h b/block/blk.h
index a34bb590cce6..f8bcb43a12c6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -428,7 +428,6 @@ static inline int blkdev_zone_mgmt_ioctl(struct 
block_device *bdev,
  #endif /* CONFIG_BLK_DEV_ZONED */

  struct inode *bdev_inode(struct block_device *bdev);
-struct address_space *bdev_mapping(struct block_device *bdev);
  struct block_device *bdev_alloc(struct gendisk *disk, u8 partno);
  void bdev_add(struct block_device *bdev, dev_t dev);

diff --git a/fs/buffer.c b/fs/buffer.c
index 4f73d23c2c46..e2bd19e3fe48 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -189,8 +189,8 @@ EXPORT_SYMBOL(end_buffer_write_sync);
  static struct buffer_head *
  __find_get_block_slow(struct block_device *bdev, sector_t block)
  {
-       struct inode *bd_inode = bdev->bd_inode;
-       struct address_space *bd_mapping = bd_inode->i_mapping;
+       struct address_space *bd_mapping = bdev_mapping(bdev);
+       struct inode *bd_inode = bd_mapping->host;
         struct buffer_head *ret = NULL;
         pgoff_t index;
         struct buffer_head *bh;
@@ -1034,12 +1034,12 @@ static sector_t folio_init_buffers(struct folio 
*folio,
  static bool grow_dev_folio(struct block_device *bdev, sector_t block,
                 pgoff_t index, unsigned size, gfp_t gfp)
  {
-       struct inode *inode = bdev->bd_inode;
+       struct address_space *bd_mapping = bdev_mapping(bdev);
         struct folio *folio;
         struct buffer_head *bh;
         sector_t end_block = 0;

-       folio = __filemap_get_folio(inode->i_mapping, index,
+       folio = __filemap_get_folio(bd_mapping, index,
                         FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp);
         if (IS_ERR(folio))
                 return false;
@@ -1073,10 +1073,10 @@ static bool grow_dev_folio(struct block_device 
*bdev, sector_t block,
          * lock to be atomic wrt __find_get_block(), which does not
          * run under the folio lock.
          */
-       spin_lock(&inode->i_mapping->i_private_lock);
+       spin_lock(&bd_mapping->i_private_lock);
         link_dev_buffers(folio, bh);
         end_block = folio_init_buffers(folio, bdev, size);
-       spin_unlock(&inode->i_mapping->i_private_lock);
+       spin_unlock(&bd_mapping->i_private_lock);
  unlock:
         folio_unlock(folio);
         folio_put(folio);
@@ -1463,7 +1463,7 @@ __bread_gfp(struct block_device *bdev, sector_t block,
  {
         struct buffer_head *bh;

-       gfp |= mapping_gfp_constraint(bdev->bd_inode->i_mapping, ~__GFP_FS);
+       gfp |= mapping_gfp_constraint(bdev_mapping(bdev), ~__GFP_FS);

         /*
          * Prefer looping in the allocator rather than here, at least that
@@ -1696,8 +1696,8 @@ EXPORT_SYMBOL(create_empty_buffers);
   */
  void clean_bdev_aliases(struct block_device *bdev, sector_t block, 
sector_t len)
  {
-       struct inode *bd_inode = bdev->bd_inode;
-       struct address_space *bd_mapping = bd_inode->i_mapping;
+       struct address_space *bd_mapping = bdev_mapping(bdev);
+       struct inode *bd_inode = bd_mapping->host;
         struct folio_batch fbatch;
         pgoff_t index = ((loff_t)block << bd_inode->i_blkbits) / PAGE_SIZE;
         pgoff_t end;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bc840e0fb6e5..bbae55535d53 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1527,6 +1527,7 @@ void blkdev_put_no_open(struct block_device *bdev);

  struct block_device *I_BDEV(struct inode *inode);
  struct block_device *file_bdev(struct file *bdev_file);
+struct address_space *bdev_mapping(struct block_device *bdev);
  bool disk_live(struct gendisk *disk);
  unsigned int block_size(struct block_device *bdev);

> .
>
Jan Kara April 10, 2024, 10:59 a.m. UTC | #17
On Tue 09-04-24 14:22:37, Yu Kuai wrote:
> Hi,
> 
> 在 2024/04/09 12:26, Al Viro 写道:
> > On Sun, Apr 07, 2024 at 11:21:56AM +0800, Yu Kuai wrote:
> > > Hi,
> > > 
> > > 在 2024/04/07 11:06, Al Viro 写道:
> > > > On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote:
> > > > 
> > > > > Other than raw block_device fops, other filesystems can use the opened
> > > > > bdev_file directly for iomap and buffer_head, and they actually don't
> > > > > need to reference block_device anymore. The point here is that whether
> > > > 
> > > > What do you mean, "reference"?  The counting reference is to opened
> > > > file; ->s_bdev is a cached pointer to associated struct block_device,
> > > > and neither it nor pointers in buffer_head are valid past the moment
> > > > when you close the file.  Storing (non-counting) pointers to struct
> > > > file in struct buffer_head is not different in that respect - they
> > > > are *still* only valid while the "master" reference is held.
> > > > 
> > > > Again, what's the point of storing struct file * in struct buffer_head
> > > > or struct iomap?  In any instances of those structures?
> > > 
> > > Perhaps this is what you missed, like the title of this set, in order to
> > > remove direct acceess of bdev->bd_inode from fs/buffer, we must store
> > > bdev_file in buffer_head and iomap, and 'bdev->bd_inode' is replaced
> > > with 'file_inode(bdev)' now.
> > 
> > BTW, what does that have to do with iomap?  All it passes ->bdev to is
> > 	1) bio_alloc()
> > 	2) bio_alloc_bioset()
> > 	3) bio_init()
> > 	4) bdev_logical_block_size()
> > 	5) bdev_iter_is_aligned()
> > 	6) bdev_fua()
> > 	7) bdev_write_cache()
> > 
> > None of those goes anywhere near fs/buffer.c or uses ->bd_inode, AFAICS.
> > 
> > Again, what's the point?  It feels like you are trying to replace *all*
> > uses of struct block_device with struct file, just because.
> > 
> > If that's what's going on, please don't.  Using struct file instead
> > of that bdev_handle crap - sure, makes perfect sense.  But shoving it
> > down into struct bio really, really does not.
> > 
> > I'd suggest to start with adding ->bd_mapping as the first step and
> > converting the places where mapping is all we want to using that.
> > Right at the beginning of your series.  Then let's see what gets
> > left.
> 
> Thanks so much for your advice, in fact, I totally agree with this that
> adding a 'bd_mapping' or expose the helper bdev_mapping().
> 
> However, I will let Christoph and Jan to make the decision, when they
> get time to take a look at this.

I agree with Christian and Al - and I think I've expressed that already in
the previous version of the series [1] but I guess I was not explicit
enough :). I think the initial part of the series (upto patch 21, perhaps
excluding patch 20) is a nice cleanup but the latter part playing with
stashing struct file is not an improvement and seems pointless to me. So
I'd separate the initial part cleaning up the obvious places and let
Christian merge it and then we can figure out what (if anything) to do with
remaining bd_inode uses in fs/buffer.c etc. E.g. what Al suggests with
bd_mapping makes sense to me but I didn't check what's left after your
initial patches...

								Honza

[1] https://lore.kernel.org/all/20240322125750.jov4f3alsrkmqnq7@quack3
Al Viro April 10, 2024, 10:34 p.m. UTC | #18
On Wed, Apr 10, 2024 at 12:59:11PM +0200, Jan Kara wrote:

> I agree with Christian and Al - and I think I've expressed that already in
> the previous version of the series [1] but I guess I was not explicit
> enough :). I think the initial part of the series (upto patch 21, perhaps
> excluding patch 20) is a nice cleanup but the latter part playing with
> stashing struct file is not an improvement and seems pointless to me. So
> I'd separate the initial part cleaning up the obvious places and let
> Christian merge it and then we can figure out what (if anything) to do with
> remaining bd_inode uses in fs/buffer.c etc. E.g. what Al suggests with
> bd_mapping makes sense to me but I didn't check what's left after your
> initial patches...

FWIW, experimental on top of -next:
Al Viro (7):
      block_device: add a pointer to struct address_space (page cache of bdev)
      use ->bd_mapping instead of ->bd_inode->i_mapping
      grow_dev_folio(): we only want ->bd_inode->i_mapping there
      gfs2: more obvious initializations of mapping->host
      blkdev_write_iter(): saner way to get inode and bdev
      blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here...
      dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode)

Yu Kuai (4):
      ext4: remove block_device_ejected()
      block: move two helpers into bdev.c
      bcachefs: remove dead function bdev_sectors()
      block2mtd: prevent direct access of bd_inode	[slightly modified]

leaves only this:
block/bdev.c:60:        struct inode *inode = bdev->bd_inode;
block/bdev.c:137:       loff_t size = i_size_read(bdev->bd_inode);
block/bdev.c:144:       bdev->bd_inode->i_blkbits = blksize_bits(bsize);
block/bdev.c:158:       if (bdev->bd_inode->i_blkbits != blksize_bits(size)) {
block/bdev.c:160:               bdev->bd_inode->i_blkbits = blksize_bits(size);
block/bdev.c:415:       bdev->bd_inode = inode;
block/bdev.c:434:       i_size_write(bdev->bd_inode, (loff_t)sectors << SECTOR_SHIFT);
block/bdev.c:444:       bdev->bd_inode->i_rdev = dev;
block/bdev.c:445:       bdev->bd_inode->i_ino = dev;
block/bdev.c:446:       insert_inode_hash(bdev->bd_inode);
block/bdev.c:974:       bdev_file = alloc_file_pseudo_noaccount(bdev->bd_inode,
block/bdev.c:980:       ihold(bdev->bd_inode);
block/bdev.c:1257:      return !inode_unhashed(disk->part0->bd_inode);
block/bdev.c:1263:      return 1 << bdev->bd_inode->i_blkbits;
block/genhd.c:659:              remove_inode_hash(part->bd_inode);
block/genhd.c:1194:     iput(disk->part0->bd_inode);    /* frees the disk */
block/genhd.c:1384:     iput(disk->part0->bd_inode);
block/partitions/core.c:246:    iput(dev_to_bdev(dev)->bd_inode);
block/partitions/core.c:472:    remove_inode_hash(part->bd_inode);
block/partitions/core.c:658:            remove_inode_hash(part->bd_inode);
drivers/s390/block/dasd_ioctl.c:218:            block->gdp->part0->bd_inode->i_blkbits =
fs/buffer.c:192:        struct inode *bd_inode = bdev->bd_inode;
fs/buffer.c:1699:       struct inode *bd_inode = bdev->bd_inode;
fs/erofs/data.c:73:             buf->inode = sb->s_bdev->bd_inode;
fs/nilfs2/segment.c:2793:       inode_attach_wb(nilfs->ns_bdev->bd_inode, NULL);

I've got erofs patches that get rid of that instance; bdev.c is obviously priveleged
since it sees coallocated inode directly.  Other than those we have
	* 3 callers of remove_inode_hash()
	* 3 callers of iput()
	* one caller of inode_attach_wb() (nilfs2)
	* weird shit in DASD (redundant, that; incidentally, I don't see anything
	  that might prevent DASD format requested with mounted partitions on that
	  disk - and won't that be fun and joy for an admin to step into...)
	* two places in fs/buffer.c that want to convert block numbers to positions
	  in bytes.  Either the function itself or its caller has the block size
	  as argument; replacing that to passing block _shift_ instead of size
	  would reduce those two to ->bd_mapping.
And that's it.  iput() and remove_inode_hash() are obvious candidates for
helpers (internal to block/*; no exporting those, it's private to bdev.c,
genhd.c and paritions/core.c).

fs/buffer.c ones need a bit more code audit (not quite done with that), but
it looks at least plausible.  Which would leave us with whatever nilfs2 is
doing and that weirdness in dasd_format() (why set ->i_blkbits but not
->i_blocksize?  why not use set_blocksize(), for that matter?  where the
hell is check for exclusive open?)
Christian Brauner April 11, 2024, 11:56 a.m. UTC | #19
On Wed, Apr 10, 2024 at 11:34:43PM +0100, Al Viro wrote:
> On Wed, Apr 10, 2024 at 12:59:11PM +0200, Jan Kara wrote:
> 
> > I agree with Christian and Al - and I think I've expressed that already in
> > the previous version of the series [1] but I guess I was not explicit
> > enough :). I think the initial part of the series (upto patch 21, perhaps
> > excluding patch 20) is a nice cleanup but the latter part playing with
> > stashing struct file is not an improvement and seems pointless to me. So
> > I'd separate the initial part cleaning up the obvious places and let
> > Christian merge it and then we can figure out what (if anything) to do with
> > remaining bd_inode uses in fs/buffer.c etc. E.g. what Al suggests with
> > bd_mapping makes sense to me but I didn't check what's left after your
> > initial patches...
> 
> FWIW, experimental on top of -next:

Ok, let's move forward with this. I've applied the first 19 patches.
Patch 20 is the start of what we all disliked. 21 is clearly a bugfix
for current code so that'll go separately from the rest. I've replaced
open-code f_mapping access with file_mapping(). The symmetry between
file_inode() and file_mapping() is quite nice.

Al, your idea to switch erofs away from buf->inode can go on top of what
Yu did imho. There's no real reason to throw it away imho.

I've exported bdev_mapping() because it really makes the btrfs change a
lot slimmer and we don't need to care about messing with a lot of that
code. I didn't care about making it static inline because that might've
meant we need to move other stuff into the header as well. Imho, it's
not that important but if it's a big deal to any of you just do the
changes on top of it, please.

Pushed to
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.super

If I hear no objections that'll show up in -next tomorrow. Al, would be
nice if you could do your changes on top of this, please.
Al Viro April 11, 2024, 2:04 p.m. UTC | #20
On Thu, Apr 11, 2024 at 01:56:03PM +0200, Christian Brauner wrote:
> On Wed, Apr 10, 2024 at 11:34:43PM +0100, Al Viro wrote:
> > On Wed, Apr 10, 2024 at 12:59:11PM +0200, Jan Kara wrote:
> > 
> > > I agree with Christian and Al - and I think I've expressed that already in
> > > the previous version of the series [1] but I guess I was not explicit
> > > enough :). I think the initial part of the series (upto patch 21, perhaps
> > > excluding patch 20) is a nice cleanup but the latter part playing with
> > > stashing struct file is not an improvement and seems pointless to me. So
> > > I'd separate the initial part cleaning up the obvious places and let
> > > Christian merge it and then we can figure out what (if anything) to do with
> > > remaining bd_inode uses in fs/buffer.c etc. E.g. what Al suggests with
> > > bd_mapping makes sense to me but I didn't check what's left after your
> > > initial patches...
> > 
> > FWIW, experimental on top of -next:
> 
> Ok, let's move forward with this. I've applied the first 19 patches.
> Patch 20 is the start of what we all disliked. 21 is clearly a bugfix
> for current code so that'll go separately from the rest. I've replaced
> open-code f_mapping access with file_mapping(). The symmetry between
> file_inode() and file_mapping() is quite nice.
> 
> Al, your idea to switch erofs away from buf->inode can go on top of what
> Yu did imho. There's no real reason to throw it away imho.
> 
> I've exported bdev_mapping() because it really makes the btrfs change a
> lot slimmer and we don't need to care about messing with a lot of that
> code. I didn't care about making it static inline because that might've
> meant we need to move other stuff into the header as well. Imho, it's
> not that important but if it's a big deal to any of you just do the
> changes on top of it, please.
> 
> Pushed to
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.super
> 
> If I hear no objections that'll show up in -next tomorrow. Al, would be
> nice if you could do your changes on top of this, please.

Objection: start with adding bdev->bd_mapping, next convert the really
obvious instances to it and most of this series becomes not needed at
all.

Really.  There is no need whatsoever to push struct file down all those
paths.

And yes, erofs and buffer.c stuff belongs on top of that, no arguments here.
Al Viro April 11, 2024, 2:49 p.m. UTC | #21
On Thu, Apr 11, 2024 at 03:04:09PM +0100, Al Viro wrote:
> > lot slimmer and we don't need to care about messing with a lot of that
> > code. I didn't care about making it static inline because that might've
> > meant we need to move other stuff into the header as well. Imho, it's
> > not that important but if it's a big deal to any of you just do the
> > changes on top of it, please.
> > 
> > Pushed to
> > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.super
> > 
> > If I hear no objections that'll show up in -next tomorrow. Al, would be
> > nice if you could do your changes on top of this, please.
> 
> Objection: start with adding bdev->bd_mapping, next convert the really
> obvious instances to it and most of this series becomes not needed at
> all.
> 
> Really.  There is no need whatsoever to push struct file down all those
> paths.
> 
> And yes, erofs and buffer.c stuff belongs on top of that, no arguments here.

FWIW, here's what you get if this is done in such order:

block/bdev.c                           | 31 ++++++++++++++++++++++---------
block/blk-zoned.c                      |  4 ++--
block/fops.c                           |  4 ++--
block/genhd.c                          |  2 +-
block/ioctl.c                          | 14 ++++++--------
block/partitions/core.c                |  2 +-
drivers/md/bcache/super.c              |  2 +-
drivers/md/dm-vdo/dm-vdo-target.c      |  4 ++--
drivers/md/dm-vdo/indexer/io-factory.c |  2 +-
drivers/mtd/devices/block2mtd.c        |  6 ++++--
drivers/scsi/scsicam.c                 |  2 +-
fs/bcachefs/util.h                     |  5 -----
fs/btrfs/disk-io.c                     |  6 +++---
fs/btrfs/volumes.c                     |  2 +-
fs/btrfs/zoned.c                       |  2 +-
fs/buffer.c                            | 10 +++++-----
fs/cramfs/inode.c                      |  2 +-
fs/ext4/dir.c                          |  2 +-
fs/ext4/ext4_jbd2.c                    |  2 +-
fs/ext4/super.c                        | 24 +++---------------------
fs/gfs2/glock.c                        |  2 +-
fs/gfs2/ops_fstype.c                   |  2 +-
fs/jbd2/journal.c                      |  2 +-
include/linux/blk_types.h              |  1 +
include/linux/blkdev.h                 | 12 ++----------
include/linux/buffer_head.h            |  4 ++--
include/linux/jbd2.h                   |  4 ++--
27 files changed, 69 insertions(+), 86 deletions(-)

The bulk of the changes is straight replacements of foo->bd_inode->i_mapping
with foo->bd_mapping.  That's completely mechanical and that takes out most
of the bd_inode uses.  Anyway, patches in followups
Matthew Wilcox April 11, 2024, 3:22 p.m. UTC | #22
On Sun, Apr 07, 2024 at 06:11:19AM +0100, Al Viro wrote:
> On Sun, Apr 07, 2024 at 05:57:58AM +0100, Al Viro wrote:
> 
> > PS: in grow_dev_folio() we probably want
> > 	struct address_space *mapping = bdev->bd_inode->i_mapping;
> > instead of
> > 	struct inode *inode = bdev->bd_inode;
> > as one of the preliminary chunks.
> > FWIW, it really looks like address_space (== page cache of block device,
> > not an unreasonably candidate for primitive) and block size (well,
> > logarithm thereof) cover the majority of what remains, with device
> > size possibly being (remote) third...
> 
> Incidentally, how painful would it be to switch __bread_gfp() and __bread()
> to passing *logarithm* of block size instead of block size?  And possibly
> supply the same to clean_bdev_aliases()...

I've looked at it because blksize_bits() was pretty horrid.  But I got
scared because I couldn't figure out how to make unconverted places
fail to compile, without doing something ugly like

-__bread(struct block_device *bdev, sector_t block, unsigned size)
+__bread(unsigned shift, struct block_device *bdev, sector_t block)

I assume you're not talking about changing bh->b_size, just passing in
the log and comparing bh->b_size to 1<<shift?
Yu Kuai April 12, 2024, 1:38 a.m. UTC | #23
Hi,

在 2024/04/11 22:49, Al Viro 写道:
> On Thu, Apr 11, 2024 at 03:04:09PM +0100, Al Viro wrote:
>>> lot slimmer and we don't need to care about messing with a lot of that
>>> code. I didn't care about making it static inline because that might've
>>> meant we need to move other stuff into the header as well. Imho, it's
>>> not that important but if it's a big deal to any of you just do the
>>> changes on top of it, please.
>>>
>>> Pushed to
>>> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.super
>>>
>>> If I hear no objections that'll show up in -next tomorrow. Al, would be
>>> nice if you could do your changes on top of this, please.
>>
>> Objection: start with adding bdev->bd_mapping, next convert the really
>> obvious instances to it and most of this series becomes not needed at
>> all.
>>
>> Really.  There is no need whatsoever to push struct file down all those
>> paths.
>>

There really is a long history here. The beginning of the attempt to try
removing the filed 'bd_inode' is that I want to make a room from the
first cacheline(64 bytes) for a new 'unsigned long flags' field because
we keep adding new 'bool xxx' field [1]. And adding a new 'bd_mapping'
field will make that impossible.

I do like the idea of passing 'bd_mapping' here, however, will it be
considered to expose bdev_mapping() for slow path, or to pass in bd_file
and get it by 'f_mapping' for fast path? So that a new field in the
first cacheline will still be possible, other than that there will be
more code change, I don't see any difference for performance.

Thanks,
Kuai

[1] 
https://lore.kernel.org/all/20231122103103.1104589-3-yukuai1@huaweicloud.com/
>> And yes, erofs and buffer.c stuff belongs on top of that, no arguments here.
> 
> FWIW, here's what you get if this is done in such order:
> 
> block/bdev.c                           | 31 ++++++++++++++++++++++---------
> block/blk-zoned.c                      |  4 ++--
> block/fops.c                           |  4 ++--
> block/genhd.c                          |  2 +-
> block/ioctl.c                          | 14 ++++++--------
> block/partitions/core.c                |  2 +-
> drivers/md/bcache/super.c              |  2 +-
> drivers/md/dm-vdo/dm-vdo-target.c      |  4 ++--
> drivers/md/dm-vdo/indexer/io-factory.c |  2 +-
> drivers/mtd/devices/block2mtd.c        |  6 ++++--
> drivers/scsi/scsicam.c                 |  2 +-
> fs/bcachefs/util.h                     |  5 -----
> fs/btrfs/disk-io.c                     |  6 +++---
> fs/btrfs/volumes.c                     |  2 +-
> fs/btrfs/zoned.c                       |  2 +-
> fs/buffer.c                            | 10 +++++-----
> fs/cramfs/inode.c                      |  2 +-
> fs/ext4/dir.c                          |  2 +-
> fs/ext4/ext4_jbd2.c                    |  2 +-
> fs/ext4/super.c                        | 24 +++---------------------
> fs/gfs2/glock.c                        |  2 +-
> fs/gfs2/ops_fstype.c                   |  2 +-
> fs/jbd2/journal.c                      |  2 +-
> include/linux/blk_types.h              |  1 +
> include/linux/blkdev.h                 | 12 ++----------
> include/linux/buffer_head.h            |  4 ++--
> include/linux/jbd2.h                   |  4 ++--
> 27 files changed, 69 insertions(+), 86 deletions(-)
> 
> The bulk of the changes is straight replacements of foo->bd_inode->i_mapping
> with foo->bd_mapping.  That's completely mechanical and that takes out most
> of the bd_inode uses.  Anyway, patches in followups
> 
> .
>
Al Viro April 12, 2024, 2:59 a.m. UTC | #24
On Fri, Apr 12, 2024 at 09:38:16AM +0800, Yu Kuai wrote:

> There really is a long history here. The beginning of the attempt to try
> removing the filed 'bd_inode' is that I want to make a room from the
> first cacheline(64 bytes) for a new 'unsigned long flags' field because
> we keep adding new 'bool xxx' field [1]. And adding a new 'bd_mapping'
> field will make that impossible.

Why does it need to be unsigned long?  dev_t is 32bit; what you need
is to keep this
        bool                    bd_read_only;   /* read-only policy */
	u8                      bd_partno;
	bool                    bd_write_holder;
	bool                    bd_has_submit_bio;

from blowing past u32.  Sure, you can't use test_bit() et.al. with u16,
but what's wrong with explicit bitwise operations?  You need some protection
for multiple writers, but you need it anyway - e.g. this
        if (bdev->bd_disk->fops->set_read_only) {
		ret = bdev->bd_disk->fops->set_read_only(bdev, n);
		if (ret)
			return ret;
	}
	bdev->bd_read_only = n;
will need the exclusion over the entire "call ->set_read_only() and set
the flag", not just for setting the flag itself.

And yes, it's a real-world bug - two threads calling BLKROSET on the
same opened file can race, with inconsistency between the flag and
whatever state ->set_read_only() modifies.

AFAICS, ->bd_write_holder is (apparently) relying upon ->open_mutex.
Whether it would be a good solution for ->bd_read_only is a question
to block folks, but some exclusion is obviously needed.

Let's sort that out, rather than papering it over with set_bit() et.al.
Al Viro April 12, 2024, 4:41 a.m. UTC | #25
On Fri, Apr 12, 2024 at 03:59:10AM +0100, Al Viro wrote:
> for multiple writers, but you need it anyway - e.g. this
>         if (bdev->bd_disk->fops->set_read_only) {
> 		ret = bdev->bd_disk->fops->set_read_only(bdev, n);
> 		if (ret)
> 			return ret;
> 	}
> 	bdev->bd_read_only = n;
> will need the exclusion over the entire "call ->set_read_only() and set
> the flag", not just for setting the flag itself.
> 
> And yes, it's a real-world bug - two threads calling BLKROSET on the
> same opened file can race, with inconsistency between the flag and
> whatever state ->set_read_only() modifies.

BLKROSET is CAP_SYS_ADMIN-only, so it's not a CVE fodder; the sky
is not falling.  The bug is real, though.

I see Christoph's postings in that thread; the thing is, it's not
just the flags that need to be protected.  If we end up deciding
that serialization for different flags should not be tied to the
same thing (which is reasonable - e.g. md_set_read_only() is not
something you want to shove under existing lock), I would still
suggest something along the lines of

	u32 __bd_flags;		// partno and flags

static inline u8 bd_partno(struct block_device *bdev)
{
	return bdev->__bd_flags & 0xff;
}

static void bd_set_flag(struct block_device *bdev, int flag)
{
	u32 v = bdev->__bd_flags;

	for (;;) {
		u32 w = cmpxchg(&bdev->__bd_flags, v, v | (1 << (flag + 8)));
		if (w == v)
			return;
		v = w;
	}
}

and similar for bd_clear_flag().  Changes of ->bd_partno never
happen - we set it at allocation time and never modify the sucker.

This is orthogonal to BLKROSET/BLKROSET exclusion, converting
->bd_inode accesses, etc.

Christoph, do you have any problems with that approach?

COMPLETELY UNTESTED patch along those lines follows; if it works,
it would need to be carved up.  And I would probably switch the
places where we do if (bdev->bd_partno) to if (bdev_is_partition(bdev)),
for better readability.

I'd converted only those 3 flags; again, this is just an untested
illustration to the above.

diff --git a/block/bdev.c b/block/bdev.c
index 7a5f611c3d2e..9aa23620fe92 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -411,13 +411,11 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 	mutex_init(&bdev->bd_fsfreeze_mutex);
 	spin_lock_init(&bdev->bd_size_lock);
 	mutex_init(&bdev->bd_holder_lock);
-	bdev->bd_partno = partno;
+	bdev->__bd_flags = partno;
 	bdev->bd_inode = inode;
 	bdev->bd_queue = disk->queue;
-	if (partno)
-		bdev->bd_has_submit_bio = disk->part0->bd_has_submit_bio;
-	else
-		bdev->bd_has_submit_bio = false;
+	if (partno && bdev_test_flag(disk->part0, BD_HAS_SUBMIT_BIO))
+		bdev_set_flag(bdev, BD_HAS_SUBMIT_BIO);
 	bdev->bd_stats = alloc_percpu(struct disk_stats);
 	if (!bdev->bd_stats) {
 		iput(inode);
@@ -624,7 +622,7 @@ static void bd_end_claim(struct block_device *bdev, void *holder)
 		bdev->bd_holder = NULL;
 		bdev->bd_holder_ops = NULL;
 		mutex_unlock(&bdev->bd_holder_lock);
-		if (bdev->bd_write_holder)
+		if (bdev_test_flag(bdev, BD_WRITE_HOLDER))
 			unblock = true;
 	}
 	if (!whole->bd_holders)
@@ -640,7 +638,7 @@ static void bd_end_claim(struct block_device *bdev, void *holder)
 	 */
 	if (unblock) {
 		disk_unblock_events(bdev->bd_disk);
-		bdev->bd_write_holder = false;
+		bdev_clear_flag(bdev, BD_WRITE_HOLDER);
 	}
 }
 
@@ -892,9 +890,10 @@ int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder,
 		 * writeable reference is too fragile given the way @mode is
 		 * used in blkdev_get/put().
 		 */
-		if ((mode & BLK_OPEN_WRITE) && !bdev->bd_write_holder &&
+		if ((mode & BLK_OPEN_WRITE) &&
+		    !bdev_test_flag(bdev, BD_WRITE_HOLDER) &&
 		    (disk->event_flags & DISK_EVENT_FLAG_BLOCK_ON_EXCL_WRITE)) {
-			bdev->bd_write_holder = true;
+			bdev_set_flag(bdev, BD_WRITE_HOLDER);
 			unblock_events = false;
 		}
 	}
diff --git a/block/blk-core.c b/block/blk-core.c
index a16b5abdbbf5..6a28b6b7062a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -615,7 +615,7 @@ static void __submit_bio(struct bio *bio)
 	if (unlikely(!blk_crypto_bio_prep(&bio)))
 		return;
 
-	if (!bio->bi_bdev->bd_has_submit_bio) {
+	if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
 		blk_mq_submit_bio(bio);
 	} else if (likely(bio_queue_enter(bio) == 0)) {
 		struct gendisk *disk = bio->bi_bdev->bd_disk;
@@ -723,7 +723,7 @@ void submit_bio_noacct_nocheck(struct bio *bio)
 	 */
 	if (current->bio_list)
 		bio_list_add(&current->bio_list[0], bio);
-	else if (!bio->bi_bdev->bd_has_submit_bio)
+	else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
 		__submit_bio_noacct_mq(bio);
 	else
 		__submit_bio_noacct(bio);
@@ -759,7 +759,7 @@ void submit_bio_noacct(struct bio *bio)
 	if (!bio_flagged(bio, BIO_REMAPPED)) {
 		if (unlikely(bio_check_eod(bio)))
 			goto end_io;
-		if (bdev->bd_partno && unlikely(blk_partition_remap(bio)))
+		if (bdev_partno(bdev) && unlikely(blk_partition_remap(bio)))
 			goto end_io;
 	}
 
@@ -989,7 +989,7 @@ void update_io_ticks(struct block_device *part, unsigned long now, bool end)
 		if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)))
 			__part_stat_add(part, io_ticks, end ? now - stamp : 1);
 	}
-	if (part->bd_partno) {
+	if (bdev_partno(part)) {
 		part = bdev_whole(part);
 		goto again;
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 32afb87efbd0..1c4bd891fd6d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -92,7 +92,7 @@ static bool blk_mq_check_inflight(struct request *rq, void *priv)
 	struct mq_inflight *mi = priv;
 
 	if (rq->part && blk_do_io_stat(rq) &&
-	    (!mi->part->bd_partno || rq->part == mi->part) &&
+	    (!bdev_partno(mi->part) || rq->part == mi->part) &&
 	    blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)
 		mi->inflight[rq_data_dir(rq)]++;
 
diff --git a/block/early-lookup.c b/block/early-lookup.c
index 3effbd0d35e9..3fb57f7d2b12 100644
--- a/block/early-lookup.c
+++ b/block/early-lookup.c
@@ -78,7 +78,7 @@ static int __init devt_from_partuuid(const char *uuid_str, dev_t *devt)
 		 * to the partition number found by UUID.
 		 */
 		*devt = part_devt(dev_to_disk(dev),
-				  dev_to_bdev(dev)->bd_partno + offset);
+				  bdev_partno(dev_to_bdev(dev)) + offset);
 	} else {
 		*devt = dev->devt;
 	}
diff --git a/block/genhd.c b/block/genhd.c
index bb29a68e1d67..19cd1a31fa80 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -413,7 +413,8 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
 	elevator_init_mq(disk->queue);
 
 	/* Mark bdev as having a submit_bio, if needed */
-	disk->part0->bd_has_submit_bio = disk->fops->submit_bio != NULL;
+	if (disk->fops->submit_bio)
+		bdev_set_flag(disk->part0, BD_HAS_SUBMIT_BIO);
 
 	/*
 	 * If the driver provides an explicit major number it also must provide
diff --git a/block/ioctl.c b/block/ioctl.c
index 0c76137adcaa..be173e4ff43d 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -402,7 +402,10 @@ static int blkdev_roset(struct block_device *bdev, unsigned cmd,
 		if (ret)
 			return ret;
 	}
-	bdev->bd_read_only = n;
+	if (n)
+		bdev_set_flag(bdev, BD_READ_ONLY);
+	else
+		bdev_clear_flag(bdev, BD_READ_ONLY);
 	return 0;
 }
 
diff --git a/block/partitions/core.c b/block/partitions/core.c
index b11e88c82c8c..edd5309dc4ba 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -173,7 +173,7 @@ static struct parsed_partitions *check_partition(struct gendisk *hd)
 static ssize_t part_partition_show(struct device *dev,
 				   struct device_attribute *attr, char *buf)
 {
-	return sprintf(buf, "%d\n", dev_to_bdev(dev)->bd_partno);
+	return sprintf(buf, "%d\n", bdev_partno(dev_to_bdev(dev)));
 }
 
 static ssize_t part_start_show(struct device *dev,
@@ -250,7 +250,7 @@ static int part_uevent(const struct device *dev, struct kobj_uevent_env *env)
 {
 	const struct block_device *part = dev_to_bdev(dev);
 
-	add_uevent_var(env, "PARTN=%u", part->bd_partno);
+	add_uevent_var(env, "PARTN=%u", bdev_partno(part));
 	if (part->bd_meta_info && part->bd_meta_info->volname[0])
 		add_uevent_var(env, "PARTNAME=%s", part->bd_meta_info->volname);
 	return 0;
@@ -267,7 +267,7 @@ void drop_partition(struct block_device *part)
 {
 	lockdep_assert_held(&part->bd_disk->open_mutex);
 
-	xa_erase(&part->bd_disk->part_tbl, part->bd_partno);
+	xa_erase(&part->bd_disk->part_tbl, bdev_partno(part));
 	kobject_put(part->bd_holder_dir);
 
 	device_del(&part->bd_device);
@@ -338,8 +338,8 @@ static struct block_device *add_partition(struct gendisk *disk, int partno,
 	pdev->parent = ddev;
 
 	/* in consecutive minor range? */
-	if (bdev->bd_partno < disk->minors) {
-		devt = MKDEV(disk->major, disk->first_minor + bdev->bd_partno);
+	if (bdev_partno(bdev) < disk->minors) {
+		devt = MKDEV(disk->major, disk->first_minor + bdev_partno(bdev));
 	} else {
 		err = blk_alloc_ext_minor();
 		if (err < 0)
@@ -404,7 +404,7 @@ static bool partition_overlaps(struct gendisk *disk, sector_t start,
 
 	rcu_read_lock();
 	xa_for_each_start(&disk->part_tbl, idx, part, 1) {
-		if (part->bd_partno != skip_partno &&
+		if (bdev_partno(part) != skip_partno &&
 		    start < part->bd_start_sect + bdev_nr_sectors(part) &&
 		    start + length > part->bd_start_sect) {
 			overlap = true;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cb1526ec44b5..bbbcbb36fb6e 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -45,10 +45,7 @@ struct block_device {
 	struct request_queue *	bd_queue;
 	struct disk_stats __percpu *bd_stats;
 	unsigned long		bd_stamp;
-	bool			bd_read_only;	/* read-only policy */
-	u8			bd_partno;
-	bool			bd_write_holder;
-	bool			bd_has_submit_bio;
+	u32			__bd_flags;	// partition number + flags
 	dev_t			bd_dev;
 	struct inode		*bd_inode;	/* will die */
 
@@ -86,6 +83,12 @@ struct block_device {
 #define bdev_kobj(_bdev) \
 	(&((_bdev)->bd_device.kobj))
 
+enum {
+	BD_READ_ONLY,		// read-only policy
+	BD_WRITE_HOLDER,
+	BD_HAS_SUBMIT_BIO
+};
+
 /*
  * Block error status values.  See block/blk-core:blk_errors for the details.
  * Alpha cannot write a byte atomically, so we need to use 32-bit value.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c3e8f7cf96be..d556cec9224b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -720,15 +720,51 @@ void invalidate_disk(struct gendisk *disk);
 void set_disk_ro(struct gendisk *disk, bool read_only);
 void disk_uevent(struct gendisk *disk, enum kobject_action action);
 
+static inline u8 bdev_partno(const struct block_device *bdev)
+{
+	return bdev->__bd_flags & 0xff;
+}
+
+static inline bool bdev_test_flag(const struct block_device *bdev, int flag)
+{
+	return bdev->__bd_flags & (1 << (flag + 8));
+}
+
+static inline void bdev_set_flag(struct block_device *bdev, int flag)
+{
+	u32 v = bdev->__bd_flags;
+
+	for (;;) {
+		u32 w = cmpxchg(&bdev->__bd_flags, v, v | (1 << (flag + 8)));
+
+		if (v == w)
+			return;
+		v = w;
+	}
+}
+
+static inline void bdev_clear_flag(struct block_device *bdev, int flag)
+{
+	u32 v = bdev->__bd_flags;
+
+	for (;;) {
+		u32 w = cmpxchg(&bdev->__bd_flags, v, v & ~(1 << (flag + 8)));
+
+		if (v == w)
+			return;
+		v = w;
+	}
+}
+
 static inline int get_disk_ro(struct gendisk *disk)
 {
-	return disk->part0->bd_read_only ||
+	return bdev_test_flag(disk->part0, BD_READ_ONLY) ||
 		test_bit(GD_READ_ONLY, &disk->state);
 }
 
 static inline int bdev_read_only(struct block_device *bdev)
 {
-	return bdev->bd_read_only || get_disk_ro(bdev->bd_disk);
+	return bdev_test_flag(bdev, BD_READ_ONLY) || get_disk_ro(bdev->bd_disk);
 }
 
 bool set_capacity_and_notify(struct gendisk *disk, sector_t size);
@@ -1095,7 +1131,7 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
 
 static inline bool bdev_is_partition(struct block_device *bdev)
 {
-	return bdev->bd_partno;
+	return bdev_partno(bdev) != 0;
 }
 
 enum blk_default_limits {
diff --git a/include/linux/part_stat.h b/include/linux/part_stat.h
index abeba356bc3f..ec7eb365b152 100644
--- a/include/linux/part_stat.h
+++ b/include/linux/part_stat.h
@@ -59,7 +59,7 @@ static inline void part_stat_set_all(struct block_device *part, int value)
 
 #define part_stat_add(part, field, addnd)	do {			\
 	__part_stat_add((part), field, addnd);				\
-	if ((part)->bd_partno)						\
+	if (bdev_partno(part))						\
 		__part_stat_add(bdev_whole(part), field, addnd);	\
 } while (0)
 
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 552738f14275..e05583e54fa5 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -966,13 +966,13 @@ char *bdev_name(char *buf, char *end, struct block_device *bdev,
 
 	hd = bdev->bd_disk;
 	buf = string(buf, end, hd->disk_name, spec);
-	if (bdev->bd_partno) {
+	if (bdev_partno(bdev)) {
 		if (isdigit(hd->disk_name[strlen(hd->disk_name)-1])) {
 			if (buf < end)
 				*buf = 'p';
 			buf++;
 		}
-		buf = number(buf, end, bdev->bd_partno, spec);
+		buf = number(buf, end, bdev_partno(bdev), spec);
 	}
 	return buf;
 }
Al Viro April 12, 2024, 7:13 a.m. UTC | #26
On Fri, Apr 12, 2024 at 05:41:16AM +0100, Al Viro wrote:

> Christoph, do you have any problems with that approach?
> 
> COMPLETELY UNTESTED patch along those lines follows; if it works,
> it would need to be carved up.  And I would probably switch the
> places where we do if (bdev->bd_partno) to if (bdev_is_partition(bdev)),
> for better readability.

See 
git://git.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git bd_flags
for a carve-up
Christian Brauner April 12, 2024, 9:21 a.m. UTC | #27
On Thu, Apr 11, 2024 at 03:04:09PM +0100, Al Viro wrote:
> On Thu, Apr 11, 2024 at 01:56:03PM +0200, Christian Brauner wrote:
> > On Wed, Apr 10, 2024 at 11:34:43PM +0100, Al Viro wrote:
> > > On Wed, Apr 10, 2024 at 12:59:11PM +0200, Jan Kara wrote:
> > > 
> > > > I agree with Christian and Al - and I think I've expressed that already in
> > > > the previous version of the series [1] but I guess I was not explicit
> > > > enough :). I think the initial part of the series (upto patch 21, perhaps
> > > > excluding patch 20) is a nice cleanup but the latter part playing with
> > > > stashing struct file is not an improvement and seems pointless to me. So
> > > > I'd separate the initial part cleaning up the obvious places and let
> > > > Christian merge it and then we can figure out what (if anything) to do with
> > > > remaining bd_inode uses in fs/buffer.c etc. E.g. what Al suggests with
> > > > bd_mapping makes sense to me but I didn't check what's left after your
> > > > initial patches...
> > > 
> > > FWIW, experimental on top of -next:
> > 
> > Ok, let's move forward with this. I've applied the first 19 patches.
> > Patch 20 is the start of what we all disliked. 21 is clearly a bugfix
> > for current code so that'll go separately from the rest. I've replaced
> > open-code f_mapping access with file_mapping(). The symmetry between
> > file_inode() and file_mapping() is quite nice.
> > 
> > Al, your idea to switch erofs away from buf->inode can go on top of what
> > Yu did imho. There's no real reason to throw it away imho.
> > 
> > I've exported bdev_mapping() because it really makes the btrfs change a
> > lot slimmer and we don't need to care about messing with a lot of that
> > code. I didn't care about making it static inline because that might've
> > meant we need to move other stuff into the header as well. Imho, it's
> > not that important but if it's a big deal to any of you just do the
> > changes on top of it, please.
> > 
> > Pushed to
> > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.super
> > 
> > If I hear no objections that'll show up in -next tomorrow. Al, would be
> > nice if you could do your changes on top of this, please.
> 
> Objection: start with adding bdev->bd_mapping, next convert the really
> obvious instances to it and most of this series becomes not needed at
> all.
> 
> Really.  There is no need whatsoever to push struct file down all those
> paths.

Your series just replaces bd_inode in struct block_device with
bd_mapping. In a lot of places we do have immediate access to the bdev
file without changing any calling conventions whatsoever. IMO it's
perfectly fine to just use file_mapping() there. Sure, let's use
bdev_mapping() in instances like btrfs where we'd otherwise have to
change function signatures I'm not opposed to that. But there's no good
reason to just replace everything with bdev->bd_mapping access. And
really, why keep that thing in struct block_device when we can avoid it.

> 
> And yes, erofs and buffer.c stuff belongs on top of that, no arguments here.
Al Viro April 12, 2024, 11:29 a.m. UTC | #28
On Fri, Apr 12, 2024 at 11:21:08AM +0200, Christian Brauner wrote:

> Your series just replaces bd_inode in struct block_device with
> bd_mapping. In a lot of places we do have immediate access to the bdev
> file without changing any calling conventions whatsoever. IMO it's
> perfectly fine to just use file_mapping() there. Sure, let's use
> bdev_mapping() in instances like btrfs where we'd otherwise have to
> change function signatures I'm not opposed to that. But there's no good
> reason to just replace everything with bdev->bd_mapping access. And
> really, why keep that thing in struct block_device when we can avoid it.

Because having to have struct file around in the places where we want to
get to page cache of block device fast is often inconvenient (see fs/buffer.c,
if nothing else).

It also simplifies the hell out of the patch series - it's one obviously
safe automatic change in a single commit.

And AFAICS the flags-related rationale can be dealt with in a much simpler
way - see #bf_flags in my tree.
Christian Brauner April 13, 2024, 3:25 p.m. UTC | #29
On Fri, Apr 12, 2024 at 12:29:19PM +0100, Al Viro wrote:
> On Fri, Apr 12, 2024 at 11:21:08AM +0200, Christian Brauner wrote:
> 
> > Your series just replaces bd_inode in struct block_device with
> > bd_mapping. In a lot of places we do have immediate access to the bdev
> > file without changing any calling conventions whatsoever. IMO it's
> > perfectly fine to just use file_mapping() there. Sure, let's use
> > bdev_mapping() in instances like btrfs where we'd otherwise have to
> > change function signatures I'm not opposed to that. But there's no good
> > reason to just replace everything with bdev->bd_mapping access. And
> > really, why keep that thing in struct block_device when we can avoid it.
> 
> Because having to have struct file around in the places where we want to
> get to page cache of block device fast is often inconvenient (see fs/buffer.c,
> if nothing else).

Yes, agreed. But my point is why can't we expose bdev_mapping() for
exactly that purpose without having to have that bd_mapping member in
struct block_device? We don't want to trade bd_inode for bd_mapping in
that struct imho. IOW, if we can avoid bloating struct block device with
additional members then we should do that. Is there some performance
concern that I'm missing and if so are there numbers to back this?

> It also simplifies the hell out of the patch series - it's one obviously
> safe automatic change in a single commit.

It's trivial to fold the simple file_mapping() conversion into a single
patch as well. It's a pure artifact of splitting the patches per
subsystem/driver. That's just because people have wildly different
opinions on how to do such conversion. But really, that can be trivially
dealt with.

> And AFAICS the flags-related rationale can be dealt with in a much simpler
> way - see #bf_flags in my tree.

That's certainly worth doing independent of this discussion.
Al Viro April 15, 2024, 8:45 p.m. UTC | #30
On Sat, Apr 13, 2024 at 05:25:01PM +0200, Christian Brauner wrote:

> > It also simplifies the hell out of the patch series - it's one obviously
> > safe automatic change in a single commit.
> 
> It's trivial to fold the simple file_mapping() conversion into a single
> patch as well.

... after a bunch of patches that propagate struct file to places where
it has no business being.  Compared to a variant that doesn't need those
patches at all.

> It's a pure artifact of splitting the patches per
> subsystem/driver.

No, it is not.  ->bd_mapping conversion can be done without any
preliminaries.  Note that it doesn't need messing with bdev_read_folio(),
it doesn't need this journal->j_fs_dev_file thing, etc.

One thing I believe is completely wrong in this series is bdev_inode()
existence.  It (and equivalent use of file_inode() on struct file is
even worse) is papering over the real interface deficiencies.  And
extra file_inode() uses are just about impossible to catch ;-/

IMO we should *never* use file_inode() on opened block devices.
At all.  It's brittle, it's asking for trouble as soon as somebody
passes a normally opened struct file to one of the functions using it
and it papers over the missing primitives.

As for the space concerns...  With struct device embedded into those
things, it's not even funny.  Space within the first cacheline - sure,
but we can have a pointer in there just fine.
Al Viro April 16, 2024, 6:32 a.m. UTC | #31
On Mon, Apr 15, 2024 at 09:45:11PM +0100, Al Viro wrote:
> On Sat, Apr 13, 2024 at 05:25:01PM +0200, Christian Brauner wrote:
> 
> > > It also simplifies the hell out of the patch series - it's one obviously
> > > safe automatic change in a single commit.
> > 
> > It's trivial to fold the simple file_mapping() conversion into a single
> > patch as well.
> 
> ... after a bunch of patches that propagate struct file to places where
> it has no business being.  Compared to a variant that doesn't need those
> patches at all.
> 
> > It's a pure artifact of splitting the patches per
> > subsystem/driver.
> 
> No, it is not.  ->bd_mapping conversion can be done without any
> preliminaries.  Note that it doesn't need messing with bdev_read_folio(),
> it doesn't need this journal->j_fs_dev_file thing, etc.
> 
> One thing I believe is completely wrong in this series is bdev_inode()
> existence.  It (and equivalent use of file_inode() on struct file is
> even worse) is papering over the real interface deficiencies.  And
> extra file_inode() uses are just about impossible to catch ;-/
> 
> IMO we should *never* use file_inode() on opened block devices.
> At all.  It's brittle, it's asking for trouble as soon as somebody
> passes a normally opened struct file to one of the functions using it
> and it papers over the missing primitives.

BTW, speaking of the things where opened struct file would be a good
idea - set_blocksize() should take an opened struct file, and it should
have non-NULL ->private_data.

Changing block size under e.g. a mounted filesystem should never happen;
doing that is asking for serious breakage.

Looking through the current callers (mainline), most are OK (and easy
to switch).  However,
	
drivers/block/pktcdvd.c:2285:           set_blocksize(disk->part0, CD_FRAMESIZE);
drivers/block/pktcdvd.c:2529:   set_blocksize(file_bdev(bdev_file), CD_FRAMESIZE);
	Might be broken; pktcdvd.c being what it is...

drivers/md/bcache/super.c:2558: if (set_blocksize(file_bdev(bdev_file), 4096))
	Almost certainly broken; hit register_bcache() with pathname of a mounted
block device, and if the block size on filesystem in question is not 4K, the things
will get interesting.

fs/btrfs/volumes.c:485: ret = set_blocksize(bdev, BTRFS_BDEV_BLOCKSIZE);
	Some of the callers do not bother with exclusive open;
in particular, if btrfs_get_dev_args_from_path() ever gets a pathname
of a mounted device with something other than btrfs on it, it won't
be pretty.

kernel/power/swap.c:371:        res = set_blocksize(file_bdev(hib_resume_bdev_file), PAGE_SIZE);
kernel/power/swap.c:1577:               set_blocksize(file_bdev(hib_resume_bdev_file), PAGE_SIZE);
	Special cases (for obvious reasons); said that, why do we bother
with set_blocksize() on those anyway?
Jan Kara April 17, 2024, 1:43 p.m. UTC | #32
On Tue 16-04-24 07:32:53, Al Viro wrote:
> On Mon, Apr 15, 2024 at 09:45:11PM +0100, Al Viro wrote:
> > On Sat, Apr 13, 2024 at 05:25:01PM +0200, Christian Brauner wrote:
> > 
> > > > It also simplifies the hell out of the patch series - it's one obviously
> > > > safe automatic change in a single commit.
> > > 
> > > It's trivial to fold the simple file_mapping() conversion into a single
> > > patch as well.
> > 
> > ... after a bunch of patches that propagate struct file to places where
> > it has no business being.  Compared to a variant that doesn't need those
> > patches at all.
> > 
> > > It's a pure artifact of splitting the patches per
> > > subsystem/driver.
> > 
> > No, it is not.  ->bd_mapping conversion can be done without any
> > preliminaries.  Note that it doesn't need messing with bdev_read_folio(),
> > it doesn't need this journal->j_fs_dev_file thing, etc.
> > 
> > One thing I believe is completely wrong in this series is bdev_inode()
> > existence.  It (and equivalent use of file_inode() on struct file is
> > even worse) is papering over the real interface deficiencies.  And
> > extra file_inode() uses are just about impossible to catch ;-/
> > 
> > IMO we should *never* use file_inode() on opened block devices.
> > At all.  It's brittle, it's asking for trouble as soon as somebody
> > passes a normally opened struct file to one of the functions using it
> > and it papers over the missing primitives.
> 
> BTW, speaking of the things where opened struct file would be a good
> idea - set_blocksize() should take an opened struct file, and it should
> have non-NULL ->private_data.
> 
> Changing block size under e.g. a mounted filesystem should never happen;
> doing that is asking for serious breakage.
> 
> Looking through the current callers (mainline), most are OK (and easy
> to switch).  However,
> 	
> drivers/block/pktcdvd.c:2285:           set_blocksize(disk->part0, CD_FRAMESIZE);
> drivers/block/pktcdvd.c:2529:   set_blocksize(file_bdev(bdev_file), CD_FRAMESIZE);
> 	Might be broken; pktcdvd.c being what it is...
> 
> drivers/md/bcache/super.c:2558: if (set_blocksize(file_bdev(bdev_file), 4096))
> 	Almost certainly broken; hit register_bcache() with pathname of a mounted
> block device, and if the block size on filesystem in question is not 4K, the things
> will get interesting.

Agreed. Furthermore that set_blocksize() seems to be completely pointless
these days AFAICT because we use read_cache_page_gfp() to read in the data
from the device. Sure we may be creating more bhs per page than necessary
but who cares?

> fs/btrfs/volumes.c:485: ret = set_blocksize(bdev, BTRFS_BDEV_BLOCKSIZE);
> 	Some of the callers do not bother with exclusive open;
> in particular, if btrfs_get_dev_args_from_path() ever gets a pathname
> of a mounted device with something other than btrfs on it, it won't
> be pretty.

Yeah and frankly reading through btrfs_read_dev_super() I'm not sure which
code needs the block size set either. We use read_cache_page_gfp() for the
IO there as well.

								Honza
Al Viro April 17, 2024, 3:23 p.m. UTC | #33
On Wed, Apr 17, 2024 at 03:43:12PM +0200, Jan Kara wrote:

> > fs/btrfs/volumes.c:485: ret = set_blocksize(bdev, BTRFS_BDEV_BLOCKSIZE);
> > 	Some of the callers do not bother with exclusive open;
> > in particular, if btrfs_get_dev_args_from_path() ever gets a pathname
> > of a mounted device with something other than btrfs on it, it won't
> > be pretty.
> 
> Yeah and frankly reading through btrfs_read_dev_super() I'm not sure which
> code needs the block size set either. We use read_cache_page_gfp() for the
> IO there as well.

FWIW, I don't understand the use of invalidate_bdev() in btrfs_get_bdev_and_sb(),
especially when called from btrfs_get_dev_args_from_path() - what's the point
of evicting page cache before reading the on-disk superblock, when all we are
going to do with the data we get is scan through internal list of opened devices
for uuid, etc.  matches?

Could btrfs folks comment on that one?
Al Viro April 17, 2024, 8:45 p.m. UTC | #34
On Tue, Apr 16, 2024 at 07:32:53AM +0100, Al Viro wrote:

> kernel/power/swap.c:371:        res = set_blocksize(file_bdev(hib_resume_bdev_file), PAGE_SIZE);
> kernel/power/swap.c:1577:               set_blocksize(file_bdev(hib_resume_bdev_file), PAGE_SIZE);
> 	Special cases (for obvious reasons); said that, why do we bother
> with set_blocksize() on those anyway?

AFAICS, we really don't need either - all IO is done via hib_submit_io(),
which sets a single-page bio and feeds it to submit_bio{,_wait}()
directly.  We are *not* using the page cache of the block device
in question, let alone any buffer_head instances.

Could swsusp folks comment?
diff mbox series

Patch

diff --git a/block/bdev.c b/block/bdev.c
index 86db97b0709e..3d300823da6b 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -846,6 +846,101 @@  static void init_bdev_file(struct file *bdev_file, struct block_device *bdev,
 	bdev_file->private_data = holder;
 }
 
+/*
+ * If BLK_OPEN_WRITE_IOCTL is set then this is a historical quirk
+ * associated with the floppy driver where it has allowed ioctls if the
+ * file was opened for writing, but does not allow reads or writes.
+ * Make sure that this quirk is reflected in @f_flags.
+ *
+ * It can also happen if a block device is opened as O_RDWR | O_WRONLY.
+ */
+static unsigned blk_to_file_flags(blk_mode_t mode)
+{
+	unsigned int flags = 0;
+
+	if ((mode & (BLK_OPEN_READ | BLK_OPEN_WRITE)) ==
+	    (BLK_OPEN_READ | BLK_OPEN_WRITE))
+		flags |= O_RDWR;
+	else if (mode & BLK_OPEN_WRITE_IOCTL)
+		flags |= O_RDWR | O_WRONLY;
+	else if (mode & BLK_OPEN_WRITE)
+		flags |= O_WRONLY;
+	else if (mode & BLK_OPEN_READ)
+		flags |= O_RDONLY; /* homeopathic, because O_RDONLY is 0 */
+	else
+		WARN_ON_ONCE(true);
+
+	if (mode & BLK_OPEN_NDELAY)
+		flags |= O_NDELAY;
+
+	return flags;
+}
+
+static int __stash_bdev_file(struct block_device *bdev)
+{
+	struct inode *inode = bdev_inode(bdev);
+	unsigned int flags = blk_to_file_flags(BLK_OPEN_READ | BLK_OPEN_WRITE);
+	struct file *file;
+	static struct file_operations stash_fops;
+
+	file = inode->i_private;
+	if (!file) {
+		/*
+		 * This file is used for iomap/buffer_head for raw block_device
+		 * read/write operations to access block_device.
+		 */
+		file = alloc_file_pseudo_noaccount(bdev_inode(bdev),
+				blockdev_mnt, "", flags | O_LARGEFILE,
+				&stash_fops);
+
+		if (IS_ERR(file))
+			return PTR_ERR(file);
+
+		ihold(inode);
+		init_bdev_file(file, bdev, 0, NULL);
+
+		inode->i_private = file;
+		WARN_ON_ONCE(bdev->stash_count != 0);
+	}
+
+	bdev->stash_count++;
+	return 0;
+}
+
+static void __unstash_bdev_file(struct block_device *bdev)
+{
+
+	WARN_ON_ONCE(bdev->stash_count <= 0);
+	if (--bdev->stash_count == 0) {
+		struct inode *inode = bdev_inode(bdev);
+		struct file *file = inode->i_private;
+
+		inode->i_private = NULL;
+		fput(file);
+	}
+}
+
+static int stash_bdev_file(struct block_device *bdev)
+{
+	int ret = __stash_bdev_file(bdev);
+
+	if (ret || !bdev_is_partition(bdev))
+		return ret;
+
+	ret = __stash_bdev_file(bdev_whole(bdev));
+	if (ret)
+		__unstash_bdev_file(bdev);
+
+	return ret;
+}
+
+static void unstash_bdev_file(struct block_device *bdev)
+{
+	__unstash_bdev_file(bdev);
+	if (bdev_is_partition(bdev))
+		__unstash_bdev_file(bdev_whole(bdev));
+}
+
 /**
  * bdev_open - open a block device
  * @bdev: block device to open
@@ -891,12 +986,17 @@  int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder,
 	ret = -EBUSY;
 	if (!bdev_may_open(bdev, mode))
 		goto put_module;
+
+	ret = stash_bdev_file(bdev);
+	if (ret)
+		goto put_module;
+
 	if (bdev_is_partition(bdev))
 		ret = blkdev_get_part(bdev, mode);
 	else
 		ret = blkdev_get_whole(bdev, mode);
 	if (ret)
-		goto put_module;
+		goto unstash_bdev_file;
 	bdev_claim_write_access(bdev, mode);
 	if (holder) {
 		bd_finish_claiming(bdev, holder, hops);
@@ -922,6 +1022,9 @@  int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder,
 	init_bdev_file(bdev_file, bdev, mode, holder);
 
 	return 0;
+
+unstash_bdev_file:
+	unstash_bdev_file(bdev);
 put_module:
 	module_put(disk->fops->owner);
 abort_claiming:
@@ -932,36 +1035,6 @@  int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder,
 	return ret;
 }
 
-/*
- * If BLK_OPEN_WRITE_IOCTL is set then this is a historical quirk
- * associated with the floppy driver where it has allowed ioctls if the
- * file was opened for writing, but does not allow reads or writes.
- * Make sure that this quirk is reflected in @f_flags.
- *
- * It can also happen if a block device is opened as O_RDWR | O_WRONLY.
- */
-static unsigned blk_to_file_flags(blk_mode_t mode)
-{
-	unsigned int flags = 0;
-
-	if ((mode & (BLK_OPEN_READ | BLK_OPEN_WRITE)) ==
-	    (BLK_OPEN_READ | BLK_OPEN_WRITE))
-		flags |= O_RDWR;
-	else if (mode & BLK_OPEN_WRITE_IOCTL)
-		flags |= O_RDWR | O_WRONLY;
-	else if (mode & BLK_OPEN_WRITE)
-		flags |= O_WRONLY;
-	else if (mode & BLK_OPEN_READ)
-		flags |= O_RDONLY; /* homeopathic, because O_RDONLY is 0 */
-	else
-		WARN_ON_ONCE(true);
-
-	if (mode & BLK_OPEN_NDELAY)
-		flags |= O_NDELAY;
-
-	return flags;
-}
-
 struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
 				   const struct blk_holder_ops *hops)
 {
@@ -1073,6 +1146,8 @@  void bdev_release(struct file *bdev_file)
 		blkdev_put_part(bdev);
 	else
 		blkdev_put_whole(bdev);
+
+	unstash_bdev_file(bdev);
 	mutex_unlock(&disk->open_mutex);
 
 	module_put(disk->fops->owner);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cb1526ec44b5..22f736908cbe 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -70,6 +70,7 @@  struct block_device {
 #endif
 	bool			bd_ro_warned;
 	int			bd_writers;
+	int			stash_count;
 	/*
 	 * keep this out-of-line as it's both big and not needed in the fast
 	 * path