Message ID | 20240406090930.2252838-23-yukuai1@huaweicloud.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs & block: remove bdev->bd_inode | expand |
On Sat, Apr 06, 2024 at 05:09:26PM +0800, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > So that iomap and bffer_head can convert to use bdev_file in following > patches. Let me see if I got it straight. You introduce dummy struct file instances (no methods, nothing). The *ONLY* purpose they serve is to correspond to opened instances of struct bdev. No other use is possible. You shove them into ->i_private of bdevfs inodes. Lifetime rules are... odd. In bdev_open() you arrange for such beast to be present. You never return it anywhere, they only get accessed via ->i_private, exposing it at least to fs/buffer.c. Reference to those suckers get stored (without grabbing refcount) into buffer_head instances. And all of that is for... what, exactly?
On Sat, Apr 06, 2024 at 08:42:06PM +0100, Al Viro wrote: > On Sat, Apr 06, 2024 at 05:09:26PM +0800, Yu Kuai wrote: > > From: Yu Kuai <yukuai3@huawei.com> > > > > So that iomap and bffer_head can convert to use bdev_file in following > > patches. > > Let me see if I got it straight. You introduce dummy struct file instances > (no methods, nothing). The *ONLY* purpose they serve is to correspond to > opened instances of struct bdev. No other use is possible. > > You shove them into ->i_private of bdevfs inodes. Lifetime rules are... > odd. > > In bdev_open() you arrange for such beast to be present. You never > return it anywhere, they only get accessed via ->i_private, exposing > it at least to fs/buffer.c. Reference to those suckers get stored > (without grabbing refcount) into buffer_head instances. > > And all of that is for... what, exactly? Put another way, what's the endgame here? Are you going to try and propagate those beasts down into bio_alloc()? Because if you do not, you need to keep struct block_device * around anyway. We use ->b_bdev for several things: * passing to bio_alloc() (quite a few places) * %pg in debugging printks * (rare) passing to write_boundary_block(). * (twice) passing to clean_bdev_aliases(). * (once) passing to __find_get_block(). * one irregular use as a key in lookup_bh_lru() IDGI...
Hi, 在 2024/04/07 4:29, Al Viro 写道: > On Sat, Apr 06, 2024 at 08:42:06PM +0100, Al Viro wrote: >> On Sat, Apr 06, 2024 at 05:09:26PM +0800, Yu Kuai wrote: >>> From: Yu Kuai <yukuai3@huawei.com> >>> >>> So that iomap and bffer_head can convert to use bdev_file in following >>> patches. >> >> Let me see if I got it straight. You introduce dummy struct file instances >> (no methods, nothing). The *ONLY* purpose they serve is to correspond to >> opened instances of struct bdev. No other use is possible. Yes, this is the only purpose. >> >> You shove them into ->i_private of bdevfs inodes. Lifetime rules are... >> odd. >> >> In bdev_open() you arrange for such beast to be present. You never >> return it anywhere, they only get accessed via ->i_private, exposing >> it at least to fs/buffer.c. Reference to those suckers get stored >> (without grabbing refcount) into buffer_head instances. >> >> And all of that is for... what, exactly? > > Put another way, what's the endgame here? Are you going to try and > propagate those beasts down into bio_alloc()? Because if you do not, > you need to keep struct block_device * around anyway. Yes, patch 23-26 already do the work to remove the field block_device and convert to use bdev_file for iomap and buffer_head. Or maybe you prefer the idea from last version to keep the block_device field in iomap/buffer_head, and use it for raw block_device fops? Thanks, Kuai > > We use ->b_bdev for several things: > * passing to bio_alloc() (quite a few places) > * %pg in debugging printks > * (rare) passing to write_boundary_block(). > * (twice) passing to clean_bdev_aliases(). > * (once) passing to __find_get_block(). > * one irregular use as a key in lookup_bh_lru() > > IDGI... > . >
On Sun, Apr 07, 2024 at 09:18:20AM +0800, Yu Kuai wrote: > Yes, patch 23-26 already do the work to remove the field block_device > and convert to use bdev_file for iomap and buffer_head. What for? I mean, what makes that dummy struct file * any better than struct block_device *? What's the point? I agree that keeping an opened struct file for a block device is a good idea - certainly better than weird crap used to carry the "how had it been opened" along with bdev. But that does *not* mean not keeping ->s_bdev around; we might or might not find that convenient, but it's not "struct block_device is Evil(tm), let's exorcise". Why do we care to do anything to struct buffer_head? Or to struct bio, for that matter... I'm not saying that parts of the patchset do not make sense on their own, but I don't understand what the last part is all about. Al, still going through that series...
Hi, 在 2024/04/07 9:51, Al Viro 写道: > On Sun, Apr 07, 2024 at 09:18:20AM +0800, Yu Kuai wrote: > >> Yes, patch 23-26 already do the work to remove the field block_device >> and convert to use bdev_file for iomap and buffer_head. > > What for? I mean, what makes that dummy struct file * any better than > struct block_device *? What's the point? > > I agree that keeping an opened struct file for a block device is > a good idea - certainly better than weird crap used to carry the > "how had it been opened" along with bdev. But that does *not* > mean not keeping ->s_bdev around; we might or might not find that > convenient, but it's not "struct block_device is Evil(tm), let's > exorcise". > > Why do we care to do anything to struct buffer_head? Or to > struct bio, for that matter... Other than raw block_device fops, other filesystems can use the opened bdev_file directly for iomap and buffer_head, and they actually don't need to reference block_device anymore. The point here is that whether we want to keep a special handling for block_device fops or not. There are two proposes now: - one is from Christian to keep using block_device for block_device fops, in order to do that, a new flag and some special handling is added to iomap and buffer_head. See the patch from last version [1]. - one is from this patchset, allocate a *dummy* bdev_file just for iomap and buffer_head to access bdev and bd_inode. I personally prefer the later one, that's why there is a new version, however, what do I know? That will depend on how people think. [1] https://lore.kernel.org/all/20240222124555.2049140-20-yukuai1@huaweicloud.com/ Thanks, Kuai > > I'm not saying that parts of the patchset do not make sense on > their own, but I don't understand what the last part is all > about. > > Al, still going through that series... > . >
On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote: > Other than raw block_device fops, other filesystems can use the opened > bdev_file directly for iomap and buffer_head, and they actually don't > need to reference block_device anymore. The point here is that whether What do you mean, "reference"? The counting reference is to opened file; ->s_bdev is a cached pointer to associated struct block_device, and neither it nor pointers in buffer_head are valid past the moment when you close the file. Storing (non-counting) pointers to struct file in struct buffer_head is not different in that respect - they are *still* only valid while the "master" reference is held. Again, what's the point of storing struct file * in struct buffer_head or struct iomap? In any instances of those structures? There is a good reason to have it in places that keep a reference to opened block device - the kind that _keeps_ the device opened. Namely, there's state that need to be carried from the place where we'd opened the sucker to the place where we close it, and that state is better carried by opened file. But neither iomap nor buffer_head contain anything of that sort - the lifetime management of the opened device is not in their competence. As the matter of fact, the logics around closing those opened devices (bdev_release()) makes sure that no instances of buffer_head (or iomap) will outlive them. And they don't care about any extra state - everything they use is in block_device and coallocated inode. I could've easily missed something in one of the threads around the earlier iterations of the patchset; if that's the case, could somebody restate the rationale for that part and/or post relevant lore.kernel.org links? Christian? hch? What am I missing here?
Hi, 在 2024/04/07 11:06, Al Viro 写道: > On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote: > >> Other than raw block_device fops, other filesystems can use the opened >> bdev_file directly for iomap and buffer_head, and they actually don't >> need to reference block_device anymore. The point here is that whether > > What do you mean, "reference"? The counting reference is to opened > file; ->s_bdev is a cached pointer to associated struct block_device, > and neither it nor pointers in buffer_head are valid past the moment > when you close the file. Storing (non-counting) pointers to struct > file in struct buffer_head is not different in that respect - they > are *still* only valid while the "master" reference is held. > > Again, what's the point of storing struct file * in struct buffer_head > or struct iomap? In any instances of those structures? Perhaps this is what you missed, like the title of this set, in order to remove direct acceess of bdev->bd_inode from fs/buffer, we must store bdev_file in buffer_head and iomap, and 'bdev->bd_inode' is replaced with 'file_inode(bdev)' now. Some history of previous discussions: [1] https://lore.kernel.org/all/ZWRDeQ4K8BiYnV+X@infradead.org/ [2] https://lore.kernel.org/all/28237ec3-c3c1-1f0c-5250-04a88845d4a6@huaweicloud.com/ [3] https://lore.kernel.org/all/20240129-vfs-bdev-file-bd_inode-v1-0-42eb9eea96cf@kernel.org/ Thanks, Kuai > > There is a good reason to have it in places that keep a reference to > opened block device - the kind that _keeps_ the device opened. Namely, > there's state that need to be carried from the place where we'd opened > the sucker to the place where we close it, and that state is better > carried by opened file. > > But neither iomap nor buffer_head contain anything of that sort - > the lifetime management of the opened device is not in their > competence. As the matter of fact, the logics around closing > those opened devices (bdev_release()) makes sure that no > instances of buffer_head (or iomap) will outlive them. > And they don't care about any extra state - everything > they use is in block_device and coallocated inode. > > I could've easily missed something in one of the threads around > the earlier iterations of the patchset; if that's the case, > could somebody restate the rationale for that part and/or > post relevant lore.kernel.org links? Christian? hch? > What am I missing here? > . >
On Sun, Apr 07, 2024 at 11:21:56AM +0800, Yu Kuai wrote: > Perhaps this is what you missed, like the title of this set, in order to > remove direct acceess of bdev->bd_inode from fs/buffer, we must store > bdev_file in buffer_head and iomap, and 'bdev->bd_inode' is replaced > with 'file_inode(bdev)' now. TBH, that looks like a very massive overkill that doesn't address the real issues - you are still poking in that coallocated struct inode, only instead of fetching it from pointer in struct block_device you get it from a pointer in a dummy struct file. How is that an improvement? After all, grepping for '\->[ ]*bd_inode\>' and looking through the few remaining users in e.g. fs/buffer.c is *much* easier than grepping for file_inode callers. AFAICS, Christoph's objections had been about the need to use saner APIs instead of getting to inode in some way and poking in it. And I agree that quite a few things in that series do just that. The final part doesn't. IMO, those dummy struct file (used as convenient storage for pointer to address_space *and* to the damn inode, with all its guts hanging out) are simply wrong. To reiterate: * we need to reduce the number of uses of those inodes * we need to find out what *is* getting used and sort out the sane set of primitives; that's hard to do when we still have a lot of noise. * we need convert to those saner primitives * we need to prevent reintroduction of noise, or at least make such reintroduced noise easy to catch and whack. ->bd_inode is a problem because it's an attractive nuisance. Removing it would be fine, if there wasn't a harder to spot alternative way to get the same pointer. Try to grep for file_inode and bd_inode resp. and compare the hit counts. Seriously reduced set of bd_inode users is fine - my impression is that after this series without the final part we'd be down to 20 users or so. In the meanwhile, there's about 1.4e3 users of file_inode(), most of them completely unrelated to block devices. PS: in grow_dev_folio() we probably want struct address_space *mapping = bdev->bd_inode->i_mapping; instead of struct inode *inode = bdev->bd_inode; as one of the preliminary chunks. FWIW, it really looks like address_space (== page cache of block device, not an unreasonably candidate for primitive) and block size (well, logarithm thereof) cover the majority of what remains, with device size possibly being (remote) third...
On Sun, Apr 07, 2024 at 05:57:58AM +0100, Al Viro wrote: > PS: in grow_dev_folio() we probably want > struct address_space *mapping = bdev->bd_inode->i_mapping; > instead of > struct inode *inode = bdev->bd_inode; > as one of the preliminary chunks. > FWIW, it really looks like address_space (== page cache of block device, > not an unreasonably candidate for primitive) and block size (well, > logarithm thereof) cover the majority of what remains, with device > size possibly being (remote) third... Incidentally, how painful would it be to switch __bread_gfp() and __bread() to passing *logarithm* of block size instead of block size? And possibly supply the same to clean_bdev_aliases()... That would reduce fs/buffer.c uses to just "give me the address_space of that block device"...
On Sun, Apr 07, 2024 at 06:11:19AM +0100, Al Viro wrote: > On Sun, Apr 07, 2024 at 05:57:58AM +0100, Al Viro wrote: > > > PS: in grow_dev_folio() we probably want > > struct address_space *mapping = bdev->bd_inode->i_mapping; > > instead of > > struct inode *inode = bdev->bd_inode; > > as one of the preliminary chunks. > > FWIW, it really looks like address_space (== page cache of block device, > > not an unreasonably candidate for primitive) and block size (well, > > logarithm thereof) cover the majority of what remains, with device > > size possibly being (remote) third... > > Incidentally, how painful would it be to switch __bread_gfp() and __bread() > to passing *logarithm* of block size instead of block size? And possibly > supply the same to clean_bdev_aliases()... > > That would reduce fs/buffer.c uses to just "give me the address_space of > that block device"... ... and from what I've seen in your series, it very much looks like after that we could replace ->bd_inode with ->bd_mapping, turning your bdev_mapping() into an inline and (hopefully) leaving the few remaining uses of bdev_inode() outside of block/bdev.c _not_ on hot paths. If nothing else, it would make it much easier to grep for remaining odd stuff. Might trim the btrfs parts of the series, at that - a lot of that seems to be "how do we propagate opened file instead of just bdev, so that we could get to its ->f_mapping deep in call chain"... Again, all of that is only if __bread...() conversion to log(size) is feasible without a massive PITA - there might be dragons...
On Sun, Apr 07, 2024 at 11:21:56AM +0800, Yu Kuai wrote: > Hi, > > 在 2024/04/07 11:06, Al Viro 写道: > > On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote: > > > > > Other than raw block_device fops, other filesystems can use the opened > > > bdev_file directly for iomap and buffer_head, and they actually don't > > > need to reference block_device anymore. The point here is that whether > > > > What do you mean, "reference"? The counting reference is to opened > > file; ->s_bdev is a cached pointer to associated struct block_device, > > and neither it nor pointers in buffer_head are valid past the moment > > when you close the file. Storing (non-counting) pointers to struct > > file in struct buffer_head is not different in that respect - they > > are *still* only valid while the "master" reference is held. > > > > Again, what's the point of storing struct file * in struct buffer_head > > or struct iomap? In any instances of those structures? > > Perhaps this is what you missed, like the title of this set, in order to > remove direct acceess of bdev->bd_inode from fs/buffer, we must store > bdev_file in buffer_head and iomap, and 'bdev->bd_inode' is replaced > with 'file_inode(bdev)' now. BTW, what does that have to do with iomap? All it passes ->bdev to is 1) bio_alloc() 2) bio_alloc_bioset() 3) bio_init() 4) bdev_logical_block_size() 5) bdev_iter_is_aligned() 6) bdev_fua() 7) bdev_write_cache() None of those goes anywhere near fs/buffer.c or uses ->bd_inode, AFAICS. Again, what's the point? It feels like you are trying to replace *all* uses of struct block_device with struct file, just because. If that's what's going on, please don't. Using struct file instead of that bdev_handle crap - sure, makes perfect sense. But shoving it down into struct bio really, really does not. I'd suggest to start with adding ->bd_mapping as the first step and converting the places where mapping is all we want to using that. Right at the beginning of your series. Then let's see what gets left. And leave ->bd_inode there for now; don't blindly replace it with ->bd_mapping->host everywhere. It's much easier to grep for. The point of the exercise is to find what do we really need ->bd_inode for and what primitives are missing, not getting rid of a bad word...
On Tue, Apr 09, 2024 at 05:26:43AM +0100, Al Viro wrote: > On Sun, Apr 07, 2024 at 11:21:56AM +0800, Yu Kuai wrote: > > Hi, > > > > 在 2024/04/07 11:06, Al Viro 写道: > > > On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote: > > > > > > > Other than raw block_device fops, other filesystems can use the opened > > > > bdev_file directly for iomap and buffer_head, and they actually don't > > > > need to reference block_device anymore. The point here is that whether > > > > > > What do you mean, "reference"? The counting reference is to opened > > > file; ->s_bdev is a cached pointer to associated struct block_device, > > > and neither it nor pointers in buffer_head are valid past the moment > > > when you close the file. Storing (non-counting) pointers to struct > > > file in struct buffer_head is not different in that respect - they > > > are *still* only valid while the "master" reference is held. > > > > > > Again, what's the point of storing struct file * in struct buffer_head > > > or struct iomap? In any instances of those structures? > > > > Perhaps this is what you missed, like the title of this set, in order to > > remove direct acceess of bdev->bd_inode from fs/buffer, we must store > > bdev_file in buffer_head and iomap, and 'bdev->bd_inode' is replaced > > with 'file_inode(bdev)' now. > > BTW, what does that have to do with iomap? All it passes ->bdev to is > 1) bio_alloc() > 2) bio_alloc_bioset() > 3) bio_init() > 4) bdev_logical_block_size() > 5) bdev_iter_is_aligned() > 6) bdev_fua() > 7) bdev_write_cache() > > None of those goes anywhere near fs/buffer.c or uses ->bd_inode, AFAICS. Note that callers of iomap stuff in block/fops.c *do* have struct file *, so there's no problem with getting to inode - there the use of ->f_mapping->host is normal for ->write_iter()/->read_iter() instances. Same for filemap_read() and iomap_file_buffered_write(). As the matter of fact, the only use of ->bd_inode in block/fops.c is easily killable, as discussed upthread.
Hi, 在 2024/04/09 12:26, Al Viro 写道: > On Sun, Apr 07, 2024 at 11:21:56AM +0800, Yu Kuai wrote: >> Hi, >> >> 在 2024/04/07 11:06, Al Viro 写道: >>> On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote: >>> >>>> Other than raw block_device fops, other filesystems can use the opened >>>> bdev_file directly for iomap and buffer_head, and they actually don't >>>> need to reference block_device anymore. The point here is that whether >>> >>> What do you mean, "reference"? The counting reference is to opened >>> file; ->s_bdev is a cached pointer to associated struct block_device, >>> and neither it nor pointers in buffer_head are valid past the moment >>> when you close the file. Storing (non-counting) pointers to struct >>> file in struct buffer_head is not different in that respect - they >>> are *still* only valid while the "master" reference is held. >>> >>> Again, what's the point of storing struct file * in struct buffer_head >>> or struct iomap? In any instances of those structures? >> >> Perhaps this is what you missed, like the title of this set, in order to >> remove direct acceess of bdev->bd_inode from fs/buffer, we must store >> bdev_file in buffer_head and iomap, and 'bdev->bd_inode' is replaced >> with 'file_inode(bdev)' now. > > BTW, what does that have to do with iomap? All it passes ->bdev to is > 1) bio_alloc() > 2) bio_alloc_bioset() > 3) bio_init() > 4) bdev_logical_block_size() > 5) bdev_iter_is_aligned() > 6) bdev_fua() > 7) bdev_write_cache() > > None of those goes anywhere near fs/buffer.c or uses ->bd_inode, AFAICS. > > Again, what's the point? It feels like you are trying to replace *all* > uses of struct block_device with struct file, just because. > > If that's what's going on, please don't. Using struct file instead > of that bdev_handle crap - sure, makes perfect sense. But shoving it > down into struct bio really, really does not. > > I'd suggest to start with adding ->bd_mapping as the first step and > converting the places where mapping is all we want to using that. > Right at the beginning of your series. Then let's see what gets > left. Thanks so much for your advice, in fact, I totally agree with this that adding a 'bd_mapping' or expose the helper bdev_mapping(). However, I will let Christoph and Jan to make the decision, when they get time to take a look at this. Thanks! Kuai > > And leave ->bd_inode there for now; don't blindly replace it with > ->bd_mapping->host everywhere. It's much easier to grep for. > The point of the exercise is to find what do we really need ->bd_inode > for and what primitives are missing, not getting rid of a bad word... > . >
On Sun, Apr 07, 2024 at 04:06:10AM +0100, Al Viro wrote: > On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote: > > > Other than raw block_device fops, other filesystems can use the opened > > bdev_file directly for iomap and buffer_head, and they actually don't > > need to reference block_device anymore. The point here is that whether > > What do you mean, "reference"? The counting reference is to opened > file; ->s_bdev is a cached pointer to associated struct block_device, > and neither it nor pointers in buffer_head are valid past the moment > when you close the file. Storing (non-counting) pointers to struct > file in struct buffer_head is not different in that respect - they > are *still* only valid while the "master" reference is held. > > Again, what's the point of storing struct file * in struct buffer_head > or struct iomap? In any instances of those structures? > > There is a good reason to have it in places that keep a reference to > opened block device - the kind that _keeps_ the device opened. Namely, > there's state that need to be carried from the place where we'd opened > the sucker to the place where we close it, and that state is better > carried by opened file. > > But neither iomap nor buffer_head contain anything of that sort - > the lifetime management of the opened device is not in their > competence. As the matter of fact, the logics around closing > those opened devices (bdev_release()) makes sure that no > instances of buffer_head (or iomap) will outlive them. > And they don't care about any extra state - everything > they use is in block_device and coallocated inode. > > I could've easily missed something in one of the threads around > the earlier iterations of the patchset; if that's the case, > could somebody restate the rationale for that part and/or > post relevant lore.kernel.org links? Christian? hch? > What am I missing here? The original series was a simple RFC/POC to show that struct file could be used to remove bd_inode access in a wide variety of situations. But as I've mentioned in that thread I wasn't happy with various aspects of the approach which is why I never pushed forward with it. The part where we pushed struct file into buffer_header was the most obvious one.
> +static int __stash_bdev_file(struct block_device *bdev)
I've said that on the previous version. I think that this is really
error prone and seems overall like an unpleasant solution. I would
really like to avoid going down that route.
I think a chunk of this series is good though specicially simple
conversions of individual filesystems where file_inode() or f_mapping
makes sense. There's a few exceptions where we might be better of
replacing the current apis with something else (I think Al touched on
that somewhere further down the thread.).
I'd suggest the straightforward bd_inode removals into a separate series
that I can take.
Thanks for working on all of this. It's certainly a contentious area.
Hi, 在 2024/04/09 18:23, Christian Brauner 写道: >> +static int __stash_bdev_file(struct block_device *bdev) > > I've said that on the previous version. I think that this is really > error prone and seems overall like an unpleasant solution. I would > really like to avoid going down that route. Yes, I see your point, and it's indeed reasonable. > > I think a chunk of this series is good though specicially simple > conversions of individual filesystems where file_inode() or f_mapping > makes sense. There's a few exceptions where we might be better of > replacing the current apis with something else (I think Al touched on > that somewhere further down the thread.). > > I'd suggest the straightforward bd_inode removals into a separate series > that I can take. > > Thanks for working on all of this. It's certainly a contentious area. How about following simple patch to expose bdev_mapping() for fs/buffer.c for now? Thanks, Kuai diff --git a/block/blk.h b/block/blk.h index a34bb590cce6..f8bcb43a12c6 100644 --- a/block/blk.h +++ b/block/blk.h @@ -428,7 +428,6 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev, #endif /* CONFIG_BLK_DEV_ZONED */ struct inode *bdev_inode(struct block_device *bdev); -struct address_space *bdev_mapping(struct block_device *bdev); struct block_device *bdev_alloc(struct gendisk *disk, u8 partno); void bdev_add(struct block_device *bdev, dev_t dev); diff --git a/fs/buffer.c b/fs/buffer.c index 4f73d23c2c46..e2bd19e3fe48 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -189,8 +189,8 @@ EXPORT_SYMBOL(end_buffer_write_sync); static struct buffer_head * __find_get_block_slow(struct block_device *bdev, sector_t block) { - struct inode *bd_inode = bdev->bd_inode; - struct address_space *bd_mapping = bd_inode->i_mapping; + struct address_space *bd_mapping = bdev_mapping(bdev); + struct inode *bd_inode = bd_mapping->host; struct buffer_head *ret = NULL; pgoff_t index; struct buffer_head *bh; @@ -1034,12 +1034,12 @@ static sector_t folio_init_buffers(struct folio *folio, static bool grow_dev_folio(struct block_device *bdev, sector_t block, pgoff_t index, unsigned size, gfp_t gfp) { - struct inode *inode = bdev->bd_inode; + struct address_space *bd_mapping = bdev_mapping(bdev); struct folio *folio; struct buffer_head *bh; sector_t end_block = 0; - folio = __filemap_get_folio(inode->i_mapping, index, + folio = __filemap_get_folio(bd_mapping, index, FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp); if (IS_ERR(folio)) return false; @@ -1073,10 +1073,10 @@ static bool grow_dev_folio(struct block_device *bdev, sector_t block, * lock to be atomic wrt __find_get_block(), which does not * run under the folio lock. */ - spin_lock(&inode->i_mapping->i_private_lock); + spin_lock(&bd_mapping->i_private_lock); link_dev_buffers(folio, bh); end_block = folio_init_buffers(folio, bdev, size); - spin_unlock(&inode->i_mapping->i_private_lock); + spin_unlock(&bd_mapping->i_private_lock); unlock: folio_unlock(folio); folio_put(folio); @@ -1463,7 +1463,7 @@ __bread_gfp(struct block_device *bdev, sector_t block, { struct buffer_head *bh; - gfp |= mapping_gfp_constraint(bdev->bd_inode->i_mapping, ~__GFP_FS); + gfp |= mapping_gfp_constraint(bdev_mapping(bdev), ~__GFP_FS); /* * Prefer looping in the allocator rather than here, at least that @@ -1696,8 +1696,8 @@ EXPORT_SYMBOL(create_empty_buffers); */ void clean_bdev_aliases(struct block_device *bdev, sector_t block, sector_t len) { - struct inode *bd_inode = bdev->bd_inode; - struct address_space *bd_mapping = bd_inode->i_mapping; + struct address_space *bd_mapping = bdev_mapping(bdev); + struct inode *bd_inode = bd_mapping->host; struct folio_batch fbatch; pgoff_t index = ((loff_t)block << bd_inode->i_blkbits) / PAGE_SIZE; pgoff_t end; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index bc840e0fb6e5..bbae55535d53 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1527,6 +1527,7 @@ void blkdev_put_no_open(struct block_device *bdev); struct block_device *I_BDEV(struct inode *inode); struct block_device *file_bdev(struct file *bdev_file); +struct address_space *bdev_mapping(struct block_device *bdev); bool disk_live(struct gendisk *disk); unsigned int block_size(struct block_device *bdev); > . >
On Tue 09-04-24 14:22:37, Yu Kuai wrote: > Hi, > > 在 2024/04/09 12:26, Al Viro 写道: > > On Sun, Apr 07, 2024 at 11:21:56AM +0800, Yu Kuai wrote: > > > Hi, > > > > > > 在 2024/04/07 11:06, Al Viro 写道: > > > > On Sun, Apr 07, 2024 at 10:34:56AM +0800, Yu Kuai wrote: > > > > > > > > > Other than raw block_device fops, other filesystems can use the opened > > > > > bdev_file directly for iomap and buffer_head, and they actually don't > > > > > need to reference block_device anymore. The point here is that whether > > > > > > > > What do you mean, "reference"? The counting reference is to opened > > > > file; ->s_bdev is a cached pointer to associated struct block_device, > > > > and neither it nor pointers in buffer_head are valid past the moment > > > > when you close the file. Storing (non-counting) pointers to struct > > > > file in struct buffer_head is not different in that respect - they > > > > are *still* only valid while the "master" reference is held. > > > > > > > > Again, what's the point of storing struct file * in struct buffer_head > > > > or struct iomap? In any instances of those structures? > > > > > > Perhaps this is what you missed, like the title of this set, in order to > > > remove direct acceess of bdev->bd_inode from fs/buffer, we must store > > > bdev_file in buffer_head and iomap, and 'bdev->bd_inode' is replaced > > > with 'file_inode(bdev)' now. > > > > BTW, what does that have to do with iomap? All it passes ->bdev to is > > 1) bio_alloc() > > 2) bio_alloc_bioset() > > 3) bio_init() > > 4) bdev_logical_block_size() > > 5) bdev_iter_is_aligned() > > 6) bdev_fua() > > 7) bdev_write_cache() > > > > None of those goes anywhere near fs/buffer.c or uses ->bd_inode, AFAICS. > > > > Again, what's the point? It feels like you are trying to replace *all* > > uses of struct block_device with struct file, just because. > > > > If that's what's going on, please don't. Using struct file instead > > of that bdev_handle crap - sure, makes perfect sense. But shoving it > > down into struct bio really, really does not. > > > > I'd suggest to start with adding ->bd_mapping as the first step and > > converting the places where mapping is all we want to using that. > > Right at the beginning of your series. Then let's see what gets > > left. > > Thanks so much for your advice, in fact, I totally agree with this that > adding a 'bd_mapping' or expose the helper bdev_mapping(). > > However, I will let Christoph and Jan to make the decision, when they > get time to take a look at this. I agree with Christian and Al - and I think I've expressed that already in the previous version of the series [1] but I guess I was not explicit enough :). I think the initial part of the series (upto patch 21, perhaps excluding patch 20) is a nice cleanup but the latter part playing with stashing struct file is not an improvement and seems pointless to me. So I'd separate the initial part cleaning up the obvious places and let Christian merge it and then we can figure out what (if anything) to do with remaining bd_inode uses in fs/buffer.c etc. E.g. what Al suggests with bd_mapping makes sense to me but I didn't check what's left after your initial patches... Honza [1] https://lore.kernel.org/all/20240322125750.jov4f3alsrkmqnq7@quack3
On Wed, Apr 10, 2024 at 12:59:11PM +0200, Jan Kara wrote: > I agree with Christian and Al - and I think I've expressed that already in > the previous version of the series [1] but I guess I was not explicit > enough :). I think the initial part of the series (upto patch 21, perhaps > excluding patch 20) is a nice cleanup but the latter part playing with > stashing struct file is not an improvement and seems pointless to me. So > I'd separate the initial part cleaning up the obvious places and let > Christian merge it and then we can figure out what (if anything) to do with > remaining bd_inode uses in fs/buffer.c etc. E.g. what Al suggests with > bd_mapping makes sense to me but I didn't check what's left after your > initial patches... FWIW, experimental on top of -next: Al Viro (7): block_device: add a pointer to struct address_space (page cache of bdev) use ->bd_mapping instead of ->bd_inode->i_mapping grow_dev_folio(): we only want ->bd_inode->i_mapping there gfs2: more obvious initializations of mapping->host blkdev_write_iter(): saner way to get inode and bdev blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here... dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode) Yu Kuai (4): ext4: remove block_device_ejected() block: move two helpers into bdev.c bcachefs: remove dead function bdev_sectors() block2mtd: prevent direct access of bd_inode [slightly modified] leaves only this: block/bdev.c:60: struct inode *inode = bdev->bd_inode; block/bdev.c:137: loff_t size = i_size_read(bdev->bd_inode); block/bdev.c:144: bdev->bd_inode->i_blkbits = blksize_bits(bsize); block/bdev.c:158: if (bdev->bd_inode->i_blkbits != blksize_bits(size)) { block/bdev.c:160: bdev->bd_inode->i_blkbits = blksize_bits(size); block/bdev.c:415: bdev->bd_inode = inode; block/bdev.c:434: i_size_write(bdev->bd_inode, (loff_t)sectors << SECTOR_SHIFT); block/bdev.c:444: bdev->bd_inode->i_rdev = dev; block/bdev.c:445: bdev->bd_inode->i_ino = dev; block/bdev.c:446: insert_inode_hash(bdev->bd_inode); block/bdev.c:974: bdev_file = alloc_file_pseudo_noaccount(bdev->bd_inode, block/bdev.c:980: ihold(bdev->bd_inode); block/bdev.c:1257: return !inode_unhashed(disk->part0->bd_inode); block/bdev.c:1263: return 1 << bdev->bd_inode->i_blkbits; block/genhd.c:659: remove_inode_hash(part->bd_inode); block/genhd.c:1194: iput(disk->part0->bd_inode); /* frees the disk */ block/genhd.c:1384: iput(disk->part0->bd_inode); block/partitions/core.c:246: iput(dev_to_bdev(dev)->bd_inode); block/partitions/core.c:472: remove_inode_hash(part->bd_inode); block/partitions/core.c:658: remove_inode_hash(part->bd_inode); drivers/s390/block/dasd_ioctl.c:218: block->gdp->part0->bd_inode->i_blkbits = fs/buffer.c:192: struct inode *bd_inode = bdev->bd_inode; fs/buffer.c:1699: struct inode *bd_inode = bdev->bd_inode; fs/erofs/data.c:73: buf->inode = sb->s_bdev->bd_inode; fs/nilfs2/segment.c:2793: inode_attach_wb(nilfs->ns_bdev->bd_inode, NULL); I've got erofs patches that get rid of that instance; bdev.c is obviously priveleged since it sees coallocated inode directly. Other than those we have * 3 callers of remove_inode_hash() * 3 callers of iput() * one caller of inode_attach_wb() (nilfs2) * weird shit in DASD (redundant, that; incidentally, I don't see anything that might prevent DASD format requested with mounted partitions on that disk - and won't that be fun and joy for an admin to step into...) * two places in fs/buffer.c that want to convert block numbers to positions in bytes. Either the function itself or its caller has the block size as argument; replacing that to passing block _shift_ instead of size would reduce those two to ->bd_mapping. And that's it. iput() and remove_inode_hash() are obvious candidates for helpers (internal to block/*; no exporting those, it's private to bdev.c, genhd.c and paritions/core.c). fs/buffer.c ones need a bit more code audit (not quite done with that), but it looks at least plausible. Which would leave us with whatever nilfs2 is doing and that weirdness in dasd_format() (why set ->i_blkbits but not ->i_blocksize? why not use set_blocksize(), for that matter? where the hell is check for exclusive open?)
On Wed, Apr 10, 2024 at 11:34:43PM +0100, Al Viro wrote: > On Wed, Apr 10, 2024 at 12:59:11PM +0200, Jan Kara wrote: > > > I agree with Christian and Al - and I think I've expressed that already in > > the previous version of the series [1] but I guess I was not explicit > > enough :). I think the initial part of the series (upto patch 21, perhaps > > excluding patch 20) is a nice cleanup but the latter part playing with > > stashing struct file is not an improvement and seems pointless to me. So > > I'd separate the initial part cleaning up the obvious places and let > > Christian merge it and then we can figure out what (if anything) to do with > > remaining bd_inode uses in fs/buffer.c etc. E.g. what Al suggests with > > bd_mapping makes sense to me but I didn't check what's left after your > > initial patches... > > FWIW, experimental on top of -next: Ok, let's move forward with this. I've applied the first 19 patches. Patch 20 is the start of what we all disliked. 21 is clearly a bugfix for current code so that'll go separately from the rest. I've replaced open-code f_mapping access with file_mapping(). The symmetry between file_inode() and file_mapping() is quite nice. Al, your idea to switch erofs away from buf->inode can go on top of what Yu did imho. There's no real reason to throw it away imho. I've exported bdev_mapping() because it really makes the btrfs change a lot slimmer and we don't need to care about messing with a lot of that code. I didn't care about making it static inline because that might've meant we need to move other stuff into the header as well. Imho, it's not that important but if it's a big deal to any of you just do the changes on top of it, please. Pushed to https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.super If I hear no objections that'll show up in -next tomorrow. Al, would be nice if you could do your changes on top of this, please.
On Thu, Apr 11, 2024 at 01:56:03PM +0200, Christian Brauner wrote: > On Wed, Apr 10, 2024 at 11:34:43PM +0100, Al Viro wrote: > > On Wed, Apr 10, 2024 at 12:59:11PM +0200, Jan Kara wrote: > > > > > I agree with Christian and Al - and I think I've expressed that already in > > > the previous version of the series [1] but I guess I was not explicit > > > enough :). I think the initial part of the series (upto patch 21, perhaps > > > excluding patch 20) is a nice cleanup but the latter part playing with > > > stashing struct file is not an improvement and seems pointless to me. So > > > I'd separate the initial part cleaning up the obvious places and let > > > Christian merge it and then we can figure out what (if anything) to do with > > > remaining bd_inode uses in fs/buffer.c etc. E.g. what Al suggests with > > > bd_mapping makes sense to me but I didn't check what's left after your > > > initial patches... > > > > FWIW, experimental on top of -next: > > Ok, let's move forward with this. I've applied the first 19 patches. > Patch 20 is the start of what we all disliked. 21 is clearly a bugfix > for current code so that'll go separately from the rest. I've replaced > open-code f_mapping access with file_mapping(). The symmetry between > file_inode() and file_mapping() is quite nice. > > Al, your idea to switch erofs away from buf->inode can go on top of what > Yu did imho. There's no real reason to throw it away imho. > > I've exported bdev_mapping() because it really makes the btrfs change a > lot slimmer and we don't need to care about messing with a lot of that > code. I didn't care about making it static inline because that might've > meant we need to move other stuff into the header as well. Imho, it's > not that important but if it's a big deal to any of you just do the > changes on top of it, please. > > Pushed to > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.super > > If I hear no objections that'll show up in -next tomorrow. Al, would be > nice if you could do your changes on top of this, please. Objection: start with adding bdev->bd_mapping, next convert the really obvious instances to it and most of this series becomes not needed at all. Really. There is no need whatsoever to push struct file down all those paths. And yes, erofs and buffer.c stuff belongs on top of that, no arguments here.
On Thu, Apr 11, 2024 at 03:04:09PM +0100, Al Viro wrote: > > lot slimmer and we don't need to care about messing with a lot of that > > code. I didn't care about making it static inline because that might've > > meant we need to move other stuff into the header as well. Imho, it's > > not that important but if it's a big deal to any of you just do the > > changes on top of it, please. > > > > Pushed to > > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.super > > > > If I hear no objections that'll show up in -next tomorrow. Al, would be > > nice if you could do your changes on top of this, please. > > Objection: start with adding bdev->bd_mapping, next convert the really > obvious instances to it and most of this series becomes not needed at > all. > > Really. There is no need whatsoever to push struct file down all those > paths. > > And yes, erofs and buffer.c stuff belongs on top of that, no arguments here. FWIW, here's what you get if this is done in such order: block/bdev.c | 31 ++++++++++++++++++++++--------- block/blk-zoned.c | 4 ++-- block/fops.c | 4 ++-- block/genhd.c | 2 +- block/ioctl.c | 14 ++++++-------- block/partitions/core.c | 2 +- drivers/md/bcache/super.c | 2 +- drivers/md/dm-vdo/dm-vdo-target.c | 4 ++-- drivers/md/dm-vdo/indexer/io-factory.c | 2 +- drivers/mtd/devices/block2mtd.c | 6 ++++-- drivers/scsi/scsicam.c | 2 +- fs/bcachefs/util.h | 5 ----- fs/btrfs/disk-io.c | 6 +++--- fs/btrfs/volumes.c | 2 +- fs/btrfs/zoned.c | 2 +- fs/buffer.c | 10 +++++----- fs/cramfs/inode.c | 2 +- fs/ext4/dir.c | 2 +- fs/ext4/ext4_jbd2.c | 2 +- fs/ext4/super.c | 24 +++--------------------- fs/gfs2/glock.c | 2 +- fs/gfs2/ops_fstype.c | 2 +- fs/jbd2/journal.c | 2 +- include/linux/blk_types.h | 1 + include/linux/blkdev.h | 12 ++---------- include/linux/buffer_head.h | 4 ++-- include/linux/jbd2.h | 4 ++-- 27 files changed, 69 insertions(+), 86 deletions(-) The bulk of the changes is straight replacements of foo->bd_inode->i_mapping with foo->bd_mapping. That's completely mechanical and that takes out most of the bd_inode uses. Anyway, patches in followups
On Sun, Apr 07, 2024 at 06:11:19AM +0100, Al Viro wrote: > On Sun, Apr 07, 2024 at 05:57:58AM +0100, Al Viro wrote: > > > PS: in grow_dev_folio() we probably want > > struct address_space *mapping = bdev->bd_inode->i_mapping; > > instead of > > struct inode *inode = bdev->bd_inode; > > as one of the preliminary chunks. > > FWIW, it really looks like address_space (== page cache of block device, > > not an unreasonably candidate for primitive) and block size (well, > > logarithm thereof) cover the majority of what remains, with device > > size possibly being (remote) third... > > Incidentally, how painful would it be to switch __bread_gfp() and __bread() > to passing *logarithm* of block size instead of block size? And possibly > supply the same to clean_bdev_aliases()... I've looked at it because blksize_bits() was pretty horrid. But I got scared because I couldn't figure out how to make unconverted places fail to compile, without doing something ugly like -__bread(struct block_device *bdev, sector_t block, unsigned size) +__bread(unsigned shift, struct block_device *bdev, sector_t block) I assume you're not talking about changing bh->b_size, just passing in the log and comparing bh->b_size to 1<<shift?
Hi, 在 2024/04/11 22:49, Al Viro 写道: > On Thu, Apr 11, 2024 at 03:04:09PM +0100, Al Viro wrote: >>> lot slimmer and we don't need to care about messing with a lot of that >>> code. I didn't care about making it static inline because that might've >>> meant we need to move other stuff into the header as well. Imho, it's >>> not that important but if it's a big deal to any of you just do the >>> changes on top of it, please. >>> >>> Pushed to >>> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.super >>> >>> If I hear no objections that'll show up in -next tomorrow. Al, would be >>> nice if you could do your changes on top of this, please. >> >> Objection: start with adding bdev->bd_mapping, next convert the really >> obvious instances to it and most of this series becomes not needed at >> all. >> >> Really. There is no need whatsoever to push struct file down all those >> paths. >> There really is a long history here. The beginning of the attempt to try removing the filed 'bd_inode' is that I want to make a room from the first cacheline(64 bytes) for a new 'unsigned long flags' field because we keep adding new 'bool xxx' field [1]. And adding a new 'bd_mapping' field will make that impossible. I do like the idea of passing 'bd_mapping' here, however, will it be considered to expose bdev_mapping() for slow path, or to pass in bd_file and get it by 'f_mapping' for fast path? So that a new field in the first cacheline will still be possible, other than that there will be more code change, I don't see any difference for performance. Thanks, Kuai [1] https://lore.kernel.org/all/20231122103103.1104589-3-yukuai1@huaweicloud.com/ >> And yes, erofs and buffer.c stuff belongs on top of that, no arguments here. > > FWIW, here's what you get if this is done in such order: > > block/bdev.c | 31 ++++++++++++++++++++++--------- > block/blk-zoned.c | 4 ++-- > block/fops.c | 4 ++-- > block/genhd.c | 2 +- > block/ioctl.c | 14 ++++++-------- > block/partitions/core.c | 2 +- > drivers/md/bcache/super.c | 2 +- > drivers/md/dm-vdo/dm-vdo-target.c | 4 ++-- > drivers/md/dm-vdo/indexer/io-factory.c | 2 +- > drivers/mtd/devices/block2mtd.c | 6 ++++-- > drivers/scsi/scsicam.c | 2 +- > fs/bcachefs/util.h | 5 ----- > fs/btrfs/disk-io.c | 6 +++--- > fs/btrfs/volumes.c | 2 +- > fs/btrfs/zoned.c | 2 +- > fs/buffer.c | 10 +++++----- > fs/cramfs/inode.c | 2 +- > fs/ext4/dir.c | 2 +- > fs/ext4/ext4_jbd2.c | 2 +- > fs/ext4/super.c | 24 +++--------------------- > fs/gfs2/glock.c | 2 +- > fs/gfs2/ops_fstype.c | 2 +- > fs/jbd2/journal.c | 2 +- > include/linux/blk_types.h | 1 + > include/linux/blkdev.h | 12 ++---------- > include/linux/buffer_head.h | 4 ++-- > include/linux/jbd2.h | 4 ++-- > 27 files changed, 69 insertions(+), 86 deletions(-) > > The bulk of the changes is straight replacements of foo->bd_inode->i_mapping > with foo->bd_mapping. That's completely mechanical and that takes out most > of the bd_inode uses. Anyway, patches in followups > > . >
On Fri, Apr 12, 2024 at 09:38:16AM +0800, Yu Kuai wrote: > There really is a long history here. The beginning of the attempt to try > removing the filed 'bd_inode' is that I want to make a room from the > first cacheline(64 bytes) for a new 'unsigned long flags' field because > we keep adding new 'bool xxx' field [1]. And adding a new 'bd_mapping' > field will make that impossible. Why does it need to be unsigned long? dev_t is 32bit; what you need is to keep this bool bd_read_only; /* read-only policy */ u8 bd_partno; bool bd_write_holder; bool bd_has_submit_bio; from blowing past u32. Sure, you can't use test_bit() et.al. with u16, but what's wrong with explicit bitwise operations? You need some protection for multiple writers, but you need it anyway - e.g. this if (bdev->bd_disk->fops->set_read_only) { ret = bdev->bd_disk->fops->set_read_only(bdev, n); if (ret) return ret; } bdev->bd_read_only = n; will need the exclusion over the entire "call ->set_read_only() and set the flag", not just for setting the flag itself. And yes, it's a real-world bug - two threads calling BLKROSET on the same opened file can race, with inconsistency between the flag and whatever state ->set_read_only() modifies. AFAICS, ->bd_write_holder is (apparently) relying upon ->open_mutex. Whether it would be a good solution for ->bd_read_only is a question to block folks, but some exclusion is obviously needed. Let's sort that out, rather than papering it over with set_bit() et.al.
On Fri, Apr 12, 2024 at 03:59:10AM +0100, Al Viro wrote: > for multiple writers, but you need it anyway - e.g. this > if (bdev->bd_disk->fops->set_read_only) { > ret = bdev->bd_disk->fops->set_read_only(bdev, n); > if (ret) > return ret; > } > bdev->bd_read_only = n; > will need the exclusion over the entire "call ->set_read_only() and set > the flag", not just for setting the flag itself. > > And yes, it's a real-world bug - two threads calling BLKROSET on the > same opened file can race, with inconsistency between the flag and > whatever state ->set_read_only() modifies. BLKROSET is CAP_SYS_ADMIN-only, so it's not a CVE fodder; the sky is not falling. The bug is real, though. I see Christoph's postings in that thread; the thing is, it's not just the flags that need to be protected. If we end up deciding that serialization for different flags should not be tied to the same thing (which is reasonable - e.g. md_set_read_only() is not something you want to shove under existing lock), I would still suggest something along the lines of u32 __bd_flags; // partno and flags static inline u8 bd_partno(struct block_device *bdev) { return bdev->__bd_flags & 0xff; } static void bd_set_flag(struct block_device *bdev, int flag) { u32 v = bdev->__bd_flags; for (;;) { u32 w = cmpxchg(&bdev->__bd_flags, v, v | (1 << (flag + 8))); if (w == v) return; v = w; } } and similar for bd_clear_flag(). Changes of ->bd_partno never happen - we set it at allocation time and never modify the sucker. This is orthogonal to BLKROSET/BLKROSET exclusion, converting ->bd_inode accesses, etc. Christoph, do you have any problems with that approach? COMPLETELY UNTESTED patch along those lines follows; if it works, it would need to be carved up. And I would probably switch the places where we do if (bdev->bd_partno) to if (bdev_is_partition(bdev)), for better readability. I'd converted only those 3 flags; again, this is just an untested illustration to the above. diff --git a/block/bdev.c b/block/bdev.c index 7a5f611c3d2e..9aa23620fe92 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -411,13 +411,11 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno) mutex_init(&bdev->bd_fsfreeze_mutex); spin_lock_init(&bdev->bd_size_lock); mutex_init(&bdev->bd_holder_lock); - bdev->bd_partno = partno; + bdev->__bd_flags = partno; bdev->bd_inode = inode; bdev->bd_queue = disk->queue; - if (partno) - bdev->bd_has_submit_bio = disk->part0->bd_has_submit_bio; - else - bdev->bd_has_submit_bio = false; + if (partno && bdev_test_flag(disk->part0, BD_HAS_SUBMIT_BIO)) + bdev_set_flag(bdev, BD_HAS_SUBMIT_BIO); bdev->bd_stats = alloc_percpu(struct disk_stats); if (!bdev->bd_stats) { iput(inode); @@ -624,7 +622,7 @@ static void bd_end_claim(struct block_device *bdev, void *holder) bdev->bd_holder = NULL; bdev->bd_holder_ops = NULL; mutex_unlock(&bdev->bd_holder_lock); - if (bdev->bd_write_holder) + if (bdev_test_flag(bdev, BD_WRITE_HOLDER)) unblock = true; } if (!whole->bd_holders) @@ -640,7 +638,7 @@ static void bd_end_claim(struct block_device *bdev, void *holder) */ if (unblock) { disk_unblock_events(bdev->bd_disk); - bdev->bd_write_holder = false; + bdev_clear_flag(bdev, BD_WRITE_HOLDER); } } @@ -892,9 +890,10 @@ int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder, * writeable reference is too fragile given the way @mode is * used in blkdev_get/put(). */ - if ((mode & BLK_OPEN_WRITE) && !bdev->bd_write_holder && + if ((mode & BLK_OPEN_WRITE) && + !bdev_test_flag(bdev, BD_WRITE_HOLDER) && (disk->event_flags & DISK_EVENT_FLAG_BLOCK_ON_EXCL_WRITE)) { - bdev->bd_write_holder = true; + bdev_set_flag(bdev, BD_WRITE_HOLDER); unblock_events = false; } } diff --git a/block/blk-core.c b/block/blk-core.c index a16b5abdbbf5..6a28b6b7062a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -615,7 +615,7 @@ static void __submit_bio(struct bio *bio) if (unlikely(!blk_crypto_bio_prep(&bio))) return; - if (!bio->bi_bdev->bd_has_submit_bio) { + if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio); } else if (likely(bio_queue_enter(bio) == 0)) { struct gendisk *disk = bio->bi_bdev->bd_disk; @@ -723,7 +723,7 @@ void submit_bio_noacct_nocheck(struct bio *bio) */ if (current->bio_list) bio_list_add(¤t->bio_list[0], bio); - else if (!bio->bi_bdev->bd_has_submit_bio) + else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) __submit_bio_noacct_mq(bio); else __submit_bio_noacct(bio); @@ -759,7 +759,7 @@ void submit_bio_noacct(struct bio *bio) if (!bio_flagged(bio, BIO_REMAPPED)) { if (unlikely(bio_check_eod(bio))) goto end_io; - if (bdev->bd_partno && unlikely(blk_partition_remap(bio))) + if (bdev_partno(bdev) && unlikely(blk_partition_remap(bio))) goto end_io; } @@ -989,7 +989,7 @@ void update_io_ticks(struct block_device *part, unsigned long now, bool end) if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now))) __part_stat_add(part, io_ticks, end ? now - stamp : 1); } - if (part->bd_partno) { + if (bdev_partno(part)) { part = bdev_whole(part); goto again; } diff --git a/block/blk-mq.c b/block/blk-mq.c index 32afb87efbd0..1c4bd891fd6d 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -92,7 +92,7 @@ static bool blk_mq_check_inflight(struct request *rq, void *priv) struct mq_inflight *mi = priv; if (rq->part && blk_do_io_stat(rq) && - (!mi->part->bd_partno || rq->part == mi->part) && + (!bdev_partno(mi->part) || rq->part == mi->part) && blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) mi->inflight[rq_data_dir(rq)]++; diff --git a/block/early-lookup.c b/block/early-lookup.c index 3effbd0d35e9..3fb57f7d2b12 100644 --- a/block/early-lookup.c +++ b/block/early-lookup.c @@ -78,7 +78,7 @@ static int __init devt_from_partuuid(const char *uuid_str, dev_t *devt) * to the partition number found by UUID. */ *devt = part_devt(dev_to_disk(dev), - dev_to_bdev(dev)->bd_partno + offset); + bdev_partno(dev_to_bdev(dev)) + offset); } else { *devt = dev->devt; } diff --git a/block/genhd.c b/block/genhd.c index bb29a68e1d67..19cd1a31fa80 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -413,7 +413,8 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, elevator_init_mq(disk->queue); /* Mark bdev as having a submit_bio, if needed */ - disk->part0->bd_has_submit_bio = disk->fops->submit_bio != NULL; + if (disk->fops->submit_bio) + bdev_set_flag(disk->part0, BD_HAS_SUBMIT_BIO); /* * If the driver provides an explicit major number it also must provide diff --git a/block/ioctl.c b/block/ioctl.c index 0c76137adcaa..be173e4ff43d 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -402,7 +402,10 @@ static int blkdev_roset(struct block_device *bdev, unsigned cmd, if (ret) return ret; } - bdev->bd_read_only = n; + if (n) + bdev_set_flag(bdev, BD_READ_ONLY); + else + bdev_clear_flag(bdev, BD_READ_ONLY); return 0; } diff --git a/block/partitions/core.c b/block/partitions/core.c index b11e88c82c8c..edd5309dc4ba 100644 --- a/block/partitions/core.c +++ b/block/partitions/core.c @@ -173,7 +173,7 @@ static struct parsed_partitions *check_partition(struct gendisk *hd) static ssize_t part_partition_show(struct device *dev, struct device_attribute *attr, char *buf) { - return sprintf(buf, "%d\n", dev_to_bdev(dev)->bd_partno); + return sprintf(buf, "%d\n", bdev_partno(dev_to_bdev(dev))); } static ssize_t part_start_show(struct device *dev, @@ -250,7 +250,7 @@ static int part_uevent(const struct device *dev, struct kobj_uevent_env *env) { const struct block_device *part = dev_to_bdev(dev); - add_uevent_var(env, "PARTN=%u", part->bd_partno); + add_uevent_var(env, "PARTN=%u", bdev_partno(part)); if (part->bd_meta_info && part->bd_meta_info->volname[0]) add_uevent_var(env, "PARTNAME=%s", part->bd_meta_info->volname); return 0; @@ -267,7 +267,7 @@ void drop_partition(struct block_device *part) { lockdep_assert_held(&part->bd_disk->open_mutex); - xa_erase(&part->bd_disk->part_tbl, part->bd_partno); + xa_erase(&part->bd_disk->part_tbl, bdev_partno(part)); kobject_put(part->bd_holder_dir); device_del(&part->bd_device); @@ -338,8 +338,8 @@ static struct block_device *add_partition(struct gendisk *disk, int partno, pdev->parent = ddev; /* in consecutive minor range? */ - if (bdev->bd_partno < disk->minors) { - devt = MKDEV(disk->major, disk->first_minor + bdev->bd_partno); + if (bdev_partno(bdev) < disk->minors) { + devt = MKDEV(disk->major, disk->first_minor + bdev_partno(bdev)); } else { err = blk_alloc_ext_minor(); if (err < 0) @@ -404,7 +404,7 @@ static bool partition_overlaps(struct gendisk *disk, sector_t start, rcu_read_lock(); xa_for_each_start(&disk->part_tbl, idx, part, 1) { - if (part->bd_partno != skip_partno && + if (bdev_partno(part) != skip_partno && start < part->bd_start_sect + bdev_nr_sectors(part) && start + length > part->bd_start_sect) { overlap = true; diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index cb1526ec44b5..bbbcbb36fb6e 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -45,10 +45,7 @@ struct block_device { struct request_queue * bd_queue; struct disk_stats __percpu *bd_stats; unsigned long bd_stamp; - bool bd_read_only; /* read-only policy */ - u8 bd_partno; - bool bd_write_holder; - bool bd_has_submit_bio; + u32 __bd_flags; // partition number + flags dev_t bd_dev; struct inode *bd_inode; /* will die */ @@ -86,6 +83,12 @@ struct block_device { #define bdev_kobj(_bdev) \ (&((_bdev)->bd_device.kobj)) +enum { + BD_READ_ONLY, // read-only policy + BD_WRITE_HOLDER, + BD_HAS_SUBMIT_BIO +}; + /* * Block error status values. See block/blk-core:blk_errors for the details. * Alpha cannot write a byte atomically, so we need to use 32-bit value. diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index c3e8f7cf96be..d556cec9224b 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -720,15 +720,51 @@ void invalidate_disk(struct gendisk *disk); void set_disk_ro(struct gendisk *disk, bool read_only); void disk_uevent(struct gendisk *disk, enum kobject_action action); +static inline u8 bdev_partno(const struct block_device *bdev) +{ + return bdev->__bd_flags & 0xff; +} + +static inline bool bdev_test_flag(const struct block_device *bdev, int flag) +{ + return bdev->__bd_flags & (1 << (flag + 8)); +} + +static inline void bdev_set_flag(struct block_device *bdev, int flag) +{ + u32 v = bdev->__bd_flags; + + for (;;) { + u32 w = cmpxchg(&bdev->__bd_flags, v, v | (1 << (flag + 8))); + + if (v == w) + return; + v = w; + } +} + +static inline void bdev_clear_flag(struct block_device *bdev, int flag) +{ + u32 v = bdev->__bd_flags; + + for (;;) { + u32 w = cmpxchg(&bdev->__bd_flags, v, v & ~(1 << (flag + 8))); + + if (v == w) + return; + v = w; + } +} + static inline int get_disk_ro(struct gendisk *disk) { - return disk->part0->bd_read_only || + return bdev_test_flag(disk->part0, BD_READ_ONLY) || test_bit(GD_READ_ONLY, &disk->state); } static inline int bdev_read_only(struct block_device *bdev) { - return bdev->bd_read_only || get_disk_ro(bdev->bd_disk); + return bdev_test_flag(bdev, BD_READ_ONLY) || get_disk_ro(bdev->bd_disk); } bool set_capacity_and_notify(struct gendisk *disk, sector_t size); @@ -1095,7 +1131,7 @@ static inline int sb_issue_zeroout(struct super_block *sb, sector_t block, static inline bool bdev_is_partition(struct block_device *bdev) { - return bdev->bd_partno; + return bdev_partno(bdev) != 0; } enum blk_default_limits { diff --git a/include/linux/part_stat.h b/include/linux/part_stat.h index abeba356bc3f..ec7eb365b152 100644 --- a/include/linux/part_stat.h +++ b/include/linux/part_stat.h @@ -59,7 +59,7 @@ static inline void part_stat_set_all(struct block_device *part, int value) #define part_stat_add(part, field, addnd) do { \ __part_stat_add((part), field, addnd); \ - if ((part)->bd_partno) \ + if (bdev_partno(part)) \ __part_stat_add(bdev_whole(part), field, addnd); \ } while (0) diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 552738f14275..e05583e54fa5 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -966,13 +966,13 @@ char *bdev_name(char *buf, char *end, struct block_device *bdev, hd = bdev->bd_disk; buf = string(buf, end, hd->disk_name, spec); - if (bdev->bd_partno) { + if (bdev_partno(bdev)) { if (isdigit(hd->disk_name[strlen(hd->disk_name)-1])) { if (buf < end) *buf = 'p'; buf++; } - buf = number(buf, end, bdev->bd_partno, spec); + buf = number(buf, end, bdev_partno(bdev), spec); } return buf; }
On Fri, Apr 12, 2024 at 05:41:16AM +0100, Al Viro wrote: > Christoph, do you have any problems with that approach? > > COMPLETELY UNTESTED patch along those lines follows; if it works, > it would need to be carved up. And I would probably switch the > places where we do if (bdev->bd_partno) to if (bdev_is_partition(bdev)), > for better readability. See git://git.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git bd_flags for a carve-up
On Thu, Apr 11, 2024 at 03:04:09PM +0100, Al Viro wrote: > On Thu, Apr 11, 2024 at 01:56:03PM +0200, Christian Brauner wrote: > > On Wed, Apr 10, 2024 at 11:34:43PM +0100, Al Viro wrote: > > > On Wed, Apr 10, 2024 at 12:59:11PM +0200, Jan Kara wrote: > > > > > > > I agree with Christian and Al - and I think I've expressed that already in > > > > the previous version of the series [1] but I guess I was not explicit > > > > enough :). I think the initial part of the series (upto patch 21, perhaps > > > > excluding patch 20) is a nice cleanup but the latter part playing with > > > > stashing struct file is not an improvement and seems pointless to me. So > > > > I'd separate the initial part cleaning up the obvious places and let > > > > Christian merge it and then we can figure out what (if anything) to do with > > > > remaining bd_inode uses in fs/buffer.c etc. E.g. what Al suggests with > > > > bd_mapping makes sense to me but I didn't check what's left after your > > > > initial patches... > > > > > > FWIW, experimental on top of -next: > > > > Ok, let's move forward with this. I've applied the first 19 patches. > > Patch 20 is the start of what we all disliked. 21 is clearly a bugfix > > for current code so that'll go separately from the rest. I've replaced > > open-code f_mapping access with file_mapping(). The symmetry between > > file_inode() and file_mapping() is quite nice. > > > > Al, your idea to switch erofs away from buf->inode can go on top of what > > Yu did imho. There's no real reason to throw it away imho. > > > > I've exported bdev_mapping() because it really makes the btrfs change a > > lot slimmer and we don't need to care about messing with a lot of that > > code. I didn't care about making it static inline because that might've > > meant we need to move other stuff into the header as well. Imho, it's > > not that important but if it's a big deal to any of you just do the > > changes on top of it, please. > > > > Pushed to > > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.super > > > > If I hear no objections that'll show up in -next tomorrow. Al, would be > > nice if you could do your changes on top of this, please. > > Objection: start with adding bdev->bd_mapping, next convert the really > obvious instances to it and most of this series becomes not needed at > all. > > Really. There is no need whatsoever to push struct file down all those > paths. Your series just replaces bd_inode in struct block_device with bd_mapping. In a lot of places we do have immediate access to the bdev file without changing any calling conventions whatsoever. IMO it's perfectly fine to just use file_mapping() there. Sure, let's use bdev_mapping() in instances like btrfs where we'd otherwise have to change function signatures I'm not opposed to that. But there's no good reason to just replace everything with bdev->bd_mapping access. And really, why keep that thing in struct block_device when we can avoid it. > > And yes, erofs and buffer.c stuff belongs on top of that, no arguments here.
On Fri, Apr 12, 2024 at 11:21:08AM +0200, Christian Brauner wrote: > Your series just replaces bd_inode in struct block_device with > bd_mapping. In a lot of places we do have immediate access to the bdev > file without changing any calling conventions whatsoever. IMO it's > perfectly fine to just use file_mapping() there. Sure, let's use > bdev_mapping() in instances like btrfs where we'd otherwise have to > change function signatures I'm not opposed to that. But there's no good > reason to just replace everything with bdev->bd_mapping access. And > really, why keep that thing in struct block_device when we can avoid it. Because having to have struct file around in the places where we want to get to page cache of block device fast is often inconvenient (see fs/buffer.c, if nothing else). It also simplifies the hell out of the patch series - it's one obviously safe automatic change in a single commit. And AFAICS the flags-related rationale can be dealt with in a much simpler way - see #bf_flags in my tree.
On Fri, Apr 12, 2024 at 12:29:19PM +0100, Al Viro wrote: > On Fri, Apr 12, 2024 at 11:21:08AM +0200, Christian Brauner wrote: > > > Your series just replaces bd_inode in struct block_device with > > bd_mapping. In a lot of places we do have immediate access to the bdev > > file without changing any calling conventions whatsoever. IMO it's > > perfectly fine to just use file_mapping() there. Sure, let's use > > bdev_mapping() in instances like btrfs where we'd otherwise have to > > change function signatures I'm not opposed to that. But there's no good > > reason to just replace everything with bdev->bd_mapping access. And > > really, why keep that thing in struct block_device when we can avoid it. > > Because having to have struct file around in the places where we want to > get to page cache of block device fast is often inconvenient (see fs/buffer.c, > if nothing else). Yes, agreed. But my point is why can't we expose bdev_mapping() for exactly that purpose without having to have that bd_mapping member in struct block_device? We don't want to trade bd_inode for bd_mapping in that struct imho. IOW, if we can avoid bloating struct block device with additional members then we should do that. Is there some performance concern that I'm missing and if so are there numbers to back this? > It also simplifies the hell out of the patch series - it's one obviously > safe automatic change in a single commit. It's trivial to fold the simple file_mapping() conversion into a single patch as well. It's a pure artifact of splitting the patches per subsystem/driver. That's just because people have wildly different opinions on how to do such conversion. But really, that can be trivially dealt with. > And AFAICS the flags-related rationale can be dealt with in a much simpler > way - see #bf_flags in my tree. That's certainly worth doing independent of this discussion.
On Sat, Apr 13, 2024 at 05:25:01PM +0200, Christian Brauner wrote: > > It also simplifies the hell out of the patch series - it's one obviously > > safe automatic change in a single commit. > > It's trivial to fold the simple file_mapping() conversion into a single > patch as well. ... after a bunch of patches that propagate struct file to places where it has no business being. Compared to a variant that doesn't need those patches at all. > It's a pure artifact of splitting the patches per > subsystem/driver. No, it is not. ->bd_mapping conversion can be done without any preliminaries. Note that it doesn't need messing with bdev_read_folio(), it doesn't need this journal->j_fs_dev_file thing, etc. One thing I believe is completely wrong in this series is bdev_inode() existence. It (and equivalent use of file_inode() on struct file is even worse) is papering over the real interface deficiencies. And extra file_inode() uses are just about impossible to catch ;-/ IMO we should *never* use file_inode() on opened block devices. At all. It's brittle, it's asking for trouble as soon as somebody passes a normally opened struct file to one of the functions using it and it papers over the missing primitives. As for the space concerns... With struct device embedded into those things, it's not even funny. Space within the first cacheline - sure, but we can have a pointer in there just fine.
On Mon, Apr 15, 2024 at 09:45:11PM +0100, Al Viro wrote: > On Sat, Apr 13, 2024 at 05:25:01PM +0200, Christian Brauner wrote: > > > > It also simplifies the hell out of the patch series - it's one obviously > > > safe automatic change in a single commit. > > > > It's trivial to fold the simple file_mapping() conversion into a single > > patch as well. > > ... after a bunch of patches that propagate struct file to places where > it has no business being. Compared to a variant that doesn't need those > patches at all. > > > It's a pure artifact of splitting the patches per > > subsystem/driver. > > No, it is not. ->bd_mapping conversion can be done without any > preliminaries. Note that it doesn't need messing with bdev_read_folio(), > it doesn't need this journal->j_fs_dev_file thing, etc. > > One thing I believe is completely wrong in this series is bdev_inode() > existence. It (and equivalent use of file_inode() on struct file is > even worse) is papering over the real interface deficiencies. And > extra file_inode() uses are just about impossible to catch ;-/ > > IMO we should *never* use file_inode() on opened block devices. > At all. It's brittle, it's asking for trouble as soon as somebody > passes a normally opened struct file to one of the functions using it > and it papers over the missing primitives. BTW, speaking of the things where opened struct file would be a good idea - set_blocksize() should take an opened struct file, and it should have non-NULL ->private_data. Changing block size under e.g. a mounted filesystem should never happen; doing that is asking for serious breakage. Looking through the current callers (mainline), most are OK (and easy to switch). However, drivers/block/pktcdvd.c:2285: set_blocksize(disk->part0, CD_FRAMESIZE); drivers/block/pktcdvd.c:2529: set_blocksize(file_bdev(bdev_file), CD_FRAMESIZE); Might be broken; pktcdvd.c being what it is... drivers/md/bcache/super.c:2558: if (set_blocksize(file_bdev(bdev_file), 4096)) Almost certainly broken; hit register_bcache() with pathname of a mounted block device, and if the block size on filesystem in question is not 4K, the things will get interesting. fs/btrfs/volumes.c:485: ret = set_blocksize(bdev, BTRFS_BDEV_BLOCKSIZE); Some of the callers do not bother with exclusive open; in particular, if btrfs_get_dev_args_from_path() ever gets a pathname of a mounted device with something other than btrfs on it, it won't be pretty. kernel/power/swap.c:371: res = set_blocksize(file_bdev(hib_resume_bdev_file), PAGE_SIZE); kernel/power/swap.c:1577: set_blocksize(file_bdev(hib_resume_bdev_file), PAGE_SIZE); Special cases (for obvious reasons); said that, why do we bother with set_blocksize() on those anyway?
On Tue 16-04-24 07:32:53, Al Viro wrote: > On Mon, Apr 15, 2024 at 09:45:11PM +0100, Al Viro wrote: > > On Sat, Apr 13, 2024 at 05:25:01PM +0200, Christian Brauner wrote: > > > > > > It also simplifies the hell out of the patch series - it's one obviously > > > > safe automatic change in a single commit. > > > > > > It's trivial to fold the simple file_mapping() conversion into a single > > > patch as well. > > > > ... after a bunch of patches that propagate struct file to places where > > it has no business being. Compared to a variant that doesn't need those > > patches at all. > > > > > It's a pure artifact of splitting the patches per > > > subsystem/driver. > > > > No, it is not. ->bd_mapping conversion can be done without any > > preliminaries. Note that it doesn't need messing with bdev_read_folio(), > > it doesn't need this journal->j_fs_dev_file thing, etc. > > > > One thing I believe is completely wrong in this series is bdev_inode() > > existence. It (and equivalent use of file_inode() on struct file is > > even worse) is papering over the real interface deficiencies. And > > extra file_inode() uses are just about impossible to catch ;-/ > > > > IMO we should *never* use file_inode() on opened block devices. > > At all. It's brittle, it's asking for trouble as soon as somebody > > passes a normally opened struct file to one of the functions using it > > and it papers over the missing primitives. > > BTW, speaking of the things where opened struct file would be a good > idea - set_blocksize() should take an opened struct file, and it should > have non-NULL ->private_data. > > Changing block size under e.g. a mounted filesystem should never happen; > doing that is asking for serious breakage. > > Looking through the current callers (mainline), most are OK (and easy > to switch). However, > > drivers/block/pktcdvd.c:2285: set_blocksize(disk->part0, CD_FRAMESIZE); > drivers/block/pktcdvd.c:2529: set_blocksize(file_bdev(bdev_file), CD_FRAMESIZE); > Might be broken; pktcdvd.c being what it is... > > drivers/md/bcache/super.c:2558: if (set_blocksize(file_bdev(bdev_file), 4096)) > Almost certainly broken; hit register_bcache() with pathname of a mounted > block device, and if the block size on filesystem in question is not 4K, the things > will get interesting. Agreed. Furthermore that set_blocksize() seems to be completely pointless these days AFAICT because we use read_cache_page_gfp() to read in the data from the device. Sure we may be creating more bhs per page than necessary but who cares? > fs/btrfs/volumes.c:485: ret = set_blocksize(bdev, BTRFS_BDEV_BLOCKSIZE); > Some of the callers do not bother with exclusive open; > in particular, if btrfs_get_dev_args_from_path() ever gets a pathname > of a mounted device with something other than btrfs on it, it won't > be pretty. Yeah and frankly reading through btrfs_read_dev_super() I'm not sure which code needs the block size set either. We use read_cache_page_gfp() for the IO there as well. Honza
On Wed, Apr 17, 2024 at 03:43:12PM +0200, Jan Kara wrote: > > fs/btrfs/volumes.c:485: ret = set_blocksize(bdev, BTRFS_BDEV_BLOCKSIZE); > > Some of the callers do not bother with exclusive open; > > in particular, if btrfs_get_dev_args_from_path() ever gets a pathname > > of a mounted device with something other than btrfs on it, it won't > > be pretty. > > Yeah and frankly reading through btrfs_read_dev_super() I'm not sure which > code needs the block size set either. We use read_cache_page_gfp() for the > IO there as well. FWIW, I don't understand the use of invalidate_bdev() in btrfs_get_bdev_and_sb(), especially when called from btrfs_get_dev_args_from_path() - what's the point of evicting page cache before reading the on-disk superblock, when all we are going to do with the data we get is scan through internal list of opened devices for uuid, etc. matches? Could btrfs folks comment on that one?
On Tue, Apr 16, 2024 at 07:32:53AM +0100, Al Viro wrote: > kernel/power/swap.c:371: res = set_blocksize(file_bdev(hib_resume_bdev_file), PAGE_SIZE); > kernel/power/swap.c:1577: set_blocksize(file_bdev(hib_resume_bdev_file), PAGE_SIZE); > Special cases (for obvious reasons); said that, why do we bother > with set_blocksize() on those anyway? AFAICS, we really don't need either - all IO is done via hib_submit_io(), which sets a single-page bio and feeds it to submit_bio{,_wait}() directly. We are *not* using the page cache of the block device in question, let alone any buffer_head instances. Could swsusp folks comment?
diff --git a/block/bdev.c b/block/bdev.c index 86db97b0709e..3d300823da6b 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -846,6 +846,101 @@ static void init_bdev_file(struct file *bdev_file, struct block_device *bdev, bdev_file->private_data = holder; } +/* + * If BLK_OPEN_WRITE_IOCTL is set then this is a historical quirk + * associated with the floppy driver where it has allowed ioctls if the + * file was opened for writing, but does not allow reads or writes. + * Make sure that this quirk is reflected in @f_flags. + * + * It can also happen if a block device is opened as O_RDWR | O_WRONLY. + */ +static unsigned blk_to_file_flags(blk_mode_t mode) +{ + unsigned int flags = 0; + + if ((mode & (BLK_OPEN_READ | BLK_OPEN_WRITE)) == + (BLK_OPEN_READ | BLK_OPEN_WRITE)) + flags |= O_RDWR; + else if (mode & BLK_OPEN_WRITE_IOCTL) + flags |= O_RDWR | O_WRONLY; + else if (mode & BLK_OPEN_WRITE) + flags |= O_WRONLY; + else if (mode & BLK_OPEN_READ) + flags |= O_RDONLY; /* homeopathic, because O_RDONLY is 0 */ + else + WARN_ON_ONCE(true); + + if (mode & BLK_OPEN_NDELAY) + flags |= O_NDELAY; + + return flags; +} + +static int __stash_bdev_file(struct block_device *bdev) +{ + struct inode *inode = bdev_inode(bdev); + unsigned int flags = blk_to_file_flags(BLK_OPEN_READ | BLK_OPEN_WRITE); + struct file *file; + static struct file_operations stash_fops; + + file = inode->i_private; + if (!file) { + /* + * This file is used for iomap/buffer_head for raw block_device + * read/write operations to access block_device. + */ + file = alloc_file_pseudo_noaccount(bdev_inode(bdev), + blockdev_mnt, "", flags | O_LARGEFILE, + &stash_fops); + + if (IS_ERR(file)) + return PTR_ERR(file); + + ihold(inode); + init_bdev_file(file, bdev, 0, NULL); + + inode->i_private = file; + WARN_ON_ONCE(bdev->stash_count != 0); + } + + bdev->stash_count++; + return 0; +} + +static void __unstash_bdev_file(struct block_device *bdev) +{ + + WARN_ON_ONCE(bdev->stash_count <= 0); + if (--bdev->stash_count == 0) { + struct inode *inode = bdev_inode(bdev); + struct file *file = inode->i_private; + + inode->i_private = NULL; + fput(file); + } +} + +static int stash_bdev_file(struct block_device *bdev) +{ + int ret = __stash_bdev_file(bdev); + + if (ret || !bdev_is_partition(bdev)) + return ret; + + ret = __stash_bdev_file(bdev_whole(bdev)); + if (ret) + __unstash_bdev_file(bdev); + + return ret; +} + +static void unstash_bdev_file(struct block_device *bdev) +{ + __unstash_bdev_file(bdev); + if (bdev_is_partition(bdev)) + __unstash_bdev_file(bdev_whole(bdev)); +} + /** * bdev_open - open a block device * @bdev: block device to open @@ -891,12 +986,17 @@ int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder, ret = -EBUSY; if (!bdev_may_open(bdev, mode)) goto put_module; + + ret = stash_bdev_file(bdev); + if (ret) + goto put_module; + if (bdev_is_partition(bdev)) ret = blkdev_get_part(bdev, mode); else ret = blkdev_get_whole(bdev, mode); if (ret) - goto put_module; + goto unstash_bdev_file; bdev_claim_write_access(bdev, mode); if (holder) { bd_finish_claiming(bdev, holder, hops); @@ -922,6 +1022,9 @@ int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder, init_bdev_file(bdev_file, bdev, mode, holder); return 0; + +unstash_bdev_file: + unstash_bdev_file(bdev); put_module: module_put(disk->fops->owner); abort_claiming: @@ -932,36 +1035,6 @@ int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder, return ret; } -/* - * If BLK_OPEN_WRITE_IOCTL is set then this is a historical quirk - * associated with the floppy driver where it has allowed ioctls if the - * file was opened for writing, but does not allow reads or writes. - * Make sure that this quirk is reflected in @f_flags. - * - * It can also happen if a block device is opened as O_RDWR | O_WRONLY. - */ -static unsigned blk_to_file_flags(blk_mode_t mode) -{ - unsigned int flags = 0; - - if ((mode & (BLK_OPEN_READ | BLK_OPEN_WRITE)) == - (BLK_OPEN_READ | BLK_OPEN_WRITE)) - flags |= O_RDWR; - else if (mode & BLK_OPEN_WRITE_IOCTL) - flags |= O_RDWR | O_WRONLY; - else if (mode & BLK_OPEN_WRITE) - flags |= O_WRONLY; - else if (mode & BLK_OPEN_READ) - flags |= O_RDONLY; /* homeopathic, because O_RDONLY is 0 */ - else - WARN_ON_ONCE(true); - - if (mode & BLK_OPEN_NDELAY) - flags |= O_NDELAY; - - return flags; -} - struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder, const struct blk_holder_ops *hops) { @@ -1073,6 +1146,8 @@ void bdev_release(struct file *bdev_file) blkdev_put_part(bdev); else blkdev_put_whole(bdev); + + unstash_bdev_file(bdev); mutex_unlock(&disk->open_mutex); module_put(disk->fops->owner); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index cb1526ec44b5..22f736908cbe 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -70,6 +70,7 @@ struct block_device { #endif bool bd_ro_warned; int bd_writers; + int stash_count; /* * keep this out-of-line as it's both big and not needed in the fast * path