diff mbox series

[v2,01/34] bdev: open block device as files

Message ID 20240123-vfs-bdev-file-v2-1-adbd023e19cc@kernel.org (mailing list archive)
State New, archived
Headers show
Series Open block devices as files | expand

Commit Message

Christian Brauner Jan. 23, 2024, 1:26 p.m. UTC
Add two new helpers to allow opening block devices as files.
This is not the final infrastructure. This still opens the block device
before opening a struct a file. Until we have removed all references to
struct bdev_handle we can't switch the order:

* Introduce blk_to_file_flags() to translate from block specific to
  flags usable to pen a new file.
* Introduce bdev_file_open_by_{dev,path}().
* Introduce temporary sb_bdev_handle() helper to retrieve a struct
  bdev_handle from a block device file and update places that directly
  reference struct bdev_handle to rely on it.
* Don't count block device openes against the number of open files. A
  bdev_file_open_by_{dev,path}() file is never installed into any
  file descriptor table.

One idea that came to mind was to use kernel_tmpfile_open() which
would require us to pass a path and it would then call do_dentry_open()
going through the regular fops->open::blkdev_open() path. But that has
drawbacks that I consider fatal. We're back to the problem of
routing block specific flags such as BLK_OPEN_RESTRICT_WRITES through
the open path and would have to waste FMODE_* flags every time we add a
new one. Second, it would prohibit us from later using custom fops to
indicate that this is a restricted write operation as well. Overall, we
can avoid using an fmode flag and we have way more leeway in how we open
block devices from bdev_open_by_{dev,path}().

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 block/bdev.c           | 104 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/cramfs/inode.c      |   2 +-
 fs/f2fs/super.c        |   2 +-
 fs/file_table.c        |  36 +++++++++++++----
 fs/jfs/jfs_logmgr.c    |   2 +-
 fs/romfs/super.c       |   2 +-
 fs/super.c             |  18 ++++-----
 fs/xfs/xfs_super.c     |   2 +-
 include/linux/blkdev.h |   7 ++++
 include/linux/file.h   |   2 +
 include/linux/fs.h     |  10 ++++-
 11 files changed, 160 insertions(+), 27 deletions(-)

Comments

Christoph Hellwig Jan. 29, 2024, 4:02 p.m. UTC | #1
> +static unsigned blk_to_file_flags(blk_mode_t mode)
> +{
> +	unsigned int flags = 0;
> +

...

> +	/*
> +	 * O_EXCL is one of those flags that the VFS clears once it's done with
> +	 * the operation. So don't raise it here either.
> +	 */
> +	if (mode & BLK_OPEN_NDELAY)
> +		flags |= O_NDELAY;

O_EXCL isn't dealt with in this helper at all.

> +	/*
> +	 * If BLK_OPEN_WRITE_IOCTL is set then this is a historical quirk
> +	 * associated with the floppy driver where it has allowed ioctls if the
> +	 * file was opened for writing, but does not allow reads or writes.
> +	 * Make sure that this quirk is reflected in @f_flags.
> +	 */
> +	if (mode & BLK_OPEN_WRITE_IOCTL)
> +		flags |= O_RDWR | O_WRONLY;

.. and BLK_OPEN_WRITE_IOCTL will never be passed to it.  It only comes
from open block devices nodes.

That being said, passing BLK_OPEN_* to bdev_file_open_by_* actually
feels wrong.  They deal with files and should just take normal
O_* flags instead of translating from BLK_OPEN_* to O_* back to
BLK_OPEN_* for the driver (where they make sense as the driver
flags are pretty different from what is passed to open).

Now of course changing that would make a mess of the whole series,
so maybe that can go into a new patch at the end?

> + * @noaccount: whether this is an internal open that shouldn't be counted
>   */
>  static struct file *alloc_file(const struct path *path, int flags,
> -		const struct file_operations *fop)
> +		const struct file_operations *fop, bool noaccount)

Just a suggestion as you are the maintainer here, but I always find
it hard to follow when infrastructure in subsystem A is changed in
a patch primarily changing subsystem B.  Can the file_table.c
changes go into a separate patch or patches with commit logs
documenting their semantics?

And while we're at the semantics I find this area already a bit of a
a mess and this doesn't make it any better..

How about the following:

 - alloc_file loses the actual file allocation and gets a new name
   (unfortunatel init_file is already taken), callers call
   alloc_empty_file_noaccount or alloc_empty_file plus the
   new helper.
 - similarly __alloc_file_pseudo is split into a helper creating
   a path for mnt and inode, and callers call that plus the
   file allocation

?

> +extern struct file *alloc_file_pseudo_noaccount(struct inode *, struct vfsmount *,

no need for the extern here.

> +	struct block_device	*s_bdev;	/* can go away once we use an accessor for @s_bdev_file */

can you put the comment into a separate line to make it readable.

But I'm not even sure it should go away.  s_bdev is used all over the
data and metadata I/O path, so caching it and avoiding multiple levels
of pointer chasing would seem useful.
Christian Brauner Feb. 1, 2024, 5:08 p.m. UTC | #2
On Mon, Jan 29, 2024 at 05:02:41PM +0100, Christoph Hellwig wrote:
> > +static unsigned blk_to_file_flags(blk_mode_t mode)
> > +{
> > +	unsigned int flags = 0;
> > +
> 
> ...
> 
> > +	/*
> > +	 * O_EXCL is one of those flags that the VFS clears once it's done with
> > +	 * the operation. So don't raise it here either.
> > +	 */
> > +	if (mode & BLK_OPEN_NDELAY)
> > +		flags |= O_NDELAY;
> 
> O_EXCL isn't dealt with in this helper at all.

Yeah, on purpose was my point bc we can just rely on @holder and passing
_EXCL without holder is invalid. But I could add it.

> 
> > +	/*
> > +	 * If BLK_OPEN_WRITE_IOCTL is set then this is a historical quirk
> > +	 * associated with the floppy driver where it has allowed ioctls if the
> > +	 * file was opened for writing, but does not allow reads or writes.
> > +	 * Make sure that this quirk is reflected in @f_flags.
> > +	 */
> > +	if (mode & BLK_OPEN_WRITE_IOCTL)
> > +		flags |= O_RDWR | O_WRONLY;
> 
> .. and BLK_OPEN_WRITE_IOCTL will never be passed to it.  It only comes
> from open block devices nodes.
> 
> That being said, passing BLK_OPEN_* to bdev_file_open_by_* actually
> feels wrong.  They deal with files and should just take normal
> O_* flags instead of translating from BLK_OPEN_* to O_* back to
> BLK_OPEN_* for the driver (where they make sense as the driver
> flags are pretty different from what is passed to open).
> 
> Now of course changing that would make a mess of the whole series,
> so maybe that can go into a new patch at the end?

Yes, I had considered that and it would work but there's the issue that
we need to figure out how to handle BLK_OPEN_RESTRICT_WRITES. It has no
corresponding O_* flag that would let us indicate this. So I had
considered:

1/ Expose bdev_file_open_excl() so callers don't need to pass any
   specific flags. Nearly all filesystems would effectively use this
   helper as sb_open_mode() adds it implicitly. That would have the
   side-effect of introducing another open helper ofc; possibly two if
   we take _by_dev() and _by_path() into account.

2/ Abuse an O_* flag to mean BLK_OPEN_RESTRICT_WRITES. For example,
   O_TRUNC or O_NOCTTY which is pretty yucky.

3/ Introduce an internal O_* flag which is also ugly. Vomitorious and my
   co-maintainers would likely chop off my hands so I can't go near a
   computer again.

3/ Make O_EXCL when passed together with bdev_file_open_by_*() always
   imply BLK_OPEN_RESTRICT_WRITES.

The 3/ option would probably be the cleanest one and I think that all
filesystems now pass at least a holder and holder ops so this _should_
work.

Thoughts?

> 
> > + * @noaccount: whether this is an internal open that shouldn't be counted
> >   */
> >  static struct file *alloc_file(const struct path *path, int flags,
> > -		const struct file_operations *fop)
> > +		const struct file_operations *fop, bool noaccount)
> 
> Just a suggestion as you are the maintainer here, but I always find
> it hard to follow when infrastructure in subsystem A is changed in
> a patch primarily changing subsystem B.  Can the file_table.c
> changes go into a separate patch or patches with commit logs
> documenting their semantics?
> 
> And while we're at the semantics I find this area already a bit of a
> a mess and this doesn't make it any better..
> 
> How about the following:
> 
>  - alloc_file loses the actual file allocation and gets a new name
>    (unfortunatel init_file is already taken), callers call
>    alloc_empty_file_noaccount or alloc_empty_file plus the
>    new helper.
>  - similarly __alloc_file_pseudo is split into a helper creating
>    a path for mnt and inode, and callers call that plus the
>    file allocation
> 
> ?

Ok, let me see how far I get.

> 
> > +extern struct file *alloc_file_pseudo_noaccount(struct inode *, struct vfsmount *,
> 
> no need for the extern here.
> 
> > +	struct block_device	*s_bdev;	/* can go away once we use an accessor for @s_bdev_file */
> 
> can you put the comment into a separate line to make it readable.
> 
> But I'm not even sure it should go away.  s_bdev is used all over the
> data and metadata I/O path, so caching it and avoiding multiple levels
> of pointer chasing would seem useful.

Fair.
Christoph Hellwig Feb. 2, 2024, 6:43 a.m. UTC | #3
On Thu, Feb 01, 2024 at 06:08:29PM +0100, Christian Brauner wrote:
> > > +	/*
> > > +	 * O_EXCL is one of those flags that the VFS clears once it's done with
> > > +	 * the operation. So don't raise it here either.
> > > +	 */
> > > +	if (mode & BLK_OPEN_NDELAY)
> > > +		flags |= O_NDELAY;
> > 
> > O_EXCL isn't dealt with in this helper at all.
> 
> Yeah, on purpose was my point bc we can just rely on @holder and passing
> _EXCL without holder is invalid. But I could add it.

Ok.  I found it weird to have the comment next to BLK_OPEN_NDELAY
as it looked like it sneaked through.  Especially as BLK_OPEN_EXCL
has literally nothing to do with O_EXCL at all as the latter is a
namespace operation flag.  So even if the comment was intentional
I think we're probably better off without it.

> Yes, I had considered that and it would work but there's the issue that
> we need to figure out how to handle BLK_OPEN_RESTRICT_WRITES. It has no
> corresponding O_* flag that would let us indicate this.

Oh, indeed.

>
> So I had
> considered:
> 
> 1/ Expose bdev_file_open_excl() so callers don't need to pass any
>    specific flags. Nearly all filesystems would effectively use this
>    helper as sb_open_mode() adds it implicitly. That would have the
>    side-effect of introducing another open helper ofc; possibly two if
>    we take _by_dev() and _by_path() into account.
> 
> 2/ Abuse an O_* flag to mean BLK_OPEN_RESTRICT_WRITES. For example,
>    O_TRUNC or O_NOCTTY which is pretty yucky.
> 
> 3/ Introduce an internal O_* flag which is also ugly. Vomitorious and my
>    co-maintainers would likely chop off my hands so I can't go near a
>    computer again.
> 
> 3/ Make O_EXCL when passed together with bdev_file_open_by_*() always
>    imply BLK_OPEN_RESTRICT_WRITES.
> 
> The 3/ option would probably be the cleanest one and I think that all
> filesystems now pass at least a holder and holder ops so this _should_
> work.

2 and 3 sound pretty horrible.  3 would work and look clean for the
block side, but O_ flags are mess so I wouldn't go there.

Maybe  variant of 1 that allows for a non-exclusive open and clearly
marks that?

Or just leave it as-is for now and look into that later.
Christian Brauner Feb. 2, 2024, 11:46 a.m. UTC | #4
On Fri, Feb 02, 2024 at 07:43:24AM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2024 at 06:08:29PM +0100, Christian Brauner wrote:
> > > > +	/*
> > > > +	 * O_EXCL is one of those flags that the VFS clears once it's done with
> > > > +	 * the operation. So don't raise it here either.
> > > > +	 */
> > > > +	if (mode & BLK_OPEN_NDELAY)
> > > > +		flags |= O_NDELAY;
> > > 
> > > O_EXCL isn't dealt with in this helper at all.
> > 
> > Yeah, on purpose was my point bc we can just rely on @holder and passing
> > _EXCL without holder is invalid. But I could add it.
> 
> Ok.  I found it weird to have the comment next to BLK_OPEN_NDELAY
> as it looked like it sneaked through.  Especially as BLK_OPEN_EXCL
> has literally nothing to do with O_EXCL at all as the latter is a
> namespace operation flag.  So even if the comment was intentional
> I think we're probably better off without it.
> 
> > Yes, I had considered that and it would work but there's the issue that
> > we need to figure out how to handle BLK_OPEN_RESTRICT_WRITES. It has no
> > corresponding O_* flag that would let us indicate this.
> 
> Oh, indeed.
> 
> >
> > So I had
> > considered:
> > 
> > 1/ Expose bdev_file_open_excl() so callers don't need to pass any
> >    specific flags. Nearly all filesystems would effectively use this
> >    helper as sb_open_mode() adds it implicitly. That would have the
> >    side-effect of introducing another open helper ofc; possibly two if
> >    we take _by_dev() and _by_path() into account.
> > 
> > 2/ Abuse an O_* flag to mean BLK_OPEN_RESTRICT_WRITES. For example,
> >    O_TRUNC or O_NOCTTY which is pretty yucky.
> > 
> > 3/ Introduce an internal O_* flag which is also ugly. Vomitorious and my
> >    co-maintainers would likely chop off my hands so I can't go near a
> >    computer again.
> > 
> > 3/ Make O_EXCL when passed together with bdev_file_open_by_*() always
> >    imply BLK_OPEN_RESTRICT_WRITES.
> > 
> > The 3/ option would probably be the cleanest one and I think that all
> > filesystems now pass at least a holder and holder ops so this _should_
> > work.
> 
> 2 and 3 sound pretty horrible.  3 would work and look clean for the
> block side, but O_ flags are mess so I wouldn't go there.

My numbering is obviously wrong btw. That last point should obviously be 4/

> 
> Maybe  variant of 1 that allows for a non-exclusive open and clearly
> marks that?
> 
> Or just leave it as-is for now and look into that later.

Ok.
Christian Brauner Feb. 9, 2024, 11:39 a.m. UTC | #5
> > How about the following:
> > 
> >  - alloc_file loses the actual file allocation and gets a new name
> >    (unfortunatel init_file is already taken), callers call
> >    alloc_empty_file_noaccount or alloc_empty_file plus the
> >    new helper.
> >  - similarly __alloc_file_pseudo is split into a helper creating
> >    a path for mnt and inode, and callers call that plus the
> >    file allocation
> > 
> > ?
> 
> Ok, let me see how far I get.

Ok, it's in vfs.super.
Christoph Hellwig March 13, 2024, 2:32 a.m. UTC | #6
Now that this is in mainline it seems to cause blktests to crash
nbd/003 with a rather non-obvious oops for me:

nbd/003 (mount/unmount concurrently with NBD_CLEAR_SOCK)    
runtime  8.139s  ...
[  802.941685] run blktests nbd/003 at 2024-03-12 14:51:20
[  803.171958] nbd0: detected capacity change from 0 to 20971520
[  803.183725] block nbd0: shutting down sockets
[  803.184645] I/O error, dev nbd0, sector 2 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
[  803.185156] EXT4-fs (nbd0): unable to read superblock
[  803.186214] I/O error, dev nbd0, sector 20968432 op 0x0:(READ) flags
 0x80700 phys_seg 1 prio clas0
[  803.186770] I/O error, dev nbd0, sector 20968432 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  803.187026] Buffer I/O error on dev nbd0, logical block 10484216, async page read
[  803.187335] I/O error, dev nbd0, sector 20968434 op 0x0:(READ) flags 0x0 phys_seg 3 prio class 0
[  803.187593] Buffer I/O error on dev nbd0, logical block 10484217, async page read
[  803.187809] Buffer I/O error on dev nbd0, logical block 10484218, async page read
[  803.188027] Buffer I/O error on dev nbd0, logical block 10484219, async page read
[  803.194147] I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  803.194400] Buffer I/O error on dev nbd0, logical block 0, async page read
[  803.194634] I/O error, dev nbd0, sector 2 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  803.194880] Buffer I/O error on dev nbd0, logical block 1, async page read
[  803.195296] I/O error, dev nbd0, sector 4 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  803.195548] Buffer I/O error on dev nbd0, logical block 2, async page read
[  803.195781] I/O error, dev nbd0, sector 6 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  803.196015] Buffer I/O error on dev nbd0, logical block 3, async page read
[  803.196246] Buffer I/O error on dev nbd0, logical block 0, async page read
[  803.196743] ldm_validate_partition_table(): Disk read failed.
[  803.197375] Dev nbd0: unable to read RDB block 0
[  803.198007]  nbd0: unable to read partition table
[  803.198467] EXT4-fs (nbd0): unable to read superblock
[  803.199444] ldm_validate_partition_table(): Disk read failed.
[  803.200487] Dev nbd0: unable to read RDB block 0
[  803.201046]  nbd0: unable to read partition table
[  803.207369] (udev-worker): attempt to access beyond end of device
[  803.207369] nbd0: rw=0, sector=2, nr_sectors = 2 limit=0
[  803.208100] (udev-worker): attempt to access beyond end of device
[  803.208100] nbd0: rw=0, sector=4, nr_sectors = 2 limit=0
[  803.208807] (udev-worker): attempt to access beyond end of device
[  803.208807] nbd0: rw=0, sector=6, nr_sectors = 2 limit=0
[  803.209197] ldm_validate_partition_table(): Disk read failed.
[  803.209365] Dev nbd0: unable to read RDB block 0
[  803.209506]  nbd0: unable to read partition table
[  803.209679] nbd0: partition table beyond EOD, truncated
[  803.210132] mount_clear_soc: attempt to access beyond end of device
[  803.210132] nbd0: rw=4096, sector=2, nr_sectors = 2 limit=0
[  803.210896] EXT4-fs (nbd0): unable to read superblock
[  803.212990] mount_clear_soc: attempt to access beyond end of device
[  803.212990] nbd0: rw=4096, sector=2, nr_sectors = 2 limit=0
[  803.213374] EXT4-fs (nbd0): unable to read superblock
[  803.221502] mount_clear_soc: attempt to access beyond end of device
[  803.221502] nbd0: rw=4096, sector=2, nr_sectors = 2 limit=0
[  803.221868] EXT4-fs (nbd0): unable to read superblock
[  803.223990] mount_clear_soc: attempt to access beyond end of device
[  803.223990] nbd0: rw=4096, sector=2, nr_sectors = 2 limit=0
[  803.224334] EXT4-fs (nbd0): unable to read superblock
[  803.229317] mount_clear_soc: attempt to access beyond end of device
[  803.229317] nbd0: rw=4096, sector=2, nr_sectors = 2 limit=0
[  803.229665] EXT4-fs (nbd0): unable to read superblock
[  803.231698] mount_clear_soc: attempt to access beyond end of device
[  803.231698] nbd0: rw=4096, sector=2, nr_sectors = 2 limit=0
[  803.232068] EXT4-fs (nbd0): unable to read superblock
[  803.233702] mount_clear_soc: attempt to access beyond end of device
[  803.233702] nbd0: rw=4096, sector=2, nr_sectors = 2 limit=0
[  803.234049] EXT4-fs (nbd0): unable to read superblock
[  803.235263] general protection fault, maybe for address 0x0: 0000 [#1] PREEMPT SMP NOPTI
[  803.235505] CPU: 1 PID: 53579 Comm: mount_clear_soc Not tainted 6.8.0+ #2288
[  803.235712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/014
[  803.235973] RIP: 0010:srso_alias_safe_ret+0x5/0x7
[  803.236118] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[  803.236637] RSP: 0018:ffffc900000d4ef8 EFLAGS: 00010202
[  803.236789] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
[  803.236991] RDX: 6b6b6b6b6b6b6b6b RSI: ffffffff833adc6a RDI: ffff888112581870
[  803.237200] RBP: ffffffff8124ae64 R08: 00000000ffffffff R09: 00000000ffffffff
[  803.237402] R10: 0000000000000002 R11: ffff8881130cb458 R12: ffff8881130caa40
[  803.237605] R13: 0000000000000000 R14: 0000000000031688 R15: ffffffff8124adde
[  803.237811] FS:  0000000000000000(0000) GS:ffff8881f9d00000(0000) knlGS:0000000000000000
[  803.238038] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  803.238206] CR2: 00007f2bb5507740 CR3: 0000000110dd2000 CR4: 0000000000750ef0
[  803.238412] PKRU: 55555554
[  803.238496] Call Trace:
[  803.238571]  <IRQ>
[  803.238634]  ? die_addr+0x31/0x80
[  803.238740]  ? exc_general_protection+0x24a/0x480
[  803.238886]  ? asm_exc_general_protection+0x26/0x30
[  803.239028]  ? rcu_core+0x34e/0xae0
[  803.239141]  ? rcu_core+0x3d4/0xae0
[  803.239255]  ? srso_alias_safe_ret+0x5/0x7
[  803.239379]  ? srso_alias_return_thunk+0x5/0xfbef5
[  803.239523]  ? rcu_core+0x3d9/0xae0
[  803.239636]  ? __do_softirq+0xec/0x481
[  803.239757]  ? __irq_exit_rcu+0x89/0xe0
[  803.239874]  ? irq_exit_rcu+0x9/0x30
[  803.239982]  ? sysvec_apic_timer_interrupt+0xa1/0xd0
[  803.240130]  </IRQ>
[  803.240197]  <TASK>
[  803.240265]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  803.240423]  ? __memcg_slab_free_hook+0x11e/0x230
[  803.240566]  ? __memcg_slab_free_hook+0x68/0x230
[  803.240705]  ? unlink_anon_vmas+0x78/0x200
[  803.240833]  ? kmem_cache_free+0x2ca/0x310
[  803.240960]  ? unlink_anon_vmas+0x78/0x200
[  803.241086]  ? free_pgtables+0x141/0x260
[  803.241218]  ? exit_mmap+0x194/0x440
[  803.241337]  ? __mmput+0x3a/0x110
[  803.241445]  ? do_exit+0x2bf/0xb10
[  803.241553]  ? do_group_exit+0x31/0x90
[  803.241669]  ? __x64_sys_exit_group+0x13/0x20
[  803.241801]  ? do_syscall_64+0x75/0x150
[  803.241921]  ? entry_SYSCALL_64_after_hwframe+0x6c/0x74
Christian Brauner March 14, 2024, 11:10 a.m. UTC | #7
On Tue, Mar 12, 2024 at 07:32:35PM -0700, Christoph Hellwig wrote:
> Now that this is in mainline it seems to cause blktests to crash
> nbd/003 with a rather non-obvious oops for me:

Ok, will be looking into that next.
Christian Brauner March 14, 2024, 2:47 p.m. UTC | #8
On Thu, Mar 14, 2024 at 12:10:59PM +0100, Christian Brauner wrote:
> On Tue, Mar 12, 2024 at 07:32:35PM -0700, Christoph Hellwig wrote:
> > Now that this is in mainline it seems to cause blktests to crash
> > nbd/003 with a rather non-obvious oops for me:
> 
> Ok, will be looking into that next.

Ok, I know what's going on. Basically, fput() on the block device runs
asynchronously which means that bdev->bd_holder can still be set to @sb
after it has already been freed. Let me illustrate what I mean:

P1                                                 P2
mount(sb)                                          fd = open("/dev/nbd", ...)
-> file = bdev_file_open_by_dev(..., sb, ...)
   bdev->bd_holder = sb;

// Later on:

umount(sb)
->kill_block_super(sb)
|----> [fput() -> deferred via task work]
-> put_super(sb) -> frees the sb via rcu
|
|                                                 nbd_ioctl(NBD_CLEAR_SOCK)
|                                                 -> disk_force_media_change()
|                                                    -> bdev_mark_dead()
|                                                       -> fs_mark_dead()
|                                                          // Finds bdev->bd_holder == sb
|-> file->release::bdev_release()                          // Tries to get reference to it but it's freed; frees it again
   bdev->bd_holder = NULL;

Two solutions that come to my mind:

[1] Indicate to fput() that this is an internal block devices open and
    thus just close it synchronously. This is fine. First, because the block
    device superblock is never unmounted or anything so there's no risk
    from hanging there for any reason. Second, bdev_release() also ran
    synchronously so any issue that we'd see from calling
    file->f_op->release::bdev_release() we would have seen from
    bdev_release() itself. See [1.1] for a patch.

(2) Take a temporary reference to the holder when opening the block
    device. This is also fine afaict because we differentiate between
    passive and active references on superblocks and what we already do
    in fs_bdev_mark_dead() and friends. This mean we don't have to mess
    around with fput(). See [1.2] for a patch.

(3) Revert and cry. No patch.

Personally, I think (2) is more desirable because we don't lose the
async property of fput() and we don't need to have a new FMODE_* flag.
I'd like to do some more testing with this. Thoughts?

[1.1]:
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 block/bdev.c       | 1 +
 fs/file_table.c    | 5 +++++
 include/linux/fs.h | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/block/bdev.c b/block/bdev.c
index e7adaaf1c219..d0c208a04b04 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -969,6 +969,7 @@ struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
 		return bdev_file;
 	}
 	ihold(bdev->bd_inode);
+	bdev_file->f_mode |= FMODE_BLOCKDEV;
 
 	ret = bdev_open(bdev, mode, holder, hops, bdev_file);
 	if (ret) {
diff --git a/fs/file_table.c b/fs/file_table.c
index 4f03beed4737..48d35dd67020 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -473,6 +473,11 @@ void fput(struct file *file)
 	if (atomic_long_dec_and_test(&file->f_count)) {
 		struct task_struct *task = current;
 
+		if (unlikely((file->f_mode & FMODE_BLOCKDEV))) {
+			__fput(file);
+			return;
+		}
+
 		if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
 			file_free(file);
 			return;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d5d5a4ee24f0..ceac9c0316a6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -121,6 +121,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define FMODE_PWRITE		((__force fmode_t)0x10)
 /* File is opened for execution with sys_execve / sys_uselib */
 #define FMODE_EXEC		((__force fmode_t)0x20)
+
+/* File is opened as block device. */
+#define FMODE_BLOCKDEV		((__force fmode_t)0x100)
 /* 32bit hashes as llseek() offset (for directories) */
 #define FMODE_32BITHASH         ((__force fmode_t)0x200)
 /* 64bit hashes as llseek() offset (for directories) */
Christian Brauner March 14, 2024, 4:45 p.m. UTC | #9
On Thu, Mar 14, 2024 at 03:47:52PM +0100, Christian Brauner wrote:
> On Thu, Mar 14, 2024 at 12:10:59PM +0100, Christian Brauner wrote:
> > On Tue, Mar 12, 2024 at 07:32:35PM -0700, Christoph Hellwig wrote:
> > > Now that this is in mainline it seems to cause blktests to crash
> > > nbd/003 with a rather non-obvious oops for me:
> > 
> > Ok, will be looking into that next.
> 
> Ok, I know what's going on. Basically, fput() on the block device runs
> asynchronously which means that bdev->bd_holder can still be set to @sb
> after it has already been freed. Let me illustrate what I mean:
> 
> P1                                                 P2
> mount(sb)                                          fd = open("/dev/nbd", ...)
> -> file = bdev_file_open_by_dev(..., sb, ...)
>    bdev->bd_holder = sb;
> 
> // Later on:
> 
> umount(sb)
> ->kill_block_super(sb)
> |----> [fput() -> deferred via task work]
> -> put_super(sb) -> frees the sb via rcu
> |
> |                                                 nbd_ioctl(NBD_CLEAR_SOCK)
> |                                                 -> disk_force_media_change()
> |                                                    -> bdev_mark_dead()
> |                                                       -> fs_mark_dead()
> |                                                          // Finds bdev->bd_holder == sb
> |-> file->release::bdev_release()                          // Tries to get reference to it but it's freed; frees it again
>    bdev->bd_holder = NULL;
> 
> Two solutions that come to my mind:
> 
> [1] Indicate to fput() that this is an internal block devices open and
>     thus just close it synchronously. This is fine. First, because the block
>     device superblock is never unmounted or anything so there's no risk
>     from hanging there for any reason. Second, bdev_release() also ran
>     synchronously so any issue that we'd see from calling
>     file->f_op->release::bdev_release() we would have seen from
>     bdev_release() itself. See [1.1] for a patch.
> 
> (2) Take a temporary reference to the holder when opening the block
>     device. This is also fine afaict because we differentiate between
>     passive and active references on superblocks and what we already do
>     in fs_bdev_mark_dead() and friends. This mean we don't have to mess
>     around with fput(). See [1.2] for a patch.
> 
> (3) Revert and cry. No patch.
> 
> Personally, I think (2) is more desirable because we don't lose the
> async property of fput() and we don't need to have a new FMODE_* flag.
> I'd like to do some more testing with this. Thoughts?
> 
> [1.1]:
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  block/bdev.c       | 1 +
>  fs/file_table.c    | 5 +++++
>  include/linux/fs.h | 3 +++
>  3 files changed, 9 insertions(+)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index e7adaaf1c219..d0c208a04b04 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -969,6 +969,7 @@ struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
>  		return bdev_file;
>  	}
>  	ihold(bdev->bd_inode);
> +	bdev_file->f_mode |= FMODE_BLOCKDEV;
>  
>  	ret = bdev_open(bdev, mode, holder, hops, bdev_file);
>  	if (ret) {
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 4f03beed4737..48d35dd67020 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -473,6 +473,11 @@ void fput(struct file *file)
>  	if (atomic_long_dec_and_test(&file->f_count)) {
>  		struct task_struct *task = current;
>  
> +		if (unlikely((file->f_mode & FMODE_BLOCKDEV))) {
> +			__fput(file);
> +			return;
> +		}
> +
>  		if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
>  			file_free(file);
>  			return;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index d5d5a4ee24f0..ceac9c0316a6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -121,6 +121,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  #define FMODE_PWRITE		((__force fmode_t)0x10)
>  /* File is opened for execution with sys_execve / sys_uselib */
>  #define FMODE_EXEC		((__force fmode_t)0x20)
> +
> +/* File is opened as block device. */
> +#define FMODE_BLOCKDEV		((__force fmode_t)0x100)
>  /* 32bit hashes as llseek() offset (for directories) */
>  #define FMODE_32BITHASH         ((__force fmode_t)0x200)
>  /* 64bit hashes as llseek() offset (for directories) */
> -- 
> 2.43.0
> 
> [1.2]:
> Sketched-by: Christian Brauner <brauner@kernel.org>
> ---
>  block/bdev.c           |  4 ++++
>  fs/super.c             | 17 +++++++++++++++++
>  include/linux/blkdev.h |  3 +++
>  3 files changed, 24 insertions(+)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index e7adaaf1c219..a0d5960dc2b9 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -627,6 +627,8 @@ static void bd_end_claim(struct block_device *bdev, void *holder)
>  		whole->bd_holder = NULL;
>  	mutex_unlock(&bdev_lock);
>  
> +	if (bdev->bd_holder_ops && bdev->bd_holder_ops->put_holder)
> +		bdev->bd_holder_ops->put_holder(holder);

That should move into bdev_release() obviously so it mirrors
bdev_open(). Plus, currently it's a nop because we just NULLed
bdev->bd_holder_ops above. But you get the idea, I hope.

>  	/*
>  	 * If this was the last claim, remove holder link and unblock evpoll if
>  	 * it was a write holder.
> @@ -902,6 +904,8 @@ int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder,
>  		bdev_file->f_mode |= FMODE_NOWAIT;
>  	bdev_file->f_mapping = bdev->bd_inode->i_mapping;
>  	bdev_file->f_wb_err = filemap_sample_wb_err(bdev_file->f_mapping);
> +	if (hops && hops->get_holder)
> +		hops->get_holder(holder);
>  	bdev_file->private_data = holder;
>  
>  	return 0;
> diff --git a/fs/super.c b/fs/super.c
> index ee05ab6b37e7..64dbbdbed93a 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1515,11 +1515,28 @@ static int fs_bdev_thaw(struct block_device *bdev)
>  	return error;
>  }
>  
> +static void fs_bdev_super_get(void *data)
> +{
> +	struct super_block *sb = data;
> +
> +	spin_lock(&sb_lock);
> +	sb->s_count++;
> +	spin_unlock(&sb_lock);
> +}
> +
> +static void fs_bdev_super_put(void *data)
> +{
> +	struct super_block *sb = data;
> +	put_super(sb);
> +}
> +
>  const struct blk_holder_ops fs_holder_ops = {
>  	.mark_dead		= fs_bdev_mark_dead,
>  	.sync			= fs_bdev_sync,
>  	.freeze			= fs_bdev_freeze,
>  	.thaw			= fs_bdev_thaw,
> +	.get_holder		= fs_bdev_super_get,
> +	.put_holder		= fs_bdev_super_put,
>  };
>  EXPORT_SYMBOL_GPL(fs_holder_ops);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f9b87c39cab0..d919e8bcb2c1 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1505,6 +1505,9 @@ struct blk_holder_ops {
>  	 * Thaw the file system mounted on the block device.
>  	 */
>  	int (*thaw)(struct block_device *bdev);
> +
> +	void (*get_holder)(void *holder);
> +	void (*put_holder)(void *holder);
>  };
>  
>  /*
> -- 
> 2.43.0
>
Jan Kara March 14, 2024, 4:58 p.m. UTC | #10
On Thu 14-03-24 15:47:52, Christian Brauner wrote:
> On Thu, Mar 14, 2024 at 12:10:59PM +0100, Christian Brauner wrote:
> > On Tue, Mar 12, 2024 at 07:32:35PM -0700, Christoph Hellwig wrote:
> > > Now that this is in mainline it seems to cause blktests to crash
> > > nbd/003 with a rather non-obvious oops for me:
> > 
> > Ok, will be looking into that next.
> 
> Ok, I know what's going on. Basically, fput() on the block device runs
> asynchronously which means that bdev->bd_holder can still be set to @sb
> after it has already been freed. Let me illustrate what I mean:
> 
> P1                                                 P2
> mount(sb)                                          fd = open("/dev/nbd", ...)
> -> file = bdev_file_open_by_dev(..., sb, ...)
>    bdev->bd_holder = sb;
> 
> // Later on:
> 
> umount(sb)
> ->kill_block_super(sb)
> |----> [fput() -> deferred via task work]
> -> put_super(sb) -> frees the sb via rcu
> |
> |                                                 nbd_ioctl(NBD_CLEAR_SOCK)
> |                                                 -> disk_force_media_change()
> |                                                    -> bdev_mark_dead()
> |                                                       -> fs_mark_dead()
> |                                                          // Finds bdev->bd_holder == sb
> |-> file->release::bdev_release()                          // Tries to get reference to it but it's freed; frees it again
>    bdev->bd_holder = NULL;
> 
> Two solutions that come to my mind:
> 
> [1] Indicate to fput() that this is an internal block devices open and
>     thus just close it synchronously. This is fine. First, because the block
>     device superblock is never unmounted or anything so there's no risk
>     from hanging there for any reason. Second, bdev_release() also ran
>     synchronously so any issue that we'd see from calling
>     file->f_op->release::bdev_release() we would have seen from
>     bdev_release() itself. See [1.1] for a patch.
> 
> (2) Take a temporary reference to the holder when opening the block
>     device. This is also fine afaict because we differentiate between
>     passive and active references on superblocks and what we already do
>     in fs_bdev_mark_dead() and friends. This mean we don't have to mess
>     around with fput(). See [1.2] for a patch.
> 
> (3) Revert and cry. No patch.
> 
> Personally, I think (2) is more desirable because we don't lose the
> async property of fput() and we don't need to have a new FMODE_* flag.
> I'd like to do some more testing with this. Thoughts?

Yeah, 2) looks like the least painful solution to me as well.

								Honza
diff mbox series

Patch

diff --git a/block/bdev.c b/block/bdev.c
index e9f1b12bd75c..4246a57a7c5a 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -49,6 +49,13 @@  struct block_device *I_BDEV(struct inode *inode)
 }
 EXPORT_SYMBOL(I_BDEV);
 
+struct block_device *file_bdev(struct file *bdev_file)
+{
+	struct bdev_handle *handle = bdev_file->private_data;
+	return handle->bdev;
+}
+EXPORT_SYMBOL(file_bdev);
+
 static void bdev_write_inode(struct block_device *bdev)
 {
 	struct inode *inode = bdev->bd_inode;
@@ -368,12 +375,12 @@  static struct file_system_type bd_type = {
 };
 
 struct super_block *blockdev_superblock __ro_after_init;
+struct vfsmount *blockdev_mnt __ro_after_init;
 EXPORT_SYMBOL_GPL(blockdev_superblock);
 
 void __init bdev_cache_init(void)
 {
 	int err;
-	static struct vfsmount *bd_mnt __ro_after_init;
 
 	bdev_cachep = kmem_cache_create("bdev_cache", sizeof(struct bdev_inode),
 			0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
@@ -382,10 +389,10 @@  void __init bdev_cache_init(void)
 	err = register_filesystem(&bd_type);
 	if (err)
 		panic("Cannot register bdev pseudo-fs");
-	bd_mnt = kern_mount(&bd_type);
-	if (IS_ERR(bd_mnt))
+	blockdev_mnt = kern_mount(&bd_type);
+	if (IS_ERR(blockdev_mnt))
 		panic("Cannot create bdev pseudo-fs");
-	blockdev_superblock = bd_mnt->mnt_sb;   /* For writeback */
+	blockdev_superblock = blockdev_mnt->mnt_sb;   /* For writeback */
 }
 
 struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
@@ -911,6 +918,95 @@  struct bdev_handle *bdev_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
 }
 EXPORT_SYMBOL(bdev_open_by_dev);
 
+static unsigned blk_to_file_flags(blk_mode_t mode)
+{
+	unsigned int flags = 0;
+
+	if ((mode & (BLK_OPEN_READ | BLK_OPEN_WRITE)) ==
+	    (BLK_OPEN_READ | BLK_OPEN_WRITE))
+		flags |= O_RDWR;
+	else if (mode & BLK_OPEN_WRITE)
+		flags |= O_WRONLY;
+	else if (mode & BLK_OPEN_READ)
+		flags |= O_RDONLY;
+	else /* Neither read nor write for a block device requested? */
+		WARN_ON_ONCE(true);
+
+	/*
+	 * O_EXCL is one of those flags that the VFS clears once it's done with
+	 * the operation. So don't raise it here either.
+	 */
+	if (mode & BLK_OPEN_NDELAY)
+		flags |= O_NDELAY;
+
+	/*
+	 * If BLK_OPEN_WRITE_IOCTL is set then this is a historical quirk
+	 * associated with the floppy driver where it has allowed ioctls if the
+	 * file was opened for writing, but does not allow reads or writes.
+	 * Make sure that this quirk is reflected in @f_flags.
+	 */
+	if (mode & BLK_OPEN_WRITE_IOCTL)
+		flags |= O_RDWR | O_WRONLY;
+
+	return flags;
+}
+
+struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
+				   const struct blk_holder_ops *hops)
+{
+	struct file *bdev_file;
+	struct bdev_handle *handle;
+	unsigned int flags;
+
+	handle = bdev_open_by_dev(dev, mode, holder, hops);
+	if (IS_ERR(handle))
+		return ERR_CAST(handle);
+
+	flags = blk_to_file_flags(mode);
+	bdev_file = alloc_file_pseudo_noaccount(handle->bdev->bd_inode,
+			blockdev_mnt, "", flags | O_LARGEFILE, &def_blk_fops);
+	if (IS_ERR(bdev_file)) {
+		bdev_release(handle);
+		return bdev_file;
+	}
+	ihold(handle->bdev->bd_inode);
+
+	bdev_file->f_mode |= FMODE_BUF_RASYNC | FMODE_CAN_ODIRECT;
+	if (bdev_nowait(handle->bdev))
+		bdev_file->f_mode |= FMODE_NOWAIT;
+
+	bdev_file->f_mapping = handle->bdev->bd_inode->i_mapping;
+	bdev_file->f_wb_err = filemap_sample_wb_err(bdev_file->f_mapping);
+	bdev_file->private_data = handle;
+	return bdev_file;
+}
+EXPORT_SYMBOL(bdev_file_open_by_dev);
+
+struct file *bdev_file_open_by_path(const char *path, blk_mode_t mode,
+				    void *holder,
+				    const struct blk_holder_ops *hops)
+{
+	struct file *bdev_file;
+	dev_t dev;
+	int error;
+
+	error = lookup_bdev(path, &dev);
+	if (error)
+		return ERR_PTR(error);
+
+	bdev_file = bdev_file_open_by_dev(dev, mode, holder, hops);
+	if (!IS_ERR(bdev_file) && (mode & BLK_OPEN_WRITE)) {
+		struct bdev_handle *handle = bdev_file->private_data;
+		if (bdev_read_only(handle->bdev)) {
+			fput(bdev_file);
+			bdev_file = ERR_PTR(-EACCES);
+		}
+	}
+
+	return bdev_file;
+}
+EXPORT_SYMBOL(bdev_file_open_by_path);
+
 /**
  * bdev_open_by_path - open a block device by name
  * @path: path to the block device to open
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 60dbfa0f8805..39e75131fd5a 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -495,7 +495,7 @@  static void cramfs_kill_sb(struct super_block *sb)
 		sb->s_mtd = NULL;
 	} else if (IS_ENABLED(CONFIG_CRAMFS_BLOCKDEV) && sb->s_bdev) {
 		sync_blockdev(sb->s_bdev);
-		bdev_release(sb->s_bdev_handle);
+		fput(sb->s_bdev_file);
 	}
 	kfree(sbi);
 }
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index d45ab0992ae5..ea94c148fee5 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -4247,7 +4247,7 @@  static int f2fs_scan_devices(struct f2fs_sb_info *sbi)
 
 	for (i = 0; i < max_devices; i++) {
 		if (i == 0)
-			FDEV(0).bdev_handle = sbi->sb->s_bdev_handle;
+			FDEV(0).bdev_handle = sb_bdev_handle(sbi->sb);
 		else if (!RDEV(i).path[0])
 			break;
 
diff --git a/fs/file_table.c b/fs/file_table.c
index b991f90571b4..f61e212b99f4 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -281,13 +281,17 @@  struct file *alloc_empty_backing_file(int flags, const struct cred *cred)
  * @path: the (dentry, vfsmount) pair for the new file
  * @flags: O_... flags with which the new file will be opened
  * @fop: the 'struct file_operations' for the new file
+ * @noaccount: whether this is an internal open that shouldn't be counted
  */
 static struct file *alloc_file(const struct path *path, int flags,
-		const struct file_operations *fop)
+		const struct file_operations *fop, bool noaccount)
 {
 	struct file *file;
 
-	file = alloc_empty_file(flags, current_cred());
+	if (noaccount)
+		file = alloc_empty_file_noaccount(flags, current_cred());
+	else
+		file = alloc_empty_file(flags, current_cred());
 	if (IS_ERR(file))
 		return file;
 
@@ -312,9 +316,11 @@  static struct file *alloc_file(const struct path *path, int flags,
 	return file;
 }
 
-struct file *alloc_file_pseudo(struct inode *inode, struct vfsmount *mnt,
-				const char *name, int flags,
-				const struct file_operations *fops)
+static struct file *__alloc_file_pseudo(struct inode *inode,
+					struct vfsmount *mnt, const char *name,
+					int flags,
+					const struct file_operations *fops,
+					bool noaccount)
 {
 	struct qstr this = QSTR_INIT(name, strlen(name));
 	struct path path;
@@ -325,19 +331,35 @@  struct file *alloc_file_pseudo(struct inode *inode, struct vfsmount *mnt,
 		return ERR_PTR(-ENOMEM);
 	path.mnt = mntget(mnt);
 	d_instantiate(path.dentry, inode);
-	file = alloc_file(&path, flags, fops);
+	file = alloc_file(&path, flags, fops, noaccount);
 	if (IS_ERR(file)) {
 		ihold(inode);
 		path_put(&path);
 	}
 	return file;
 }
+
+struct file *alloc_file_pseudo(struct inode *inode, struct vfsmount *mnt,
+				const char *name, int flags,
+				const struct file_operations *fops)
+{
+	return __alloc_file_pseudo(inode, mnt, name, flags, fops, false);
+}
 EXPORT_SYMBOL(alloc_file_pseudo);
 
+struct file *alloc_file_pseudo_noaccount(struct inode *inode,
+					 struct vfsmount *mnt, const char *name,
+					 int flags,
+					 const struct file_operations *fops)
+{
+	return __alloc_file_pseudo(inode, mnt, name, flags, fops, true);
+}
+EXPORT_SYMBOL_GPL(alloc_file_pseudo_noaccount);
+
 struct file *alloc_file_clone(struct file *base, int flags,
 				const struct file_operations *fops)
 {
-	struct file *f = alloc_file(&base->f_path, flags, fops);
+	struct file *f = alloc_file(&base->f_path, flags, fops, false);
 	if (!IS_ERR(f)) {
 		path_get(&f->f_path);
 		f->f_mapping = base->f_mapping;
diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index cb6d1fda66a7..8691463956d1 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -1162,7 +1162,7 @@  static int open_inline_log(struct super_block *sb)
 	init_waitqueue_head(&log->syncwait);
 
 	set_bit(log_INLINELOG, &log->flag);
-	log->bdev_handle = sb->s_bdev_handle;
+	log->bdev_handle = sb_bdev_handle(sb);
 	log->base = addressPXD(&JFS_SBI(sb)->logpxd);
 	log->size = lengthPXD(&JFS_SBI(sb)->logpxd) >>
 	    (L2LOGPSIZE - sb->s_blocksize_bits);
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 545ad44f96b8..1ed468c03557 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -594,7 +594,7 @@  static void romfs_kill_sb(struct super_block *sb)
 #ifdef CONFIG_ROMFS_ON_BLOCK
 	if (sb->s_bdev) {
 		sync_blockdev(sb->s_bdev);
-		bdev_release(sb->s_bdev_handle);
+		fput(sb->s_bdev_file);
 	}
 #endif
 }
diff --git a/fs/super.c b/fs/super.c
index d35e85295489..08dcc3371aa0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1532,16 +1532,16 @@  int setup_bdev_super(struct super_block *sb, int sb_flags,
 		struct fs_context *fc)
 {
 	blk_mode_t mode = sb_open_mode(sb_flags);
-	struct bdev_handle *bdev_handle;
+	struct file *bdev_file;
 	struct block_device *bdev;
 
-	bdev_handle = bdev_open_by_dev(sb->s_dev, mode, sb, &fs_holder_ops);
-	if (IS_ERR(bdev_handle)) {
+	bdev_file = bdev_file_open_by_dev(sb->s_dev, mode, sb, &fs_holder_ops);
+	if (IS_ERR(bdev_file)) {
 		if (fc)
 			errorf(fc, "%s: Can't open blockdev", fc->source);
-		return PTR_ERR(bdev_handle);
+		return PTR_ERR(bdev_file);
 	}
-	bdev = bdev_handle->bdev;
+	bdev = file_bdev(bdev_file);
 
 	/*
 	 * This really should be in blkdev_get_by_dev, but right now can't due
@@ -1549,7 +1549,7 @@  int setup_bdev_super(struct super_block *sb, int sb_flags,
 	 * writable from userspace even for a read-only block device.
 	 */
 	if ((mode & BLK_OPEN_WRITE) && bdev_read_only(bdev)) {
-		bdev_release(bdev_handle);
+		fput(bdev_file);
 		return -EACCES;
 	}
 
@@ -1560,11 +1560,11 @@  int setup_bdev_super(struct super_block *sb, int sb_flags,
 	if (atomic_read(&bdev->bd_fsfreeze_count) > 0) {
 		if (fc)
 			warnf(fc, "%pg: Can't mount, blockdev is frozen", bdev);
-		bdev_release(bdev_handle);
+		fput(bdev_file);
 		return -EBUSY;
 	}
 	spin_lock(&sb_lock);
-	sb->s_bdev_handle = bdev_handle;
+	sb->s_bdev_file = bdev_file;
 	sb->s_bdev = bdev;
 	sb->s_bdi = bdi_get(bdev->bd_disk->bdi);
 	if (bdev_stable_writes(bdev))
@@ -1680,7 +1680,7 @@  void kill_block_super(struct super_block *sb)
 	generic_shutdown_super(sb);
 	if (bdev) {
 		sync_blockdev(bdev);
-		bdev_release(sb->s_bdev_handle);
+		fput(sb->s_bdev_file);
 	}
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index aff20ddd4a9f..e5ac0e59ede9 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -467,7 +467,7 @@  xfs_open_devices(
 	 * Setup xfs_mount buffer target pointers
 	 */
 	error = -ENOMEM;
-	mp->m_ddev_targp = xfs_alloc_buftarg(mp, sb->s_bdev_handle);
+	mp->m_ddev_targp = xfs_alloc_buftarg(mp, sb_bdev_handle(sb));
 	if (!mp->m_ddev_targp)
 		goto out_close_rtdev;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 99e4f5e72213..76706aa47316 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -24,6 +24,7 @@ 
 #include <linux/sbitmap.h>
 #include <linux/uuid.h>
 #include <linux/xarray.h>
+#include <linux/file.h>
 
 struct module;
 struct request_queue;
@@ -1474,6 +1475,7 @@  extern const struct blk_holder_ops fs_holder_ops;
 	(BLK_OPEN_READ | BLK_OPEN_RESTRICT_WRITES | \
 	 (((flags) & SB_RDONLY) ? 0 : BLK_OPEN_WRITE))
 
+/* @bdev_handle will be removed soon. */
 struct bdev_handle {
 	struct block_device *bdev;
 	void *holder;
@@ -1484,6 +1486,10 @@  struct bdev_handle *bdev_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
 		const struct blk_holder_ops *hops);
 struct bdev_handle *bdev_open_by_path(const char *path, blk_mode_t mode,
 		void *holder, const struct blk_holder_ops *hops);
+struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
+		const struct blk_holder_ops *hops);
+struct file *bdev_file_open_by_path(const char *path, blk_mode_t mode,
+		void *holder, const struct blk_holder_ops *hops);
 int bd_prepare_to_claim(struct block_device *bdev, void *holder,
 		const struct blk_holder_ops *hops);
 void bd_abort_claiming(struct block_device *bdev, void *holder);
@@ -1494,6 +1500,7 @@  struct block_device *blkdev_get_no_open(dev_t dev);
 void blkdev_put_no_open(struct block_device *bdev);
 
 struct block_device *I_BDEV(struct inode *inode);
+struct block_device *file_bdev(struct file *bdev_file);
 
 #ifdef CONFIG_BLOCK
 void invalidate_bdev(struct block_device *bdev);
diff --git a/include/linux/file.h b/include/linux/file.h
index 6834a29338c4..169692cb1906 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -24,6 +24,8 @@  struct inode;
 struct path;
 extern struct file *alloc_file_pseudo(struct inode *, struct vfsmount *,
 	const char *, int flags, const struct file_operations *);
+extern struct file *alloc_file_pseudo_noaccount(struct inode *, struct vfsmount *,
+	const char *, int flags, const struct file_operations *);
 extern struct file *alloc_file_clone(struct file *, int flags,
 	const struct file_operations *);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ed5966a70495..e9291e27cc47 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1228,8 +1228,8 @@  struct super_block {
 #endif
 	struct hlist_bl_head	s_roots;	/* alternate root dentries for NFS */
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
-	struct block_device	*s_bdev;
-	struct bdev_handle	*s_bdev_handle;
+	struct block_device	*s_bdev;	/* can go away once we use an accessor for @s_bdev_file */
+	struct file		*s_bdev_file;
 	struct backing_dev_info *s_bdi;
 	struct mtd_info		*s_mtd;
 	struct hlist_node	s_instances;
@@ -1327,6 +1327,12 @@  struct super_block {
 	struct list_head	s_inodes_wb;	/* writeback inodes */
 } __randomize_layout;
 
+/* Temporary helper that will go away. */
+static inline struct bdev_handle *sb_bdev_handle(struct super_block *sb)
+{
+	return sb->s_bdev_file->private_data;
+}
+
 static inline struct user_namespace *i_user_ns(const struct inode *inode)
 {
 	return inode->i_sb->s_user_ns;