diff mbox series

[4/7] swapon(2): open swap with O_EXCL

Message ID 20240427211128.GD1495312@ZenIV (mailing list archive)
State New
Headers show
Series [1/7] bcache_register(): don't bother with set_blocksize() | expand

Commit Message

Al Viro April 27, 2024, 9:11 p.m. UTC
... eliminating the need to reopen block devices so they could be
exclusively held.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/swap.h |  1 -
 mm/swapfile.c        | 19 ++-----------------
 2 files changed, 2 insertions(+), 18 deletions(-)

Comments

Linus Torvalds April 27, 2024, 9:40 p.m. UTC | #1
On Sat, 27 Apr 2024 at 14:11, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> ... eliminating the need to reopen block devices so they could be
> exclusively held.

This looks like a good change, but it raises the question of why we
did it this odd way to begin with?

Is it just because O_EXCL without O_CREAT is kind of odd, and only has
meaning for block devices?

Or is it just that before we used fiel pointers for block devices, the
old model made more sense?

Anyway, I like it, it just makes me go "why didn't we do it that way
originally?"

                Linus
Al Viro April 27, 2024, 11:46 p.m. UTC | #2
On Sat, Apr 27, 2024 at 02:40:22PM -0700, Linus Torvalds wrote:
> On Sat, 27 Apr 2024 at 14:11, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > ... eliminating the need to reopen block devices so they could be
> > exclusively held.
> 
> This looks like a good change, but it raises the question of why we
> did it this odd way to begin with?
> 
> Is it just because O_EXCL without O_CREAT is kind of odd, and only has
> meaning for block devices?
> 
> Or is it just that before we used fiel pointers for block devices, the
> old model made more sense?
> 
> Anyway, I like it, it just makes me go "why didn't we do it that way
> originally?"

Exclusion for swap partitions:

commit 75e9c9e1bffbe4a1767172855296b94ccba28f71
Author: Alexander Viro <viro@math.psu.edu>
Date:   Mon Mar 4 22:56:47 2002 -0800

    [PATCH] death of is_mounted() and aother fixes


O_EXCL for block devices:

commit c366082d9ed0a0d3c46441d1b3fdf895d8e55ca9
Author: Andrew Morton <akpm@osdl.org>
Date:   Wed Aug 20 10:26:57 2003 -0700

    [PATCH] Allow O_EXCL on a block device to claim exclusive use.

IOW, O_EXCL hadn't been available at the time - it had been implemented
on top of bd_claim()/bd_release() introduced in the same earlier commit.

Switching swap exclusion to O_EXCL could've been done back in 2003 or
at any later point; it's just that swapon(2)/swapoff(2) is something that
rarely gets a look...
Al Viro April 28, 2024, 1:25 a.m. UTC | #3
On Sun, Apr 28, 2024 at 12:46:23AM +0100, Al Viro wrote:

> Switching swap exclusion to O_EXCL could've been done back in 2003 or
> at any later point; it's just that swapon(2)/swapoff(2) is something that
> rarely gets a look...

BTW, a fun archaeological question: at which point has this
                /*
                 * Retrying may succeed; for example the folio may finish   
                 * writeback, or buffers may be cleaned.  This should not  
                 * happen very often; maybe we have old buffers attached to
                 * this blockdev's page cache and we're trying to change
                 * the block size?
                 */
                if (!try_to_free_buffers(folio)) {
                        end_block = ~0ULL;
                        goto unlock;
                }

in grow_dev_folio() (grow_dev_page() in earlier kernels) become unreachable?
I _think_ it was
commit fbc139f54fdb7edfec470421c2cc885d3796dfcd
Author: Linus Torvalds <torvalds@athlon.transmeta.com>
Date:   Mon Feb 4 20:19:55 2002 -0800

    v2.4.10.0.2 -> v2.4.10.0.3

      - more buffers-in-pagecache coherency

when set_blocksize() started to do
	sync_buffers(dev, 2);
	...
	invalidate_bdev(bdev, 1);
	truncate_inode_pages(bdev->bd_inode->i_mapping, 0);

at which point the "what if we'd found a page with attached buffers of the
wrong size?" should've become impossible.

Am I misreading that?
Al Viro April 28, 2024, 6:19 p.m. UTC | #4
On Sun, Apr 28, 2024 at 12:46:23AM +0100, Al Viro wrote:
> On Sat, Apr 27, 2024 at 02:40:22PM -0700, Linus Torvalds wrote:
> > On Sat, 27 Apr 2024 at 14:11, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > ... eliminating the need to reopen block devices so they could be
> > > exclusively held.
> > 
> > This looks like a good change, but it raises the question of why we
> > did it this odd way to begin with?
> > 
> > Is it just because O_EXCL without O_CREAT is kind of odd, and only has
> > meaning for block devices?
> > 
> > Or is it just that before we used fiel pointers for block devices, the
> > old model made more sense?
> > 
> > Anyway, I like it, it just makes me go "why didn't we do it that way
> > originally?"
> 
> Exclusion for swap partitions:
> 
> commit 75e9c9e1bffbe4a1767172855296b94ccba28f71
> Author: Alexander Viro <viro@math.psu.edu>
> Date:   Mon Mar 4 22:56:47 2002 -0800
> 
>     [PATCH] death of is_mounted() and aother fixes
> 
> 
> O_EXCL for block devices:
> 
> commit c366082d9ed0a0d3c46441d1b3fdf895d8e55ca9
> Author: Andrew Morton <akpm@osdl.org>
> Date:   Wed Aug 20 10:26:57 2003 -0700
> 
>     [PATCH] Allow O_EXCL on a block device to claim exclusive use.
> 
> IOW, O_EXCL hadn't been available at the time - it had been implemented
> on top of bd_claim()/bd_release() introduced in the same earlier commit.
> 
> Switching swap exclusion to O_EXCL could've been done back in 2003 or
> at any later point; it's just that swapon(2)/swapoff(2) is something that
> rarely gets a look...

FWIW, pretty much the same can be done with zram - open with O_EXCL and to
hell with reopening.  Guys, are there any objections to that?

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f0639df6cd18..d882a0c7b522 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -426,11 +426,10 @@ static void reset_bdev(struct zram *zram)
 	if (!zram->backing_dev)
 		return;
 
-	fput(zram->bdev_file);
 	/* hope filp_close flush all of IO */
 	filp_close(zram->backing_dev, NULL);
 	zram->backing_dev = NULL;
-	zram->bdev_file = NULL;
+	zram->bdev = NULL;
 	zram->disk->fops = &zram_devops;
 	kvfree(zram->bitmap);
 	zram->bitmap = NULL;
@@ -473,10 +472,8 @@ static ssize_t backing_dev_store(struct device *dev,
 	size_t sz;
 	struct file *backing_dev = NULL;
 	struct inode *inode;
-	struct address_space *mapping;
 	unsigned int bitmap_sz;
 	unsigned long nr_pages, *bitmap = NULL;
-	struct file *bdev_file = NULL;
 	int err;
 	struct zram *zram = dev_to_zram(dev);
 
@@ -497,15 +494,14 @@ static ssize_t backing_dev_store(struct device *dev,
 	if (sz > 0 && file_name[sz - 1] == '\n')
 		file_name[sz - 1] = 0x00;
 
-	backing_dev = filp_open(file_name, O_RDWR|O_LARGEFILE, 0);
+	backing_dev = filp_open(file_name, O_RDWR|O_LARGEFILE|O_EXCL, 0);
 	if (IS_ERR(backing_dev)) {
 		err = PTR_ERR(backing_dev);
 		backing_dev = NULL;
 		goto out;
 	}
 
-	mapping = backing_dev->f_mapping;
-	inode = mapping->host;
+	inode = backing_dev->f_mapping->host;
 
 	/* Support only block device in this moment */
 	if (!S_ISBLK(inode->i_mode)) {
@@ -513,14 +509,6 @@ static ssize_t backing_dev_store(struct device *dev,
 		goto out;
 	}
 
-	bdev_file = bdev_file_open_by_dev(inode->i_rdev,
-				BLK_OPEN_READ | BLK_OPEN_WRITE, zram, NULL);
-	if (IS_ERR(bdev_file)) {
-		err = PTR_ERR(bdev_file);
-		bdev_file = NULL;
-		goto out;
-	}
-
 	nr_pages = i_size_read(inode) >> PAGE_SHIFT;
 	bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
 	bitmap = kvzalloc(bitmap_sz, GFP_KERNEL);
@@ -531,7 +519,7 @@ static ssize_t backing_dev_store(struct device *dev,
 
 	reset_bdev(zram);
 
-	zram->bdev_file = bdev_file;
+	zram->bdev = I_BDEV(inode);
 	zram->backing_dev = backing_dev;
 	zram->bitmap = bitmap;
 	zram->nr_pages = nr_pages;
@@ -544,9 +532,6 @@ static ssize_t backing_dev_store(struct device *dev,
 out:
 	kvfree(bitmap);
 
-	if (bdev_file)
-		fput(bdev_file);
-
 	if (backing_dev)
 		filp_close(backing_dev, NULL);
 
@@ -587,7 +572,7 @@ static void read_from_bdev_async(struct zram *zram, struct page *page,
 {
 	struct bio *bio;
 
-	bio = bio_alloc(file_bdev(zram->bdev_file), 1, parent->bi_opf, GFP_NOIO);
+	bio = bio_alloc(zram->bdev, 1, parent->bi_opf, GFP_NOIO);
 	bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9);
 	__bio_add_page(bio, page, PAGE_SIZE, 0);
 	bio_chain(bio, parent);
@@ -703,7 +688,7 @@ static ssize_t writeback_store(struct device *dev,
 			continue;
 		}
 
-		bio_init(&bio, file_bdev(zram->bdev_file), &bio_vec, 1,
+		bio_init(&bio, zram->bdev, &bio_vec, 1,
 			 REQ_OP_WRITE | REQ_SYNC);
 		bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
 		__bio_add_page(&bio, page, PAGE_SIZE, 0);
@@ -785,7 +770,7 @@ static void zram_sync_read(struct work_struct *work)
 	struct bio_vec bv;
 	struct bio bio;
 
-	bio_init(&bio, file_bdev(zw->zram->bdev_file), &bv, 1, REQ_OP_READ);
+	bio_init(&bio, zw->zram->bdev, &bv, 1, REQ_OP_READ);
 	bio.bi_iter.bi_sector = zw->entry * (PAGE_SIZE >> 9);
 	__bio_add_page(&bio, zw->page, PAGE_SIZE, 0);
 	zw->error = submit_bio_wait(&bio);
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 37bf29f34d26..35e322144629 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -132,7 +132,7 @@ struct zram {
 	spinlock_t wb_limit_lock;
 	bool wb_limit_enable;
 	u64 bd_wb_limit;
-	struct file *bdev_file;
+	struct block_device *bdev;
 	unsigned long *bitmap;
 	unsigned long nr_pages;
 #endif
Linus Torvalds April 28, 2024, 6:46 p.m. UTC | #5
On Sun, 28 Apr 2024 at 11:19, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, pretty much the same can be done with zram - open with O_EXCL and to
> hell with reopening.  Guys, are there any objections to that?

Please do. The fewer of these strange "re-open block device" things we
have, the better.

I particularly dislike our "holder" logic, and this re-opening is one
source of nasty confusion, and if we could replace them all with just
the "O_EXCL uses the file itself as the holder", that would be
absolutely _lovely_.

                Linus
Al Viro April 28, 2024, 7:07 p.m. UTC | #6
On Sun, Apr 28, 2024 at 11:46:05AM -0700, Linus Torvalds wrote:
> On Sun, 28 Apr 2024 at 11:19, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > FWIW, pretty much the same can be done with zram - open with O_EXCL and to
> > hell with reopening.  Guys, are there any objections to that?
> 
> Please do. The fewer of these strange "re-open block device" things we
> have, the better.
> 
> I particularly dislike our "holder" logic, and this re-opening is one
> source of nasty confusion, and if we could replace them all with just
> the "O_EXCL uses the file itself as the holder", that would be
> absolutely _lovely_.

The tricky part is blk_holder_ops, and I'm no fonder of it than you are.

Christoph, do you have any plans along those lines for swap devices?
If they are not going to grow holder_ops, I'd say we should switch
to O_EXCL open and be done with that; zram is in the same situation,
AFAICS.

Might be worth a topic at LSF, actually...
Christoph Hellwig April 29, 2024, 5:09 a.m. UTC | #7
> -	swap_file = file_open_name(name, O_RDWR|O_LARGEFILE, 0);
> +	swap_file = file_open_name(name, O_RDWR|O_LARGEFILE|O_EXCL, 0);

Can you add the proper whitespaces here while you touch it?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig April 29, 2024, 5:10 a.m. UTC | #8
On Sun, Apr 28, 2024 at 08:07:09PM +0100, Al Viro wrote:
> Christoph, do you have any plans along those lines for swap devices?
> If they are not going to grow holder_ops, I'd say we should switch
> to O_EXCL open and be done with that; zram is in the same situation,
> AFAICS.

holder_ops right now are used for fs freezing, fs syncing and dead
device notification.  None of this is useful for swap, as is the
resize notification I plan to add eventually.
Christian Brauner April 29, 2024, 8:39 a.m. UTC | #9
On Sat, Apr 27, 2024 at 10:11:28PM +0100, Al Viro wrote:
> ... eliminating the need to reopen block devices so they could be
> exclusively held.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Looks good to me,
Reviewed-by: Christian Brauner <brauner@kernel.org>
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a5b640cca459..7e61a4aef2fc 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -298,7 +298,6 @@  struct swap_info_struct {
 	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
-	struct file *bdev_file;		/* open handle of the bdev */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
 	struct file *swap_file;		/* seldom referenced */
 	struct completion comp;		/* seldom referenced */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 304f74d039f3..71cb76a2d0ce 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2550,10 +2550,6 @@  SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	exit_swap_address_space(p->type);
 
 	inode = mapping->host;
-	if (p->bdev_file) {
-		fput(p->bdev_file);
-		p->bdev_file = NULL;
-	}
 
 	inode_lock(inode);
 	inode->i_flags &= ~S_SWAPFILE;
@@ -2780,14 +2776,7 @@  static struct swap_info_struct *alloc_swap_info(void)
 static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)
 {
 	if (S_ISBLK(inode->i_mode)) {
-		p->bdev_file = bdev_file_open_by_dev(inode->i_rdev,
-				BLK_OPEN_READ | BLK_OPEN_WRITE, p, NULL);
-		if (IS_ERR(p->bdev_file)) {
-			int error = PTR_ERR(p->bdev_file);
-			p->bdev_file = NULL;
-			return error;
-		}
-		p->bdev = file_bdev(p->bdev_file);
+		p->bdev = I_BDEV(inode);
 		/*
 		 * Zoned block devices contain zones that have a sequential
 		 * write only restriction.  Hence zoned block devices are not
@@ -3028,7 +3017,7 @@  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		name = NULL;
 		goto bad_swap;
 	}
-	swap_file = file_open_name(name, O_RDWR|O_LARGEFILE, 0);
+	swap_file = file_open_name(name, O_RDWR|O_LARGEFILE|O_EXCL, 0);
 	if (IS_ERR(swap_file)) {
 		error = PTR_ERR(swap_file);
 		swap_file = NULL;
@@ -3225,10 +3214,6 @@  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	p->percpu_cluster = NULL;
 	free_percpu(p->cluster_next_cpu);
 	p->cluster_next_cpu = NULL;
-	if (p->bdev_file) {
-		fput(p->bdev_file);
-		p->bdev_file = NULL;
-	}
 	inode = NULL;
 	destroy_swap_extents(p);
 	swap_cgroup_swapoff(p->type);