diff mbox

[RFC,1/2] mm: introduce bmap_walk()

Message ID 149766212976.22552.11210067224152823950.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams June 17, 2017, 1:15 a.m. UTC
Refactor the core of generic_swapfile_activate() into bmap_walk() so
that it can be used by a new daxfile_activate() helper (to be added).

There should be no functional differences as a result of this change,
although it does add the capability to perform the bmap with a given
page-size. This is in support of daxfile users that want to ensure huge
page usage.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/page_io.c |   86 +++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 67 insertions(+), 19 deletions(-)

Comments

Christoph Hellwig June 17, 2017, 5:22 a.m. UTC | #1
On Fri, Jun 16, 2017 at 06:15:29PM -0700, Dan Williams wrote:
> Refactor the core of generic_swapfile_activate() into bmap_walk() so
> that it can be used by a new daxfile_activate() helper (to be added).

No way in hell!  generic_swapfile_activate needs to day and no new users
of ->bmap over my dead body.  It's a guaranteed to fuck up your data left,
right and center.
Dan Williams June 17, 2017, 12:29 p.m. UTC | #2
On Fri, Jun 16, 2017 at 10:22 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Jun 16, 2017 at 06:15:29PM -0700, Dan Williams wrote:
>> Refactor the core of generic_swapfile_activate() into bmap_walk() so
>> that it can be used by a new daxfile_activate() helper (to be added).
>
> No way in hell!  generic_swapfile_activate needs to day and no new users
> of ->bmap over my dead body.  It's a guaranteed to fuck up your data left,
> right and center.

Certainly you're not saying that existing swapfiles are broken, so I
wonder what bugs you're talking about?

Unless you had plans to go remove bmap() I don't see how this gets in
your way at all. That said, I think "please don't add a new bmap()
user, use iomap instead" is a fair comment. You know me well enough to
know that would be all it takes to redirect my work, I can do without
the bluster.
Christoph Hellwig June 18, 2017, 7:51 a.m. UTC | #3
On Sat, Jun 17, 2017 at 05:29:23AM -0700, Dan Williams wrote:
> On Fri, Jun 16, 2017 at 10:22 PM, Christoph Hellwig <hch@lst.de> wrote:
> > On Fri, Jun 16, 2017 at 06:15:29PM -0700, Dan Williams wrote:
> >> Refactor the core of generic_swapfile_activate() into bmap_walk() so
> >> that it can be used by a new daxfile_activate() helper (to be added).
> >
> > No way in hell!  generic_swapfile_activate needs to day and no new users
> > of ->bmap over my dead body.  It's a guaranteed to fuck up your data left,
> > right and center.
> 
> Certainly you're not saying that existing swapfiles are broken, so I
> wonder what bugs you're talking about?

They are somewhat broken, but we manage to paper over the fact.

And in fact if you plan to use a method marked:

	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
	sector_t (*bmap)(struct address_space *, sector_t);

I'd expect a little research.

By it's signature alone ->bmap can't do a whole lot - it can try to
translate the _current_ mapping of a relative block number to a physical
one, and do extremely crude error reporting.

Notice what it can't do:

 a) provide any guaranteed that the block mapping doesn't change any time
    after it returned
 b) deal with the fact that there might be anything like a physical block
 c) put the physical block into any sort of context, that is explain what
    device it actually is relative to

So yes, swap files are broken.  They sort of work by:

 a) ensuring that ->bmap is not implemented for anything fancy (btrfs), or
    living  with it doing I/O into random places (XFS RT subvolumes, *cough*)
 b) doing extremely heavy handed locking to ensure things don't change at all
    (S_SWAPFILE).  This might kinda sorta work for swapfiles which are
    part of the system and require privilegues, but an absolute no-go
    for anything else
 c) simply not using this brain-haired systems - see the swap over NFS
    support, or the WIP swap over btrfs patches.

> Unless you had plans to go remove bmap() I don't see how this gets in
> your way at all.

I'm not talking about getting in my way.  I'm talking about you doing
something incredibly stupid.  Don't do that.

> That said, I think "please don't add a new bmap()
> user, use iomap instead" is a fair comment. You know me well enough to
> know that would be all it takes to redirect my work, I can do without
> the bluster.

But that's not the point.  The point is that ->bmap() semantics simplify
do not work in practice because they don't make sense.
Darrick J. Wong June 19, 2017, 4:18 p.m. UTC | #4
On Sun, Jun 18, 2017 at 09:51:52AM +0200, Christoph Hellwig wrote:
> On Sat, Jun 17, 2017 at 05:29:23AM -0700, Dan Williams wrote:
> > On Fri, Jun 16, 2017 at 10:22 PM, Christoph Hellwig <hch@lst.de> wrote:
> > > On Fri, Jun 16, 2017 at 06:15:29PM -0700, Dan Williams wrote:
> > >> Refactor the core of generic_swapfile_activate() into bmap_walk() so
> > >> that it can be used by a new daxfile_activate() helper (to be added).
> > >
> > > No way in hell!  generic_swapfile_activate needs to day and no new users
> > > of ->bmap over my dead body.  It's a guaranteed to fuck up your data left,
> > > right and center.
> > 
> > Certainly you're not saying that existing swapfiles are broken, so I
> > wonder what bugs you're talking about?
> 
> They are somewhat broken, but we manage to paper over the fact.
> 
> And in fact if you plan to use a method marked:
> 
> 	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
> 	sector_t (*bmap)(struct address_space *, sector_t);
> 
> I'd expect a little research.
> 
> By it's signature alone ->bmap can't do a whole lot - it can try to
> translate the _current_ mapping of a relative block number to a physical
> one, and do extremely crude error reporting.
> 
> Notice what it can't do:
> 
>  a) provide any guaranteed that the block mapping doesn't change any time
>     after it returned
>  b) deal with the fact that there might be anything like a physical block
>  c) put the physical block into any sort of context, that is explain what
>     device it actually is relative to
> 
> So yes, swap files are broken.  They sort of work by:
> 
>  a) ensuring that ->bmap is not implemented for anything fancy (btrfs), or
>     living  with it doing I/O into random places (XFS RT subvolumes, *cough*)

Ye $deities, it really /doesn't/ check XFS_IS_REALTIME_INODE(ip)!  AIEEEE!

Uh... patch soon.

>  b) doing extremely heavy handed locking to ensure things don't change at all
>     (S_SWAPFILE).  This might kinda sorta work for swapfiles which are
>     part of the system and require privilegues, but an absolute no-go
>     for anything else
>  c) simply not using this brain-haired systems - see the swap over NFS
>     support, or the WIP swap over btrfs patches.
> 
> > Unless you had plans to go remove bmap() I don't see how this gets in
> > your way at all.
> 
> I'm not talking about getting in my way.  I'm talking about you doing
> something incredibly stupid.  Don't do that.
> 
> > That said, I think "please don't add a new bmap()
> > user, use iomap instead" is a fair comment. You know me well enough to
> > know that would be all it takes to redirect my work, I can do without
> > the bluster.
> 
> But that's not the point.  The point is that ->bmap() semantics simplify
> do not work in practice because they don't make sense.

Seconded, bmap doesn't coordinate with the filesystem in any way to
guarantee that the mappings are stable, nor does it seem to care about
delayed alloc reservations.  Granted I suspect the dax usage model is
"all the blocks were already allocated" so there are no da reservations,
but still, ugh, bmap. :)

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro June 19, 2017, 6:19 p.m. UTC | #5
On Sun, Jun 18, 2017 at 09:51:52AM +0200, Christoph Hellwig wrote:

> > That said, I think "please don't add a new bmap()
> > user, use iomap instead" is a fair comment. You know me well enough to
> > know that would be all it takes to redirect my work, I can do without
> > the bluster.
> 
> But that's not the point.  The point is that ->bmap() semantics simplify
> do not work in practice because they don't make sense.

Speaking of iomap, what's supposed to happen when doing a write into what
used to be a hole?  Suppose we have a file with a megabyte hole in it
and there's some process mmapping that range.  Another process does
write over the entire range.  We call ->iomap_begin() and allocate
disk blocks.  Then we start copying data into those.  In the meanwhile,
the first process attempts to fetch from address in the middle of that
hole.  What should happen?

Should the blocks we'd allocated in ->iomap_begin() be immediately linked
into the whatever indirect locks/btree/whatnot we are using?  That would
require zeroing all of them first - otherwise that readpage will read
uninitialized block.  Another variant would be to delay linking them
in until ->iomap_end(), but...  Suppose we get the page evicted by
memory pressure after the writer is finished with it.  If ->readpage()
comes before ->iomap_end(), we'll need to somehow figure out that it's
not a hole anymore, or we'll end up with an uptodate page full of zeroes
observed by reads after successful write().

The comment you've got in linux/iomap.h would seem to suggest the second
interpretation, but neither it nor anything in Documentation discusses the
relations with readpage/writepage...
Christoph Hellwig June 20, 2017, 7:34 a.m. UTC | #6
On Mon, Jun 19, 2017 at 07:19:57PM +0100, Al Viro wrote:
> Speaking of iomap, what's supposed to happen when doing a write into what
> used to be a hole?  Suppose we have a file with a megabyte hole in it
> and there's some process mmapping that range.  Another process does
> write over the entire range.  We call ->iomap_begin() and allocate
> disk blocks.  Then we start copying data into those.  In the meanwhile,
> the first process attempts to fetch from address in the middle of that
> hole.  What should happen?

Right now the buffered iomap code expects delayed allocations.
So ->iomap_begin will only reserve block in memory, and not even
mark the blocks as allocated in the page / buffer_head.  The fact
that the block is allocated is only propagated into the page buffer_head
on a page by page basis in the actor.

> Should the blocks we'd allocated in ->iomap_begin() be immediately linked
> into the whatever indirect locks/btree/whatnot we are using?  That would
> require zeroing all of them first - otherwise that readpage will read
> uninitialized block.  Another variant would be to delay linking them
> in until ->iomap_end(), but...  Suppose we get the page evicted by
> memory pressure after the writer is finished with it.  If ->readpage()
> comes before ->iomap_end(), we'll need to somehow figure out that it's
> not a hole anymore, or we'll end up with an uptodate page full of zeroes
> observed by reads after successful write().

Delayed blocks are ignored by the read code, so it will read 'through'
them.

> The comment you've got in linux/iomap.h would seem to suggest the second
> interpretation, but neither it nor anything in Documentation discusses the
> relations with readpage/writepage...

I'll see if I can come up with some better documentation.
diff mbox

Patch

diff --git a/mm/page_io.c b/mm/page_io.c
index 23f6d0d3470f..5cec9a3d49f2 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -135,11 +135,22 @@  static void end_swap_bio_read(struct bio *bio)
 	bio_put(bio);
 }
 
-int generic_swapfile_activate(struct swap_info_struct *sis,
-				struct file *swap_file,
-				sector_t *span)
+enum bmap_check {
+	BMAP_WALK_UNALIGNED,
+	BMAP_WALK_DISCONTIG,
+	BMAP_WALK_FULLPAGE,
+	BMAP_WALK_DONE,
+};
+
+typedef int (*bmap_check_fn)(sector_t block, unsigned long page_no,
+		enum bmap_check type, void *ctx);
+
+static int bmap_walk(struct file *file, const unsigned page_size,
+		const unsigned long page_max, sector_t *span,
+		bmap_check_fn check, void *ctx)
 {
-	struct address_space *mapping = swap_file->f_mapping;
+	struct address_space *mapping = file->f_mapping;
+	const unsigned page_shift = ilog2(page_size);
 	struct inode *inode = mapping->host;
 	unsigned blocks_per_page;
 	unsigned long page_no;
@@ -152,7 +163,7 @@  int generic_swapfile_activate(struct swap_info_struct *sis,
 	int ret;
 
 	blkbits = inode->i_blkbits;
-	blocks_per_page = PAGE_SIZE >> blkbits;
+	blocks_per_page = page_size >> blkbits;
 
 	/*
 	 * Map all the blocks into the extent list.  This code doesn't try
@@ -162,7 +173,7 @@  int generic_swapfile_activate(struct swap_info_struct *sis,
 	page_no = 0;
 	last_block = i_size_read(inode) >> blkbits;
 	while ((probe_block + blocks_per_page) <= last_block &&
-			page_no < sis->max) {
+			page_no < page_max) {
 		unsigned block_in_page;
 		sector_t first_block;
 
@@ -173,11 +184,15 @@  int generic_swapfile_activate(struct swap_info_struct *sis,
 			goto bad_bmap;
 
 		/*
-		 * It must be PAGE_SIZE aligned on-disk
+		 * It must be @page_size aligned on-disk
 		 */
 		if (first_block & (blocks_per_page - 1)) {
 			probe_block++;
-			goto reprobe;
+			ret = check(first_block, page_no,
+					BMAP_WALK_UNALIGNED, ctx);
+			if (ret == -EAGAIN)
+				goto reprobe;
+			goto bad_bmap;
 		}
 
 		for (block_in_page = 1; block_in_page < blocks_per_page;
@@ -190,11 +205,15 @@  int generic_swapfile_activate(struct swap_info_struct *sis,
 			if (block != first_block + block_in_page) {
 				/* Discontiguity */
 				probe_block++;
-				goto reprobe;
+				ret = check(first_block, page_no,
+						BMAP_WALK_DISCONTIG, ctx);
+				if (ret == -EAGAIN)
+					goto reprobe;
+				goto bad_bmap;
 			}
 		}
 
-		first_block >>= (PAGE_SHIFT - blkbits);
+		first_block >>= (page_shift - blkbits);
 		if (page_no) {	/* exclude the header page */
 			if (first_block < lowest_block)
 				lowest_block = first_block;
@@ -203,9 +222,9 @@  int generic_swapfile_activate(struct swap_info_struct *sis,
 		}
 
 		/*
-		 * We found a PAGE_SIZE-length, PAGE_SIZE-aligned run of blocks
+		 * We found a @page_size-{length,aligned} run of blocks
 		 */
-		ret = add_swap_extent(sis, page_no, 1, first_block);
+		ret = check(first_block, page_no, BMAP_WALK_FULLPAGE, ctx);
 		if (ret < 0)
 			goto out;
 		nr_extents += ret;
@@ -215,20 +234,49 @@  int generic_swapfile_activate(struct swap_info_struct *sis,
 		continue;
 	}
 	ret = nr_extents;
-	*span = 1 + highest_block - lowest_block;
-	if (page_no == 0)
-		page_no = 1;	/* force Empty message */
-	sis->max = page_no;
-	sis->pages = page_no - 1;
-	sis->highest_bit = page_no - 1;
+	if (span)
+		*span = 1 + highest_block - lowest_block;
+	check(highest_block, page_no, BMAP_WALK_DONE, ctx);
 out:
 	return ret;
 bad_bmap:
-	pr_err("swapon: swapfile has holes\n");
 	ret = -EINVAL;
 	goto out;
 }
 
+static int swapfile_check(sector_t block, unsigned long page_no,
+		enum bmap_check type, void *_sis)
+{
+	struct swap_info_struct *sis = _sis;
+
+	if (type == BMAP_WALK_DONE) {
+		if (page_no == 0)
+			page_no = 1;	/* force Empty message */
+		sis->max = page_no;
+		sis->pages = page_no - 1;
+		sis->highest_bit = page_no - 1;
+		return 0;
+	}
+
+	if (type != BMAP_WALK_FULLPAGE)
+		return -EAGAIN;
+
+	return add_swap_extent(sis, page_no, 1, block);
+}
+
+int generic_swapfile_activate(struct swap_info_struct *sis,
+				struct file *swap_file,
+				sector_t *span)
+{
+	int rc = bmap_walk(swap_file, PAGE_SIZE, sis->max, span,
+			swapfile_check, sis);
+
+	if (rc < 0)
+		pr_err("swapon: swapfile has holes\n");
+
+	return rc;
+}
+
 /*
  * We may have stale swap cache pages in memory: notice
  * them here and get rid of the unnecessary final write.