diff mbox

[PATCH/RFC] fscache/cachefiles versus btrfs

Message ID 20150409174916.5a2efef5@notabene.brown (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown April 9, 2015, 7:49 a.m. UTC
hi,
 fscache cannot currently be used with btrfs as the backing store for the
 cache (managed by cachefilesd).
 This is because cachefiles needs the ->bmap address_space_operation, and
 btrfs doesn't provide it.

 cachefiles only uses this to find out if a particular page is a 'hole' or
 not.  For btrfs, this can be done with 'SEEK_DATA'.

 Unfortunately it doesn't seem to be possible to query a filesystem or a file
 to see if SEEK_DATA is reliable or not, so we cannot simply use SEEK_DATA
 when reliable, else ->bmap if available.

 The following patch make fscache work for me on btrfs.  It explicitly checks
 for BTRFS_SUPER_MAGIC.  Not really a nice solution, but all I could think of.

 Is there a better way?  Could a better way be created?  Maybe
 SEEK_DATA_RELIABLE ??

 Comments, suggestions welcome.


Also, if you do try to use fscache on btrfs with 3.19, then nothing gets
cached (as expected) and with a heavy load you can lose a race and get an
asserting fail in fscache_enqueue_operation

	ASSERT(fscache_object_is_available(op->object));

It looks like the object is being killed before it is available...

[  859.700765] kernel BUG at ../fs/fscache/operation.c:38!
...
[  859.703124] Call Trace:
[  859.703193]  [<ffffffffa0448044>] fscache_run_op.isra.4+0x34/0x80 [fscache]
[  859.703260]  [<ffffffffa0448160>] fscache_start_operations+0xa0/0xf0 [fscache]
[  859.703388]  [<ffffffffa0446cd8>] fscache_kill_object+0x98/0xc0 [fscache]
[  859.703455]  [<ffffffffa04475c1>] fscache_object_work_func+0x151/0x210 [fscache]
[  859.703578]  [<ffffffff81078b07>] process_one_work+0x147/0x3c0
[  859.703642]  [<ffffffff8107929c>] worker_thread+0x20c/0x470

I haven't figured out the cause of that yet.


Thanks,
NeilBrown

Comments

David Howells April 9, 2015, 9:23 a.m. UTC | #1
NeilBrown <neilb@suse.de> wrote:

>  Is there a better way?  Could a better way be created?  Maybe
>  SEEK_DATA_RELIABLE ??

fiemap() maybe?

> Also, if you do try to use fscache on btrfs with 3.19, then nothing gets
> cached (as expected) and with a heavy load you can lose a race and get an
> asserting fail in fscache_enqueue_operation

Do you have the patches here applied?

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes

David
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner April 9, 2015, 11:52 p.m. UTC | #2
On Thu, Apr 09, 2015 at 10:23:08AM +0100, David Howells wrote:
> NeilBrown <neilb@suse.de> wrote:
> 
> >  Is there a better way?  Could a better way be created?  Maybe
> >  SEEK_DATA_RELIABLE ??
> 
> fiemap() maybe?

fiemap is not reliable for mapping holes - it returns extent info,
not whether there is data in a range. i.e. there can be data over a
hole (e.g. delayed allocation) and fiemap will return it as a hole.
cp made this mistake back when fiemap was first introduced,
resulting in corrupt file copies.

SEEK_HOLE/SEEK_DATA is what you want, as they are page cache
coherent, not extent based operations. And, really if you need it to
really be able to find real holes, then a superblock flag might be a
better way of marking filesystems with the required capability.

Cheers,

Dave.
NeilBrown April 10, 2015, 12:42 a.m. UTC | #3
On Fri, 10 Apr 2015 09:52:34 +1000 Dave Chinner <david@fromorbit.com> wrote:

> On Thu, Apr 09, 2015 at 10:23:08AM +0100, David Howells wrote:
> > NeilBrown <neilb@suse.de> wrote:
> > 
> > >  Is there a better way?  Could a better way be created?  Maybe
> > >  SEEK_DATA_RELIABLE ??
> > 
> > fiemap() maybe?
> 
> fiemap is not reliable for mapping holes - it returns extent info,
> not whether there is data in a range. i.e. there can be data over a
> hole (e.g. delayed allocation) and fiemap will return it as a hole.
> cp made this mistake back when fiemap was first introduced,
> resulting in corrupt file copies.

I assumed from the doco that FIEMAP_EXTENT_DELALLOC would get returned in
this case.  I guess not?

> 
> SEEK_HOLE/SEEK_DATA is what you want, as they are page cache
> coherent, not extent based operations. And, really if you need it to
> really be able to find real holes, then a superblock flag might be a
> better way of marking filesystems with the required capability.

btrfs seems to use the same underlying functionality for both fiemap and
SEEK_HOLE/SEEK_DATA...

But yes: a new fs_flags flag is probably the go.

Thanks,
NeilBrown

> 
> Cheers,
> 
> Dave.
Dave Chinner April 10, 2015, 1:08 a.m. UTC | #4
On Fri, Apr 10, 2015 at 10:42:21AM +1000, NeilBrown wrote:
> On Fri, 10 Apr 2015 09:52:34 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Thu, Apr 09, 2015 at 10:23:08AM +0100, David Howells wrote:
> > > NeilBrown <neilb@suse.de> wrote:
> > > 
> > > >  Is there a better way?  Could a better way be created?  Maybe
> > > >  SEEK_DATA_RELIABLE ??
> > > 
> > > fiemap() maybe?
> > 
> > fiemap is not reliable for mapping holes - it returns extent info,
> > not whether there is data in a range. i.e. there can be data over a
> > hole (e.g. delayed allocation) and fiemap will return it as a hole.
> > cp made this mistake back when fiemap was first introduced,
> > resulting in corrupt file copies.
> 
> I assumed from the doco that FIEMAP_EXTENT_DELALLOC would get returned in
> this case.  I guess not?

FIEMAP is advisory and implementing features in it are optional.
It's also not guaranteed to be an accurate representation of the
layout of the file because it can change even before hte FIEMAP
ioctl returns to userspace. it's also a representation of the
physical layout of the data in the file, but it does not need to be
coherent with the current data in the file.  i.e. FIEMAP is not
required to report the current state of data in the files to the
caller, just the working physical layout sitting in memory...

> > SEEK_HOLE/SEEK_DATA is what you want, as they are page cache
> > coherent, not extent based operations. And, really if you need it to
> > really be able to find real holes, then a superblock flag might be a
> > better way of marking filesystems with the required capability.
> 
> btrfs seems to use the same underlying functionality for both fiemap and
> SEEK_HOLE/SEEK_DATA...

That's just an internal implementation detail - it does not
define the behaviour and constraints of the information
FIEMAP or SEEK_HOLE/SEEK_DATA APIs are supposed to provide.

Cheers,

Dave.
NeilBrown April 10, 2015, 1:24 a.m. UTC | #5
On Thu, 09 Apr 2015 10:23:08 +0100 David Howells <dhowells@redhat.com> wrote:

> NeilBrown <neilb@suse.de> wrote:
> 
> >  Is there a better way?  Could a better way be created?  Maybe
> >  SEEK_DATA_RELIABLE ??
> 
> fiemap() maybe?
> 
> > Also, if you do try to use fscache on btrfs with 3.19, then nothing gets
> > cached (as expected) and with a heavy load you can lose a race and get an
> > asserting fail in fscache_enqueue_operation
> 
> Do you have the patches here applied?
> 
> 	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes
> 

Do I don't.  I had looked through them and while they did seem to be
addressing similar sorts of races, nothing seems like an obvious match.
I haven't been able to reproduce the BUG_ON myself.  I only have a report
of it repeatedly affecting someone else:

  https://bugzilla.opensuse.org/show_bug.cgi?id=908706

I'll probably have to be happy with fixing usage on btrfs, and hope the other
bug is fixed already or doesn't become a problem.

Thanks,
NeilBrown
David Howells April 10, 2015, 1:28 p.m. UTC | #6
Dave Chinner <david@fromorbit.com> wrote:

> SEEK_HOLE/SEEK_DATA is what you want, as they are page cache
> coherent, not extent based operations. And, really if you need it to
> really be able to find real holes, then a superblock flag might be a
> better way of marking filesystems with the required capability.

Actually, I wonder if what I want is a kernel_read() that returns ENODATA upon
encountering a hole at the beginning of the area to be read.

Of course, what I really, really want is asynchronous, direct read and write
within the kernel, where the read will notify you of any holes, but will read
all the data it can around those holes.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 13, 2015, 5:30 p.m. UTC | #7
On Fri, Apr 10, 2015 at 02:28:16PM +0100, David Howells wrote:
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > SEEK_HOLE/SEEK_DATA is what you want, as they are page cache
> > coherent, not extent based operations. And, really if you need it to
> > really be able to find real holes, then a superblock flag might be a
> > better way of marking filesystems with the required capability.
> 
> Actually, I wonder if what I want is a kernel_read() that returns ENODATA upon
> encountering a hole at the beginning of the area to be read.

NFS READ_PLUS could also make use of this, but someone needs to actually
implement it.

Until we have that lseek SEEK_HOLE/DATA is the way to go, and the
horrible ->bmap hack needs to die ASAP, I can't believe you managed to
sneak that in in the not too distant past.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown April 20, 2015, 4:49 a.m. UTC | #8
On Fri, 10 Apr 2015 11:24:31 +1000 NeilBrown <neilb@suse.de> wrote:

> On Thu, 09 Apr 2015 10:23:08 +0100 David Howells <dhowells@redhat.com> wrote:
> 
> > NeilBrown <neilb@suse.de> wrote:
> > 
> > >  Is there a better way?  Could a better way be created?  Maybe
> > >  SEEK_DATA_RELIABLE ??
> > 
> > fiemap() maybe?
> > 
> > > Also, if you do try to use fscache on btrfs with 3.19, then nothing gets
> > > cached (as expected) and with a heavy load you can lose a race and get an
> > > asserting fail in fscache_enqueue_operation
> > 
> > Do you have the patches here applied?
> > 
> > 	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes
> > 
> 
> Do I don't.  I had looked through them and while they did seem to be
> addressing similar sorts of races, nothing seems like an obvious match.
> I haven't been able to reproduce the BUG_ON myself.  I only have a report
> of it repeatedly affecting someone else:
> 
>   https://bugzilla.opensuse.org/show_bug.cgi?id=908706
> 
> I'll probably have to be happy with fixing usage on btrfs, and hope the other
> bug is fixed already or doesn't become a problem.

I managed to reproduce the bug, and when I applied your patches I cannot any
more.  So it looks like you've fixed it - thanks.

That just leave the bmap issue.  I'll post a patch which causes lseek to be
used when the fs says that is OK.

Thanks,
NeilBrown
David Howells April 20, 2015, 9:27 a.m. UTC | #9
NeilBrown <neilb@suse.de> wrote:

> I managed to reproduce the bug, and when I applied your patches I cannot any
> more.  So it looks like you've fixed it - thanks.

I hope so too.  Now I just hope Linus takes the patches.

> That just leave the bmap issue.  I'll post a patch which causes lseek to be
> used when the fs says that is OK.

Okay, thanks.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 1e51714eb33e..1389d8483d5d 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -20,6 +20,7 @@ 
 #include <linux/namei.h>
 #include <linux/security.h>
 #include <linux/slab.h>
+#include <linux/magic.h>
 #include "internal.h"
 
 #define CACHEFILES_KEYBUF_SIZE 512
@@ -647,7 +648,8 @@  lookup_again:
 
 			ret = -EPERM;
 			aops = object->dentry->d_inode->i_mapping->a_ops;
-			if (!aops->bmap)
+			if (!aops->bmap &&
+			    object->dentry->d_sb->s_magic != BTRFS_SUPER_MAGIC)
 				goto check_error;
 
 			object->backer = object->dentry;
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index c6cd8d7a4eef..49fb330c0ab8 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -410,11 +410,11 @@  int cachefiles_read_or_alloc_page(struct fscache_retrieval *op,
 
 	inode = object->backer->d_inode;
 	ASSERT(S_ISREG(inode->i_mode));
-	ASSERT(inode->i_mapping->a_ops->bmap);
 	ASSERT(inode->i_mapping->a_ops->readpages);
 
 	/* calculate the shift required to use bmap */
-	if (inode->i_sb->s_blocksize > PAGE_SIZE)
+	if (inode->i_mapping->a_ops->bmap &&
+	    inode->i_sb->s_blocksize > PAGE_SIZE)
 		goto enobufs;
 
 	shift = PAGE_SHIFT - inode->i_sb->s_blocksize_bits;
@@ -423,20 +423,36 @@  int cachefiles_read_or_alloc_page(struct fscache_retrieval *op,
 	op->op.flags |= FSCACHE_OP_ASYNC;
 	op->op.processor = cachefiles_read_copier;
 
-	/* we assume the absence or presence of the first block is a good
-	 * enough indication for the page as a whole
-	 * - TODO: don't use bmap() for this as it is _not_ actually good
-	 *   enough for this as it doesn't indicate errors, but it's all we've
-	 *   got for the moment
-	 */
-	block0 = page->index;
-	block0 <<= shift;
-
-	block = inode->i_mapping->a_ops->bmap(inode->i_mapping, block0);
-	_debug("%llx -> %llx",
-	       (unsigned long long) block0,
-	       (unsigned long long) block);
+	if (inode->i_mapping->a_ops->bmap) {
+		/* we assume the absence or presence of the first block is a good
+		 * enough indication for the page as a whole
+		 * - TODO: don't use bmap() for this as it is _not_ actually good
+		 *   enough for this as it doesn't indicate errors, but it's all we've
+		 *   got for the moment
+		 */
+		block0 = page->index;
+		block0 <<= shift;
 
+		block = inode->i_mapping->a_ops->bmap(inode->i_mapping, block0);
+		_debug("%llx -> %llx",
+		       (unsigned long long) block0,
+		       (unsigned long long) block);
+	} else {
+		/* Use llseek */
+		struct path path;
+		struct file *file;
+		path.mnt = cache->mnt;
+		path.dentry = object->backer;
+		file = dentry_open(&path, O_RDONLY, cache->cache_cred);
+		if (IS_ERR(file))
+			goto enobufs;
+		block = vfs_llseek(file, page->index << PAGE_SHIFT, SEEK_DATA);
+		filp_close(file, NULL);
+		if (block != page->index << PAGE_SHIFT)
+			block = 0;
+		else
+			block = 1;
+	}
 	if (block) {
 		/* submit the apparently valid page to the backing fs to be
 		 * read from disk */
@@ -707,11 +723,11 @@  int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op,
 
 	inode = object->backer->d_inode;
 	ASSERT(S_ISREG(inode->i_mode));
-	ASSERT(inode->i_mapping->a_ops->bmap);
 	ASSERT(inode->i_mapping->a_ops->readpages);
 
 	/* calculate the shift required to use bmap */
-	if (inode->i_sb->s_blocksize > PAGE_SIZE)
+	if (inode->i_mapping->a_ops->bmap &&
+	    inode->i_sb->s_blocksize > PAGE_SIZE)
 		goto all_enobufs;
 
 	shift = PAGE_SHIFT - inode->i_sb->s_blocksize_bits;
@@ -729,21 +745,38 @@  int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op,
 	list_for_each_entry_safe(page, _n, pages, lru) {
 		sector_t block0, block;
 
-		/* we assume the absence or presence of the first block is a
-		 * good enough indication for the page as a whole
-		 * - TODO: don't use bmap() for this as it is _not_ actually
-		 *   good enough for this as it doesn't indicate errors, but
-		 *   it's all we've got for the moment
-		 */
-		block0 = page->index;
-		block0 <<= shift;
-
-		block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
-						      block0);
-		_debug("%llx -> %llx",
-		       (unsigned long long) block0,
-		       (unsigned long long) block);
+		if (inode->i_mapping->a_ops->bmap) {
+			/* we assume the absence or presence of the first block is a
+			 * good enough indication for the page as a whole
+			 * - TODO: don't use bmap() for this as it is _not_ actually
+			 *   good enough for this as it doesn't indicate errors, but
+			 *   it's all we've got for the moment
+			 */
+			block0 = page->index;
+			block0 <<= shift;
+
+			block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
+							      block0);
+			_debug("%llx -> %llx",
+			       (unsigned long long) block0,
+			       (unsigned long long) block);
 
+		} else {
+			/* Use llseek */
+			struct path path;
+			struct file *file;
+			path.mnt = cache->mnt;
+			path.dentry = object->backer;
+			file = dentry_open(&path, O_RDONLY, cache->cache_cred);
+			if (IS_ERR(file))
+				goto all_enobufs;
+			block = vfs_llseek(file, page->index << PAGE_SHIFT, SEEK_DATA);
+			filp_close(file, NULL);
+			if (block != page->index << PAGE_SHIFT)
+				block = 0;
+			else
+				block = 1;
+		}
 		if (block) {
 			/* we have data - add it to the list to give to the
 			 * backing fs */