diff mbox series

[2/9] cachefiles: drop direct usage of ->bmap method.

Message ID 20190808082744.31405-3-cmaiolino@redhat.com (mailing list archive)
State New, archived
Headers show
Series New ->fiemap infrastructure and ->bmap removal | expand

Commit Message

Carlos Maiolino Aug. 8, 2019, 8:27 a.m. UTC
Replace the direct usage of ->bmap method by a bmap() call.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---

 fs/cachefiles/rdwr.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

Comments

Christoph Hellwig Aug. 14, 2019, 11:15 a.m. UTC | #1
On Thu, Aug 08, 2019 at 10:27:37AM +0200, Carlos Maiolino wrote:
> +	block = page->index;
> +	block <<= shift;

Can't this cause overflows?

> +
> +	ret = bmap(inode, &block);
> +	ASSERT(!ret);

I think we want some real error handling here instead of just an
assert..
Carlos Maiolino Aug. 20, 2019, 11:57 a.m. UTC | #2
On Wed, Aug 14, 2019 at 01:15:35PM +0200, Christoph Hellwig wrote:
> On Thu, Aug 08, 2019 at 10:27:37AM +0200, Carlos Maiolino wrote:
> > +	block = page->index;
> > +	block <<= shift;
> 
> Can't this cause overflows?

Hmm, I honestly don't know. I did look at the code, and I couldn't really spot
anything concrete.

Maybe if the block size is much smaller than PAGE_SIZE, but I am really not
sure.

Bear in mind though, I didn't change the logic here at all. I just reused one
variable instead of juggling both (block0 and block) old variables. So, if this
really can overflow, the code is already buggy even without my patch, I'm CC'ing
dhowells just in case.


> 
> > +
> > +	ret = bmap(inode, &block);
> > +	ASSERT(!ret);
> 
> I think we want some real error handling here instead of just an
> assert..

I left this ASSERT() here, to match the current logic. By now, the only error we
can get is -EINVAL, which basically says ->bmap() method does not exist, which
is basically what does happen today with:

ASSERT(inode->i_mapping->a_ops->bmap);


But I do agree, it will be better to provide some sort of error handling here,
maybe I should do something like:

ASSERT(ret == -EINVAL)

to keep the logic exactly the same and do not blow up in the future if/when we
expand possible error values from bmap()

What you think?
David Howells Aug. 20, 2019, 12:50 p.m. UTC | #3
Carlos Maiolino <cmaiolino@redhat.com> wrote:

> > > +	block = page->index;
> > > +	block <<= shift;
> > 
> > Can't this cause overflows?
> 
> Hmm, I honestly don't know. I did look at the code, and I couldn't really spot
> anything concrete.

Maybe, though we'd have to support file sizes over 16 Exabytes for that to be
a problem.

Note that bmap() is *only* used to find out if the page is present in the
cache - and even that I'm not actually doing very well, since I really *ought*
to check every block in the page.

I really want to replace the use of bmap entirely with iov_iter doing DIO.
Cachefiles currently does double buffering because it works through the
pagecache of the underlying to do actual read or write - and this appears to
cause memory management problems.

David
David Howells Aug. 20, 2019, 1:31 p.m. UTC | #4
Christoph Hellwig <hch@lst.de> wrote:

> > +	block = page->index;
> > +	block <<= shift;
> 
> Can't this cause overflows?

No, not unless the netfs allows files >16EiB in size and as long as block
(type sector_t) is a 64-bit integer.

A 16EiB-1 (0xffffffffffffffff) file would have 4P-1 (0xfffffffffffff) pages
assuming a 4K page size.

At a block size of 1 (and a shift therefore of 12), the maximum block number
calculated would be 0xfffffffffffff000.

David
Christoph Hellwig Aug. 29, 2019, 7:13 a.m. UTC | #5
On Tue, Aug 20, 2019 at 01:50:30PM +0100, David Howells wrote:
> Carlos Maiolino <cmaiolino@redhat.com> wrote:
> 
> > > > +	block = page->index;
> > > > +	block <<= shift;
> > > 
> > > Can't this cause overflows?
> > 
> > Hmm, I honestly don't know. I did look at the code, and I couldn't really spot
> > anything concrete.
> 
> Maybe, though we'd have to support file sizes over 16 Exabytes for that to be
> a problem.

On 32-bit sysems page->index is a 32-bit value, so you'd overflow at
pretty normal sizes of a few TB.

> Note that bmap() is *only* used to find out if the page is present in the
> cache - and even that I'm not actually doing very well, since I really *ought*
> to check every block in the page.
> 
> I really want to replace the use of bmap entirely with iov_iter doing DIO.
> Cachefiles currently does double buffering because it works through the
> pagecache of the underlying to do actual read or write - and this appears to
> cause memory management problems.

Not related to this patch, but using iov_iter with dio is trivial, what
is the blocker therere?
David Howells Aug. 30, 2019, 4:17 p.m. UTC | #6
Christoph Hellwig <hch@lst.de> wrote:

> Not related to this patch, but using iov_iter with dio is trivial, what
> is the blocker therere?

The usual: time.

The change as a whole is not actually trivial since it will involve completely
overhauling the fscache data API and how the filesystems use it - and then
having cachefiles perform the DIO asynchronously as per Trond's requirements
for using fscache with NFS.

I also need to work out how I'm going to do data/hole detection.  Can I set,
say, O_NOREADHOLE and then expect the DIO to stop early with a short read?  Or
do I need to use SEEK_DATA/SEEK_HOLE in advance to define the occupied
regions?

Maybe a better way would be to take a leaf out of the book of OpenAFS and
suchlike and keep a parallel file that tracks the occupancy of a cache object
(eg. a bitmap with 1 bit per 64k block) - but that the synchronisation and
performance issues.

David
Christoph Hellwig Aug. 30, 2019, 4:59 p.m. UTC | #7
On Fri, Aug 30, 2019 at 05:17:51PM +0100, David Howells wrote:
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > Not related to this patch, but using iov_iter with dio is trivial, what
> > is the blocker therere?
> 
> The usual: time.
> 
> The change as a whole is not actually trivial since it will involve completely
> overhauling the fscache data API and how the filesystems use it - and then
> having cachefiles perform the DIO asynchronously as per Trond's requirements
> for using fscache with NFS.

Well, doing in-kernel async I/O is actually rather trivial these days.
Take a look at the loop driver for an example.

> I also need to work out how I'm going to do data/hole detection.  Can I set,
> say, O_NOREADHOLE and then expect the DIO to stop early with a short read?  Or
> do I need to use SEEK_DATA/SEEK_HOLE in advance to define the occupied
> regions?

We'll you'd need to implement a IOCB_NOHOLE, but that wouldn't be all
that hard.  Having a READ_PLUS like interface that actually tells you
how large the hole is might be hard.
David Howells Aug. 31, 2019, 12:45 a.m. UTC | #8
Christoph Hellwig <hch@lst.de> wrote:

> We'll you'd need to implement a IOCB_NOHOLE, but that wouldn't be all
> that hard.  Having a READ_PLUS like interface that actually tells you
> how large the hole is might be hard.

Actually, that raises another couple of questions:

 (1) Is a filesystem allowed to join up two disjoint blocks of data with a
     block of zeros to make a single extent?  If this happens, I'll see the
     inserted block of zeros to be valid data.

 (2) Is a filesystem allowed to punch out a block of valid zero data to make a
     hole?  This would cause me to refetch the data.

David
Dave Chinner Sept. 5, 2019, 10:44 p.m. UTC | #9
On Sat, Aug 31, 2019 at 01:45:57AM +0100, David Howells wrote:
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > We'll you'd need to implement a IOCB_NOHOLE, but that wouldn't be all
> > that hard.  Having a READ_PLUS like interface that actually tells you
> > how large the hole is might be hard.
> 
> Actually, that raises another couple of questions:
> 
>  (1) Is a filesystem allowed to join up two disjoint blocks of data with a
>      block of zeros to make a single extent?  If this happens, I'll see the
>      inserted block of zeros to be valid data.

Yes.

>  (2) Is a filesystem allowed to punch out a block of valid zero data to make a
>      hole?  This would cause me to refetch the data.

Yes.

Essentially, assumptions that filesystems will not change the file
layout even if the data does not change are invalid. copy-on-write,
deduplication, speculative preallocation, etc mean the file layout
can change even if the file data itself is not directly modified.

If you want to cache physical file layout information and be
notified when they may change, then that's what layout leases are
for....

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 44a3ce1e4ce4..073c14cae6aa 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -396,7 +396,7 @@  int cachefiles_read_or_alloc_page(struct fscache_retrieval *op,
 	struct cachefiles_object *object;
 	struct cachefiles_cache *cache;
 	struct inode *inode;
-	sector_t block0, block;
+	sector_t block;
 	unsigned shift;
 	int ret;
 
@@ -412,7 +412,6 @@  int cachefiles_read_or_alloc_page(struct fscache_retrieval *op,
 
 	inode = d_backing_inode(object->backer);
 	ASSERT(S_ISREG(inode->i_mode));
-	ASSERT(inode->i_mapping->a_ops->bmap);
 	ASSERT(inode->i_mapping->a_ops->readpages);
 
 	/* calculate the shift required to use bmap */
@@ -428,12 +427,14 @@  int cachefiles_read_or_alloc_page(struct fscache_retrieval *op,
 	 *   enough for this as it doesn't indicate errors, but it's all we've
 	 *   got for the moment
 	 */
-	block0 = page->index;
-	block0 <<= shift;
+	block = page->index;
+	block <<= shift;
+
+	ret = bmap(inode, &block);
+	ASSERT(!ret);
 
-	block = inode->i_mapping->a_ops->bmap(inode->i_mapping, block0);
 	_debug("%llx -> %llx",
-	       (unsigned long long) block0,
+	       (unsigned long long) (page->index << shift),
 	       (unsigned long long) block);
 
 	if (block) {
@@ -711,7 +712,6 @@  int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op,
 
 	inode = d_backing_inode(object->backer);
 	ASSERT(S_ISREG(inode->i_mode));
-	ASSERT(inode->i_mapping->a_ops->bmap);
 	ASSERT(inode->i_mapping->a_ops->readpages);
 
 	/* calculate the shift required to use bmap */
@@ -728,7 +728,7 @@  int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op,
 
 	ret = space ? -ENODATA : -ENOBUFS;
 	list_for_each_entry_safe(page, _n, pages, lru) {
-		sector_t block0, block;
+		sector_t block;
 
 		/* we assume the absence or presence of the first block is a
 		 * good enough indication for the page as a whole
@@ -736,13 +736,14 @@  int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op,
 		 *   good enough for this as it doesn't indicate errors, but
 		 *   it's all we've got for the moment
 		 */
-		block0 = page->index;
-		block0 <<= shift;
+		block = page->index;
+		block <<= shift;
+
+		ret = bmap(inode, &block);
+		ASSERT(!ret);
 
-		block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
-						      block0);
 		_debug("%llx -> %llx",
-		       (unsigned long long) block0,
+		       (unsigned long long) (page->index << shift),
 		       (unsigned long long) block);
 
 		if (block) {