diff mbox

fallocate INSERT_RANGE/COLLAPSE_RANGE is completely broken [PATCH]

Message ID 20160329042546.GA10581@kmo-pixel (mailing list archive)
State New, archived
Headers show

Commit Message

Kent Overstreet March 29, 2016, 4:25 a.m. UTC
Bit of previous discussion:
http://thread.gmane.org/gmane.linux.file-systems/101201/

The underlying issue is that we have no mechanism for invalidating a range of
the pagecache and then _keeping it invalidated_ while we Do Stuff. 

The fallocate INSERT_RANGE/COLLAPSE_RANGE situation seems likely to be worse
than I initially thought. I've been digging into this in the course of bcachefs
testing - I was hitting assertions that meant state hanging off the page cache
(in this case, allocation information, i.e. whether we needed to reserve space
on write) was inconsistent with the btree in writepages().

Well, bcachefs isn't the only filesystem that hangs additional state off the
pagecache, and the situation today is that an unpriviliged user can cause
inconsistencies there by just doing buffered reads concurrently with
INSERT_RANGE/COLLAPSE_RANGE. I highly highly doubt this is an issue of just
"oops, you corrupted your file because you were doing stupid stuff" - who knows
what internal invariants are getting broken here, and I don't particularly care
to find out.

I'm pretty certain that ext4 and xfs are both broken here - I haven't dug into
the btrfs code. The impression I get from reading the code is that whoever wrote
it was cripping off of truncate - except that truncate relies on the i_size
change to work; without that there's no way this code can possibly work.

That's the bad news. Good news is, I think it's fixable.

My solution is just to add a lock (per address_space) around adding pages to the
pagecache - it's got some funny semantics so as to be usable for dio writes, but
it's just a lock. (It's got two states like an rwlock, except that both states
are shared and only exclusive with the other states; dio writes don't block
other dio writes, buffered IO adding pages to the pagecache does't block other
pages being added).

The only tricky bit is the recursion Dave noted in the dio write path -
get_user_pages() -> fault(). My solution is to just keep track of what address
space we have locked (blocking page adds) in task_struct, then A) don't deadlock
in __add_to_page_cache_locked(), and B) don't do readahead in filemap_fault()
(which means we can delete another hack from the dio write code).

I don't particularly care for where the add side of the lock is taken - it
works, but it means we can't distinguish accidental recursion from intentional.
I'm open to ideas there. Also, we made better use of the add side of the lock we
could probably delete a lot of fragile code for synchronization with truncate.

Patch below is reliably passing xfstests for me (where previously I was hitting
the aforementioned pagecache inconsistency assertions). I've only hooked up
bcachefs so far, other filesystems should be pretty trivial (but it does need
individual filesystems to be converted).

One issue I haven't dealt with is that dio writes can starve buffered IO/mmapped
IO with this lock - not sure if that's worth dealing with.

Patch below includes bcachefs changes, but is otherwise on 4.5. Available in my
usual git repo:

https://evilpiepirate.org/git/linux-bcache.git/log/?h=bcache-dev-pagecache-lock

-- >8 --
Subject: [PATCH] mm: pagecache add lock

Add a per address space lock around adding pages to the pagecache - making it
possible for fallocate INSERT_RANGE/COLLAPSE_RANGE to work correctly, and also
hopefully making truncate and dio a bit saner.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 drivers/md/bcache/fs-io.c | 197 +++++++++++++++++++++++++---------------------
 fs/inode.c                |   1 +
 include/linux/fs.h        |  23 ++++++
 include/linux/init_task.h |   1 +
 include/linux/sched.h     |   4 +
 mm/filemap.c              |  85 ++++++++++++++++++--
 6 files changed, 217 insertions(+), 94 deletions(-)

Comments

Dave Chinner March 29, 2016, 5:15 a.m. UTC | #1
On Mon, Mar 28, 2016 at 08:25:46PM -0800, Kent Overstreet wrote:
> Bit of previous discussion:
> http://thread.gmane.org/gmane.linux.file-systems/101201/
> 
> The underlying issue is that we have no mechanism for invalidating a range of
> the pagecache and then _keeping it invalidated_ while we Do Stuff. 
> 
> The fallocate INSERT_RANGE/COLLAPSE_RANGE situation seems likely to be worse
> than I initially thought. I've been digging into this in the course of bcachefs
> testing - I was hitting assertions that meant state hanging off the page cache
> (in this case, allocation information, i.e. whether we needed to reserve space
> on write) was inconsistent with the btree in writepages().
> 
> Well, bcachefs isn't the only filesystem that hangs additional state off the
> pagecache, and the situation today is that an unpriviliged user can cause
> inconsistencies there by just doing buffered reads concurrently with
> INSERT_RANGE/COLLAPSE_RANGE. I highly highly doubt this is an issue of just
> "oops, you corrupted your file because you were doing stupid stuff" - who knows
> what internal invariants are getting broken here, and I don't particularly care
> to find out.

I'd like to see a test case for this. Concurrent IO and/or page
faults should not run at the same as fallocate on XFS. Hence I'd
like to see the test cases that demonstrate buffered reads are
causing corruption during insert/collapse range operations. We use
the same locking strategy for fallocate as we use for truncate and
all the other internal extent manipulation operations, so if there's
something wrong, we need to fix it.

Cheers,

Dave.
Kent Overstreet March 29, 2016, 6:04 a.m. UTC | #2
On Tue, Mar 29, 2016 at 04:15:58PM +1100, Dave Chinner wrote:
> On Mon, Mar 28, 2016 at 08:25:46PM -0800, Kent Overstreet wrote:
> > Bit of previous discussion:
> > http://thread.gmane.org/gmane.linux.file-systems/101201/
> > 
> > The underlying issue is that we have no mechanism for invalidating a range of
> > the pagecache and then _keeping it invalidated_ while we Do Stuff. 
> > 
> > The fallocate INSERT_RANGE/COLLAPSE_RANGE situation seems likely to be worse
> > than I initially thought. I've been digging into this in the course of bcachefs
> > testing - I was hitting assertions that meant state hanging off the page cache
> > (in this case, allocation information, i.e. whether we needed to reserve space
> > on write) was inconsistent with the btree in writepages().
> > 
> > Well, bcachefs isn't the only filesystem that hangs additional state off the
> > pagecache, and the situation today is that an unpriviliged user can cause
> > inconsistencies there by just doing buffered reads concurrently with
> > INSERT_RANGE/COLLAPSE_RANGE. I highly highly doubt this is an issue of just
> > "oops, you corrupted your file because you were doing stupid stuff" - who knows
> > what internal invariants are getting broken here, and I don't particularly care
> > to find out.
> 
> I'd like to see a test case for this. Concurrent IO and/or page
> faults should not run at the same as fallocate on XFS. Hence I'd
> like to see the test cases that demonstrate buffered reads are
> causing corruption during insert/collapse range operations. We use
> the same locking strategy for fallocate as we use for truncate and
> all the other internal extent manipulation operations, so if there's
> something wrong, we need to fix it.

It's entirely possible I'm wrong about XFS - your fault path locking looked
correct, and I did see you had extra locking in your buffered read path but I
thought it was a different lock. I'll recheck later, but for the moment I'm just
going to assume I misspoke (and tbh always found xfs's locking to be quite
rigorous).

ext4 uses the generic code in all the places you're hooking into though -
.fault, .read_iter, etc.

The scheme I've got in this patch should perform quite a bit better than what
you're doing - only locking in the slow cache miss path, vs. every time you
touch the page cache.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 29, 2016, 10:37 p.m. UTC | #3
On Mon, Mar 28, 2016 at 10:04:10PM -0800, Kent Overstreet wrote:
> On Tue, Mar 29, 2016 at 04:15:58PM +1100, Dave Chinner wrote:
> > On Mon, Mar 28, 2016 at 08:25:46PM -0800, Kent Overstreet wrote:
> > > Bit of previous discussion:
> > > http://thread.gmane.org/gmane.linux.file-systems/101201/
> > > 
> > > The underlying issue is that we have no mechanism for invalidating a range of
> > > the pagecache and then _keeping it invalidated_ while we Do Stuff. 
> > > 
> > > The fallocate INSERT_RANGE/COLLAPSE_RANGE situation seems likely to be worse
> > > than I initially thought. I've been digging into this in the course of bcachefs
> > > testing - I was hitting assertions that meant state hanging off the page cache
> > > (in this case, allocation information, i.e. whether we needed to reserve space
> > > on write) was inconsistent with the btree in writepages().
> > > 
> > > Well, bcachefs isn't the only filesystem that hangs additional state off the
> > > pagecache, and the situation today is that an unpriviliged user can cause
> > > inconsistencies there by just doing buffered reads concurrently with
> > > INSERT_RANGE/COLLAPSE_RANGE. I highly highly doubt this is an issue of just
> > > "oops, you corrupted your file because you were doing stupid stuff" - who knows
> > > what internal invariants are getting broken here, and I don't particularly care
> > > to find out.
> > 
> > I'd like to see a test case for this. Concurrent IO and/or page
> > faults should not run at the same as fallocate on XFS. Hence I'd
> > like to see the test cases that demonstrate buffered reads are
> > causing corruption during insert/collapse range operations. We use
> > the same locking strategy for fallocate as we use for truncate and
> > all the other internal extent manipulation operations, so if there's
> > something wrong, we need to fix it.
> 
> It's entirely possible I'm wrong about XFS - your fault path locking looked
> correct, and I did see you had extra locking in your buffered read path but I
> thought it was a different lock. I'll recheck later, but for the moment I'm just
> going to assume I misspoke (and tbh always found xfs's locking to be quite
> rigorous).

There are two locks the XFS_IOLOCK for read/write/splice IO path vs
truncate/fallocate exclusion, and XFS_MMAPLOCK for page fault vs
truncate/fallocate exclusion.

> ext4 uses the generic code in all the places you're hooking into though -
> .fault, .read_iter, etc.
> 
> The scheme I've got in this patch should perform quite a bit better than what
> you're doing - only locking in the slow cache miss path, vs. every time you
> touch the page cache.

I'm not sure I follow - how does this work when a fallocate
operation use the page cache for, say, zeroing data blocks rather
than invalidating them (e.g.  FALLOC_FL_ZERO_RANGE can validly zero
blocks through the page cache, so can hole punching)?  Won't the
buffered read then return a mix of real and zeroed data, depending
who wins the race to each underlying page lock?

i.e. if the locking only occurs in the page insert slow path, then
it doesn't provide sufficient exclusion for extent manipulation
operations that use the page cache during their normal operation.
IOWs, other, higher level synchronisation methods for fallocate
are still necessary....

Cheers,

Dave.
Kent Overstreet March 29, 2016, 11:46 p.m. UTC | #4
On Wed, Mar 30, 2016 at 09:37:43AM +1100, Dave Chinner wrote:
> There are two locks the XFS_IOLOCK for read/write/splice IO path vs
> truncate/fallocate exclusion, and XFS_MMAPLOCK for page fault vs
> truncate/fallocate exclusion.

Ok, that makes sense then.

> > ext4 uses the generic code in all the places you're hooking into though -
> > .fault, .read_iter, etc.
> > 
> > The scheme I've got in this patch should perform quite a bit better than what
> > you're doing - only locking in the slow cache miss path, vs. every time you
> > touch the page cache.
> 
> I'm not sure I follow - how does this work when a fallocate
> operation use the page cache for, say, zeroing data blocks rather
> than invalidating them (e.g.  FALLOC_FL_ZERO_RANGE can validly zero
> blocks through the page cache, so can hole punching)?  Won't the
> buffered read then return a mix of real and zeroed data, depending
> who wins the race to each underlying page lock?

Yes, but does that matter? I don't know of any requirement that the entire
fallocate be atomic, and writes > PAGE_SIZE aren't generally atomic wrt reads
either (but looking at the code it looks like XFS is special there too...)

> i.e. if the locking only occurs in the page insert slow path, then
> it doesn't provide sufficient exclusion for extent manipulation
> operations that use the page cache during their normal operation.
> IOWs, other, higher level synchronisation methods for fallocate
> are still necessary....

Well, I would argue it _is_ enough for internal consistency, if you want to
provide stronger atomicity guarantees to userspace (not that there's anything
wrong with that) I think I'd argue that really that's a different issue.

Also the situation seems to be that XFS provides stronger guarantees than any
other Linux filesystem, which is sucky from an application point of view because
they don't want to have to depend on running on XFS for correctness. It seems
like it'd actually be pretty easy to lift your locking scheme to the generic VFS
code, and then possibly even have it be controlled with a per inode flag so that
applications that don't need the locking aren't paying for it.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 30, 2016, 11:17 p.m. UTC | #5
On Tue, Mar 29, 2016 at 03:46:45PM -0800, Kent Overstreet wrote:
> On Wed, Mar 30, 2016 at 09:37:43AM +1100, Dave Chinner wrote:
> > There are two locks the XFS_IOLOCK for read/write/splice IO path vs
> > truncate/fallocate exclusion, and XFS_MMAPLOCK for page fault vs
> > truncate/fallocate exclusion.
> 
> Ok, that makes sense then.
> 
> > > ext4 uses the generic code in all the places you're hooking into though -
> > > .fault, .read_iter, etc.
> > > 
> > > The scheme I've got in this patch should perform quite a bit better than what
> > > you're doing - only locking in the slow cache miss path, vs. every time you
> > > touch the page cache.
> > 
> > I'm not sure I follow - how does this work when a fallocate
> > operation use the page cache for, say, zeroing data blocks rather
> > than invalidating them (e.g.  FALLOC_FL_ZERO_RANGE can validly zero
> > blocks through the page cache, so can hole punching)?  Won't the
> > buffered read then return a mix of real and zeroed data, depending
> > who wins the race to each underlying page lock?
> 
> Yes, but does that matter?

IMO, yes. Do we really want a read returning partial data that might
be zeros or even data from different offsets in the file as a shift
is in progress?

> I don't know of any requirement that the entire
> fallocate be atomic, and writes > PAGE_SIZE aren't generally atomic wrt reads
> either (but looking at the code it looks like XFS is special there too...)

POSIX says writes should be atomic w.r.t. to other data accesses.
i.e. that a concurrent overlapping read should not see data that has
been partially written. And, yes, XFS is the only Linux filesystem
that actually complies with that requirement - page-based locking
does not provide this.

Even though there is no standard to define behaviour, fallocate
should behave the same way with respect to user data accesses. Reads
that return partial data from fallocate are just as bad as reads
that return partial data from write. Imagine if we treated truncate
like that...

And therein lies the issue - much of what fallocate does is
equivalent to truncate - it can extend files, it can remove data
from files, it can move data around in files. This stuff needs to be
atomic w.r.t. other user data accesses precisely because of the way
they manipulate user data in the file.

And there are other implementation based reasons, too. For example:
we cannot modify the underlying storage mappings to data while there
is IO in progress over that range as the moment we free the block,
it could be reallocated to something else. Hence if there is IO in
flight while we are freeing blocks we have a potential
use-after-free race condition. Truncate has this exact same issue,
and it really has nothing to do with page cache level interactions.

IOWs, when we directly manipulate the block mapping of an inode
outside of the IO path, we have to a) stop data IO submission and
page faults, b) wait for existing IO to drain, c) invalidate cached
data over the range to be modified, and d) modify the block mapping
of the inode, e) commit the change, and finally f) allow IO to
restart.

AFAICT, your patches do not completely address a) because they only
block page cache insertion (e.g. won't stop direct IO submission)
and they don't do b) at all. Hence it doesn't create an IO barrier
and so I don't see how it can replace the existing IO vs block
map synchronisation mechanisms we already have.

Yes, it will allow us to ensure mmap/direct IO coherency because we
can block page faults while DIO is in progress, but AFAICT it
doesn't prevent us from submitting more DIO on the inode at the same
time...

> > i.e. if the locking only occurs in the page insert slow path, then
> > it doesn't provide sufficient exclusion for extent manipulation
> > operations that use the page cache during their normal operation.
> > IOWs, other, higher level synchronisation methods for fallocate
> > are still necessary....
> 
> Well, I would argue it _is_ enough for internal consistency, if you want to
> provide stronger atomicity guarantees to userspace (not that there's anything
> wrong with that) I think I'd argue that really that's a different issue.
> 
> Also the situation seems to be that XFS provides stronger guarantees than any
> other Linux filesystem, which is sucky from an application point of view because
> they don't want to have to depend on running on XFS for correctness.

Sure, but that's the reality that people have lived with for a long
time, especially if they have an app that needs direct IO
concurrency (e.g. databases) or large amounts of data to be
stored...

> It seems
> like it'd actually be pretty easy to lift your locking scheme to the generic VFS
> code, and then possibly even have it be controlled with a per inode flag so that
> applications that don't need the locking aren't paying for it.

Slowly, slowly starting to happen:

commit 5955102c9984fa081b2d570cfac75c97eecf8f3b
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Fri Jan 22 15:40:57 2016 -0500

    wrappers for ->i_mutex access
    
    parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).
    
    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 69b8b526c1..617c61e070 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -345,6 +345,7 @@  void address_space_init_once(struct address_space *mapping)
 {
 	memset(mapping, 0, sizeof(*mapping));
 	INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC);
+	pagecache_lock_init(&mapping->add_lock);
 	spin_lock_init(&mapping->tree_lock);
 	init_rwsem(&mapping->i_mmap_rwsem);
 	INIT_LIST_HEAD(&mapping->private_list);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae68100210..a806c09c21 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -424,9 +424,32 @@  int pagecache_write_end(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata);
 
+/*
+ * Two-state lock - can be taken for add or block - both states are shared,
+ * like read side of rwsem, but conflict with other state:
+ */
+struct pagecache_lock {
+	atomic_long_t		v;
+	wait_queue_head_t	wait;
+};
+
+static inline void pagecache_lock_init(struct pagecache_lock *lock)
+{
+	atomic_long_set(&lock->v, 0);
+	init_waitqueue_head(&lock->wait);
+}
+
+void pagecache_add_put(struct pagecache_lock *);
+void pagecache_add_get(struct pagecache_lock *);
+void __pagecache_block_put(struct pagecache_lock *);
+void __pagecache_block_get(struct pagecache_lock *);
+void pagecache_block_put(struct pagecache_lock *);
+void pagecache_block_get(struct pagecache_lock *);
+
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
+	struct pagecache_lock	add_lock;	/* protects adding new pages */
 	spinlock_t		tree_lock;	/* and lock protecting it */
 	atomic_t		i_mmap_writable;/* count VM_SHARED mappings */
 	struct rb_root		i_mmap;		/* tree of private and shared mappings */
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f2cb8d4551..1bded65b89 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -235,6 +235,7 @@  extern struct task_group root_task_group;
 		.signal = {{0}}},					\
 	.blocked	= {{0}},					\
 	.alloc_lock	= __SPIN_LOCK_UNLOCKED(tsk.alloc_lock),		\
+	.pagecache_lock = NULL,						\
 	.journal_info	= NULL,						\
 	.cpu_timers	= INIT_CPU_TIMERS(tsk.cpu_timers),		\
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(tsk.pi_lock),	\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ea5e7c5ce6..48762b6b75 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -134,6 +134,7 @@  struct perf_event_context;
 struct blk_plug;
 struct filename;
 struct nameidata;
+struct pagecache_lock;
 
 #define VMACACHE_BITS 2
 #define VMACACHE_SIZE (1U << VMACACHE_BITS)
@@ -1649,6 +1650,9 @@  struct task_struct {
 	unsigned int in_ubsan;
 #endif
 
+	/* currently held lock, for avoiding recursing in fault path: */
+	struct pagecache_lock *pagecache_lock;
+
 /* journalling filesystem info */
 	void *journal_info;
 
diff --git a/mm/filemap.c b/mm/filemap.c
index eae0d353d1..c9c24ceed9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,6 +110,67 @@ 
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
+static void __pagecache_lock_put(struct pagecache_lock *lock, long i)
+{
+	BUG_ON(atomic_long_read(&lock->v) == 0);
+
+	if (atomic_long_sub_return_release(i, &lock->v) == 0)
+		wake_up_all(&lock->wait);
+}
+
+static bool __pagecache_lock_tryget(struct pagecache_lock *lock, long i)
+{
+	long v = atomic_long_read(&lock->v), old;
+
+	do {
+		old = v;
+
+		if (i > 0 ? v < 0 : v > 0)
+			return false;
+	} while ((v = atomic_long_cmpxchg_acquire(&lock->v,
+					old, old + i)) != old);
+	return true;
+}
+
+static void __pagecache_lock_get(struct pagecache_lock *lock, long i)
+{
+	wait_event(lock->wait, __pagecache_lock_tryget(lock, i));
+}
+
+void pagecache_add_put(struct pagecache_lock *lock)
+{
+	__pagecache_lock_put(lock, 1);
+}
+
+void pagecache_add_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, 1);
+}
+
+void __pagecache_block_put(struct pagecache_lock *lock)
+{
+	__pagecache_lock_put(lock, -1);
+}
+
+void __pagecache_block_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, -1);
+}
+
+void pagecache_block_put(struct pagecache_lock *lock)
+{
+	BUG_ON(current->pagecache_lock != lock);
+	current->pagecache_lock = NULL;
+	__pagecache_lock_put(lock, -1);
+}
+
+void pagecache_block_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, -1);
+	BUG_ON(current->pagecache_lock);
+	current->pagecache_lock = lock;
+}
+
 static void page_cache_tree_delete(struct address_space *mapping,
 				   struct page *page, void *shadow)
 {
@@ -631,18 +692,21 @@  static int __add_to_page_cache_locked(struct page *page,
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_get(&mapping->add_lock);
+
 	if (!huge) {
 		error = mem_cgroup_try_charge(page, current->mm,
 					      gfp_mask, &memcg, false);
 		if (error)
-			return error;
+			goto err;
 	}
 
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
 		if (!huge)
 			mem_cgroup_cancel_charge(page, memcg, false);
-		return error;
+		goto err;
 	}
 
 	page_cache_get(page);
@@ -662,7 +726,11 @@  static int __add_to_page_cache_locked(struct page *page,
 	if (!huge)
 		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
-	return 0;
+err:
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_put(&mapping->add_lock);
+
+	return error;
 err_insert:
 	page->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
@@ -670,7 +738,7 @@  err_insert:
 	if (!huge)
 		mem_cgroup_cancel_charge(page, memcg, false);
 	page_cache_release(page);
-	return error;
+	goto err;
 }
 
 /**
@@ -1771,7 +1839,14 @@  int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	 * Do we have something in the page cache already?
 	 */
 	page = find_get_page(mapping, offset);
-	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
+	if (unlikely(current->pagecache_lock == &mapping->add_lock)) {
+		/*
+		 * fault from e.g. dio -> get_user_pages() - _don't_ want to do
+		 * readahead, only read in page we need:
+		 */
+		if (!page)
+			goto no_cached_page;
+	} else if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
 		/*
 		 * We found the page, so try async readahead before
 		 * waiting for the lock.
diff --git a/drivers/md/bcache/fs-io.c b/drivers/md/bcache/fs-io.c
index 24cd1297ef..96642f2ef0 100644
--- a/drivers/md/bcache/fs-io.c
+++ b/drivers/md/bcache/fs-io.c
@@ -25,6 +25,36 @@  struct bio_set *bch_writepage_bioset;
 struct bio_set *bch_dio_read_bioset;
 struct bio_set *bch_dio_write_bioset;
 
+/* pagecache_block must be held */
+static int write_invalidate_inode_pages_range(struct address_space *mapping,
+					      loff_t start, loff_t end)
+{
+	int ret;
+
+	/*
+	 * XXX: the way this is currently implemented, we can spin if a process
+	 * is continually redirtying a specific page
+	 */
+	do {
+		if (!mapping->nrpages &&
+		    !mapping->nrexceptional)
+			return 0;
+
+		ret = filemap_write_and_wait_range(mapping, start, end);
+		if (ret)
+			break;
+
+		if (!mapping->nrpages)
+			return 0;
+
+		ret = invalidate_inode_pages2_range(mapping,
+				start >> PAGE_CACHE_SHIFT,
+				end >> PAGE_CACHE_SHIFT);
+	} while (ret == -EBUSY);
+
+	return ret;
+}
+
 /* i_size updates: */
 
 /*
@@ -1085,13 +1115,16 @@  int bch_write_begin(struct file *file, struct address_space *mapping,
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 	unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
 	struct page *page;
-	int ret = 0;
+	int ret = -ENOMEM;
 
 	BUG_ON(inode_unhashed(mapping->host));
 
+	/* Not strictly necessary - same reason as mkwrite(): */
+	pagecache_add_get(&mapping->add_lock);
+
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
-		return -ENOMEM;
+		goto err_unlock;
 
 	if (PageUptodate(page))
 		goto out;
@@ -1132,11 +1165,13 @@  out:
 	}
 
 	*pagep = page;
-	return ret;
+	return 0;
 err:
 	unlock_page(page);
 	page_cache_release(page);
 	*pagep = NULL;
+err_unlock:
+	pagecache_add_put(&mapping->add_lock);
 	return ret;
 }
 
@@ -1211,6 +1246,7 @@  int bch_write_end(struct file *filp, struct address_space *mapping,
 out:
 	unlock_page(page);
 	page_cache_release(page);
+	pagecache_add_put(&mapping->add_lock);
 
 	return copied;
 }
@@ -1331,7 +1367,9 @@  out:
 
 static long __bch_dio_write_complete(struct dio_write *dio)
 {
-	struct inode *inode = dio->req->ki_filp->f_inode;
+	struct file *file = dio->req->ki_filp;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = file->f_inode;
 	struct bch_inode_info *ei = to_bch_ei(inode);
 	struct cache_set *c = inode->i_sb->s_fs_info;
 	long ret = dio->error ?: dio->written;
@@ -1340,7 +1378,8 @@  static long __bch_dio_write_complete(struct dio_write *dio)
 
 	i_sectors_dirty_put(ei, &dio->i_sectors_hook);
 
-	inode_dio_end(dio->req->ki_filp->f_inode);
+	__pagecache_block_put(&mapping->add_lock);
+	inode_dio_end(inode);
 
 	if (dio->iovec && dio->iovec != dio->inline_vecs)
 		kfree(dio->iovec);
@@ -1425,12 +1464,17 @@  static void bch_dio_write_loop_async(struct closure *cl)
 {
 	struct dio_write *dio =
 		container_of(cl, struct dio_write, cl);
+	struct address_space *mapping = dio->req->ki_filp->f_mapping;
 
 	bch_dio_write_done(dio);
 
 	if (dio->iter.count && !dio->error) {
 		use_mm(dio->mm);
+		pagecache_block_get(&mapping->add_lock);
+
 		bch_do_direct_IO_write(dio, false);
+
+		pagecache_block_put(&mapping->add_lock);
 		unuse_mm(dio->mm);
 
 		continue_at(&dio->cl,
@@ -1450,6 +1494,7 @@  static int bch_direct_IO_write(struct cache_set *c, struct kiocb *req,
 			       struct file *file, struct inode *inode,
 			       struct iov_iter *iter, loff_t offset)
 {
+	struct address_space *mapping = file->f_mapping;
 	struct bch_inode_info *ei = to_bch_ei(inode);
 	struct dio_write *dio;
 	struct bio *bio;
@@ -1504,6 +1549,7 @@  static int bch_direct_IO_write(struct cache_set *c, struct kiocb *req,
 	closure_init(&dio->cl, NULL);
 
 	inode_dio_begin(inode);
+	__pagecache_block_get(&mapping->add_lock);
 
 	/*
 	 * appends are sync in order to do the i_size update under
@@ -1595,67 +1641,30 @@  ssize_t bch_direct_IO(struct kiocb *req, struct iov_iter *iter, loff_t offset)
 static ssize_t
 bch_direct_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos)
 {
-	struct file	*file = iocb->ki_filp;
+	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
-	ssize_t		written;
-	size_t		write_len;
-	pgoff_t		end;
+	ssize_t	ret;
 
-	write_len = iov_iter_count(from);
-	end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;
+	pagecache_block_get(&mapping->add_lock);
 
-	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
-	if (written)
-		goto out;
-
-	/*
-	 * After a write we want buffered reads to be sure to go to disk to get
-	 * the new data.  We invalidate clean cached page from the region we're
-	 * about to write.  We do this *before* the write so that we can return
-	 * without clobbering -EIOCBQUEUED from ->direct_IO().
-	 */
-	if (mapping->nrpages) {
-		written = invalidate_inode_pages2_range(mapping,
-					pos >> PAGE_CACHE_SHIFT, end);
-		/*
-		 * If a page can not be invalidated, return 0 to fall back
-		 * to buffered write.
-		 */
-		if (written) {
-			if (written == -EBUSY)
-				return 0;
-			goto out;
-		}
-	}
+	/* Write and invalidate pagecache range that we're writing to: */
+	ret = write_invalidate_inode_pages_range(file->f_mapping, pos,
+					pos + iov_iter_count(from) - 1);
+	if (ret)
+		goto err;
 
-	written = mapping->a_ops->direct_IO(iocb, from, pos);
+	ret = bch_direct_IO(iocb, from, pos);
+err:
+	pagecache_block_put(&mapping->add_lock);
 
-	/*
-	 * Finally, try again to invalidate clean pages which might have been
-	 * cached by non-direct readahead, or faulted in by get_user_pages()
-	 * if the source of the write was an mmap'ed region of the file
-	 * we're writing.  Either one is a pretty crazy thing to do,
-	 * so we don't support it 100%.  If this invalidation
-	 * fails, tough, the write still worked...
-	 *
-	 * Augh: this makes no sense for async writes - the second invalidate
-	 * has to come after the new data is visible. But, we can't just move it
-	 * to the end of the dio write path - for async writes we don't have
-	 * i_mutex held anymore, 
-	 */
-	if (mapping->nrpages) {
-		invalidate_inode_pages2_range(mapping,
-					      pos >> PAGE_CACHE_SHIFT, end);
-	}
-out:
-	return written;
+	return ret;
 }
 
 static ssize_t __bch_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space * mapping = file->f_mapping;
-	struct inode 	*inode = mapping->host;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
 	ssize_t	ret;
 
 	/* We can write back this queue in page reclaim */
@@ -1713,10 +1722,13 @@  int bch_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	file_update_time(vma->vm_file);
 
 	/*
-	 * i_mutex is required for synchronizing with fcollapse(), O_DIRECT
-	 * writes
+	 * Not strictly necessary, but helps avoid dio writes livelocking in
+	 * write_invalidate_inode_pages_range() - can drop this if/when we get
+	 * a write_invalidate_inode_pages_range() that works without dropping
+	 * page lock before invalidating page
 	 */
-	mutex_lock(&inode->i_mutex);
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_get(&mapping->add_lock);
 
 	lock_page(page);
 	if (page->mapping != mapping ||
@@ -1736,7 +1748,8 @@  int bch_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 		set_page_dirty(page);
 	wait_for_stable_page(page);
 out:
-	mutex_unlock(&inode->i_mutex);
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_put(&mapping->add_lock);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
 }
@@ -1744,8 +1757,8 @@  out:
 void bch_invalidatepage(struct page *page, unsigned int offset,
 			unsigned int length)
 {
-	BUG_ON(!PageLocked(page));
-	BUG_ON(PageWriteback(page));
+	EBUG_ON(!PageLocked(page));
+	EBUG_ON(PageWriteback(page));
 
 	if (offset || length < PAGE_CACHE_SIZE)
 		return;
@@ -1755,12 +1768,13 @@  void bch_invalidatepage(struct page *page, unsigned int offset,
 
 int bch_releasepage(struct page *page, gfp_t gfp_mask)
 {
-	BUG_ON(!PageLocked(page));
-	BUG_ON(PageWriteback(page));
-	BUG_ON(PageDirty(page));
+	EBUG_ON(!PageLocked(page));
+	EBUG_ON(PageWriteback(page));
 
-	bch_clear_page_bits(page);
+	if (PageDirty(page))
+		return 0;
 
+	bch_clear_page_bits(page);
 	return 1;
 }
 
@@ -1935,6 +1949,7 @@  static int bch_truncate_page(struct address_space *mapping, loff_t from)
 
 int bch_truncate(struct inode *inode, struct iattr *iattr)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct bch_inode_info *ei = to_bch_ei(inode);
 	struct cache_set *c = inode->i_sb->s_fs_info;
 	struct i_size_update *u;
@@ -1943,6 +1958,7 @@  int bch_truncate(struct inode *inode, struct iattr *iattr)
 	int ret = 0;
 
 	inode_dio_wait(inode);
+	pagecache_block_get(&mapping->add_lock);
 
 	mutex_lock(&ei->update_lock);
 
@@ -1986,7 +2002,7 @@  int bch_truncate(struct inode *inode, struct iattr *iattr)
 	 * just put it, because it actually is still dirty
 	 */
 	if (unlikely(ret))
-		return ret;
+		goto err;
 
 	/*
 	 * truncate_setsize() does the i_size_write(), can't use
@@ -2007,12 +2023,12 @@  int bch_truncate(struct inode *inode, struct iattr *iattr)
 
 		ret = i_sectors_dirty_get(ei, &i_sectors_hook);
 		if (unlikely(ret))
-			return ret;
+			goto err;
 
 		ret = bch_truncate_page(inode->i_mapping, iattr->ia_size);
 		if (unlikely(ret)) {
 			i_sectors_dirty_put(ei, &i_sectors_hook);
-			return ret;
+			goto err;
 		}
 
 		ret = bch_inode_truncate(c, inode->i_ino,
@@ -2023,18 +2039,24 @@  int bch_truncate(struct inode *inode, struct iattr *iattr)
 		i_sectors_dirty_put(ei, &i_sectors_hook);
 
 		if (unlikely(ret))
-			return ret;
+			goto err;
 	}
 
 	setattr_copy(inode, iattr);
 
+	pagecache_block_put(&mapping->add_lock);
+
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	i_size_update_put(c, ei, idx, 1);
 	return 0;
+err:
+	pagecache_block_put(&mapping->add_lock);
+	return ret;
 }
 
 static long bch_fpunch(struct inode *inode, loff_t offset, loff_t len)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct bch_inode_info *ei = to_bch_ei(inode);
 	struct cache_set *c = inode->i_sb->s_fs_info;
 	u64 ino = inode->i_ino;
@@ -2043,6 +2065,9 @@  static long bch_fpunch(struct inode *inode, loff_t offset, loff_t len)
 	int ret = 0;
 
 	mutex_lock(&inode->i_mutex);
+	inode_dio_wait(inode);
+	pagecache_block_get(&mapping->add_lock);
+
 	ret = __bch_truncate_page(inode->i_mapping,
 				  offset >> PAGE_CACHE_SHIFT,
 				  offset, offset + len);
@@ -2078,6 +2103,7 @@  static long bch_fpunch(struct inode *inode, loff_t offset, loff_t len)
 		i_sectors_dirty_put(ei, &i_sectors_hook);
 	}
 out:
+	pagecache_block_put(&mapping->add_lock);
 	mutex_unlock(&inode->i_mutex);
 
 	return ret;
@@ -2085,6 +2111,7 @@  out:
 
 static long bch_fcollapse(struct inode *inode, loff_t offset, loff_t len)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct bch_inode_info *ei = to_bch_ei(inode);
 	struct cache_set *c = inode->i_sb->s_fs_info;
 	struct btree_iter src;
@@ -2111,11 +2138,10 @@  static long bch_fcollapse(struct inode *inode, loff_t offset, loff_t len)
 	 * btree, and the btree consistent with i_size - we don't need outside
 	 * locking for the extents btree itself, because we're using linked
 	 * iterators
-	 *
-	 * XXX: hmm, need to prevent reads adding things to the pagecache until
-	 * we're done?
 	 */
 	mutex_lock(&inode->i_mutex);
+	inode_dio_wait(inode);
+	pagecache_block_get(&mapping->add_lock);
 
 	ret = -EINVAL;
 	if (offset + len >= inode->i_size)
@@ -2126,19 +2152,8 @@  static long bch_fcollapse(struct inode *inode, loff_t offset, loff_t len)
 
 	new_size = inode->i_size - len;
 
-	inode_dio_wait(inode);
-
-	do {
-		ret = filemap_write_and_wait_range(inode->i_mapping,
-						   offset, LLONG_MAX);
-		if (ret)
-			goto err;
-
-		ret = invalidate_inode_pages2_range(inode->i_mapping,
-					offset >> PAGE_CACHE_SHIFT,
-					ULONG_MAX);
-	} while (ret == -EBUSY);
-
+	ret = write_invalidate_inode_pages_range(inode->i_mapping,
+						 offset, LLONG_MAX);
 	if (ret)
 		goto err;
 
@@ -2227,6 +2242,7 @@  static long bch_fcollapse(struct inode *inode, loff_t offset, loff_t len)
 
 	mutex_unlock(&ei->update_lock);
 
+	pagecache_block_put(&mapping->add_lock);
 	mutex_unlock(&inode->i_mutex);
 
 	return ret;
@@ -2239,6 +2255,7 @@  err_unwind:
 err:
 	bch_btree_iter_unlock(&src);
 	bch_btree_iter_unlock(&dst);
+	pagecache_block_put(&mapping->add_lock);
 	mutex_unlock(&inode->i_mutex);
 	return ret;
 }
@@ -2246,6 +2263,7 @@  err:
 static long bch_fallocate(struct inode *inode, int mode,
 				    loff_t offset, loff_t len)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct bch_inode_info *ei = to_bch_ei(inode);
 	struct cache_set *c = inode->i_sb->s_fs_info;
 	struct i_sectors_hook i_sectors_hook;
@@ -2261,6 +2279,8 @@  static long bch_fallocate(struct inode *inode, int mode,
 	bch_btree_iter_init_intent(&iter, c, BTREE_ID_EXTENTS, POS_MIN);
 
 	mutex_lock(&inode->i_mutex);
+	inode_dio_wait(inode);
+	pagecache_block_get(&mapping->add_lock);
 
 	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
 	    new_size > inode->i_size) {
@@ -2270,9 +2290,6 @@  static long bch_fallocate(struct inode *inode, int mode,
 	}
 
 	if (mode & FALLOC_FL_ZERO_RANGE) {
-		/* just for __bch_truncate_page(): */
-		inode_dio_wait(inode);
-
 		ret = __bch_truncate_page(inode->i_mapping,
 					  offset >> PAGE_CACHE_SHIFT,
 					  offset, offset + len);
@@ -2374,6 +2391,7 @@  static long bch_fallocate(struct inode *inode, int mode,
 		i_size_update_put(c, ei, idx, 1);
 	}
 
+	pagecache_block_put(&mapping->add_lock);
 	mutex_unlock(&inode->i_mutex);
 
 	return 0;
@@ -2381,6 +2399,7 @@  err_put_sectors_dirty:
 	i_sectors_dirty_put(ei, &i_sectors_hook);
 err:
 	bch_btree_iter_unlock(&iter);
+	pagecache_block_put(&mapping->add_lock);
 	mutex_unlock(&inode->i_mutex);
 	return ret;
 }