diff mbox series

[2/2] bcachefs: Buffered write path now can avoid the inode lock

Message ID 20240229063010.68754-3-kent.overstreet@linux.dev (mailing list archive)
State New
Headers show
Series buffered write path without inode lock (for bcachefs) | expand

Commit Message

Kent Overstreet Feb. 29, 2024, 6:30 a.m. UTC
Non append, non extending buffered writes can now avoid taking the inode
lock.

To ensure atomicity of writes w.r.t. other writes, we lock every folio
that we'll be writing to, and if this fails we fall back to taking the
inode lock.

Extensive comments are provided as to corner cases.

Link: https://lore.kernel.org/linux-fsdevel/Zdkxfspq3urnrM6I@bombadil.infradead.org/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/errcode.h        |   3 +-
 fs/bcachefs/fs-io-buffered.c | 145 +++++++++++++++++++++++++----------
 2 files changed, 107 insertions(+), 41 deletions(-)

Comments

Linus Torvalds Feb. 29, 2024, 7:20 a.m. UTC | #1
On Wed, 28 Feb 2024 at 22:30, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> Non append, non extending buffered writes can now avoid taking the inode
> lock.

I think this is buggy.

I think you still need to take the inode lock *shared* for the writes,
because otherwise you can have somebody else that truncates the file
and now you will do a write past the end of the size of the file. That
will cause a lot of issues.

So it's not a "inode_lock or not" situation. I think it's a
"inode_lock vs inode_locks_shared" situation.

Note that the reading side isn't all that critical - if a read races
with a truncate, at worst it will read some zeroes because we used the
old length and the page cache got cleared in the meantime.

But the writing side ends up having actual consistency issues on disk.
You don't want to have a truncate that removes the pages past the end
of the new size and clears the end of the new last page, and race with
another write that used the old size and *thought* it was writing to
the middle of the file, but is now actually accessing a folio that is
past the end of the whole file and writing to it.

There may be some reason that I'm missing that would make this a
non-issue, but I really think you want to get the inode lock at least
shared for the duration of the write.

Also note that for similar reasons, you can't just look at "will I
extend the file" and take the lock non-shared. No, in order to
actually trust the size, you need to *hold* the lock, so the logic
needs to be something like

 - take the lock exclusively if O_APPEND or if it *looks* like you
might extend the file size.

 - otherwise, take the shared lock, and THEN RE-CHECK. The file size
might have changed, so now you need to double-check that you're really
not going to extend the size of the file, and if you are, you need to
go back and take the inode lock exclusively after all.

Again - I haven't thought a ton about this, so maybe there's some
trick to it, but the above is what my naive thinking says the rule has
to be. Writes are different from reads.

              Linus
Linus Torvalds Feb. 29, 2024, 7:27 a.m. UTC | #2
On Wed, 28 Feb 2024 at 23:20, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
>  - take the lock exclusively if O_APPEND or if it *looks* like you
> might extend the file size.
>
>  - otherwise, take the shared lock, and THEN RE-CHECK. The file size
> might have changed, so now you need to double-check that you're really
> not going to extend the size of the file, and if you are, you need to
> go back and take the inode lock exclusively after all.

Same goes for the suid/sgid checks. You need to take the inode lock
shared to even have the i_mode be stable, and at that point you might
decide "oh, I need to clear suid/sgid after all" and have to go drop
the lock and retake it for exclusive access after all.

(And yes, we *really* should have a rwsem_try_upgrade() operation and
at least avoid the "read_unlock -> write_lock" dance), but we don't.
See a comment in mmap_upgrade_trylock() about it. It should be
reasonably easy to add, but we've never really had enough reason to.
Maybe somebody decides it's worth the effort)

               Linus
Kent Overstreet Feb. 29, 2024, 7:42 a.m. UTC | #3
On Wed, Feb 28, 2024 at 11:20:44PM -0800, Linus Torvalds wrote:
> On Wed, 28 Feb 2024 at 22:30, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > Non append, non extending buffered writes can now avoid taking the inode
> > lock.
> 
> I think this is buggy.
> 
> I think you still need to take the inode lock *shared* for the writes,
> because otherwise you can have somebody else that truncates the file
> and now you will do a write past the end of the size of the file. That
> will cause a lot of issues.
> 
> So it's not a "inode_lock or not" situation. I think it's a
> "inode_lock vs inode_locks_shared" situation.
> 
> Note that the reading side isn't all that critical - if a read races
> with a truncate, at worst it will read some zeroes because we used the
> old length and the page cache got cleared in the meantime.
> 
> But the writing side ends up having actual consistency issues on disk.
> You don't want to have a truncate that removes the pages past the end
> of the new size and clears the end of the new last page, and race with
> another write that used the old size and *thought* it was writing to
> the middle of the file, but is now actually accessing a folio that is
> past the end of the whole file and writing to it.
> 
> There may be some reason that I'm missing that would make this a
> non-issue, but I really think you want to get the inode lock at least
> shared for the duration of the write.

It's even mentioned in one of the comments - bcachefs's pagecache add
lock guards against that. The rules for that lock are

 - things that add to the pagecache take the add side of that lock
 - things that remove the pagecache take the block side of that lock

I added that so that we wouldn't have pagecache inconsistency issues
with dio and mmap'd IO - without it anything that needs to shoot down
the pagecache while it's doing IO that bypasses the pagecache is buggy
(fpunch, fcollapse...).

> Also note that for similar reasons, you can't just look at "will I
> extend the file" and take the lock non-shared. No, in order to
> actually trust the size, you need to *hold* the lock, so the logic
> needs to be something like
> 
>  - take the lock exclusively if O_APPEND or if it *looks* like you
> might extend the file size.
> 
>  - otherwise, take the shared lock, and THEN RE-CHECK. The file size
> might have changed, so now you need to double-check that you're really
> not going to extend the size of the file, and if you are, you need to
> go back and take the inode lock exclusively after all.

That one - yes.

pagecache add lock was also supposed to handle that because anything
that changes i_size downward needs pagecache block, but I moved where we
take that lock for lock ordering reasons, and I really didn't need
too...

I'm undecided on that one. I dislike using pagecache add lock to guard
i_size because that's really not what it's for, but I also hate hitting
the inode lock if we don't actually need it.

Kinda waiting for Al to drop in and mention the other super obscure
reason the inode lock actually is needed here...
Kent Overstreet Feb. 29, 2024, 8:06 a.m. UTC | #4
On Wed, Feb 28, 2024 at 11:27:05PM -0800, Linus Torvalds wrote:
> On Wed, 28 Feb 2024 at 23:20, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> >
> >  - take the lock exclusively if O_APPEND or if it *looks* like you
> > might extend the file size.
> >
> >  - otherwise, take the shared lock, and THEN RE-CHECK. The file size
> > might have changed, so now you need to double-check that you're really
> > not going to extend the size of the file, and if you are, you need to
> > go back and take the inode lock exclusively after all.
> 
> Same goes for the suid/sgid checks. You need to take the inode lock
> shared to even have the i_mode be stable, and at that point you might
> decide "oh, I need to clear suid/sgid after all" and have to go drop
> the lock and retake it for exclusive access after all.

Do we though? Yes i_mode can change mid write, but if userspace issues a
chmod() while we're in a write() - the end result has to be consistent
with either "chmod(); write();" or 'write(); chmod();"; meaning as long
as there's been a barrier in the syscall path so that we can't have seen
a value of i_mode from before the chmod returned we're good.
Christian Brauner Feb. 29, 2024, 9:46 a.m. UTC | #5
On Wed, Feb 28, 2024 at 11:27:05PM -0800, Linus Torvalds wrote:
> On Wed, 28 Feb 2024 at 23:20, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> >
> >  - take the lock exclusively if O_APPEND or if it *looks* like you
> > might extend the file size.
> >
> >  - otherwise, take the shared lock, and THEN RE-CHECK. The file size
> > might have changed, so now you need to double-check that you're really
> > not going to extend the size of the file, and if you are, you need to
> > go back and take the inode lock exclusively after all.
> 
> Same goes for the suid/sgid checks. You need to take the inode lock
> shared to even have the i_mode be stable, and at that point you might
> decide "oh, I need to clear suid/sgid after all" and have to go drop
> the lock and retake it for exclusive access after all.

I agree. And this is how we've done it in xfs as well. The inode lock is
taken shared, then check for IS_NOSEC(inode) and if that's true keep the
shared lock, otherwise upgrade to an exclusive lock. Note that this
whole check also checks filesystem capability xattrs.

I think you'd also get potential security issues or at least very subtle
behavior. Someone could make a binary setuid and a concurrent write
happens. chmod is taking the exclusive lock and there's a concurrent
write going on changing the contents to something unsafe without being
forced to remove the set*id bits.

Similar for {g,u}id changes. Someone starts a chown() taking inode lock
while someone issues a concurrent write. The chown changes the group
ownership so that a writer would be forced to drop the sgid bit. But the
writer proceeds keeping that bit.

So that's all potentially pretty subtle side-effects we'd allow. And
I've removed so many bugs in that area the last couple of years as well.
So let's please use the shared lock.
Kent Overstreet Feb. 29, 2024, 4:43 p.m. UTC | #6
On Thu, Feb 29, 2024 at 10:46:48AM +0100, Christian Brauner wrote:
> On Wed, Feb 28, 2024 at 11:27:05PM -0800, Linus Torvalds wrote:
> > On Wed, 28 Feb 2024 at 23:20, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > >
> > >  - take the lock exclusively if O_APPEND or if it *looks* like you
> > > might extend the file size.
> > >
> > >  - otherwise, take the shared lock, and THEN RE-CHECK. The file size
> > > might have changed, so now you need to double-check that you're really
> > > not going to extend the size of the file, and if you are, you need to
> > > go back and take the inode lock exclusively after all.
> > 
> > Same goes for the suid/sgid checks. You need to take the inode lock
> > shared to even have the i_mode be stable, and at that point you might
> > decide "oh, I need to clear suid/sgid after all" and have to go drop
> > the lock and retake it for exclusive access after all.
> 
> I agree. And this is how we've done it in xfs as well. The inode lock is
> taken shared, then check for IS_NOSEC(inode) and if that's true keep the
> shared lock, otherwise upgrade to an exclusive lock. Note that this
> whole check also checks filesystem capability xattrs.
> 
> I think you'd also get potential security issues or at least very subtle
> behavior. Someone could make a binary setuid and a concurrent write
> happens. chmod is taking the exclusive lock and there's a concurrent
> write going on changing the contents to something unsafe without being
> forced to remove the set*id bits.
> 
> Similar for {g,u}id changes. Someone starts a chown() taking inode lock
> while someone issues a concurrent write. The chown changes the group
> ownership so that a writer would be forced to drop the sgid bit. But the
> writer proceeds keeping that bit.

That's the kind of race that's not actually a race - if the syscalls
being invoked at the same time, it's fine if the result is as if the
write happened first, then the chown.

For there to be an actual race one of two things would have to be the
case:
 - the chown/chmod path would have to be dependent on the on the data
   the write path is changing, or
 - there'd have to be a problem wnith the write path seeing inconsistent
   state mid modification

But I can't claim 100% the second case isn't a factor, I don't know the
security side of things as well, and the XFS way sounds nice and clean.
diff mbox series

Patch

diff --git a/fs/bcachefs/errcode.h b/fs/bcachefs/errcode.h
index e960a6eae66a..2791d5127090 100644
--- a/fs/bcachefs/errcode.h
+++ b/fs/bcachefs/errcode.h
@@ -247,7 +247,8 @@ 
 	x(BCH_ERR_nopromote,		nopromote_congested)			\
 	x(BCH_ERR_nopromote,		nopromote_in_flight)			\
 	x(BCH_ERR_nopromote,		nopromote_no_writes)			\
-	x(BCH_ERR_nopromote,		nopromote_enomem)
+	x(BCH_ERR_nopromote,		nopromote_enomem)			\
+	x(0,				need_inode_lock)
 
 enum bch_errcode {
 	BCH_ERR_START		= 2048,
diff --git a/fs/bcachefs/fs-io-buffered.c b/fs/bcachefs/fs-io-buffered.c
index 27710cdd5710..9ce92d1bc3ea 100644
--- a/fs/bcachefs/fs-io-buffered.c
+++ b/fs/bcachefs/fs-io-buffered.c
@@ -810,7 +810,8 @@  static noinline void folios_trunc(folios *fs, struct folio **fi)
 static int __bch2_buffered_write(struct bch_inode_info *inode,
 				 struct address_space *mapping,
 				 struct iov_iter *iter,
-				 loff_t pos, unsigned len)
+				 loff_t pos, unsigned len,
+				 bool inode_locked)
 {
 	struct bch_fs *c = inode->v.i_sb->s_fs_info;
 	struct bch2_folio_reservation res;
@@ -835,6 +836,15 @@  static int __bch2_buffered_write(struct bch_inode_info *inode,
 
 	BUG_ON(!fs.nr);
 
+	/*
+	 * If we're not using the inode lock, we need to lock all the folios for
+	 * atomiticity of writes vs. other writes:
+	 */
+	if (!inode_locked && folio_end_pos(darray_last(fs)) < end) {
+		ret = -BCH_ERR_need_inode_lock;
+		goto out;
+	}
+
 	f = darray_first(fs);
 	if (pos != folio_pos(f) && !folio_test_uptodate(f)) {
 		ret = bch2_read_single_folio(f, mapping);
@@ -929,8 +939,10 @@  static int __bch2_buffered_write(struct bch_inode_info *inode,
 	end = pos + copied;
 
 	spin_lock(&inode->v.i_lock);
-	if (end > inode->v.i_size)
+	if (end > inode->v.i_size) {
+		BUG_ON(!inode_locked);
 		i_size_write(&inode->v, end);
+	}
 	spin_unlock(&inode->v.i_lock);
 
 	f_pos = pos;
@@ -974,9 +986,61 @@  static ssize_t bch2_buffered_write(struct kiocb *iocb, struct iov_iter *iter)
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	struct bch_inode_info *inode = file_bch_inode(file);
-	loff_t pos = iocb->ki_pos;
-	ssize_t written = 0;
-	int ret = 0;
+	loff_t pos;
+	bool inode_locked = false;
+	ssize_t written = 0, written2 = 0, ret = 0;
+
+	/*
+	 * We don't take the inode lock unless i_size will be changing. Folio
+	 * locks provide exclusion with other writes, and the pagecache add lock
+	 * provides exclusion with truncate and hole punching.
+	 *
+	 * There is one nasty corner case where atomicity would be broken
+	 * without great care: when copying data from userspace to the page
+	 * cache, we do that with faults disable - a page fault would recurse
+	 * back into the filesystem, taking filesystem locks again, and
+	 * deadlock; so it's done with faults disabled, and we fault in the user
+	 * buffer when we aren't holding locks.
+	 *
+	 * If we do part of the write, but we then race and in the userspace
+	 * buffer have been evicted and are no longer resident, then we have to
+	 * drop our folio locks to re-fault them in, breaking write atomicity.
+	 *
+	 * To fix this, we restart the write from the start, if we weren't
+	 * holding the inode lock.
+	 *
+	 * There is another wrinkle after that; if we restart the write from the
+	 * start, and then get an unrecoverable error, we _cannot_ claim to
+	 * userspace that we did not write data we actually did - so we must
+	 * track (written2) the most we ever wrote.
+	 */
+
+	if ((iocb->ki_flags & IOCB_APPEND) ||
+	    (iocb->ki_pos + iov_iter_count(iter) > i_size_read(&inode->v))) {
+		inode_lock(&inode->v);
+		inode_locked = true;
+	}
+
+	ret = generic_write_checks(iocb, iter);
+	if (ret <= 0)
+		goto unlock;
+
+	ret = file_remove_privs_flags(file, !inode_locked ? IOCB_NOWAIT : 0);
+	if (ret) {
+		if (!inode_locked) {
+			inode_lock(&inode->v);
+			inode_locked = true;
+			ret = file_remove_privs_flags(file, 0);
+		}
+		if (ret)
+			goto unlock;
+	}
+
+	ret = file_update_time(file);
+	if (ret)
+		goto unlock;
+
+	pos = iocb->ki_pos;
 
 	bch2_pagecache_add_get(inode);
 
@@ -1004,12 +1068,17 @@  static ssize_t bch2_buffered_write(struct kiocb *iocb, struct iov_iter *iter)
 			}
 		}
 
+		if (unlikely(bytes != iov_iter_count(iter) && !inode_locked))
+			goto get_inode_lock;
+
 		if (unlikely(fatal_signal_pending(current))) {
 			ret = -EINTR;
 			break;
 		}
 
-		ret = __bch2_buffered_write(inode, mapping, iter, pos, bytes);
+		ret = __bch2_buffered_write(inode, mapping, iter, pos, bytes, inode_locked);
+		if (ret == -BCH_ERR_need_inode_lock)
+			goto get_inode_lock;
 		if (unlikely(ret < 0))
 			break;
 
@@ -1030,50 +1099,46 @@  static ssize_t bch2_buffered_write(struct kiocb *iocb, struct iov_iter *iter)
 		}
 		pos += ret;
 		written += ret;
+		written2 = max(written, written2);
+
+		if (ret != bytes && !inode_locked)
+			goto get_inode_lock;
 		ret = 0;
 
 		balance_dirty_pages_ratelimited(mapping);
-	} while (iov_iter_count(iter));
 
+		if (0) {
+get_inode_lock:
+			bch2_pagecache_add_put(inode);
+			inode_lock(&inode->v);
+			inode_locked = true;
+			bch2_pagecache_add_get(inode);
+
+			iov_iter_revert(iter, written);
+			pos -= written;
+			written = 0;
+			ret = 0;
+		}
+	} while (iov_iter_count(iter));
 	bch2_pagecache_add_put(inode);
+unlock:
+	if (inode_locked)
+		inode_unlock(&inode->v);
+
+	iocb->ki_pos += written;
 
-	return written ? written : ret;
+	ret = max(written, written2) ?: ret;
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+	return ret;
 }
 
-ssize_t bch2_write_iter(struct kiocb *iocb, struct iov_iter *from)
+ssize_t bch2_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
-	struct file *file = iocb->ki_filp;
-	struct bch_inode_info *inode = file_bch_inode(file);
-	ssize_t ret;
-
-	if (iocb->ki_flags & IOCB_DIRECT) {
-		ret = bch2_direct_write(iocb, from);
-		goto out;
-	}
-
-	inode_lock(&inode->v);
-
-	ret = generic_write_checks(iocb, from);
-	if (ret <= 0)
-		goto unlock;
-
-	ret = file_remove_privs(file);
-	if (ret)
-		goto unlock;
-
-	ret = file_update_time(file);
-	if (ret)
-		goto unlock;
-
-	ret = bch2_buffered_write(iocb, from);
-	if (likely(ret > 0))
-		iocb->ki_pos += ret;
-unlock:
-	inode_unlock(&inode->v);
+	ssize_t ret = iocb->ki_flags & IOCB_DIRECT
+		? bch2_direct_write(iocb, iter)
+		: bch2_buffered_write(iocb, iter);
 
-	if (ret > 0)
-		ret = generic_write_sync(iocb, ret);
-out:
 	return bch2_err_class(ret);
 }