diff mbox series

[LSF/MM/BPF,TOPIC] Bcachefs update

Message ID 20191216193852.GA8664@kmo-pixel (mailing list archive)
State New, archived
Headers show
Series [LSF/MM/BPF,TOPIC] Bcachefs update | expand

Commit Message

Kent Overstreet Dec. 16, 2019, 7:38 p.m. UTC
Hi all, I'd like to poke my head up and let everyone know where bcachefs is at,
and talk about finally upstreaming it.

Last LSF (two years ago), bcachefs was coming along quite nicely and was quite
usable but there was still some core work unfinished (primarily persistent alloc
info; we still had to walk all metadata at mount time). Additionally, there some
unresolved discussions around locking for pagecache consistency.

The core features one would expect from a posix filesystem are now done, and
then some. Reflink was finished recently, and I'm now starting to work towards
snapshots.

If there's interest I may talk a bit about the plan for snapshots in bcachefs.

The short version is: all metadata in bcachefs are keys in various btrees
(extents/inodes/dirents/xattrs btrees) indexed by inode:offset; for snapshots we
extent the key so that the low bits are a snapshot id, i.e.
inode:offset:snapshot. Snapshots form a tree where the root has id U32_MAX and
children always have smaller IDs than their parent, so to read from a given
snapshot we do a lookup as normal, including the snapshot ID of the snapshot we
want, and then filter out keys from unrelated (non ancestor) snapshots.

This will give us excellent overall performance when there are many snapshots
that each have only a small number of overwrites; when we end up in a situation
where a given part of the keyspace has many keys from unrelated snapshots we'll
want to arrange metadata differently.

This scheme does get considerably trickier when you add extents; that's what
I've been focusing on recently.

Pagecache consistency:

I recently got rid of my pagecache add lock; that added locking to core paths in
filemap.c and some found my locking scheme to be distastefull (and I never liked
it enough to argue). I've recently switched to something closer to XFS's locking
scheme (top of the IO paths); however, I do still need one patch to the
get_user_pages() path to avoid deadlock via recursive page fault - patch is
below:

(This would probably be better done as a new argument to get_user_pages(); I
didn't do it that way initially because the patch would have been _much_
bigger.)

Yee haw.

commit 20ebb1f34cc9a532a675a43b5bd48d1705477816
Author: Kent Overstreet <kent.overstreet@gmail.com>
Date:   Wed Oct 16 15:03:50 2019 -0400

    mm: Add a mechanism to disable faults for a specific mapping
    
    This will be used to prevent a nasty cache coherency issue for O_DIRECT
    writes; O_DIRECT writes need to shoot down the range of the page cache
    corresponding to the part of the file being written to - but, if the
    file is mapped in, userspace can pass in an address in that mapping to
    pwrite(), causing those pages to be faulted back into the page cache
    in get_user_pages().
    
    Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

Comments

Jan Kara Dec. 18, 2019, 12:40 p.m. UTC | #1
On Mon 16-12-19 14:38:52, Kent Overstreet wrote:
> Pagecache consistency:
> 
> I recently got rid of my pagecache add lock; that added locking to core paths in
> filemap.c and some found my locking scheme to be distastefull (and I never liked
> it enough to argue). I've recently switched to something closer to XFS's locking
> scheme (top of the IO paths); however, I do still need one patch to the
> get_user_pages() path to avoid deadlock via recursive page fault - patch is
> below:
> 
> (This would probably be better done as a new argument to get_user_pages(); I
> didn't do it that way initially because the patch would have been _much_
> bigger.)
> 
> Yee haw.
> 
> commit 20ebb1f34cc9a532a675a43b5bd48d1705477816
> Author: Kent Overstreet <kent.overstreet@gmail.com>
> Date:   Wed Oct 16 15:03:50 2019 -0400
> 
>     mm: Add a mechanism to disable faults for a specific mapping
>     
>     This will be used to prevent a nasty cache coherency issue for O_DIRECT
>     writes; O_DIRECT writes need to shoot down the range of the page cache
>     corresponding to the part of the file being written to - but, if the
>     file is mapped in, userspace can pass in an address in that mapping to
>     pwrite(), causing those pages to be faulted back into the page cache
>     in get_user_pages().
>     
>     Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

I'm not really sure about the exact nature of the deadlock since the
changelog doesn't explain it but if you need to take some lockA in your
page fault path and you already hold lockA in your DIO code, then this
patch isn't going to cut it. Just think of a malicious scheme with two
tasks one doing DIO from fileA (protected by lockA) to buffers mapped from
fileB and the other process the other way around...

								Honza
Kent Overstreet Dec. 18, 2019, 7:11 p.m. UTC | #2
On Wed, Dec 18, 2019 at 01:40:52PM +0100, Jan Kara wrote:
> On Mon 16-12-19 14:38:52, Kent Overstreet wrote:
> > Pagecache consistency:
> > 
> > I recently got rid of my pagecache add lock; that added locking to core paths in
> > filemap.c and some found my locking scheme to be distastefull (and I never liked
> > it enough to argue). I've recently switched to something closer to XFS's locking
> > scheme (top of the IO paths); however, I do still need one patch to the
> > get_user_pages() path to avoid deadlock via recursive page fault - patch is
> > below:
> > 
> > (This would probably be better done as a new argument to get_user_pages(); I
> > didn't do it that way initially because the patch would have been _much_
> > bigger.)
> > 
> > Yee haw.
> > 
> > commit 20ebb1f34cc9a532a675a43b5bd48d1705477816
> > Author: Kent Overstreet <kent.overstreet@gmail.com>
> > Date:   Wed Oct 16 15:03:50 2019 -0400
> > 
> >     mm: Add a mechanism to disable faults for a specific mapping
> >     
> >     This will be used to prevent a nasty cache coherency issue for O_DIRECT
> >     writes; O_DIRECT writes need to shoot down the range of the page cache
> >     corresponding to the part of the file being written to - but, if the
> >     file is mapped in, userspace can pass in an address in that mapping to
> >     pwrite(), causing those pages to be faulted back into the page cache
> >     in get_user_pages().
> >     
> >     Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> 
> I'm not really sure about the exact nature of the deadlock since the
> changelog doesn't explain it but if you need to take some lockA in your
> page fault path and you already hold lockA in your DIO code, then this
> patch isn't going to cut it. Just think of a malicious scheme with two
> tasks one doing DIO from fileA (protected by lockA) to buffers mapped from
> fileB and the other process the other way around...

Ooh, yeah, good catch...

The lock in question is - for the purposes of this discussion, a RW lock (call
it map lock here): the fault handler and the buffered IO paths take it it read
mode, and the DIO path takes it in write mode to block new pages being added to
the page cache.

But get_user_pages() -> page fault invokes the fault handler, hence deadlock. My
patch was for avoiding this deadlock when the fault handler tries locking the
same inode's map lock, but as you note this is a more general problem...

This is starting to smell like possibly what wound/wait mutexes were invented
for, a situation where we need deadlock avoidance because lock ordering is under
userspace control.

So for that we need some state describing what locks are held that we can refer
to when taking the next lock of this class - and since it's got to be shared
between the dio write path and then (via gup()) the fault handler, that means
it's pretty much going to have to hang off of task struct. Then in the fault
handler, when we go to take the map lock we:
 - return -EFAULT if it's the same lock the dio write path took
 - trylock; if that fails and lock ordering is wrong (pointer comparison of the
   locks works here) then we have to do a dance that involves bailing out and
   retrying from the top of the dio write path.

I dunno. On the plus side, if other filesystems don't want this I think this can
all be implemented in bcachefs code with only a pointer added to task_struct to
hang this lock state, but I would much rather either get buy in from the other
filesystem people and make this a general purpose facility or not do it at all.

And I'm not sure if anyone else cares about the page cache consistency issues
inherent to dio writes as much as I do... so I'd like to hear from other people
on that.
Jan Kara Dec. 19, 2019, 11:36 a.m. UTC | #3
On Wed 18-12-19 14:11:14, Kent Overstreet wrote:
> On Wed, Dec 18, 2019 at 01:40:52PM +0100, Jan Kara wrote:
> > On Mon 16-12-19 14:38:52, Kent Overstreet wrote:
> > > Pagecache consistency:
> > > 
> > > I recently got rid of my pagecache add lock; that added locking to core paths in
> > > filemap.c and some found my locking scheme to be distastefull (and I never liked
> > > it enough to argue). I've recently switched to something closer to XFS's locking
> > > scheme (top of the IO paths); however, I do still need one patch to the
> > > get_user_pages() path to avoid deadlock via recursive page fault - patch is
> > > below:
> > > 
> > > (This would probably be better done as a new argument to get_user_pages(); I
> > > didn't do it that way initially because the patch would have been _much_
> > > bigger.)
> > > 
> > > Yee haw.
> > > 
> > > commit 20ebb1f34cc9a532a675a43b5bd48d1705477816
> > > Author: Kent Overstreet <kent.overstreet@gmail.com>
> > > Date:   Wed Oct 16 15:03:50 2019 -0400
> > > 
> > >     mm: Add a mechanism to disable faults for a specific mapping
> > >     
> > >     This will be used to prevent a nasty cache coherency issue for O_DIRECT
> > >     writes; O_DIRECT writes need to shoot down the range of the page cache
> > >     corresponding to the part of the file being written to - but, if the
> > >     file is mapped in, userspace can pass in an address in that mapping to
> > >     pwrite(), causing those pages to be faulted back into the page cache
> > >     in get_user_pages().
> > >     
> > >     Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> > 
> > I'm not really sure about the exact nature of the deadlock since the
> > changelog doesn't explain it but if you need to take some lockA in your
> > page fault path and you already hold lockA in your DIO code, then this
> > patch isn't going to cut it. Just think of a malicious scheme with two
> > tasks one doing DIO from fileA (protected by lockA) to buffers mapped from
> > fileB and the other process the other way around...
> 
> Ooh, yeah, good catch...
> 
> The lock in question is - for the purposes of this discussion, a RW lock (call
> it map lock here): the fault handler and the buffered IO paths take it it read
> mode, and the DIO path takes it in write mode to block new pages being added to
> the page cache.
> 
> But get_user_pages() -> page fault invokes the fault handler, hence
> deadlock. My patch was for avoiding this deadlock when the fault handler
> tries locking the same inode's map lock, but as you note this is a more
> general problem...
> 
> This is starting to smell like possibly what wound/wait mutexes were
> invented for, a situation where we need deadlock avoidance because lock
> ordering is under userspace control.
> 
> So for that we need some state describing what locks are held that we can
> refer to when taking the next lock of this class - and since it's got to
> be shared between the dio write path and then (via gup()) the fault
> handler, that means it's pretty much going to have to hang off of task
> struct. Then in the fault handler, when we go to take the map lock we:
>  - return -EFAULT if it's the same lock the dio write path took
>  - trylock; if that fails and lock ordering is wrong (pointer comparison of the
>    locks works here) then we have to do a dance that involves bailing out and
>    retrying from the top of the dio write path.

Well, I don't think the problem is actually limited to your "map lock".
Because by calling get_user_pages() from under "map lock" in DIO code, you
create for example "map lock" (inode A) -> mmap_sem (process P) lock
dependency. And when you do page fault, you create mmap_sem (process Q) ->
"map lock" (inode B) lock dependency. It does not take that much effort to
chain these in a way that will deadlock the system.

And the lock tracking that would be able to detect such locking cycles
would IMHO be too expensive as it would have to cover multiple different
rather hot locks and also be coordinated between multiple processes.

> I dunno. On the plus side, if other filesystems don't want this I think
> this can all be implemented in bcachefs code with only a pointer added to
> task_struct to hang this lock state, but I would much rather either get
> buy in from the other filesystem people and make this a general purpose
> facility or not do it at all.
> 
> And I'm not sure if anyone else cares about the page cache consistency
> issues inherent to dio writes as much as I do... so I'd like to hear from
> other people on that.

I don't think other filesystems care much about DIO cache coherency. But
OTOH I do agree that these page-fault vs DIO vs buffered IO page cache
handling races are making life difficult even for other filesystems because
it is not just about DIO cache coherency but in some nasty corner cases
also about stale data exposure or filesystem corruption. And filesystems
have to jump through the hoops to avoid them. So I think in general people
are interested in somehow making the situation simpler - for example I
think it would simplify life for Dave's range lock he is working on for
XFS...

								Honza
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ebfa046b2d..3b4d9689ef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -740,6 +740,7 @@  struct task_struct {
 
 	struct mm_struct		*mm;
 	struct mm_struct		*active_mm;
+	struct address_space		*faults_disabled_mapping;
 
 	/* Per-thread vma caching: */
 	struct vmacache			vmacache;
diff --git a/init/init_task.c b/init/init_task.c
index ee3d9d29b8..706abd9547 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -77,6 +77,7 @@  struct task_struct init_task
 	.nr_cpus_allowed= NR_CPUS,
 	.mm		= NULL,
 	.active_mm	= &init_mm,
+	.faults_disabled_mapping = NULL,
 	.restart_block	= {
 		.fn = do_no_restart_syscall,
 	},
diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37b..9cc1479201 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -849,6 +849,13 @@  static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		}
 		cond_resched();
 
+		if (current->faults_disabled_mapping &&
+		    vma->vm_file &&
+		    vma->vm_file->f_mapping == current->faults_disabled_mapping) {
+			ret = -EFAULT;
+			goto out;
+		}
+
 		page = follow_page_mask(vma, start, foll_flags, &ctx);
 		if (!page) {
 			ret = faultin_page(tsk, vma, start, &foll_flags,