Message ID | 20191216193852.GA8664@kmo-pixel (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [LSF/MM/BPF,TOPIC] Bcachefs update | expand |
On Mon 16-12-19 14:38:52, Kent Overstreet wrote: > Pagecache consistency: > > I recently got rid of my pagecache add lock; that added locking to core paths in > filemap.c and some found my locking scheme to be distastefull (and I never liked > it enough to argue). I've recently switched to something closer to XFS's locking > scheme (top of the IO paths); however, I do still need one patch to the > get_user_pages() path to avoid deadlock via recursive page fault - patch is > below: > > (This would probably be better done as a new argument to get_user_pages(); I > didn't do it that way initially because the patch would have been _much_ > bigger.) > > Yee haw. > > commit 20ebb1f34cc9a532a675a43b5bd48d1705477816 > Author: Kent Overstreet <kent.overstreet@gmail.com> > Date: Wed Oct 16 15:03:50 2019 -0400 > > mm: Add a mechanism to disable faults for a specific mapping > > This will be used to prevent a nasty cache coherency issue for O_DIRECT > writes; O_DIRECT writes need to shoot down the range of the page cache > corresponding to the part of the file being written to - but, if the > file is mapped in, userspace can pass in an address in that mapping to > pwrite(), causing those pages to be faulted back into the page cache > in get_user_pages(). > > Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> I'm not really sure about the exact nature of the deadlock since the changelog doesn't explain it but if you need to take some lockA in your page fault path and you already hold lockA in your DIO code, then this patch isn't going to cut it. Just think of a malicious scheme with two tasks one doing DIO from fileA (protected by lockA) to buffers mapped from fileB and the other process the other way around... Honza
On Wed, Dec 18, 2019 at 01:40:52PM +0100, Jan Kara wrote: > On Mon 16-12-19 14:38:52, Kent Overstreet wrote: > > Pagecache consistency: > > > > I recently got rid of my pagecache add lock; that added locking to core paths in > > filemap.c and some found my locking scheme to be distastefull (and I never liked > > it enough to argue). I've recently switched to something closer to XFS's locking > > scheme (top of the IO paths); however, I do still need one patch to the > > get_user_pages() path to avoid deadlock via recursive page fault - patch is > > below: > > > > (This would probably be better done as a new argument to get_user_pages(); I > > didn't do it that way initially because the patch would have been _much_ > > bigger.) > > > > Yee haw. > > > > commit 20ebb1f34cc9a532a675a43b5bd48d1705477816 > > Author: Kent Overstreet <kent.overstreet@gmail.com> > > Date: Wed Oct 16 15:03:50 2019 -0400 > > > > mm: Add a mechanism to disable faults for a specific mapping > > > > This will be used to prevent a nasty cache coherency issue for O_DIRECT > > writes; O_DIRECT writes need to shoot down the range of the page cache > > corresponding to the part of the file being written to - but, if the > > file is mapped in, userspace can pass in an address in that mapping to > > pwrite(), causing those pages to be faulted back into the page cache > > in get_user_pages(). > > > > Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> > > I'm not really sure about the exact nature of the deadlock since the > changelog doesn't explain it but if you need to take some lockA in your > page fault path and you already hold lockA in your DIO code, then this > patch isn't going to cut it. Just think of a malicious scheme with two > tasks one doing DIO from fileA (protected by lockA) to buffers mapped from > fileB and the other process the other way around... Ooh, yeah, good catch... The lock in question is - for the purposes of this discussion, a RW lock (call it map lock here): the fault handler and the buffered IO paths take it it read mode, and the DIO path takes it in write mode to block new pages being added to the page cache. But get_user_pages() -> page fault invokes the fault handler, hence deadlock. My patch was for avoiding this deadlock when the fault handler tries locking the same inode's map lock, but as you note this is a more general problem... This is starting to smell like possibly what wound/wait mutexes were invented for, a situation where we need deadlock avoidance because lock ordering is under userspace control. So for that we need some state describing what locks are held that we can refer to when taking the next lock of this class - and since it's got to be shared between the dio write path and then (via gup()) the fault handler, that means it's pretty much going to have to hang off of task struct. Then in the fault handler, when we go to take the map lock we: - return -EFAULT if it's the same lock the dio write path took - trylock; if that fails and lock ordering is wrong (pointer comparison of the locks works here) then we have to do a dance that involves bailing out and retrying from the top of the dio write path. I dunno. On the plus side, if other filesystems don't want this I think this can all be implemented in bcachefs code with only a pointer added to task_struct to hang this lock state, but I would much rather either get buy in from the other filesystem people and make this a general purpose facility or not do it at all. And I'm not sure if anyone else cares about the page cache consistency issues inherent to dio writes as much as I do... so I'd like to hear from other people on that.
On Wed 18-12-19 14:11:14, Kent Overstreet wrote: > On Wed, Dec 18, 2019 at 01:40:52PM +0100, Jan Kara wrote: > > On Mon 16-12-19 14:38:52, Kent Overstreet wrote: > > > Pagecache consistency: > > > > > > I recently got rid of my pagecache add lock; that added locking to core paths in > > > filemap.c and some found my locking scheme to be distastefull (and I never liked > > > it enough to argue). I've recently switched to something closer to XFS's locking > > > scheme (top of the IO paths); however, I do still need one patch to the > > > get_user_pages() path to avoid deadlock via recursive page fault - patch is > > > below: > > > > > > (This would probably be better done as a new argument to get_user_pages(); I > > > didn't do it that way initially because the patch would have been _much_ > > > bigger.) > > > > > > Yee haw. > > > > > > commit 20ebb1f34cc9a532a675a43b5bd48d1705477816 > > > Author: Kent Overstreet <kent.overstreet@gmail.com> > > > Date: Wed Oct 16 15:03:50 2019 -0400 > > > > > > mm: Add a mechanism to disable faults for a specific mapping > > > > > > This will be used to prevent a nasty cache coherency issue for O_DIRECT > > > writes; O_DIRECT writes need to shoot down the range of the page cache > > > corresponding to the part of the file being written to - but, if the > > > file is mapped in, userspace can pass in an address in that mapping to > > > pwrite(), causing those pages to be faulted back into the page cache > > > in get_user_pages(). > > > > > > Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> > > > > I'm not really sure about the exact nature of the deadlock since the > > changelog doesn't explain it but if you need to take some lockA in your > > page fault path and you already hold lockA in your DIO code, then this > > patch isn't going to cut it. Just think of a malicious scheme with two > > tasks one doing DIO from fileA (protected by lockA) to buffers mapped from > > fileB and the other process the other way around... > > Ooh, yeah, good catch... > > The lock in question is - for the purposes of this discussion, a RW lock (call > it map lock here): the fault handler and the buffered IO paths take it it read > mode, and the DIO path takes it in write mode to block new pages being added to > the page cache. > > But get_user_pages() -> page fault invokes the fault handler, hence > deadlock. My patch was for avoiding this deadlock when the fault handler > tries locking the same inode's map lock, but as you note this is a more > general problem... > > This is starting to smell like possibly what wound/wait mutexes were > invented for, a situation where we need deadlock avoidance because lock > ordering is under userspace control. > > So for that we need some state describing what locks are held that we can > refer to when taking the next lock of this class - and since it's got to > be shared between the dio write path and then (via gup()) the fault > handler, that means it's pretty much going to have to hang off of task > struct. Then in the fault handler, when we go to take the map lock we: > - return -EFAULT if it's the same lock the dio write path took > - trylock; if that fails and lock ordering is wrong (pointer comparison of the > locks works here) then we have to do a dance that involves bailing out and > retrying from the top of the dio write path. Well, I don't think the problem is actually limited to your "map lock". Because by calling get_user_pages() from under "map lock" in DIO code, you create for example "map lock" (inode A) -> mmap_sem (process P) lock dependency. And when you do page fault, you create mmap_sem (process Q) -> "map lock" (inode B) lock dependency. It does not take that much effort to chain these in a way that will deadlock the system. And the lock tracking that would be able to detect such locking cycles would IMHO be too expensive as it would have to cover multiple different rather hot locks and also be coordinated between multiple processes. > I dunno. On the plus side, if other filesystems don't want this I think > this can all be implemented in bcachefs code with only a pointer added to > task_struct to hang this lock state, but I would much rather either get > buy in from the other filesystem people and make this a general purpose > facility or not do it at all. > > And I'm not sure if anyone else cares about the page cache consistency > issues inherent to dio writes as much as I do... so I'd like to hear from > other people on that. I don't think other filesystems care much about DIO cache coherency. But OTOH I do agree that these page-fault vs DIO vs buffered IO page cache handling races are making life difficult even for other filesystems because it is not just about DIO cache coherency but in some nasty corner cases also about stale data exposure or filesystem corruption. And filesystems have to jump through the hoops to avoid them. So I think in general people are interested in somehow making the situation simpler - for example I think it would simplify life for Dave's range lock he is working on for XFS... Honza
diff --git a/include/linux/sched.h b/include/linux/sched.h index ebfa046b2d..3b4d9689ef 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -740,6 +740,7 @@ struct task_struct { struct mm_struct *mm; struct mm_struct *active_mm; + struct address_space *faults_disabled_mapping; /* Per-thread vma caching: */ struct vmacache vmacache; diff --git a/init/init_task.c b/init/init_task.c index ee3d9d29b8..706abd9547 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -77,6 +77,7 @@ struct task_struct init_task .nr_cpus_allowed= NR_CPUS, .mm = NULL, .active_mm = &init_mm, + .faults_disabled_mapping = NULL, .restart_block = { .fn = do_no_restart_syscall, }, diff --git a/mm/gup.c b/mm/gup.c index 98f13ab37b..9cc1479201 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -849,6 +849,13 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } cond_resched(); + if (current->faults_disabled_mapping && + vma->vm_file && + vma->vm_file->f_mapping == current->faults_disabled_mapping) { + ret = -EFAULT; + goto out; + } + page = follow_page_mask(vma, start, foll_flags, &ctx); if (!page) { ret = faultin_page(tsk, vma, start, &foll_flags,
Hi all, I'd like to poke my head up and let everyone know where bcachefs is at, and talk about finally upstreaming it. Last LSF (two years ago), bcachefs was coming along quite nicely and was quite usable but there was still some core work unfinished (primarily persistent alloc info; we still had to walk all metadata at mount time). Additionally, there some unresolved discussions around locking for pagecache consistency. The core features one would expect from a posix filesystem are now done, and then some. Reflink was finished recently, and I'm now starting to work towards snapshots. If there's interest I may talk a bit about the plan for snapshots in bcachefs. The short version is: all metadata in bcachefs are keys in various btrees (extents/inodes/dirents/xattrs btrees) indexed by inode:offset; for snapshots we extent the key so that the low bits are a snapshot id, i.e. inode:offset:snapshot. Snapshots form a tree where the root has id U32_MAX and children always have smaller IDs than their parent, so to read from a given snapshot we do a lookup as normal, including the snapshot ID of the snapshot we want, and then filter out keys from unrelated (non ancestor) snapshots. This will give us excellent overall performance when there are many snapshots that each have only a small number of overwrites; when we end up in a situation where a given part of the keyspace has many keys from unrelated snapshots we'll want to arrange metadata differently. This scheme does get considerably trickier when you add extents; that's what I've been focusing on recently. Pagecache consistency: I recently got rid of my pagecache add lock; that added locking to core paths in filemap.c and some found my locking scheme to be distastefull (and I never liked it enough to argue). I've recently switched to something closer to XFS's locking scheme (top of the IO paths); however, I do still need one patch to the get_user_pages() path to avoid deadlock via recursive page fault - patch is below: (This would probably be better done as a new argument to get_user_pages(); I didn't do it that way initially because the patch would have been _much_ bigger.) Yee haw. commit 20ebb1f34cc9a532a675a43b5bd48d1705477816 Author: Kent Overstreet <kent.overstreet@gmail.com> Date: Wed Oct 16 15:03:50 2019 -0400 mm: Add a mechanism to disable faults for a specific mapping This will be used to prevent a nasty cache coherency issue for O_DIRECT writes; O_DIRECT writes need to shoot down the range of the page cache corresponding to the part of the file being written to - but, if the file is mapped in, userspace can pass in an address in that mapping to pwrite(), causing those pages to be faulted back into the page cache in get_user_pages(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>