Message ID | 20210511140113.1225981-1-agruenba@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] Trigger retry from fault vm operation | expand |
On Tue, May 11, 2021 at 04:01:13PM +0200, Andreas Gruenbacher wrote: > we have a locking problem in gfs2 that I don't have a proper solution for, so > I'm looking for suggestions. > > What's happening is that a page fault triggers during a read or write > operation, while we're holding a glock (the cluster-wide gfs2 inode > lock), and the page fault requires another glock. We can recognize and > handle the case when both glocks are the same, but when the page fault requires > another glock, there is a chance that taking that other glock would deadlock. So we're looking at something like one file on a gfs2 filesystem being mmaped() and then doing read() or write() to another gfs2 file with the mmaped address being the passed to read()/write()? Have you looked at iov_iter_fault_in_readable() as a solution to your locking order? That way, you bring the mmaped page in first (see generic_perform_write()). > When we realize that we may not be able to take the other glock in gfs2_fault, > we need to communicate that to the read or write operation, which will then > drop and re-acquire the "outer" glock and retry. However, there doesn't seem > to be a good way to do that; we can only indicate that a page fault should fail > by returning VM_FAULT_SIGBUS or similar; that will then be mapped to -EFAULT. > We'd need something like VM_FAULT_RESTART that can be mapped to -EBUSY so that > we can tell the retry case apart from genuine -EFAULT errors. We do have VM_FAULT_RETRY ... does that retry at the wrong level?
On Tue, May 11, 2021 at 4:34 PM Matthew Wilcox <willy@infradead.org> wrote: > On Tue, May 11, 2021 at 04:01:13PM +0200, Andreas Gruenbacher wrote: > > we have a locking problem in gfs2 that I don't have a proper solution for, so > > I'm looking for suggestions. > > > > What's happening is that a page fault triggers during a read or write > > operation, while we're holding a glock (the cluster-wide gfs2 inode > > lock), and the page fault requires another glock. We can recognize and > > handle the case when both glocks are the same, but when the page fault requires > > another glock, there is a chance that taking that other glock would deadlock. > > So we're looking at something like one file on a gfs2 filesystem being > mmaped() and then doing read() or write() to another gfs2 file with the > mmaped address being the passed to read()/write()? Yes, those kinds of scenarios. Here's an example that Jan Kara came up with: Two independent processes P1, P2. Two files F1, F2, and two mappings M1, M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They can race like: P1 P2 read() read() gfs2_file_read_iter() gfs2_file_read_iter() gfs2_file_direct_read() gfs2_file_direct_read() locks glock of F1 locks glock of F2 iomap_dio_rw() iomap_dio_rw() bio_iov_iter_get_pages() bio_iov_iter_get_pages() <fault in M2> <fault in M1> gfs2_fault() gfs2_fault() tries to grab glock of F2 tries to grab glock of F1 With cluster-wide locks, we can obviously end up with distributed deadlock scenarios as well, of course. > Have you looked at iov_iter_fault_in_readable() as a solution to > your locking order? That way, you bring the mmaped page in first > (see generic_perform_write()). Yes. The problem there is that we need to hold the inode glock from ->iomap_begin to ->iomap_end; that's what guarantees that the mapping returned by ->iomap_begin remains valid. > > When we realize that we may not be able to take the other glock in gfs2_fault, > > we need to communicate that to the read or write operation, which will then > > drop and re-acquire the "outer" glock and retry. However, there doesn't seem > > to be a good way to do that; we can only indicate that a page fault should fail > > by returning VM_FAULT_SIGBUS or similar; that will then be mapped to -EFAULT. > > We'd need something like VM_FAULT_RESTART that can be mapped to -EBUSY so that > > we can tell the retry case apart from genuine -EFAULT errors. > > We do have VM_FAULT_RETRY ... does that retry at the wrong level? There's also VM_FAULT_NOPAGE, but that only triggers a retry at the VM level and doesn't propagate out far enough. My impression is that VM_FAULT_RETRY is similar to VM_FAULT_NOPAGE except that it allows the lock dropping optimization implemented in maybe_unlock_mmap_for_io(). That error code can also only be used when FAULT_FLAG_ALLOW_RETRY is set it seems. Correct me if I'm getting this wrong. Thanks, Andreas
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 658fed79d65a..253e720f2df0 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -543,10 +543,15 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf) vm_fault_t ret; int err; - gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); + gfs2_holder_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_TRY, &gh); if (likely(!recursive)) { err = gfs2_glock_nq(&gh); if (err) { + if (err == GLR_TRYFAILED) { + current->restart_hack = 1; + ret = VM_FAULT_SIGBUS; + goto out_uninit; + } ret = block_page_mkwrite_return(err); goto out_uninit; } @@ -773,12 +778,16 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to, return 0; /* skip atime */ gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); +restart: ret = gfs2_glock_nq(gh); if (ret) goto out_uninit; + current->restart_hack = 0; ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0); gfs2_glock_dq(gh); + if (unlikely(ret == -EFAULT) && current->restart_hack) + goto restart; out_uninit: gfs2_holder_uninit(gh); return ret; @@ -803,6 +812,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, * VFS does. */ gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); + +restart: ret = gfs2_glock_nq(gh); if (ret) goto out_uninit; @@ -811,11 +822,14 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, if (offset + len > i_size_read(&ip->i_inode)) goto out; + current->restart_hack = 0; ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0); if (ret == -ENOTBLK) ret = 0; out: gfs2_glock_dq(gh); + if (unlikely(ret == -EFAULT) && current->restart_hack) + goto restart; out_uninit: gfs2_holder_uninit(gh); return ret; @@ -834,7 +848,9 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) return ret; iocb->ki_flags &= ~IOCB_DIRECT; } +restart1: iocb->ki_flags |= IOCB_NOIO; + current->restart_hack = 0; ret = generic_file_read_iter(iocb, to); iocb->ki_flags &= ~IOCB_NOIO; if (ret >= 0) { @@ -842,6 +858,8 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) return ret; written = ret; } else { + if (unlikely(ret == -EFAULT) && current->restart_hack) + goto restart1; if (ret != -EAGAIN) return ret; if (iocb->ki_flags & IOCB_NOWAIT) @@ -849,13 +867,18 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) } ip = GFS2_I(iocb->ki_filp->f_mapping->host); gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); + +restart2: ret = gfs2_glock_nq(&gh); if (ret) goto out_uninit; + current->restart_hack = 0; ret = generic_file_read_iter(iocb, to); if (ret > 0) written += ret; gfs2_glock_dq(&gh); + if (unlikely(ret == -EFAULT) && current->restart_hack) + goto restart2; out_uninit: gfs2_holder_uninit(&gh); return written ? written : ret; @@ -912,13 +935,18 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from) goto out_unlock; iocb->ki_flags |= IOCB_DSYNC; +restart1: + current->restart_hack = 0; current->backing_dev_info = inode_to_bdi(inode); buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops); current->backing_dev_info = NULL; if (unlikely(buffered <= 0)) { + if (unlikely(buffered == -EFAULT) && current->restart_hack) + goto restart1; if (!ret) ret = buffered; goto out_unlock; + } /* * We need to ensure that the page cache pages are written to @@ -935,10 +963,15 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from) if (!ret || ret2 > 0) ret += ret2; } else { +restart2: + current->restart_hack = 0; current->backing_dev_info = inode_to_bdi(inode); ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops); current->backing_dev_info = NULL; - if (likely(ret > 0)) { + if (unlikely(ret <= 0)) { + if (unlikely(ret == -EFAULT) && current->restart_hack) + goto restart2; + } else { iocb->ki_pos += ret; ret = generic_write_sync(iocb, ret); } diff --git a/include/linux/sched.h b/include/linux/sched.h index d2c881384517..de053ac8d8d6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1067,6 +1067,7 @@ struct task_struct { /* Journalling filesystem info: */ void *journal_info; + unsigned restart_hack:1; /* Stacked block device info: */ struct bio_list *bio_list;