Message ID | 20210803191818.993968-1-agruenba@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | gfs2: Fix mmap + page fault deadlocks | expand |
On Tue, Aug 3, 2021 at 12:18 PM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > > With this patch queue, fstest generic/208 (aio-dio-invalidate-failure.c) > endlessly spins in gfs2_file_direct_write. It looks as if there's a bug > in get_user_pages_fast when called with FOLL_FAST_ONLY: > > (1) The test case performs an aio write into a 32 MB buffer. > > (2) The buffer is initially not in memory, so when iomap_dio_rw() -> > ... -> bio_iov_iter_get_pages() is called with the iter->noio flag > set, we get to get_user_pages_fast() with FOLL_FAST_ONLY set. > get_user_pages_fast() returns 0, which causes > bio_iov_iter_get_pages to return -EFAULT. > > (3) Then gfs2_file_direct_write faults in the entire buffer with > fault_in_iov_iter_readable(), which succeeds. > > (4) With the buffer in memory, we retry the iomap_dio_rw() -> > ... -> bio_iov_iter_get_pages() -> ... -> get_user_pages_fast(). > This should succeed now, but get_user_pages_fast() still returns 0. Hmm. Have you tried to figure out why that "still returns 0" happens? One option - for debugging only - would be to introduce a new flag to get_user_pages_fast() tyhat says "print out reason if failed" and make the retry (but not the original one) have that flag set. There are a couple of things of note when it comes to "get_user_pages_fast()": (a) some architectures don't even enable it (b) it can be very picky about the page table contents, and wants the accessed bit to already be set (or the dirty bit, in the case of a write). but (a) shouldn't be an issue on any common platform and (b) shouldn't be an issue with fault_in_iov_iter_readable() that actually does a __get_user() so it will access through the page tables. (It might be more of an issue with fault_in_iov_iter_writable() due to walking the page tables by hand - if we don't do the proper access/dirty setting, I could see get_user_pages_fast() failing). Anyway, for reason (a) I do think that eventually we should probably introduce FOLL_NOFAULT, and allow the full "slow" page table walk - just not calling down to handle_mm_fault() if it fails. But (a) should be a non-issue in your test environment, and so it would be interesting to hear what it is that fails. Because scanning through the patches, they all _look_ fine to me (apart from the one comment about return values, which is more about being consistent with copy_to/from_user() and making the code simpler - not about correctness) Linus
On Tue, Aug 3, 2021 at 9:45 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Aug 3, 2021 at 12:18 PM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > With this patch queue, fstest generic/208 (aio-dio-invalidate-failure.c) > > endlessly spins in gfs2_file_direct_write. It looks as if there's a bug > > in get_user_pages_fast when called with FOLL_FAST_ONLY: > > > > (1) The test case performs an aio write into a 32 MB buffer. > > > > (2) The buffer is initially not in memory, so when iomap_dio_rw() -> > > ... -> bio_iov_iter_get_pages() is called with the iter->noio flag > > set, we get to get_user_pages_fast() with FOLL_FAST_ONLY set. > > get_user_pages_fast() returns 0, which causes > > bio_iov_iter_get_pages to return -EFAULT. > > > > (3) Then gfs2_file_direct_write faults in the entire buffer with > > fault_in_iov_iter_readable(), which succeeds. > > > > (4) With the buffer in memory, we retry the iomap_dio_rw() -> > > ... -> bio_iov_iter_get_pages() -> ... -> get_user_pages_fast(). > > This should succeed now, but get_user_pages_fast() still returns 0. > > Hmm. Have you tried to figure out why that "still returns 0" happens? The call stack is: gup_pte_range gup_pmd_range gup_pud_range gup_p4d_range gup_pgd_range lockless_pages_from_mm internal_get_user_pages_fast get_user_pages_fast iov_iter_get_pages __bio_iov_iter_get_pages bio_iov_iter_get_pages iomap_dio_bio_actor iomap_dio_actor iomap_apply iomap_dio_rw gfs2_file_direct_write In gup_pte_range, pte_special(pte) is true and so we return 0. > One option - for debugging only - would be to introduce a new flag to > get_user_pages_fast() that says "print out reason if failed" and make > the retry (but not the original one) have that flag set. > > There are a couple of things of note when it comes to "get_user_pages_fast()": > > (a) some architectures don't even enable it > > (b) it can be very picky about the page table contents, and wants the > accessed bit to already be set (or the dirty bit, in the case of a > write). > > but (a) shouldn't be an issue on any common platform and (b) shouldn't > be an issue with fault_in_iov_iter_readable() that actually does a > __get_user() so it will access through the page tables. > > (It might be more of an issue with fault_in_iov_iter_writable() due to > walking the page tables by hand - if we don't do the proper > access/dirty setting, I could see get_user_pages_fast() failing). > > Anyway, for reason (a) I do think that eventually we should probably > introduce FOLL_NOFAULT, and allow the full "slow" page table walk - > just not calling down to handle_mm_fault() if it fails. > > But (a) should be a non-issue in your test environment, and so it > would be interesting to hear what it is that fails. Because scanning > through the patches, they all _look_ fine to me (apart from the one > comment about return values, which is more about being consistent with > copy_to/from_user() and making the code simpler - not about > correctness) > > Linus > Thanks, Andreas
[ Sorry for the delay, I was on the road and this fell through the cracks ] On Mon, Aug 16, 2021 at 12:14 PM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > On Tue, Aug 3, 2021 at 9:45 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Hmm. Have you tried to figure out why that "still returns 0" happens? > > The call stack is: > > gup_pte_range > gup_pmd_range > gup_pud_range > gup_p4d_range > gup_pgd_range > lockless_pages_from_mm > internal_get_user_pages_fast > get_user_pages_fast > iov_iter_get_pages > __bio_iov_iter_get_pages > bio_iov_iter_get_pages > iomap_dio_bio_actor > iomap_dio_actor > iomap_apply > iomap_dio_rw > gfs2_file_direct_write > > In gup_pte_range, pte_special(pte) is true and so we return 0. Ok, so that is indeed something that the fast-case can't handle, because some of the special code wants to have the mm_lock so that it can look at the vma flags (eg "vm_normal_page()" and friends. That said, some of these cases even the full GUP won't ever handle, simply because a mapping doesn't necessarily even _have_ a 'struct page' associated with it if it's a VM_IO mapping. So it turns out that you can't just always do fault_in_iov_iter_readable() and then assume that you can do iov_iter_get_pages() and repeat until successful. We could certainly make get_user_pages_fast() handle a few more cases, but I get the feeling that we need to have separate error cases for EFAULT - no page exists - and the "page exists, but cannot be mapped as a 'struct page'" case. I also do still think that even regardless of that, we want to just add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(), and then you can use the regular get_user_pages(). That at least gives us the full _normal_ page handling stuff. Linus
On Wed, Aug 18, 2021 at 11:49 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > [ Sorry for the delay, I was on the road and this fell through the cracks ] No harm done, I was busy enough implementing your previous suggestions. > On Mon, Aug 16, 2021 at 12:14 PM Andreas Gruenbacher > <agruenba@redhat.com> wrote: > > > > On Tue, Aug 3, 2021 at 9:45 PM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > Hmm. Have you tried to figure out why that "still returns 0" happens? > > > > The call stack is: > > > > gup_pte_range > > gup_pmd_range > > gup_pud_range > > gup_p4d_range > > gup_pgd_range > > lockless_pages_from_mm > > internal_get_user_pages_fast > > get_user_pages_fast > > iov_iter_get_pages > > __bio_iov_iter_get_pages > > bio_iov_iter_get_pages > > iomap_dio_bio_actor > > iomap_dio_actor > > iomap_apply > > iomap_dio_rw > > gfs2_file_direct_write > > > > In gup_pte_range, pte_special(pte) is true and so we return 0. > > Ok, so that is indeed something that the fast-case can't handle, > because some of the special code wants to have the mm_lock so that it > can look at the vma flags (eg "vm_normal_page()" and friends. > > That said, some of these cases even the full GUP won't ever handle, > simply because a mapping doesn't necessarily even _have_ a 'struct > page' associated with it if it's a VM_IO mapping. > > So it turns out that you can't just always do > fault_in_iov_iter_readable() and then assume that you can do > iov_iter_get_pages() and repeat until successful. > > We could certainly make get_user_pages_fast() handle a few more cases, > but I get the feeling that we need to have separate error cases for > EFAULT - no page exists - and the "page exists, but cannot be mapped > as a 'struct page'" case. Hmm, what if GUP is made to skip VM_IO vmas without adding anything to the pages array? That would match fault_in_iov_iter_writeable, which is modeled after __mm_populate and which skips VM_IO and VM_PFNMAP vmas. The other strategy I've added is to scale back the page fault windows to a single page if faulting in multiple pages didn't help, and to give up if the I/O operation still fails after that. So pathological cases won't loop indefinitely anymore at least. > I also do still think that even regardless of that, we want to just > add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(), > and then you can use the regular get_user_pages(). > > That at least gives us the full _normal_ page handling stuff. And it does fix the generic/208 failure. Thanks, Andreas
On Thu, Aug 19, 2021 at 12:41 PM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > Hmm, what if GUP is made to skip VM_IO vmas without adding anything to > the pages array? That would match fault_in_iov_iter_writeable, which > is modeled after __mm_populate and which skips VM_IO and VM_PFNMAP > vmas. I don't understand what you mean.. GUP already skips VM_IO (and VM_PFNMAP) pages. It just returns EFAULT. We could make it return another error. We already have DAX and FOLL_LONGTERM returning -EOPNOTSUPP. Of course, I think some code ends up always just returning "number of pages looked up" and might return 0 for "no pages" rather than the error for the first page. So we may end up having interfaces that then lose that explanation error code, but I didn't check. But we couldn't make it just say "skip them and try later addresses", if that is what you meant. THAT makes no sense - that would just make GUP look up some other address than what was asked for. > > I also do still think that even regardless of that, we want to just > > add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(), > > and then you can use the regular get_user_pages(). > > > > That at least gives us the full _normal_ page handling stuff. > > And it does fix the generic/208 failure. Good. So I think the approach is usable, even if we might have corner cases left. So I think the remaining issue is exactly things like VM_IO and VM_PFNMAP. Do the fstests have test-cases for things like this? It _is_ quite specialized, it might be a good idea to have that. Of course, doing direct-IO from special memory regions with zerocopy might be something special people actually want to do. But I think we've had that VM_IO flag testing there basically forever, so I don't think it has ever worked (for some definition of "ever"). Linus
On Thu, Aug 19, 2021 at 10:14 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Aug 19, 2021 at 12:41 PM Andreas Gruenbacher > <agruenba@redhat.com> wrote: > > > > Hmm, what if GUP is made to skip VM_IO vmas without adding anything to > > the pages array? That would match fault_in_iov_iter_writeable, which > > is modeled after __mm_populate and which skips VM_IO and VM_PFNMAP > > vmas. > > I don't understand what you mean.. GUP already skips VM_IO (and > VM_PFNMAP) pages. It just returns EFAULT. > > We could make it return another error. We already have DAX and > FOLL_LONGTERM returning -EOPNOTSUPP. > > Of course, I think some code ends up always just returning "number of > pages looked up" and might return 0 for "no pages" rather than the > error for the first page. > > So we may end up having interfaces that then lose that explanation > error code, but I didn't check. > > But we couldn't make it just say "skip them and try later addresses", > if that is what you meant. THAT makes no sense - that would just make > GUP look up some other address than what was asked for. get_user_pages has a start and a nr_pages argument, which specifies an address range from start to start + nr_pages * PAGE_SIZE. If pages != NULL, it adds a pointer to that array for each PAGE_SIZE subpage. I was thinking of skipping over VM_IO vmas in that process, so when the range starts in a mappable vma, runs into a VM_IO vma, and ends in a mappable vma, the pages in the pages array would be discontiguous; they would only cover the mappable vmas. But that would make it difficult to make sense of what's in the pages array. So scratch that idea. > > > I also do still think that even regardless of that, we want to just > > > add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(), > > > and then you can use the regular get_user_pages(). > > > > > > That at least gives us the full _normal_ page handling stuff. > > > > And it does fix the generic/208 failure. > > Good. So I think the approach is usable, even if we might have corner > cases left. > > So I think the remaining issue is exactly things like VM_IO and > VM_PFNMAP. Do the fstests have test-cases for things like this? It > _is_ quite specialized, it might be a good idea to have that. > > Of course, doing direct-IO from special memory regions with zerocopy > might be something special people actually want to do. But I think > we've had that VM_IO flag testing there basically forever, so I don't > think it has ever worked (for some definition of "ever"). The v6 patch queue should handle those cases acceptably well for now, but I don't think we have tests covering that at all. Thanks, Andreas