diff mbox

ceph: only dirty ITER_IOVEC pages for direct read

Message ID CAOi1vP9f5Jr+kY0ZjM0JBRhcW+Nw8z9bmaMsAPQtT07O=hHuvw@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ilya Dryomov April 1, 2018, 4:33 p.m. UTC
On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote:
>> (CC'ing Jeff)
>>
>> Ilya Dryomov <idryomov@gmail.com> writes:
>>
>> > On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@redhat.com> wrote:
>> > >
>> > >
>> > > > On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@gmail.com> wrote:
>> > > >
>> > > > On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@redhat.com> wrote:
>> > > > >
>> > > > >
>> > > > > > On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@gmail.com> wrote:
>> > > > > >
>> > > > > > On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@redhat.com> wrote:
>> > > > > >
>> > > > > > Hi Zheng,
>> > > > > >
>> > > > > > Can you explain how this fixes the invalid memory access reported in
>> > > > > > "KASAN: use-after-free Read in set_page_dirty_lock"?  Did you come up
>> > > > > > with a different reproducer?
>> > > > > >
>> > > > >
>> > > > > I don\u2019t know why KASAN reports use-after-free (like false alarm) in
>> > > > > this case. I compared the code with other filesystem, found recent commits
>> > > > > make other filesystems only dirty ITER_IOVEC pages. The change should also
>> > > > > make KASAN silent.
>> > > >
>> > > > Did you manage to reproduce that KASAN report?
>> > > >
>> > >
>> > > no
>> > >
>> > > > If it is a false positive, the patch shouldn't have been marked for
>> > > > stable.
>> > > >
>> > >
>> > > I CCed stable because fuse did the same
>> > >
>> > > commit 8fba54aebbdf1f999738121922e74bf796ad60ee
>> > > Author: Miklos Szeredi <mszeredi@redhat.com>
>> > > Date:   Wed Aug 24 18:17:04 2016 +0200
>> > >
>> > >     fuse: direct-io: don't dirty ITER_BVEC pages
>> > >
>> > >     When reading from a loop device backed by a fuse file it deadlocks on
>> > >     lock_page().
>> > >
>> > >     This is because the page is already locked by the read() operation done on
>> > >     the loop device.  In this case we don't want to either lock the page or
>> > >     dirty it.
>> > >
>> > >     So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
>> >
>> > OK, so it is a potential deadlock which has nothing to do with that
>> > KASAN report.  I have rewritten the changelog to reflect that, please
>> > take a look:
>> >
>> >   https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c
>>
>> Just as a side note, it's still trivial to get a cephfs-related KASAN
>> report by running a fuzzer (trinity in my case) against a mainline
>> kernel 4.16.0-rc7 with this fix backported.
>>
>
> I suspect trinity is clobbering the array
>> As Jeff mentioned in a different thread, splice syscall is broken on
>> cephfs and the fix for it is still tagged as "DO NOT MERGE" in the
>> ceph-client testing branch.  I still think there's some bug in Jeff's
>> fix as I still see a crash occasionally, but I haven't yet finished
>> debugging it.  Unfortunately, Jeff's fix is probably not appropriate for
>> stable kernels (but may I'm wrong).
>>
>> Cheers,
>
> (cc'ing Al)
>
> Yeah, I should have revisited this a while ago. So the deal is that I
> posted this patch upstream around a year ago, but Al didn't really like
> it:
>
>     https://patchwork.kernel.org/patch/9541829/
>
> He wanted to add some iov_iter_for_each_page infrastructure and base
> the new function on that. I had thought he had something queued up
> along those lines at one point, but I don't see anything in mainline so
> far.
>
> There is a iov_iter_for_each_range, but it doesn't handle userland
> iovecs and doesn't seem to care about alignment. I think we'll need
> both here.
>
> I'm not sure however that a per-page callback is really that helpful
> for callers like ceph_direct_read_write. We'll just end up having to do
> more  work in the fs to batch up the pages before we add the op to the
> req.
>
> iov_iter_get_pages_alloc is really sort of what we want in this case,
> and we want that function to stitch together longer arrays when it can.
>
> Al, do you have any thoughts on what we should do here?

I think this can be fixed entirely in ceph by walking away from page
vectors (barely tested):

  https://github.com/ceph/ceph-client/commits/wip-splice

If folks would like to see something like iov_iter_get_bvecs{,_alloc}()
in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs
and maxsize parameters -- I didn't need maxbvecs and maxsize is just
"make a copy of the iterator and truncate it").  Otherwise we will
probably merge it into ceph, as this code has remained broken for way
too long...

Attached is the main patch from wip-splice.

Thanks,

                Ilya

Comments

Yan, Zheng April 2, 2018, 7:29 a.m. UTC | #1
> On Apr 2, 2018, at 00:33, Ilya Dryomov <idryomov@gmail.com> wrote:
> 
> On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote:
>>> (CC'ing Jeff)
>>> 
>>> Ilya Dryomov <idryomov@gmail.com> writes:
>>> 
>>>> On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@redhat.com> wrote:
>>>>> 
>>>>> 
>>>>>> On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@gmail.com> wrote:
>>>>>> 
>>>>>> On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@redhat.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@redhat.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Zheng,
>>>>>>>> 
>>>>>>>> Can you explain how this fixes the invalid memory access reported in
>>>>>>>> "KASAN: use-after-free Read in set_page_dirty_lock"?  Did you come up
>>>>>>>> with a different reproducer?
>>>>>>>> 
>>>>>>> 
>>>>>>> I don\u2019t know why KASAN reports use-after-free (like false alarm) in
>>>>>>> this case. I compared the code with other filesystem, found recent commits
>>>>>>> make other filesystems only dirty ITER_IOVEC pages. The change should also
>>>>>>> make KASAN silent.
>>>>>> 
>>>>>> Did you manage to reproduce that KASAN report?
>>>>>> 
>>>>> 
>>>>> no
>>>>> 
>>>>>> If it is a false positive, the patch shouldn't have been marked for
>>>>>> stable.
>>>>>> 
>>>>> 
>>>>> I CCed stable because fuse did the same
>>>>> 
>>>>> commit 8fba54aebbdf1f999738121922e74bf796ad60ee
>>>>> Author: Miklos Szeredi <mszeredi@redhat.com>
>>>>> Date:   Wed Aug 24 18:17:04 2016 +0200
>>>>> 
>>>>>    fuse: direct-io: don't dirty ITER_BVEC pages
>>>>> 
>>>>>    When reading from a loop device backed by a fuse file it deadlocks on
>>>>>    lock_page().
>>>>> 
>>>>>    This is because the page is already locked by the read() operation done on
>>>>>    the loop device.  In this case we don't want to either lock the page or
>>>>>    dirty it.
>>>>> 
>>>>>    So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
>>>> 
>>>> OK, so it is a potential deadlock which has nothing to do with that
>>>> KASAN report.  I have rewritten the changelog to reflect that, please
>>>> take a look:
>>>> 
>>>>  https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c
>>> 
>>> Just as a side note, it's still trivial to get a cephfs-related KASAN
>>> report by running a fuzzer (trinity in my case) against a mainline
>>> kernel 4.16.0-rc7 with this fix backported.
>>> 
>> 
>> I suspect trinity is clobbering the array
>>> As Jeff mentioned in a different thread, splice syscall is broken on
>>> cephfs and the fix for it is still tagged as "DO NOT MERGE" in the
>>> ceph-client testing branch.  I still think there's some bug in Jeff's
>>> fix as I still see a crash occasionally, but I haven't yet finished
>>> debugging it.  Unfortunately, Jeff's fix is probably not appropriate for
>>> stable kernels (but may I'm wrong).
>>> 
>>> Cheers,
>> 
>> (cc'ing Al)
>> 
>> Yeah, I should have revisited this a while ago. So the deal is that I
>> posted this patch upstream around a year ago, but Al didn't really like
>> it:
>> 
>>    https://patchwork.kernel.org/patch/9541829/
>> 
>> He wanted to add some iov_iter_for_each_page infrastructure and base
>> the new function on that. I had thought he had something queued up
>> along those lines at one point, but I don't see anything in mainline so
>> far.
>> 
>> There is a iov_iter_for_each_range, but it doesn't handle userland
>> iovecs and doesn't seem to care about alignment. I think we'll need
>> both here.
>> 
>> I'm not sure however that a per-page callback is really that helpful
>> for callers like ceph_direct_read_write. We'll just end up having to do
>> more  work in the fs to batch up the pages before we add the op to the
>> req.
>> 
>> iov_iter_get_pages_alloc is really sort of what we want in this case,
>> and we want that function to stitch together longer arrays when it can.
>> 
>> Al, do you have any thoughts on what we should do here?
> 
> I think this can be fixed entirely in ceph by walking away from page
> vectors (barely tested):
> 
>  https://github.com/ceph/ceph-client/commits/wip-splice

The patch looks good to me. using bio_vecs make the code more clear.

Regards
Yan, Zheng


> 
> If folks would like to see something like iov_iter_get_bvecs{,_alloc}()
> in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs
> and maxsize parameters -- I didn't need maxbvecs and maxsize is just
> "make a copy of the iterator and truncate it").  Otherwise we will
> probably merge it into ceph, as this code has remained broken for way
> too long...
> 
> Attached is the main patch from wip-splice.
> 
> Thanks,
> 
>                Ilya
> <0003-ceph-fix-iov_iter-issues-in-ceph_direct_read_write.patch>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton April 2, 2018, 11:10 a.m. UTC | #2
On Sun, 2018-04-01 at 18:33 +0200, Ilya Dryomov wrote:
> On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote:
> > > (CC'ing Jeff)
> > > 
> > > Ilya Dryomov <idryomov@gmail.com> writes:
> > > 
> > > > On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > > > > 
> > > > > 
> > > > > > On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > > 
> > > > > > On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > > On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > > > > 
> > > > > > > > On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > > > > > > > 
> > > > > > > > Hi Zheng,
> > > > > > > > 
> > > > > > > > Can you explain how this fixes the invalid memory access reported in
> > > > > > > > "KASAN: use-after-free Read in set_page_dirty_lock"?  Did you come up
> > > > > > > > with a different reproducer?
> > > > > > > > 
> > > > > > > 
> > > > > > > I don\u2019t know why KASAN reports use-after-free (like false alarm) in
> > > > > > > this case. I compared the code with other filesystem, found recent commits
> > > > > > > make other filesystems only dirty ITER_IOVEC pages. The change should also
> > > > > > > make KASAN silent.
> > > > > > 
> > > > > > Did you manage to reproduce that KASAN report?
> > > > > > 
> > > > > 
> > > > > no
> > > > > 
> > > > > > If it is a false positive, the patch shouldn't have been marked for
> > > > > > stable.
> > > > > > 
> > > > > 
> > > > > I CCed stable because fuse did the same
> > > > > 
> > > > > commit 8fba54aebbdf1f999738121922e74bf796ad60ee
> > > > > Author: Miklos Szeredi <mszeredi@redhat.com>
> > > > > Date:   Wed Aug 24 18:17:04 2016 +0200
> > > > > 
> > > > >     fuse: direct-io: don't dirty ITER_BVEC pages
> > > > > 
> > > > >     When reading from a loop device backed by a fuse file it deadlocks on
> > > > >     lock_page().
> > > > > 
> > > > >     This is because the page is already locked by the read() operation done on
> > > > >     the loop device.  In this case we don't want to either lock the page or
> > > > >     dirty it.
> > > > > 
> > > > >     So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
> > > > 
> > > > OK, so it is a potential deadlock which has nothing to do with that
> > > > KASAN report.  I have rewritten the changelog to reflect that, please
> > > > take a look:
> > > > 
> > > >   https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c
> > > 
> > > Just as a side note, it's still trivial to get a cephfs-related KASAN
> > > report by running a fuzzer (trinity in my case) against a mainline
> > > kernel 4.16.0-rc7 with this fix backported.
> > > 
> > 
> > I suspect trinity is clobbering the array
> > > As Jeff mentioned in a different thread, splice syscall is broken on
> > > cephfs and the fix for it is still tagged as "DO NOT MERGE" in the
> > > ceph-client testing branch.  I still think there's some bug in Jeff's
> > > fix as I still see a crash occasionally, but I haven't yet finished
> > > debugging it.  Unfortunately, Jeff's fix is probably not appropriate for
> > > stable kernels (but may I'm wrong).
> > > 
> > > Cheers,
> > 
> > (cc'ing Al)
> > 
> > Yeah, I should have revisited this a while ago. So the deal is that I
> > posted this patch upstream around a year ago, but Al didn't really like
> > it:
> > 
> >     https://patchwork.kernel.org/patch/9541829/
> > 
> > He wanted to add some iov_iter_for_each_page infrastructure and base
> > the new function on that. I had thought he had something queued up
> > along those lines at one point, but I don't see anything in mainline so
> > far.
> > 
> > There is a iov_iter_for_each_range, but it doesn't handle userland
> > iovecs and doesn't seem to care about alignment. I think we'll need
> > both here.
> > 
> > I'm not sure however that a per-page callback is really that helpful
> > for callers like ceph_direct_read_write. We'll just end up having to do
> > more  work in the fs to batch up the pages before we add the op to the
> > req.
> > 
> > iov_iter_get_pages_alloc is really sort of what we want in this case,
> > and we want that function to stitch together longer arrays when it can.
> > 
> > Al, do you have any thoughts on what we should do here?
> 
> I think this can be fixed entirely in ceph by walking away from page
> vectors (barely tested):
> 
>   https://github.com/ceph/ceph-client/commits/wip-splice
> 
> If folks would like to see something like iov_iter_get_bvecs{,_alloc}()
> in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs
> and maxsize parameters -- I didn't need maxbvecs and maxsize is just
> "make a copy of the iterator and truncate it").  Otherwise we will
> probably merge it into ceph, as this code has remained broken for way
> too long...
> 

I think we should try to keep as much of the gory details in
lib/iov_iter.c as we can, so I'd support lifting most of
iter_get_bvecs_alloc into there.

> Attached is the main patch from wip-splice.
> 
> Thanks,
> 
>                 Ilya


+	while (iov_iter_count(i)) {
+		struct page *pages[16];
+		ssize_t bytes;
+		size_t start;
+		int idx = 0;
+
+		bytes = iov_iter_get_pages(i, pages, SIZE_MAX, 16, &start);
+		if (bytes < 0)
+			return total ?: bytes;
+
+		for ( ; bytes; idx++, bvec_idx++) {
+			struct bio_vec bv = {
+				.bv_page = pages[idx],
+				.bv_len = min_t(int, bytes, PAGE_SIZE - start),
+				.bv_offset = start,
+			};
+
+			bvecs[bvec_idx] = bv;
+			iov_iter_advance(i, bv.bv_len);
+			total += bv.bv_len;
+			bytes -= bv.bv_len;
+			start = 0;
+		}
+	}
+
+	return total;

It's possible that we'll end up with a set of contiguous bvecs that do
not begin or end on page boundaries. Will the lower-level ceph OSD
client code handle that correctly? If so, then this should be ok. If
not, then we need to be more careful when we're stitching together
iovecs like this and check the alignment as we go.
Ilya Dryomov April 2, 2018, 12:11 p.m. UTC | #3
On Mon, Apr 2, 2018 at 1:10 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Sun, 2018-04-01 at 18:33 +0200, Ilya Dryomov wrote:
>> On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote:
>> > > (CC'ing Jeff)
>> > >
>> > > Ilya Dryomov <idryomov@gmail.com> writes:
>> > >
>> > > > On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@redhat.com> wrote:
>> > > > >
>> > > > >
>> > > > > > On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@gmail.com> wrote:
>> > > > > >
>> > > > > > On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@redhat.com> wrote:
>> > > > > > >
>> > > > > > >
>> > > > > > > > On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@gmail.com> wrote:
>> > > > > > > >
>> > > > > > > > On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@redhat.com> wrote:
>> > > > > > > >
>> > > > > > > > Hi Zheng,
>> > > > > > > >
>> > > > > > > > Can you explain how this fixes the invalid memory access reported in
>> > > > > > > > "KASAN: use-after-free Read in set_page_dirty_lock"?  Did you come up
>> > > > > > > > with a different reproducer?
>> > > > > > > >
>> > > > > > >
>> > > > > > > I don\u2019t know why KASAN reports use-after-free (like false alarm) in
>> > > > > > > this case. I compared the code with other filesystem, found recent commits
>> > > > > > > make other filesystems only dirty ITER_IOVEC pages. The change should also
>> > > > > > > make KASAN silent.
>> > > > > >
>> > > > > > Did you manage to reproduce that KASAN report?
>> > > > > >
>> > > > >
>> > > > > no
>> > > > >
>> > > > > > If it is a false positive, the patch shouldn't have been marked for
>> > > > > > stable.
>> > > > > >
>> > > > >
>> > > > > I CCed stable because fuse did the same
>> > > > >
>> > > > > commit 8fba54aebbdf1f999738121922e74bf796ad60ee
>> > > > > Author: Miklos Szeredi <mszeredi@redhat.com>
>> > > > > Date:   Wed Aug 24 18:17:04 2016 +0200
>> > > > >
>> > > > >     fuse: direct-io: don't dirty ITER_BVEC pages
>> > > > >
>> > > > >     When reading from a loop device backed by a fuse file it deadlocks on
>> > > > >     lock_page().
>> > > > >
>> > > > >     This is because the page is already locked by the read() operation done on
>> > > > >     the loop device.  In this case we don't want to either lock the page or
>> > > > >     dirty it.
>> > > > >
>> > > > >     So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
>> > > >
>> > > > OK, so it is a potential deadlock which has nothing to do with that
>> > > > KASAN report.  I have rewritten the changelog to reflect that, please
>> > > > take a look:
>> > > >
>> > > >   https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c
>> > >
>> > > Just as a side note, it's still trivial to get a cephfs-related KASAN
>> > > report by running a fuzzer (trinity in my case) against a mainline
>> > > kernel 4.16.0-rc7 with this fix backported.
>> > >
>> >
>> > I suspect trinity is clobbering the array
>> > > As Jeff mentioned in a different thread, splice syscall is broken on
>> > > cephfs and the fix for it is still tagged as "DO NOT MERGE" in the
>> > > ceph-client testing branch.  I still think there's some bug in Jeff's
>> > > fix as I still see a crash occasionally, but I haven't yet finished
>> > > debugging it.  Unfortunately, Jeff's fix is probably not appropriate for
>> > > stable kernels (but may I'm wrong).
>> > >
>> > > Cheers,
>> >
>> > (cc'ing Al)
>> >
>> > Yeah, I should have revisited this a while ago. So the deal is that I
>> > posted this patch upstream around a year ago, but Al didn't really like
>> > it:
>> >
>> >     https://patchwork.kernel.org/patch/9541829/
>> >
>> > He wanted to add some iov_iter_for_each_page infrastructure and base
>> > the new function on that. I had thought he had something queued up
>> > along those lines at one point, but I don't see anything in mainline so
>> > far.
>> >
>> > There is a iov_iter_for_each_range, but it doesn't handle userland
>> > iovecs and doesn't seem to care about alignment. I think we'll need
>> > both here.
>> >
>> > I'm not sure however that a per-page callback is really that helpful
>> > for callers like ceph_direct_read_write. We'll just end up having to do
>> > more  work in the fs to batch up the pages before we add the op to the
>> > req.
>> >
>> > iov_iter_get_pages_alloc is really sort of what we want in this case,
>> > and we want that function to stitch together longer arrays when it can.
>> >
>> > Al, do you have any thoughts on what we should do here?
>>
>> I think this can be fixed entirely in ceph by walking away from page
>> vectors (barely tested):
>>
>>   https://github.com/ceph/ceph-client/commits/wip-splice
>>
>> If folks would like to see something like iov_iter_get_bvecs{,_alloc}()
>> in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs
>> and maxsize parameters -- I didn't need maxbvecs and maxsize is just
>> "make a copy of the iterator and truncate it").  Otherwise we will
>> probably merge it into ceph, as this code has remained broken for way
>> too long...
>>
>
> I think we should try to keep as much of the gory details in
> lib/iov_iter.c as we can, so I'd support lifting most of
> iter_get_bvecs_alloc into there.
>
>> Attached is the main patch from wip-splice.
>>
>> Thanks,
>>
>>                 Ilya
>
>
> +       while (iov_iter_count(i)) {
> +               struct page *pages[16];
> +               ssize_t bytes;
> +               size_t start;
> +               int idx = 0;
> +
> +               bytes = iov_iter_get_pages(i, pages, SIZE_MAX, 16, &start);
> +               if (bytes < 0)
> +                       return total ?: bytes;
> +
> +               for ( ; bytes; idx++, bvec_idx++) {
> +                       struct bio_vec bv = {
> +                               .bv_page = pages[idx],
> +                               .bv_len = min_t(int, bytes, PAGE_SIZE - start),
> +                               .bv_offset = start,
> +                       };
> +
> +                       bvecs[bvec_idx] = bv;
> +                       iov_iter_advance(i, bv.bv_len);
> +                       total += bv.bv_len;
> +                       bytes -= bv.bv_len;
> +                       start = 0;
> +               }
> +       }
> +
> +       return total;
>
> It's possible that we'll end up with a set of contiguous bvecs that do
> not begin or end on page boundaries. Will the lower-level ceph OSD
> client code handle that correctly? If so, then this should be ok. If
> not, then we need to be more careful when we're stitching together
> iovecs like this and check the alignment as we go.

Yes, it should be handled correctly, just like a bio with an arbitrary
set of bvecs in it.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Henriques April 2, 2018, 2:36 p.m. UTC | #4
On Sun, Apr 01, 2018 at 06:33:31PM +0200, Ilya Dryomov wrote:
> On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote:
> >> (CC'ing Jeff)
> >>
> >> Ilya Dryomov <idryomov@gmail.com> writes:
> >>
> >> > On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@redhat.com> wrote:
> >> > >
> >> > >
> >> > > > On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@gmail.com> wrote:
> >> > > >
> >> > > > On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@redhat.com> wrote:
> >> > > > >
> >> > > > >
> >> > > > > > On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@gmail.com> wrote:
> >> > > > > >
> >> > > > > > On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@redhat.com> wrote:
> >> > > > > >
> >> > > > > > Hi Zheng,
> >> > > > > >
> >> > > > > > Can you explain how this fixes the invalid memory access reported in
> >> > > > > > "KASAN: use-after-free Read in set_page_dirty_lock"?  Did you come up
> >> > > > > > with a different reproducer?
> >> > > > > >
> >> > > > >
> >> > > > > I don\u2019t know why KASAN reports use-after-free (like false alarm) in
> >> > > > > this case. I compared the code with other filesystem, found recent commits
> >> > > > > make other filesystems only dirty ITER_IOVEC pages. The change should also
> >> > > > > make KASAN silent.
> >> > > >
> >> > > > Did you manage to reproduce that KASAN report?
> >> > > >
> >> > >
> >> > > no
> >> > >
> >> > > > If it is a false positive, the patch shouldn't have been marked for
> >> > > > stable.
> >> > > >
> >> > >
> >> > > I CCed stable because fuse did the same
> >> > >
> >> > > commit 8fba54aebbdf1f999738121922e74bf796ad60ee
> >> > > Author: Miklos Szeredi <mszeredi@redhat.com>
> >> > > Date:   Wed Aug 24 18:17:04 2016 +0200
> >> > >
> >> > >     fuse: direct-io: don't dirty ITER_BVEC pages
> >> > >
> >> > >     When reading from a loop device backed by a fuse file it deadlocks on
> >> > >     lock_page().
> >> > >
> >> > >     This is because the page is already locked by the read() operation done on
> >> > >     the loop device.  In this case we don't want to either lock the page or
> >> > >     dirty it.
> >> > >
> >> > >     So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
> >> >
> >> > OK, so it is a potential deadlock which has nothing to do with that
> >> > KASAN report.  I have rewritten the changelog to reflect that, please
> >> > take a look:
> >> >
> >> >   https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c
> >>
> >> Just as a side note, it's still trivial to get a cephfs-related KASAN
> >> report by running a fuzzer (trinity in my case) against a mainline
> >> kernel 4.16.0-rc7 with this fix backported.
> >>
> >
> > I suspect trinity is clobbering the array
> >> As Jeff mentioned in a different thread, splice syscall is broken on
> >> cephfs and the fix for it is still tagged as "DO NOT MERGE" in the
> >> ceph-client testing branch.  I still think there's some bug in Jeff's
> >> fix as I still see a crash occasionally, but I haven't yet finished
> >> debugging it.  Unfortunately, Jeff's fix is probably not appropriate for
> >> stable kernels (but may I'm wrong).
> >>
> >> Cheers,
> >
> > (cc'ing Al)
> >
> > Yeah, I should have revisited this a while ago. So the deal is that I
> > posted this patch upstream around a year ago, but Al didn't really like
> > it:
> >
> >     https://patchwork.kernel.org/patch/9541829/
> >
> > He wanted to add some iov_iter_for_each_page infrastructure and base
> > the new function on that. I had thought he had something queued up
> > along those lines at one point, but I don't see anything in mainline so
> > far.
> >
> > There is a iov_iter_for_each_range, but it doesn't handle userland
> > iovecs and doesn't seem to care about alignment. I think we'll need
> > both here.
> >
> > I'm not sure however that a per-page callback is really that helpful
> > for callers like ceph_direct_read_write. We'll just end up having to do
> > more  work in the fs to batch up the pages before we add the op to the
> > req.
> >
> > iov_iter_get_pages_alloc is really sort of what we want in this case,
> > and we want that function to stitch together longer arrays when it can.
> >
> > Al, do you have any thoughts on what we should do here?
> 
> I think this can be fixed entirely in ceph by walking away from page
> vectors (barely tested):
> 
>   https://github.com/ceph/ceph-client/commits/wip-splice
> 
> If folks would like to see something like iov_iter_get_bvecs{,_alloc}()
> in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs
> and maxsize parameters -- I didn't need maxbvecs and maxsize is just
> "make a copy of the iterator and truncate it").  Otherwise we will
> probably merge it into ceph, as this code has remained broken for way
> too long...
> 
> Attached is the main patch from wip-splice.
> 
> Thanks,
> 
>                 Ilya

I haven't (yet) spent much time looking at the code, but a quick test
shows me the below failure in the sanity() tests from lib/iov_iter.c.

Cheers,
--
Luís


[  235.134588] idx = 7, offset = 4096
[  235.134917] curbuf = 8, nrbufs = 1, buffers = 16
[  235.135388] [          (null) 00000000a5376b82 0 4096]
[  235.135914] [          (null) 00000000fb52416a 0 4096]
[  235.136421] [          (null) 0000000009c7c0fc 0 4096]
[  235.136901] [          (null) 000000008b9911f0 0 4096]
[  235.137411] [          (null) 0000000042e4a3e2 0 4096]
[  235.137879] [          (null) 00000000110f4fc8 0 4096]
[  235.138358] [          (null) 00000000e16bdd11 0 4096]
[  235.138910] [          (null) 000000007d61a8f6 0 4096]
[  235.139399] [00000000af9d005b 000000003b50ae93 0 4096]
[  235.139879] [          (null) 00000000c55231a4 0 4096]
[  235.140378] [          (null) 00000000eba243d2 0 4096]
[  235.140851] [          (null) 00000000f5617952 0 4096]
[  235.141344] [          (null) 00000000fec2c691 0 4096]
[  235.141836] [          (null) 00000000ca09f9a4 0 4096]
[  235.142360] [          (null) 0000000087cfbb92 0 4096]
[  235.142886] [          (null) 000000009432c839 0 4096]
[  235.143441] WARNING: CPU: 1 PID: 261 at lib/iov_iter.c:350 sanity+0x162/0x210
[  235.144208] CPU: 1 PID: 261 Comm: trinity-c5 Not tainted 4.16.0-rc7+ #52
[  235.144942] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
[  235.146057] RIP: 0010:sanity+0x162/0x210
[  235.146509] RSP: 0018:ffff8800656970e8 EFLAGS: 00010246
[  235.147078] RAX: 0000000000000000 RBX: ffff88006852c868 RCX: ffffffff81589cb3
[  235.147841] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88006852c8a8
[  235.148636] RBP: 0000000000000010 R08: fffffbfff059aa08 R09: ffff88006852c8e0
[  235.149470] R10: 0000000000000003 R11: fffffbfff059aa07 R12: ffff88006406dd38
[  235.150340] R13: ffff88006852c8a8 R14: 0000000000000000 R15: 0000000000001000
[  235.151184] FS:  00007fd75163eb40(0000) GS:ffff88006d700000(0000) knlGS:0000000000000000
[  235.152028] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  235.152562] CR2: 00007fd751501000 CR3: 000000006a940000 CR4: 00000000000006a0
[  235.153232] DR0: 00007fd7514fe000 DR1: 0000000000000000 DR2: 0000000000000000
[  235.153887] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000f0602
[  235.154644] Call Trace:
[  235.154914]  iov_iter_get_pages+0x2a3/0x680
[  235.155377]  ? iov_iter_gap_alignment+0x490/0x490
[  235.155882]  ? iov_iter_advance+0x360/0x780
[  235.156334]  ? generic_file_splice_read+0x21b/0x320
[  235.156993]  ? SyS_splice+0x914/0x9a0
[  235.157498]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  235.158177]  ? iov_iter_copy_from_user_atomic+0x6e0/0x6e0
[  235.158957]  ? kmemleak_disable+0x70/0x70
[  235.159456]  __iter_get_bvecs+0x117/0x290
[  235.159955]  ? ceph_fallocate+0xc10/0xc10
[  235.160455]  ? iov_iter_npages+0x309/0x540
[  235.160974]  ? __kmalloc+0xf9/0x1f0
[  235.161410]  ceph_direct_read_write+0x4e7/0x1380
[  235.161981]  ? ceph_aio_complete_req+0x750/0x750
[  235.162503]  ? ceph_write_iter+0x103b/0x1530
[  235.162986]  ? __rb_erase_color+0xd50/0xd50
[  235.163449]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  235.164043]  ? ceph_fscache_revalidate_cookie+0x80/0x1d0
[  235.164630]  ? ceph_fscache_unregister_fs+0x1f0/0x1f0
[  235.165201]  ? __ceph_mark_dirty_caps+0x212/0x610
[  235.165713]  ? ceph_get_caps+0x2bc/0x700
[  235.166147]  ? _raw_spin_unlock_bh+0xc0/0xc0
[  235.166681]  ? ceph_put_cap_refs+0x5a0/0x5a0
[  235.167182]  ? __delete_object+0xcd/0x100
[  235.167636]  ? put_object+0x40/0x40
[  235.168033]  ? up_read+0x20/0x20
[  235.168396]  ? ceph_write_iter+0x103b/0x1530
[  235.168865]  ? kmem_cache_free+0x80/0x1d0
[  235.169252]  ? ceph_write_iter+0x1047/0x1530
[  235.169692]  ? ceph_read_iter+0xc3e/0x10b0
[  235.170097]  ceph_read_iter+0xc3e/0x10b0
[  235.170519]  ? __tty_buffer_request_room+0xd0/0x320
[  235.171138]  ? ceph_write_iter+0x1530/0x1530
[  235.171667]  ? timerqueue_add+0xd2/0x100
[  235.172239]  ? enqueue_hrtimer+0xfb/0x1d0
[  235.172789]  ? hrtimer_reprogram+0x130/0x130
[  235.173361]  ? _raw_spin_unlock_bh+0xc0/0xc0
[  235.173913]  ? _raw_spin_unlock_bh+0xc0/0xc0
[  235.174464]  ? hrtimer_start_range_ns+0x32d/0x440
[  235.174979]  ? hrtimer_forward+0x120/0x120
[  235.175394]  ? hrtimer_try_to_cancel+0x83/0x250
[  235.176033]  ? hrtimer_active+0x240/0x240
[  235.176630]  ? __hrtimer_get_remaining+0xc6/0x100
[  235.177285]  ? hrtimer_start_range_ns+0x440/0x440
[  235.177980]  ? generic_file_splice_read+0x21b/0x320
[  235.178694]  ? ceph_write_iter+0x1530/0x1530
[  235.179325]  generic_file_splice_read+0x21b/0x320
[  235.180011]  ? splice_shrink_spd+0x40/0x40
[  235.180619]  ? schedule+0x50/0x280
[  235.181126]  ? rw_verify_area+0x1f/0x70
[  235.181641]  ? do_splice_to+0x6a/0xc0
[  235.182075]  SyS_splice+0x914/0x9a0
[  235.182405]  ? __task_pid_nr_ns+0x1a9/0x1c0
[  235.182825]  ? compat_SyS_vmsplice+0x100/0x100
[  235.183266]  ? do_syscall_64+0x8d/0x300
[  235.183701]  ? compat_SyS_vmsplice+0x100/0x100
[  235.184223]  do_syscall_64+0x164/0x300
[  235.184662]  ? syscall_return_slowpath+0x1c0/0x1c0
[  235.185213]  ? page_fault+0x25/0x50
[  235.185616]  ? syscall_return_slowpath+0x13c/0x1c0
[  235.186124]  ? prepare_exit_to_usermode+0xdb/0x140
[  235.186692]  ? syscall_trace_enter+0x320/0x320
[  235.187206]  ? __put_user_4+0x1c/0x30
[  235.187858]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  235.188726] RIP: 0033:0x7fd750f61229
[  235.189312] RSP: 002b:00007fff1ad745c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
[  235.190568] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd750f61229
[  235.191687] RDX: 000000000000003d RSI: 0000000000000000 RDI: 0000000000000098
[  235.192598] RBP: 00007fd751614000 R08: 0000000000fffff7 R09: 0000000000000001
[  235.193528] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000113
[  235.194443] R13: 00007fd751614048 R14: 00007fd75163ead8 R15: 00007fd751614000
[  235.195364] Code: 4c 89 ff e8 f1 43 cf ff 8b 73 3c 44 89 e1 48 c7 c7 80 13 11 82 89 ea e8 87 8d b8 ff 4c 89 ef e8 d5 43 cf ff 8b 43 40 85 c0 75 13 <0f> 0b 48 83 c4 10 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 8d 
[  235.197769] ---[ end trace 0228543143d962c4 ]---
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton April 2, 2018, 4:27 p.m. UTC | #5
On Mon, 2018-04-02 at 15:36 +0100, Luis Henriques wrote:
> On Sun, Apr 01, 2018 at 06:33:31PM +0200, Ilya Dryomov wrote:
> > On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote:
> > > > (CC'ing Jeff)
> > > > 
> > > > Ilya Dryomov <idryomov@gmail.com> writes:
> > > > 
> > > > > On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > > > > > 
> > > > > > 
> > > > > > > On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > > > 
> > > > > > > On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > > > > > 
> > > > > > > > > On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > > > > > > > > 
> > > > > > > > > Hi Zheng,
> > > > > > > > > 
> > > > > > > > > Can you explain how this fixes the invalid memory access reported in
> > > > > > > > > "KASAN: use-after-free Read in set_page_dirty_lock"?  Did you come up
> > > > > > > > > with a different reproducer?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I don\u2019t know why KASAN reports use-after-free (like false alarm) in
> > > > > > > > this case. I compared the code with other filesystem, found recent commits
> > > > > > > > make other filesystems only dirty ITER_IOVEC pages. The change should also
> > > > > > > > make KASAN silent.
> > > > > > > 
> > > > > > > Did you manage to reproduce that KASAN report?
> > > > > > > 
> > > > > > 
> > > > > > no
> > > > > > 
> > > > > > > If it is a false positive, the patch shouldn't have been marked for
> > > > > > > stable.
> > > > > > > 
> > > > > > 
> > > > > > I CCed stable because fuse did the same
> > > > > > 
> > > > > > commit 8fba54aebbdf1f999738121922e74bf796ad60ee
> > > > > > Author: Miklos Szeredi <mszeredi@redhat.com>
> > > > > > Date:   Wed Aug 24 18:17:04 2016 +0200
> > > > > > 
> > > > > >     fuse: direct-io: don't dirty ITER_BVEC pages
> > > > > > 
> > > > > >     When reading from a loop device backed by a fuse file it deadlocks on
> > > > > >     lock_page().
> > > > > > 
> > > > > >     This is because the page is already locked by the read() operation done on
> > > > > >     the loop device.  In this case we don't want to either lock the page or
> > > > > >     dirty it.
> > > > > > 
> > > > > >     So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
> > > > > 
> > > > > OK, so it is a potential deadlock which has nothing to do with that
> > > > > KASAN report.  I have rewritten the changelog to reflect that, please
> > > > > take a look:
> > > > > 
> > > > >   https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c
> > > > 
> > > > Just as a side note, it's still trivial to get a cephfs-related KASAN
> > > > report by running a fuzzer (trinity in my case) against a mainline
> > > > kernel 4.16.0-rc7 with this fix backported.
> > > > 
> > > 
> > > I suspect trinity is clobbering the array
> > > > As Jeff mentioned in a different thread, splice syscall is broken on
> > > > cephfs and the fix for it is still tagged as "DO NOT MERGE" in the
> > > > ceph-client testing branch.  I still think there's some bug in Jeff's
> > > > fix as I still see a crash occasionally, but I haven't yet finished
> > > > debugging it.  Unfortunately, Jeff's fix is probably not appropriate for
> > > > stable kernels (but may I'm wrong).
> > > > 
> > > > Cheers,
> > > 
> > > (cc'ing Al)
> > > 
> > > Yeah, I should have revisited this a while ago. So the deal is that I
> > > posted this patch upstream around a year ago, but Al didn't really like
> > > it:
> > > 
> > >     https://patchwork.kernel.org/patch/9541829/
> > > 
> > > He wanted to add some iov_iter_for_each_page infrastructure and base
> > > the new function on that. I had thought he had something queued up
> > > along those lines at one point, but I don't see anything in mainline so
> > > far.
> > > 
> > > There is a iov_iter_for_each_range, but it doesn't handle userland
> > > iovecs and doesn't seem to care about alignment. I think we'll need
> > > both here.
> > > 
> > > I'm not sure however that a per-page callback is really that helpful
> > > for callers like ceph_direct_read_write. We'll just end up having to do
> > > more  work in the fs to batch up the pages before we add the op to the
> > > req.
> > > 
> > > iov_iter_get_pages_alloc is really sort of what we want in this case,
> > > and we want that function to stitch together longer arrays when it can.
> > > 
> > > Al, do you have any thoughts on what we should do here?
> > 
> > I think this can be fixed entirely in ceph by walking away from page
> > vectors (barely tested):
> > 
> >   https://github.com/ceph/ceph-client/commits/wip-splice
> > 
> > If folks would like to see something like iov_iter_get_bvecs{,_alloc}()
> > in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs
> > and maxsize parameters -- I didn't need maxbvecs and maxsize is just
> > "make a copy of the iterator and truncate it").  Otherwise we will
> > probably merge it into ceph, as this code has remained broken for way
> > too long...
> > 
> > Attached is the main patch from wip-splice.
> > 
> > Thanks,
> > 
> >                 Ilya
> 
> I haven't (yet) spent much time looking at the code, but a quick test
> shows me the below failure in the sanity() tests from lib/iov_iter.c.
> 
> Cheers,
> --
> Luís
> 
> 
> [  235.134588] idx = 7, offset = 4096
> [  235.134917] curbuf = 8, nrbufs = 1, buffers = 16
> [  235.135388] [          (null) 00000000a5376b82 0 4096]
> [  235.135914] [          (null) 00000000fb52416a 0 4096]
> [  235.136421] [          (null) 0000000009c7c0fc 0 4096]
> [  235.136901] [          (null) 000000008b9911f0 0 4096]
> [  235.137411] [          (null) 0000000042e4a3e2 0 4096]
> [  235.137879] [          (null) 00000000110f4fc8 0 4096]
> [  235.138358] [          (null) 00000000e16bdd11 0 4096]
> [  235.138910] [          (null) 000000007d61a8f6 0 4096]
> [  235.139399] [00000000af9d005b 000000003b50ae93 0 4096]
> [  235.139879] [          (null) 00000000c55231a4 0 4096]
> [  235.140378] [          (null) 00000000eba243d2 0 4096]
> [  235.140851] [          (null) 00000000f5617952 0 4096]
> [  235.141344] [          (null) 00000000fec2c691 0 4096]
> [  235.141836] [          (null) 00000000ca09f9a4 0 4096]
> [  235.142360] [          (null) 0000000087cfbb92 0 4096]
> [  235.142886] [          (null) 000000009432c839 0 4096]
> [  235.143441] WARNING: CPU: 1 PID: 261 at lib/iov_iter.c:350 sanity+0x162/0x210
> [  235.144208] CPU: 1 PID: 261 Comm: trinity-c5 Not tainted 4.16.0-rc7+ #52
> [  235.144942] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> [  235.146057] RIP: 0010:sanity+0x162/0x210
> [  235.146509] RSP: 0018:ffff8800656970e8 EFLAGS: 00010246
> [  235.147078] RAX: 0000000000000000 RBX: ffff88006852c868 RCX: ffffffff81589cb3
> [  235.147841] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88006852c8a8
> [  235.148636] RBP: 0000000000000010 R08: fffffbfff059aa08 R09: ffff88006852c8e0
> [  235.149470] R10: 0000000000000003 R11: fffffbfff059aa07 R12: ffff88006406dd38
> [  235.150340] R13: ffff88006852c8a8 R14: 0000000000000000 R15: 0000000000001000
> [  235.151184] FS:  00007fd75163eb40(0000) GS:ffff88006d700000(0000) knlGS:0000000000000000
> [  235.152028] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  235.152562] CR2: 00007fd751501000 CR3: 000000006a940000 CR4: 00000000000006a0
> [  235.153232] DR0: 00007fd7514fe000 DR1: 0000000000000000 DR2: 0000000000000000
> [  235.153887] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000f0602
> [  235.154644] Call Trace:
> [  235.154914]  iov_iter_get_pages+0x2a3/0x680
> [  235.155377]  ? iov_iter_gap_alignment+0x490/0x490
> [  235.155882]  ? iov_iter_advance+0x360/0x780
> [  235.156334]  ? generic_file_splice_read+0x21b/0x320
> [  235.156993]  ? SyS_splice+0x914/0x9a0
> [  235.157498]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [  235.158177]  ? iov_iter_copy_from_user_atomic+0x6e0/0x6e0
> [  235.158957]  ? kmemleak_disable+0x70/0x70
> [  235.159456]  __iter_get_bvecs+0x117/0x290
> [  235.159955]  ? ceph_fallocate+0xc10/0xc10
> [  235.160455]  ? iov_iter_npages+0x309/0x540
> [  235.160974]  ? __kmalloc+0xf9/0x1f0
> [  235.161410]  ceph_direct_read_write+0x4e7/0x1380
> [  235.161981]  ? ceph_aio_complete_req+0x750/0x750
> [  235.162503]  ? ceph_write_iter+0x103b/0x1530
> [  235.162986]  ? __rb_erase_color+0xd50/0xd50
> [  235.163449]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [  235.164043]  ? ceph_fscache_revalidate_cookie+0x80/0x1d0
> [  235.164630]  ? ceph_fscache_unregister_fs+0x1f0/0x1f0
> [  235.165201]  ? __ceph_mark_dirty_caps+0x212/0x610
> [  235.165713]  ? ceph_get_caps+0x2bc/0x700
> [  235.166147]  ? _raw_spin_unlock_bh+0xc0/0xc0
> [  235.166681]  ? ceph_put_cap_refs+0x5a0/0x5a0
> [  235.167182]  ? __delete_object+0xcd/0x100
> [  235.167636]  ? put_object+0x40/0x40
> [  235.168033]  ? up_read+0x20/0x20
> [  235.168396]  ? ceph_write_iter+0x103b/0x1530
> [  235.168865]  ? kmem_cache_free+0x80/0x1d0
> [  235.169252]  ? ceph_write_iter+0x1047/0x1530
> [  235.169692]  ? ceph_read_iter+0xc3e/0x10b0
> [  235.170097]  ceph_read_iter+0xc3e/0x10b0
> [  235.170519]  ? __tty_buffer_request_room+0xd0/0x320
> [  235.171138]  ? ceph_write_iter+0x1530/0x1530
> [  235.171667]  ? timerqueue_add+0xd2/0x100
> [  235.172239]  ? enqueue_hrtimer+0xfb/0x1d0
> [  235.172789]  ? hrtimer_reprogram+0x130/0x130
> [  235.173361]  ? _raw_spin_unlock_bh+0xc0/0xc0
> [  235.173913]  ? _raw_spin_unlock_bh+0xc0/0xc0
> [  235.174464]  ? hrtimer_start_range_ns+0x32d/0x440
> [  235.174979]  ? hrtimer_forward+0x120/0x120
> [  235.175394]  ? hrtimer_try_to_cancel+0x83/0x250
> [  235.176033]  ? hrtimer_active+0x240/0x240
> [  235.176630]  ? __hrtimer_get_remaining+0xc6/0x100
> [  235.177285]  ? hrtimer_start_range_ns+0x440/0x440
> [  235.177980]  ? generic_file_splice_read+0x21b/0x320
> [  235.178694]  ? ceph_write_iter+0x1530/0x1530
> [  235.179325]  generic_file_splice_read+0x21b/0x320
> [  235.180011]  ? splice_shrink_spd+0x40/0x40
> [  235.180619]  ? schedule+0x50/0x280
> [  235.181126]  ? rw_verify_area+0x1f/0x70
> [  235.181641]  ? do_splice_to+0x6a/0xc0
> [  235.182075]  SyS_splice+0x914/0x9a0
> [  235.182405]  ? __task_pid_nr_ns+0x1a9/0x1c0
> [  235.182825]  ? compat_SyS_vmsplice+0x100/0x100
> [  235.183266]  ? do_syscall_64+0x8d/0x300
> [  235.183701]  ? compat_SyS_vmsplice+0x100/0x100
> [  235.184223]  do_syscall_64+0x164/0x300
> [  235.184662]  ? syscall_return_slowpath+0x1c0/0x1c0
> [  235.185213]  ? page_fault+0x25/0x50
> [  235.185616]  ? syscall_return_slowpath+0x13c/0x1c0
> [  235.186124]  ? prepare_exit_to_usermode+0xdb/0x140
> [  235.186692]  ? syscall_trace_enter+0x320/0x320
> [  235.187206]  ? __put_user_4+0x1c/0x30
> [  235.187858]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [  235.188726] RIP: 0033:0x7fd750f61229
> [  235.189312] RSP: 002b:00007fff1ad745c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
> [  235.190568] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd750f61229
> [  235.191687] RDX: 000000000000003d RSI: 0000000000000000 RDI: 0000000000000098
> [  235.192598] RBP: 00007fd751614000 R08: 0000000000fffff7 R09: 0000000000000001
> [  235.193528] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000113
> [  235.194443] R13: 00007fd751614048 R14: 00007fd75163ead8 R15: 00007fd751614000
> [  235.195364] Code: 4c 89 ff e8 f1 43 cf ff 8b 73 3c 44 89 e1 48 c7 c7 80 13 11 82 89 ea e8 87 8d b8 ff 4c 89 ef e8 d5 43 cf ff 8b 43 40 85 c0 75 13 <0f> 0b 48 83 c4 10 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 8d 
> [  235.197769] ---[ end trace 0228543143d962c4 ]---

Interesting. It really looks like trinity is thrashing the pipe in such
a way to create races while we're trying to iterate over it? ceph is
just calling pipe_get_pages, but something is hosed down inside the pipe
tracking and it fails before we even get to doing anything with the
pipe.

I'm not that familiar with the pipe code, so I'm not sure what prevents
that info in the pipe inode from changing while we're trying to sanity
check it here.

Does this testcase work properly on other filesystems?
Luis Henriques April 3, 2018, 8:16 a.m. UTC | #6
On Mon, Apr 02, 2018 at 12:27:28PM -0400, Jeff Layton wrote:
> On Mon, 2018-04-02 at 15:36 +0100, Luis Henriques wrote:
> > On Sun, Apr 01, 2018 at 06:33:31PM +0200, Ilya Dryomov wrote:
> > > On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote:
> > > > > (CC'ing Jeff)
> > > > > 
> > > > > Ilya Dryomov <idryomov@gmail.com> writes:
> > > > > 
> > > > > > On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > > On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > > > > 
> > > > > > > > On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > Hi Zheng,
> > > > > > > > > > 
> > > > > > > > > > Can you explain how this fixes the invalid memory access reported in
> > > > > > > > > > "KASAN: use-after-free Read in set_page_dirty_lock"?  Did you come up
> > > > > > > > > > with a different reproducer?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I don\u2019t know why KASAN reports use-after-free (like false alarm) in
> > > > > > > > > this case. I compared the code with other filesystem, found recent commits
> > > > > > > > > make other filesystems only dirty ITER_IOVEC pages. The change should also
> > > > > > > > > make KASAN silent.
> > > > > > > > 
> > > > > > > > Did you manage to reproduce that KASAN report?
> > > > > > > > 
> > > > > > > 
> > > > > > > no
> > > > > > > 
> > > > > > > > If it is a false positive, the patch shouldn't have been marked for
> > > > > > > > stable.
> > > > > > > > 
> > > > > > > 
> > > > > > > I CCed stable because fuse did the same
> > > > > > > 
> > > > > > > commit 8fba54aebbdf1f999738121922e74bf796ad60ee
> > > > > > > Author: Miklos Szeredi <mszeredi@redhat.com>
> > > > > > > Date:   Wed Aug 24 18:17:04 2016 +0200
> > > > > > > 
> > > > > > >     fuse: direct-io: don't dirty ITER_BVEC pages
> > > > > > > 
> > > > > > >     When reading from a loop device backed by a fuse file it deadlocks on
> > > > > > >     lock_page().
> > > > > > > 
> > > > > > >     This is because the page is already locked by the read() operation done on
> > > > > > >     the loop device.  In this case we don't want to either lock the page or
> > > > > > >     dirty it.
> > > > > > > 
> > > > > > >     So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
> > > > > > 
> > > > > > OK, so it is a potential deadlock which has nothing to do with that
> > > > > > KASAN report.  I have rewritten the changelog to reflect that, please
> > > > > > take a look:
> > > > > > 
> > > > > >   https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c
> > > > > 
> > > > > Just as a side note, it's still trivial to get a cephfs-related KASAN
> > > > > report by running a fuzzer (trinity in my case) against a mainline
> > > > > kernel 4.16.0-rc7 with this fix backported.
> > > > > 
> > > > 
> > > > I suspect trinity is clobbering the array
> > > > > As Jeff mentioned in a different thread, splice syscall is broken on
> > > > > cephfs and the fix for it is still tagged as "DO NOT MERGE" in the
> > > > > ceph-client testing branch.  I still think there's some bug in Jeff's
> > > > > fix as I still see a crash occasionally, but I haven't yet finished
> > > > > debugging it.  Unfortunately, Jeff's fix is probably not appropriate for
> > > > > stable kernels (but may I'm wrong).
> > > > > 
> > > > > Cheers,
> > > > 
> > > > (cc'ing Al)
> > > > 
> > > > Yeah, I should have revisited this a while ago. So the deal is that I
> > > > posted this patch upstream around a year ago, but Al didn't really like
> > > > it:
> > > > 
> > > >     https://patchwork.kernel.org/patch/9541829/
> > > > 
> > > > He wanted to add some iov_iter_for_each_page infrastructure and base
> > > > the new function on that. I had thought he had something queued up
> > > > along those lines at one point, but I don't see anything in mainline so
> > > > far.
> > > > 
> > > > There is a iov_iter_for_each_range, but it doesn't handle userland
> > > > iovecs and doesn't seem to care about alignment. I think we'll need
> > > > both here.
> > > > 
> > > > I'm not sure however that a per-page callback is really that helpful
> > > > for callers like ceph_direct_read_write. We'll just end up having to do
> > > > more  work in the fs to batch up the pages before we add the op to the
> > > > req.
> > > > 
> > > > iov_iter_get_pages_alloc is really sort of what we want in this case,
> > > > and we want that function to stitch together longer arrays when it can.
> > > > 
> > > > Al, do you have any thoughts on what we should do here?
> > > 
> > > I think this can be fixed entirely in ceph by walking away from page
> > > vectors (barely tested):
> > > 
> > >   https://github.com/ceph/ceph-client/commits/wip-splice
> > > 
> > > If folks would like to see something like iov_iter_get_bvecs{,_alloc}()
> > > in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs
> > > and maxsize parameters -- I didn't need maxbvecs and maxsize is just
> > > "make a copy of the iterator and truncate it").  Otherwise we will
> > > probably merge it into ceph, as this code has remained broken for way
> > > too long...
> > > 
> > > Attached is the main patch from wip-splice.
> > > 
> > > Thanks,
> > > 
> > >                 Ilya
> > 
> > I haven't (yet) spent much time looking at the code, but a quick test
> > shows me the below failure in the sanity() tests from lib/iov_iter.c.
> > 
> > Cheers,
> > --
> > Luís
> > 
> > 
> > [  235.134588] idx = 7, offset = 4096
> > [  235.134917] curbuf = 8, nrbufs = 1, buffers = 16
> > [  235.135388] [          (null) 00000000a5376b82 0 4096]
> > [  235.135914] [          (null) 00000000fb52416a 0 4096]
> > [  235.136421] [          (null) 0000000009c7c0fc 0 4096]
> > [  235.136901] [          (null) 000000008b9911f0 0 4096]
> > [  235.137411] [          (null) 0000000042e4a3e2 0 4096]
> > [  235.137879] [          (null) 00000000110f4fc8 0 4096]
> > [  235.138358] [          (null) 00000000e16bdd11 0 4096]
> > [  235.138910] [          (null) 000000007d61a8f6 0 4096]
> > [  235.139399] [00000000af9d005b 000000003b50ae93 0 4096]
> > [  235.139879] [          (null) 00000000c55231a4 0 4096]
> > [  235.140378] [          (null) 00000000eba243d2 0 4096]
> > [  235.140851] [          (null) 00000000f5617952 0 4096]
> > [  235.141344] [          (null) 00000000fec2c691 0 4096]
> > [  235.141836] [          (null) 00000000ca09f9a4 0 4096]
> > [  235.142360] [          (null) 0000000087cfbb92 0 4096]
> > [  235.142886] [          (null) 000000009432c839 0 4096]
> > [  235.143441] WARNING: CPU: 1 PID: 261 at lib/iov_iter.c:350 sanity+0x162/0x210
> > [  235.144208] CPU: 1 PID: 261 Comm: trinity-c5 Not tainted 4.16.0-rc7+ #52
> > [  235.144942] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> > [  235.146057] RIP: 0010:sanity+0x162/0x210
> > [  235.146509] RSP: 0018:ffff8800656970e8 EFLAGS: 00010246
> > [  235.147078] RAX: 0000000000000000 RBX: ffff88006852c868 RCX: ffffffff81589cb3
> > [  235.147841] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88006852c8a8
> > [  235.148636] RBP: 0000000000000010 R08: fffffbfff059aa08 R09: ffff88006852c8e0
> > [  235.149470] R10: 0000000000000003 R11: fffffbfff059aa07 R12: ffff88006406dd38
> > [  235.150340] R13: ffff88006852c8a8 R14: 0000000000000000 R15: 0000000000001000
> > [  235.151184] FS:  00007fd75163eb40(0000) GS:ffff88006d700000(0000) knlGS:0000000000000000
> > [  235.152028] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  235.152562] CR2: 00007fd751501000 CR3: 000000006a940000 CR4: 00000000000006a0
> > [  235.153232] DR0: 00007fd7514fe000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  235.153887] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000f0602
> > [  235.154644] Call Trace:
> > [  235.154914]  iov_iter_get_pages+0x2a3/0x680
> > [  235.155377]  ? iov_iter_gap_alignment+0x490/0x490
> > [  235.155882]  ? iov_iter_advance+0x360/0x780
> > [  235.156334]  ? generic_file_splice_read+0x21b/0x320
> > [  235.156993]  ? SyS_splice+0x914/0x9a0
> > [  235.157498]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > [  235.158177]  ? iov_iter_copy_from_user_atomic+0x6e0/0x6e0
> > [  235.158957]  ? kmemleak_disable+0x70/0x70
> > [  235.159456]  __iter_get_bvecs+0x117/0x290
> > [  235.159955]  ? ceph_fallocate+0xc10/0xc10
> > [  235.160455]  ? iov_iter_npages+0x309/0x540
> > [  235.160974]  ? __kmalloc+0xf9/0x1f0
> > [  235.161410]  ceph_direct_read_write+0x4e7/0x1380
> > [  235.161981]  ? ceph_aio_complete_req+0x750/0x750
> > [  235.162503]  ? ceph_write_iter+0x103b/0x1530
> > [  235.162986]  ? __rb_erase_color+0xd50/0xd50
> > [  235.163449]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > [  235.164043]  ? ceph_fscache_revalidate_cookie+0x80/0x1d0
> > [  235.164630]  ? ceph_fscache_unregister_fs+0x1f0/0x1f0
> > [  235.165201]  ? __ceph_mark_dirty_caps+0x212/0x610
> > [  235.165713]  ? ceph_get_caps+0x2bc/0x700
> > [  235.166147]  ? _raw_spin_unlock_bh+0xc0/0xc0
> > [  235.166681]  ? ceph_put_cap_refs+0x5a0/0x5a0
> > [  235.167182]  ? __delete_object+0xcd/0x100
> > [  235.167636]  ? put_object+0x40/0x40
> > [  235.168033]  ? up_read+0x20/0x20
> > [  235.168396]  ? ceph_write_iter+0x103b/0x1530
> > [  235.168865]  ? kmem_cache_free+0x80/0x1d0
> > [  235.169252]  ? ceph_write_iter+0x1047/0x1530
> > [  235.169692]  ? ceph_read_iter+0xc3e/0x10b0
> > [  235.170097]  ceph_read_iter+0xc3e/0x10b0
> > [  235.170519]  ? __tty_buffer_request_room+0xd0/0x320
> > [  235.171138]  ? ceph_write_iter+0x1530/0x1530
> > [  235.171667]  ? timerqueue_add+0xd2/0x100
> > [  235.172239]  ? enqueue_hrtimer+0xfb/0x1d0
> > [  235.172789]  ? hrtimer_reprogram+0x130/0x130
> > [  235.173361]  ? _raw_spin_unlock_bh+0xc0/0xc0
> > [  235.173913]  ? _raw_spin_unlock_bh+0xc0/0xc0
> > [  235.174464]  ? hrtimer_start_range_ns+0x32d/0x440
> > [  235.174979]  ? hrtimer_forward+0x120/0x120
> > [  235.175394]  ? hrtimer_try_to_cancel+0x83/0x250
> > [  235.176033]  ? hrtimer_active+0x240/0x240
> > [  235.176630]  ? __hrtimer_get_remaining+0xc6/0x100
> > [  235.177285]  ? hrtimer_start_range_ns+0x440/0x440
> > [  235.177980]  ? generic_file_splice_read+0x21b/0x320
> > [  235.178694]  ? ceph_write_iter+0x1530/0x1530
> > [  235.179325]  generic_file_splice_read+0x21b/0x320
> > [  235.180011]  ? splice_shrink_spd+0x40/0x40
> > [  235.180619]  ? schedule+0x50/0x280
> > [  235.181126]  ? rw_verify_area+0x1f/0x70
> > [  235.181641]  ? do_splice_to+0x6a/0xc0
> > [  235.182075]  SyS_splice+0x914/0x9a0
> > [  235.182405]  ? __task_pid_nr_ns+0x1a9/0x1c0
> > [  235.182825]  ? compat_SyS_vmsplice+0x100/0x100
> > [  235.183266]  ? do_syscall_64+0x8d/0x300
> > [  235.183701]  ? compat_SyS_vmsplice+0x100/0x100
> > [  235.184223]  do_syscall_64+0x164/0x300
> > [  235.184662]  ? syscall_return_slowpath+0x1c0/0x1c0
> > [  235.185213]  ? page_fault+0x25/0x50
> > [  235.185616]  ? syscall_return_slowpath+0x13c/0x1c0
> > [  235.186124]  ? prepare_exit_to_usermode+0xdb/0x140
> > [  235.186692]  ? syscall_trace_enter+0x320/0x320
> > [  235.187206]  ? __put_user_4+0x1c/0x30
> > [  235.187858]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > [  235.188726] RIP: 0033:0x7fd750f61229
> > [  235.189312] RSP: 002b:00007fff1ad745c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
> > [  235.190568] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd750f61229
> > [  235.191687] RDX: 000000000000003d RSI: 0000000000000000 RDI: 0000000000000098
> > [  235.192598] RBP: 00007fd751614000 R08: 0000000000fffff7 R09: 0000000000000001
> > [  235.193528] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000113
> > [  235.194443] R13: 00007fd751614048 R14: 00007fd75163ead8 R15: 00007fd751614000
> > [  235.195364] Code: 4c 89 ff e8 f1 43 cf ff 8b 73 3c 44 89 e1 48 c7 c7 80 13 11 82 89 ea e8 87 8d b8 ff 4c 89 ef e8 d5 43 cf ff 8b 43 40 85 c0 75 13 <0f> 0b 48 83 c4 10 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 8d 
> > [  235.197769] ---[ end trace 0228543143d962c4 ]---
> 
> Interesting. It really looks like trinity is thrashing the pipe in such
> a way to create races while we're trying to iterate over it? ceph is
> just calling pipe_get_pages, but something is hosed down inside the pipe
> tracking and it fails before we even get to doing anything with the
> pipe.
> 
> I'm not that familiar with the pipe code, so I'm not sure what prevents
> that info in the pipe inode from changing while we're trying to sanity
> check it here.
> 
> Does this testcase work properly on other filesystems?

I can't reproduce it on btrfs or ext4, but it's possible that it's just
more difficult to trigger.


Cheers,
--
Luís
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton April 3, 2018, 12:46 p.m. UTC | #7
On Tue, 2018-04-03 at 09:16 +0100, Luis Henriques wrote:
> On Mon, Apr 02, 2018 at 12:27:28PM -0400, Jeff Layton wrote:
> > On Mon, 2018-04-02 at 15:36 +0100, Luis Henriques wrote:
> > > On Sun, Apr 01, 2018 at 06:33:31PM +0200, Ilya Dryomov wrote:
> > > > On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > > On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote:
> > > > > > (CC'ing Jeff)
> > > > > > 
> > > > > > Ilya Dryomov <idryomov@gmail.com> writes:
> > > > > > 
> > > > > > > On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > > > > > 
> > > > > > > > > On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@gmail.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@redhat.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > Hi Zheng,
> > > > > > > > > > > 
> > > > > > > > > > > Can you explain how this fixes the invalid memory access reported in
> > > > > > > > > > > "KASAN: use-after-free Read in set_page_dirty_lock"?  Did you come up
> > > > > > > > > > > with a different reproducer?
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > I don\u2019t know why KASAN reports use-after-free (like false alarm) in
> > > > > > > > > > this case. I compared the code with other filesystem, found recent commits
> > > > > > > > > > make other filesystems only dirty ITER_IOVEC pages. The change should also
> > > > > > > > > > make KASAN silent.
> > > > > > > > > 
> > > > > > > > > Did you manage to reproduce that KASAN report?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > no
> > > > > > > > 
> > > > > > > > > If it is a false positive, the patch shouldn't have been marked for
> > > > > > > > > stable.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I CCed stable because fuse did the same
> > > > > > > > 
> > > > > > > > commit 8fba54aebbdf1f999738121922e74bf796ad60ee
> > > > > > > > Author: Miklos Szeredi <mszeredi@redhat.com>
> > > > > > > > Date:   Wed Aug 24 18:17:04 2016 +0200
> > > > > > > > 
> > > > > > > >     fuse: direct-io: don't dirty ITER_BVEC pages
> > > > > > > > 
> > > > > > > >     When reading from a loop device backed by a fuse file it deadlocks on
> > > > > > > >     lock_page().
> > > > > > > > 
> > > > > > > >     This is because the page is already locked by the read() operation done on
> > > > > > > >     the loop device.  In this case we don't want to either lock the page or
> > > > > > > >     dirty it.
> > > > > > > > 
> > > > > > > >     So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
> > > > > > > 
> > > > > > > OK, so it is a potential deadlock which has nothing to do with that
> > > > > > > KASAN report.  I have rewritten the changelog to reflect that, please
> > > > > > > take a look:
> > > > > > > 
> > > > > > >   https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c
> > > > > > 
> > > > > > Just as a side note, it's still trivial to get a cephfs-related KASAN
> > > > > > report by running a fuzzer (trinity in my case) against a mainline
> > > > > > kernel 4.16.0-rc7 with this fix backported.
> > > > > > 
> > > > > 
> > > > > I suspect trinity is clobbering the array
> > > > > > As Jeff mentioned in a different thread, splice syscall is broken on
> > > > > > cephfs and the fix for it is still tagged as "DO NOT MERGE" in the
> > > > > > ceph-client testing branch.  I still think there's some bug in Jeff's
> > > > > > fix as I still see a crash occasionally, but I haven't yet finished
> > > > > > debugging it.  Unfortunately, Jeff's fix is probably not appropriate for
> > > > > > stable kernels (but may I'm wrong).
> > > > > > 
> > > > > > Cheers,
> > > > > 
> > > > > (cc'ing Al)
> > > > > 
> > > > > Yeah, I should have revisited this a while ago. So the deal is that I
> > > > > posted this patch upstream around a year ago, but Al didn't really like
> > > > > it:
> > > > > 
> > > > >     https://patchwork.kernel.org/patch/9541829/
> > > > > 
> > > > > He wanted to add some iov_iter_for_each_page infrastructure and base
> > > > > the new function on that. I had thought he had something queued up
> > > > > along those lines at one point, but I don't see anything in mainline so
> > > > > far.
> > > > > 
> > > > > There is a iov_iter_for_each_range, but it doesn't handle userland
> > > > > iovecs and doesn't seem to care about alignment. I think we'll need
> > > > > both here.
> > > > > 
> > > > > I'm not sure however that a per-page callback is really that helpful
> > > > > for callers like ceph_direct_read_write. We'll just end up having to do
> > > > > more  work in the fs to batch up the pages before we add the op to the
> > > > > req.
> > > > > 
> > > > > iov_iter_get_pages_alloc is really sort of what we want in this case,
> > > > > and we want that function to stitch together longer arrays when it can.
> > > > > 
> > > > > Al, do you have any thoughts on what we should do here?
> > > > 
> > > > I think this can be fixed entirely in ceph by walking away from page
> > > > vectors (barely tested):
> > > > 
> > > >   https://github.com/ceph/ceph-client/commits/wip-splice
> > > > 
> > > > If folks would like to see something like iov_iter_get_bvecs{,_alloc}()
> > > > in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs
> > > > and maxsize parameters -- I didn't need maxbvecs and maxsize is just
> > > > "make a copy of the iterator and truncate it").  Otherwise we will
> > > > probably merge it into ceph, as this code has remained broken for way
> > > > too long...
> > > > 
> > > > Attached is the main patch from wip-splice.
> > > > 
> > > > Thanks,
> > > > 
> > > >                 Ilya
> > > 
> > > I haven't (yet) spent much time looking at the code, but a quick test
> > > shows me the below failure in the sanity() tests from lib/iov_iter.c.
> > > 
> > > Cheers,
> > > --
> > > Luís
> > > 
> > > 
> > > [  235.134588] idx = 7, offset = 4096
> > > [  235.134917] curbuf = 8, nrbufs = 1, buffers = 16
> > > [  235.135388] [          (null) 00000000a5376b82 0 4096]
> > > [  235.135914] [          (null) 00000000fb52416a 0 4096]
> > > [  235.136421] [          (null) 0000000009c7c0fc 0 4096]
> > > [  235.136901] [          (null) 000000008b9911f0 0 4096]
> > > [  235.137411] [          (null) 0000000042e4a3e2 0 4096]
> > > [  235.137879] [          (null) 00000000110f4fc8 0 4096]
> > > [  235.138358] [          (null) 00000000e16bdd11 0 4096]
> > > [  235.138910] [          (null) 000000007d61a8f6 0 4096]
> > > [  235.139399] [00000000af9d005b 000000003b50ae93 0 4096]
> > > [  235.139879] [          (null) 00000000c55231a4 0 4096]
> > > [  235.140378] [          (null) 00000000eba243d2 0 4096]
> > > [  235.140851] [          (null) 00000000f5617952 0 4096]
> > > [  235.141344] [          (null) 00000000fec2c691 0 4096]
> > > [  235.141836] [          (null) 00000000ca09f9a4 0 4096]
> > > [  235.142360] [          (null) 0000000087cfbb92 0 4096]
> > > [  235.142886] [          (null) 000000009432c839 0 4096]
> > > [  235.143441] WARNING: CPU: 1 PID: 261 at lib/iov_iter.c:350 sanity+0x162/0x210
> > > [  235.144208] CPU: 1 PID: 261 Comm: trinity-c5 Not tainted 4.16.0-rc7+ #52
> > > [  235.144942] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> > > [  235.146057] RIP: 0010:sanity+0x162/0x210
> > > [  235.146509] RSP: 0018:ffff8800656970e8 EFLAGS: 00010246
> > > [  235.147078] RAX: 0000000000000000 RBX: ffff88006852c868 RCX: ffffffff81589cb3
> > > [  235.147841] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88006852c8a8
> > > [  235.148636] RBP: 0000000000000010 R08: fffffbfff059aa08 R09: ffff88006852c8e0
> > > [  235.149470] R10: 0000000000000003 R11: fffffbfff059aa07 R12: ffff88006406dd38
> > > [  235.150340] R13: ffff88006852c8a8 R14: 0000000000000000 R15: 0000000000001000
> > > [  235.151184] FS:  00007fd75163eb40(0000) GS:ffff88006d700000(0000) knlGS:0000000000000000
> > > [  235.152028] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  235.152562] CR2: 00007fd751501000 CR3: 000000006a940000 CR4: 00000000000006a0
> > > [  235.153232] DR0: 00007fd7514fe000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [  235.153887] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000f0602
> > > [  235.154644] Call Trace:
> > > [  235.154914]  iov_iter_get_pages+0x2a3/0x680
> > > [  235.155377]  ? iov_iter_gap_alignment+0x490/0x490
> > > [  235.155882]  ? iov_iter_advance+0x360/0x780
> > > [  235.156334]  ? generic_file_splice_read+0x21b/0x320
> > > [  235.156993]  ? SyS_splice+0x914/0x9a0
> > > [  235.157498]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > > [  235.158177]  ? iov_iter_copy_from_user_atomic+0x6e0/0x6e0
> > > [  235.158957]  ? kmemleak_disable+0x70/0x70
> > > [  235.159456]  __iter_get_bvecs+0x117/0x290
> > > [  235.159955]  ? ceph_fallocate+0xc10/0xc10
> > > [  235.160455]  ? iov_iter_npages+0x309/0x540
> > > [  235.160974]  ? __kmalloc+0xf9/0x1f0
> > > [  235.161410]  ceph_direct_read_write+0x4e7/0x1380
> > > [  235.161981]  ? ceph_aio_complete_req+0x750/0x750
> > > [  235.162503]  ? ceph_write_iter+0x103b/0x1530
> > > [  235.162986]  ? __rb_erase_color+0xd50/0xd50
> > > [  235.163449]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > > [  235.164043]  ? ceph_fscache_revalidate_cookie+0x80/0x1d0
> > > [  235.164630]  ? ceph_fscache_unregister_fs+0x1f0/0x1f0
> > > [  235.165201]  ? __ceph_mark_dirty_caps+0x212/0x610
> > > [  235.165713]  ? ceph_get_caps+0x2bc/0x700
> > > [  235.166147]  ? _raw_spin_unlock_bh+0xc0/0xc0
> > > [  235.166681]  ? ceph_put_cap_refs+0x5a0/0x5a0
> > > [  235.167182]  ? __delete_object+0xcd/0x100
> > > [  235.167636]  ? put_object+0x40/0x40
> > > [  235.168033]  ? up_read+0x20/0x20
> > > [  235.168396]  ? ceph_write_iter+0x103b/0x1530
> > > [  235.168865]  ? kmem_cache_free+0x80/0x1d0
> > > [  235.169252]  ? ceph_write_iter+0x1047/0x1530
> > > [  235.169692]  ? ceph_read_iter+0xc3e/0x10b0
> > > [  235.170097]  ceph_read_iter+0xc3e/0x10b0
> > > [  235.170519]  ? __tty_buffer_request_room+0xd0/0x320
> > > [  235.171138]  ? ceph_write_iter+0x1530/0x1530
> > > [  235.171667]  ? timerqueue_add+0xd2/0x100
> > > [  235.172239]  ? enqueue_hrtimer+0xfb/0x1d0
> > > [  235.172789]  ? hrtimer_reprogram+0x130/0x130
> > > [  235.173361]  ? _raw_spin_unlock_bh+0xc0/0xc0
> > > [  235.173913]  ? _raw_spin_unlock_bh+0xc0/0xc0
> > > [  235.174464]  ? hrtimer_start_range_ns+0x32d/0x440
> > > [  235.174979]  ? hrtimer_forward+0x120/0x120
> > > [  235.175394]  ? hrtimer_try_to_cancel+0x83/0x250
> > > [  235.176033]  ? hrtimer_active+0x240/0x240
> > > [  235.176630]  ? __hrtimer_get_remaining+0xc6/0x100
> > > [  235.177285]  ? hrtimer_start_range_ns+0x440/0x440
> > > [  235.177980]  ? generic_file_splice_read+0x21b/0x320
> > > [  235.178694]  ? ceph_write_iter+0x1530/0x1530
> > > [  235.179325]  generic_file_splice_read+0x21b/0x320
> > > [  235.180011]  ? splice_shrink_spd+0x40/0x40
> > > [  235.180619]  ? schedule+0x50/0x280
> > > [  235.181126]  ? rw_verify_area+0x1f/0x70
> > > [  235.181641]  ? do_splice_to+0x6a/0xc0
> > > [  235.182075]  SyS_splice+0x914/0x9a0
> > > [  235.182405]  ? __task_pid_nr_ns+0x1a9/0x1c0
> > > [  235.182825]  ? compat_SyS_vmsplice+0x100/0x100
> > > [  235.183266]  ? do_syscall_64+0x8d/0x300
> > > [  235.183701]  ? compat_SyS_vmsplice+0x100/0x100
> > > [  235.184223]  do_syscall_64+0x164/0x300
> > > [  235.184662]  ? syscall_return_slowpath+0x1c0/0x1c0
> > > [  235.185213]  ? page_fault+0x25/0x50
> > > [  235.185616]  ? syscall_return_slowpath+0x13c/0x1c0
> > > [  235.186124]  ? prepare_exit_to_usermode+0xdb/0x140
> > > [  235.186692]  ? syscall_trace_enter+0x320/0x320
> > > [  235.187206]  ? __put_user_4+0x1c/0x30
> > > [  235.187858]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > > [  235.188726] RIP: 0033:0x7fd750f61229
> > > [  235.189312] RSP: 002b:00007fff1ad745c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
> > > [  235.190568] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd750f61229
> > > [  235.191687] RDX: 000000000000003d RSI: 0000000000000000 RDI: 0000000000000098
> > > [  235.192598] RBP: 00007fd751614000 R08: 0000000000fffff7 R09: 0000000000000001
> > > [  235.193528] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000113
> > > [  235.194443] R13: 00007fd751614048 R14: 00007fd75163ead8 R15: 00007fd751614000
> > > [  235.195364] Code: 4c 89 ff e8 f1 43 cf ff 8b 73 3c 44 89 e1 48 c7 c7 80 13 11 82 89 ea e8 87 8d b8 ff 4c 89 ef e8 d5 43 cf ff 8b 43 40 85 c0 75 13 <0f> 0b 48 83 c4 10 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 8d 
> > > [  235.197769] ---[ end trace 0228543143d962c4 ]---
> > 
> > Interesting. It really looks like trinity is thrashing the pipe in such
> > a way to create races while we're trying to iterate over it? ceph is
> > just calling pipe_get_pages, but something is hosed down inside the pipe
> > tracking and it fails before we even get to doing anything with the
> > pipe.
> > 
> > I'm not that familiar with the pipe code, so I'm not sure what prevents
> > that info in the pipe inode from changing while we're trying to sanity
> > check it here.
> > 

Looks like the pipe_lock is what protects it, and the splice code takes
that at a fairly high level. I think the problem is elsewhere.

> > Does this testcase work properly on other filesystems?
> 
> I can't reproduce it on btrfs or ext4, but it's possible that it's just
> more difficult to trigger.
> 

I can reproduce what you're seeing with trinity. It fails almost
immediately for me on cephfs, but runs more or less indefinitely on xfs.
That suggests that there is something wrong with what's currently in
ceph-client.

What's there _does_ seem to fix xfstests generic/095, which implies that
there some raciness involved.

One interesting bit:

    [  235.134588] idx = 7, offset = 4096
    [  235.134917] curbuf = 8, nrbufs = 1, buffers = 16

The offset always seems to be 4096 when I see this problem. I'm not sure
if that's significant yet.
Yan, Zheng April 3, 2018, 1:11 p.m. UTC | #8
iter_get_bvecs_alloc() copies iov_iter.  iov_iter_advance() are coded for both old and new io_iter.  I think pipe_advance() did bad thing.

Regards
Yan, Zheng


> On Apr 3, 2018, at 20:46, Jeff Layton <jlayton@redhat.com> wrote:
> 
> On Tue, 2018-04-03 at 09:16 +0100, Luis Henriques wrote:
>> On Mon, Apr 02, 2018 at 12:27:28PM -0400, Jeff Layton wrote:
>>> On Mon, 2018-04-02 at 15:36 +0100, Luis Henriques wrote:
>>>> On Sun, Apr 01, 2018 at 06:33:31PM +0200, Ilya Dryomov wrote:
>>>>> On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@redhat.com> wrote:
>>>>>> On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote:
>>>>>>> (CC'ing Jeff)
>>>>>>> 
>>>>>>> Ilya Dryomov <idryomov@gmail.com> writes:
>>>>>>> 
>>>>>>>> On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@redhat.com> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@redhat.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@gmail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@redhat.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Zheng,
>>>>>>>>>>>> 
>>>>>>>>>>>> Can you explain how this fixes the invalid memory access reported in
>>>>>>>>>>>> "KASAN: use-after-free Read in set_page_dirty_lock"?  Did you come up
>>>>>>>>>>>> with a different reproducer?
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I don\u2019t know why KASAN reports use-after-free (like false alarm) in
>>>>>>>>>>> this case. I compared the code with other filesystem, found recent commits
>>>>>>>>>>> make other filesystems only dirty ITER_IOVEC pages. The change should also
>>>>>>>>>>> make KASAN silent.
>>>>>>>>>> 
>>>>>>>>>> Did you manage to reproduce that KASAN report?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> no
>>>>>>>>> 
>>>>>>>>>> If it is a false positive, the patch shouldn't have been marked for
>>>>>>>>>> stable.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I CCed stable because fuse did the same
>>>>>>>>> 
>>>>>>>>> commit 8fba54aebbdf1f999738121922e74bf796ad60ee
>>>>>>>>> Author: Miklos Szeredi <mszeredi@redhat.com>
>>>>>>>>> Date:   Wed Aug 24 18:17:04 2016 +0200
>>>>>>>>> 
>>>>>>>>>    fuse: direct-io: don't dirty ITER_BVEC pages
>>>>>>>>> 
>>>>>>>>>    When reading from a loop device backed by a fuse file it deadlocks on
>>>>>>>>>    lock_page().
>>>>>>>>> 
>>>>>>>>>    This is because the page is already locked by the read() operation done on
>>>>>>>>>    the loop device.  In this case we don't want to either lock the page or
>>>>>>>>>    dirty it.
>>>>>>>>> 
>>>>>>>>>    So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
>>>>>>>> 
>>>>>>>> OK, so it is a potential deadlock which has nothing to do with that
>>>>>>>> KASAN report.  I have rewritten the changelog to reflect that, please
>>>>>>>> take a look:
>>>>>>>> 
>>>>>>>>  https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c
>>>>>>> 
>>>>>>> Just as a side note, it's still trivial to get a cephfs-related KASAN
>>>>>>> report by running a fuzzer (trinity in my case) against a mainline
>>>>>>> kernel 4.16.0-rc7 with this fix backported.
>>>>>>> 
>>>>>> 
>>>>>> I suspect trinity is clobbering the array
>>>>>>> As Jeff mentioned in a different thread, splice syscall is broken on
>>>>>>> cephfs and the fix for it is still tagged as "DO NOT MERGE" in the
>>>>>>> ceph-client testing branch.  I still think there's some bug in Jeff's
>>>>>>> fix as I still see a crash occasionally, but I haven't yet finished
>>>>>>> debugging it.  Unfortunately, Jeff's fix is probably not appropriate for
>>>>>>> stable kernels (but may I'm wrong).
>>>>>>> 
>>>>>>> Cheers,
>>>>>> 
>>>>>> (cc'ing Al)
>>>>>> 
>>>>>> Yeah, I should have revisited this a while ago. So the deal is that I
>>>>>> posted this patch upstream around a year ago, but Al didn't really like
>>>>>> it:
>>>>>> 
>>>>>>    https://patchwork.kernel.org/patch/9541829/
>>>>>> 
>>>>>> He wanted to add some iov_iter_for_each_page infrastructure and base
>>>>>> the new function on that. I had thought he had something queued up
>>>>>> along those lines at one point, but I don't see anything in mainline so
>>>>>> far.
>>>>>> 
>>>>>> There is a iov_iter_for_each_range, but it doesn't handle userland
>>>>>> iovecs and doesn't seem to care about alignment. I think we'll need
>>>>>> both here.
>>>>>> 
>>>>>> I'm not sure however that a per-page callback is really that helpful
>>>>>> for callers like ceph_direct_read_write. We'll just end up having to do
>>>>>> more  work in the fs to batch up the pages before we add the op to the
>>>>>> req.
>>>>>> 
>>>>>> iov_iter_get_pages_alloc is really sort of what we want in this case,
>>>>>> and we want that function to stitch together longer arrays when it can.
>>>>>> 
>>>>>> Al, do you have any thoughts on what we should do here?
>>>>> 
>>>>> I think this can be fixed entirely in ceph by walking away from page
>>>>> vectors (barely tested):
>>>>> 
>>>>>  https://github.com/ceph/ceph-client/commits/wip-splice
>>>>> 
>>>>> If folks would like to see something like iov_iter_get_bvecs{,_alloc}()
>>>>> in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs
>>>>> and maxsize parameters -- I didn't need maxbvecs and maxsize is just
>>>>> "make a copy of the iterator and truncate it").  Otherwise we will
>>>>> probably merge it into ceph, as this code has remained broken for way
>>>>> too long...
>>>>> 
>>>>> Attached is the main patch from wip-splice.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>>                Ilya
>>>> 
>>>> I haven't (yet) spent much time looking at the code, but a quick test
>>>> shows me the below failure in the sanity() tests from lib/iov_iter.c.
>>>> 
>>>> Cheers,
>>>> --
>>>> Luís
>>>> 
>>>> 
>>>> [  235.134588] idx = 7, offset = 4096
>>>> [  235.134917] curbuf = 8, nrbufs = 1, buffers = 16
>>>> [  235.135388] [          (null) 00000000a5376b82 0 4096]
>>>> [  235.135914] [          (null) 00000000fb52416a 0 4096]
>>>> [  235.136421] [          (null) 0000000009c7c0fc 0 4096]
>>>> [  235.136901] [          (null) 000000008b9911f0 0 4096]
>>>> [  235.137411] [          (null) 0000000042e4a3e2 0 4096]
>>>> [  235.137879] [          (null) 00000000110f4fc8 0 4096]
>>>> [  235.138358] [          (null) 00000000e16bdd11 0 4096]
>>>> [  235.138910] [          (null) 000000007d61a8f6 0 4096]
>>>> [  235.139399] [00000000af9d005b 000000003b50ae93 0 4096]
>>>> [  235.139879] [          (null) 00000000c55231a4 0 4096]
>>>> [  235.140378] [          (null) 00000000eba243d2 0 4096]
>>>> [  235.140851] [          (null) 00000000f5617952 0 4096]
>>>> [  235.141344] [          (null) 00000000fec2c691 0 4096]
>>>> [  235.141836] [          (null) 00000000ca09f9a4 0 4096]
>>>> [  235.142360] [          (null) 0000000087cfbb92 0 4096]
>>>> [  235.142886] [          (null) 000000009432c839 0 4096]
>>>> [  235.143441] WARNING: CPU: 1 PID: 261 at lib/iov_iter.c:350 sanity+0x162/0x210
>>>> [  235.144208] CPU: 1 PID: 261 Comm: trinity-c5 Not tainted 4.16.0-rc7+ #52
>>>> [  235.144942] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
>>>> [  235.146057] RIP: 0010:sanity+0x162/0x210
>>>> [  235.146509] RSP: 0018:ffff8800656970e8 EFLAGS: 00010246
>>>> [  235.147078] RAX: 0000000000000000 RBX: ffff88006852c868 RCX: ffffffff81589cb3
>>>> [  235.147841] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88006852c8a8
>>>> [  235.148636] RBP: 0000000000000010 R08: fffffbfff059aa08 R09: ffff88006852c8e0
>>>> [  235.149470] R10: 0000000000000003 R11: fffffbfff059aa07 R12: ffff88006406dd38
>>>> [  235.150340] R13: ffff88006852c8a8 R14: 0000000000000000 R15: 0000000000001000
>>>> [  235.151184] FS:  00007fd75163eb40(0000) GS:ffff88006d700000(0000) knlGS:0000000000000000
>>>> [  235.152028] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  235.152562] CR2: 00007fd751501000 CR3: 000000006a940000 CR4: 00000000000006a0
>>>> [  235.153232] DR0: 00007fd7514fe000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [  235.153887] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000f0602
>>>> [  235.154644] Call Trace:
>>>> [  235.154914]  iov_iter_get_pages+0x2a3/0x680
>>>> [  235.155377]  ? iov_iter_gap_alignment+0x490/0x490
>>>> [  235.155882]  ? iov_iter_advance+0x360/0x780
>>>> [  235.156334]  ? generic_file_splice_read+0x21b/0x320
>>>> [  235.156993]  ? SyS_splice+0x914/0x9a0
>>>> [  235.157498]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>>> [  235.158177]  ? iov_iter_copy_from_user_atomic+0x6e0/0x6e0
>>>> [  235.158957]  ? kmemleak_disable+0x70/0x70
>>>> [  235.159456]  __iter_get_bvecs+0x117/0x290
>>>> [  235.159955]  ? ceph_fallocate+0xc10/0xc10
>>>> [  235.160455]  ? iov_iter_npages+0x309/0x540
>>>> [  235.160974]  ? __kmalloc+0xf9/0x1f0
>>>> [  235.161410]  ceph_direct_read_write+0x4e7/0x1380
>>>> [  235.161981]  ? ceph_aio_complete_req+0x750/0x750
>>>> [  235.162503]  ? ceph_write_iter+0x103b/0x1530
>>>> [  235.162986]  ? __rb_erase_color+0xd50/0xd50
>>>> [  235.163449]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>>> [  235.164043]  ? ceph_fscache_revalidate_cookie+0x80/0x1d0
>>>> [  235.164630]  ? ceph_fscache_unregister_fs+0x1f0/0x1f0
>>>> [  235.165201]  ? __ceph_mark_dirty_caps+0x212/0x610
>>>> [  235.165713]  ? ceph_get_caps+0x2bc/0x700
>>>> [  235.166147]  ? _raw_spin_unlock_bh+0xc0/0xc0
>>>> [  235.166681]  ? ceph_put_cap_refs+0x5a0/0x5a0
>>>> [  235.167182]  ? __delete_object+0xcd/0x100
>>>> [  235.167636]  ? put_object+0x40/0x40
>>>> [  235.168033]  ? up_read+0x20/0x20
>>>> [  235.168396]  ? ceph_write_iter+0x103b/0x1530
>>>> [  235.168865]  ? kmem_cache_free+0x80/0x1d0
>>>> [  235.169252]  ? ceph_write_iter+0x1047/0x1530
>>>> [  235.169692]  ? ceph_read_iter+0xc3e/0x10b0
>>>> [  235.170097]  ceph_read_iter+0xc3e/0x10b0
>>>> [  235.170519]  ? __tty_buffer_request_room+0xd0/0x320
>>>> [  235.171138]  ? ceph_write_iter+0x1530/0x1530
>>>> [  235.171667]  ? timerqueue_add+0xd2/0x100
>>>> [  235.172239]  ? enqueue_hrtimer+0xfb/0x1d0
>>>> [  235.172789]  ? hrtimer_reprogram+0x130/0x130
>>>> [  235.173361]  ? _raw_spin_unlock_bh+0xc0/0xc0
>>>> [  235.173913]  ? _raw_spin_unlock_bh+0xc0/0xc0
>>>> [  235.174464]  ? hrtimer_start_range_ns+0x32d/0x440
>>>> [  235.174979]  ? hrtimer_forward+0x120/0x120
>>>> [  235.175394]  ? hrtimer_try_to_cancel+0x83/0x250
>>>> [  235.176033]  ? hrtimer_active+0x240/0x240
>>>> [  235.176630]  ? __hrtimer_get_remaining+0xc6/0x100
>>>> [  235.177285]  ? hrtimer_start_range_ns+0x440/0x440
>>>> [  235.177980]  ? generic_file_splice_read+0x21b/0x320
>>>> [  235.178694]  ? ceph_write_iter+0x1530/0x1530
>>>> [  235.179325]  generic_file_splice_read+0x21b/0x320
>>>> [  235.180011]  ? splice_shrink_spd+0x40/0x40
>>>> [  235.180619]  ? schedule+0x50/0x280
>>>> [  235.181126]  ? rw_verify_area+0x1f/0x70
>>>> [  235.181641]  ? do_splice_to+0x6a/0xc0
>>>> [  235.182075]  SyS_splice+0x914/0x9a0
>>>> [  235.182405]  ? __task_pid_nr_ns+0x1a9/0x1c0
>>>> [  235.182825]  ? compat_SyS_vmsplice+0x100/0x100
>>>> [  235.183266]  ? do_syscall_64+0x8d/0x300
>>>> [  235.183701]  ? compat_SyS_vmsplice+0x100/0x100
>>>> [  235.184223]  do_syscall_64+0x164/0x300
>>>> [  235.184662]  ? syscall_return_slowpath+0x1c0/0x1c0
>>>> [  235.185213]  ? page_fault+0x25/0x50
>>>> [  235.185616]  ? syscall_return_slowpath+0x13c/0x1c0
>>>> [  235.186124]  ? prepare_exit_to_usermode+0xdb/0x140
>>>> [  235.186692]  ? syscall_trace_enter+0x320/0x320
>>>> [  235.187206]  ? __put_user_4+0x1c/0x30
>>>> [  235.187858]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>>> [  235.188726] RIP: 0033:0x7fd750f61229
>>>> [  235.189312] RSP: 002b:00007fff1ad745c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
>>>> [  235.190568] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd750f61229
>>>> [  235.191687] RDX: 000000000000003d RSI: 0000000000000000 RDI: 0000000000000098
>>>> [  235.192598] RBP: 00007fd751614000 R08: 0000000000fffff7 R09: 0000000000000001
>>>> [  235.193528] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000113
>>>> [  235.194443] R13: 00007fd751614048 R14: 00007fd75163ead8 R15: 00007fd751614000
>>>> [  235.195364] Code: 4c 89 ff e8 f1 43 cf ff 8b 73 3c 44 89 e1 48 c7 c7 80 13 11 82 89 ea e8 87 8d b8 ff 4c 89 ef e8 d5 43 cf ff 8b 43 40 85 c0 75 13 <0f> 0b 48 83 c4 10 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 8d 
>>>> [  235.197769] ---[ end trace 0228543143d962c4 ]---
>>> 
>>> Interesting. It really looks like trinity is thrashing the pipe in such
>>> a way to create races while we're trying to iterate over it? ceph is
>>> just calling pipe_get_pages, but something is hosed down inside the pipe
>>> tracking and it fails before we even get to doing anything with the
>>> pipe.
>>> 
>>> I'm not that familiar with the pipe code, so I'm not sure what prevents
>>> that info in the pipe inode from changing while we're trying to sanity
>>> check it here.
>>> 
> 
> Looks like the pipe_lock is what protects it, and the splice code takes
> that at a fairly high level. I think the problem is elsewhere.
> 
>>> Does this testcase work properly on other filesystems?
>> 
>> I can't reproduce it on btrfs or ext4, but it's possible that it's just
>> more difficult to trigger.
>> 
> 
> I can reproduce what you're seeing with trinity. It fails almost
> immediately for me on cephfs, but runs more or less indefinitely on xfs.
> That suggests that there is something wrong with what's currently in
> ceph-client.
> 
> What's there _does_ seem to fix xfstests generic/095, which implies that
> there some raciness involved.
> 
> One interesting bit:
> 
>    [  235.134588] idx = 7, offset = 4096
>    [  235.134917] curbuf = 8, nrbufs = 1, buffers = 16
> 
> The offset always seems to be 4096 when I see this problem. I'm not sure
> if that's significant yet.
> 
> -- 
> Jeff Layton <jlayton@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilya Dryomov April 3, 2018, 1:42 p.m. UTC | #9
On Tue, Apr 3, 2018 at 3:11 PM, Yan, Zheng <zyan@redhat.com> wrote:
> iter_get_bvecs_alloc() copies iov_iter.  iov_iter_advance() are coded for both old and new io_iter.  I think pipe_advance() did bad thing.

Hrm...  I guess duplicating ITER_PIPE iterators is a bad idea.  I was
trying to keep the structure of ceph_direct_read_write() the same, but
we can refactor it if necessary.  I'll take a look.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From b3f10930acba4bd9a805921dcb98070475a05261 Mon Sep 17 00:00:00 2001
From: Ilya Dryomov <idryomov@gmail.com>
Date: Sun, 1 Apr 2018 17:20:31 +0200
Subject: [PATCH 3/3] ceph: fix iov_iter issues in ceph_direct_read_write()

dio_get_pagev_size() and dio_get_pages_alloc() introduced in commit
b5b98989dc7e ("ceph: combine as many iovec as possile into one OSD
request") assume that the passed iov_iter is ITER_IOVEC.  This isn't
the case with splice where it ends up poking into the guts of ITER_BVEC
or ITER_PIPE iterators, causing lockups and crashes.

Rather than trying to figure out gap alignment and stuff pages into
a page vector, add a helper for going from iov_iter to a bio_vec array
and make use of the new CEPH_OSD_DATA_TYPE_BVECS code.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
---
 fs/ceph/file.c | 171 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 91 insertions(+), 80 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 4a92acba1e9c..66f5e13a0804 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -69,70 +69,76 @@  static __le32 ceph_flags_sys2wire(u32 flags)
  * need to wait for MDS acknowledgement.
  */
 
-/*
- * Calculate the length sum of direct io vectors that can
- * be combined into one page vector.
- */
-static size_t dio_get_pagev_size(const struct iov_iter *it)
+static ssize_t __iter_get_bvecs(struct iov_iter *i, struct bio_vec *bvecs)
 {
-    const struct iovec *iov = it->iov;
-    const struct iovec *iovend = iov + it->nr_segs;
-    size_t size;
-
-    size = iov->iov_len - it->iov_offset;
-    /*
-     * An iov can be page vectored when both the current tail
-     * and the next base are page aligned.
-     */
-    while (PAGE_ALIGNED((iov->iov_base + iov->iov_len)) &&
-           (++iov < iovend && PAGE_ALIGNED((iov->iov_base)))) {
-        size += iov->iov_len;
-    }
-    dout("dio_get_pagevlen len = %zu\n", size);
-    return size;
+	size_t total = 0;
+	int bvec_idx = 0;
+
+	while (iov_iter_count(i)) {
+		struct page *pages[16];
+		ssize_t bytes;
+		size_t start;
+		int idx = 0;
+
+		bytes = iov_iter_get_pages(i, pages, SIZE_MAX, 16, &start);
+		if (bytes < 0)
+			return total ?: bytes;
+
+		for ( ; bytes; idx++, bvec_idx++) {
+			struct bio_vec bv = {
+				.bv_page = pages[idx],
+				.bv_len = min_t(int, bytes, PAGE_SIZE - start),
+				.bv_offset = start,
+			};
+
+			bvecs[bvec_idx] = bv;
+			iov_iter_advance(i, bv.bv_len);
+			total += bv.bv_len;
+			bytes -= bv.bv_len;
+			start = 0;
+		}
+	}
+
+	return total;
 }
 
-/*
- * Allocate a page vector based on (@it, @nbytes).
- * The return value is the tuple describing a page vector,
- * that is (@pages, @page_align, @num_pages).
- */
-static struct page **
-dio_get_pages_alloc(const struct iov_iter *it, size_t nbytes,
-		    size_t *page_align, int *num_pages)
+static ssize_t iter_get_bvecs_alloc(struct iov_iter *i, size_t maxsize,
+				    struct bio_vec **bvecs, int *num_bvecs)
 {
-	struct iov_iter tmp_it = *it;
-	size_t align;
-	struct page **pages;
-	int ret = 0, idx, npages;
+	struct iov_iter ii = *i;
+	struct bio_vec *bv;
+	ssize_t bytes;
+	int npages;
+
+	iov_iter_truncate(&ii, maxsize);
+	npages = iov_iter_npages(&ii, INT_MAX);
+	bv = kcalloc(npages, sizeof(*bv), GFP_KERNEL);
+	if (!bv)
+		return -ENOMEM;
 
-	align = (unsigned long)(it->iov->iov_base + it->iov_offset) &
-		(PAGE_SIZE - 1);
-	npages = calc_pages_for(align, nbytes);
-	pages = kvmalloc(sizeof(*pages) * npages, GFP_KERNEL);
-	if (!pages)
-		return ERR_PTR(-ENOMEM);
+	bytes = __iter_get_bvecs(&ii, bv);
+	if (bytes < 0) {
+		kfree(bv);
+		return bytes;
+	}
 
-	for (idx = 0; idx < npages; ) {
-		size_t start;
-		ret = iov_iter_get_pages(&tmp_it, pages + idx, nbytes,
-					 npages - idx, &start);
-		if (ret < 0)
-			goto fail;
+	*bvecs = bv;
+	*num_bvecs = npages;
+	return bytes;
+}
 
-		iov_iter_advance(&tmp_it, ret);
-		nbytes -= ret;
-		idx += (ret + start + PAGE_SIZE - 1) / PAGE_SIZE;
-	}
+static void put_bvecs(struct bio_vec *bvecs, int num_bvecs, bool should_dirty)
+{
+	int i;
 
-	BUG_ON(nbytes != 0);
-	*num_pages = npages;
-	*page_align = align;
-	dout("dio_get_pages_alloc: got %d pages align %zu\n", npages, align);
-	return pages;
-fail:
-	ceph_put_page_vector(pages, idx, false);
-	return ERR_PTR(ret);
+	for (i = 0; i < num_bvecs; i++) {
+		if (bvecs[i].bv_page) {
+			if (should_dirty)
+				set_page_dirty_lock(bvecs[i].bv_page);
+			put_page(bvecs[i].bv_page);
+		}
+	}
+	kfree(bvecs);
 }
 
 /*
@@ -744,11 +750,12 @@  static void ceph_aio_complete_req(struct ceph_osd_request *req)
 	struct inode *inode = req->r_inode;
 	struct ceph_aio_request *aio_req = req->r_priv;
 	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
-	int num_pages = calc_pages_for((u64)osd_data->alignment,
-				       osd_data->length);
 
-	dout("ceph_aio_complete_req %p rc %d bytes %llu\n",
-	     inode, rc, osd_data->length);
+	BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_BVECS);
+	BUG_ON(!osd_data->num_bvecs);
+
+	dout("ceph_aio_complete_req %p rc %d bytes %u\n",
+	     inode, rc, osd_data->bvec_pos.iter.bi_size);
 
 	if (rc == -EOLDSNAPC) {
 		struct ceph_aio_work *aio_work;
@@ -766,9 +773,10 @@  static void ceph_aio_complete_req(struct ceph_osd_request *req)
 	} else if (!aio_req->write) {
 		if (rc == -ENOENT)
 			rc = 0;
-		if (rc >= 0 && osd_data->length > rc) {
-			int zoff = osd_data->alignment + rc;
-			int zlen = osd_data->length - rc;
+		if (rc >= 0 && osd_data->bvec_pos.iter.bi_size > rc) {
+			struct iov_iter i;
+			int zlen = osd_data->bvec_pos.iter.bi_size - rc;
+
 			/*
 			 * If read is satisfied by single OSD request,
 			 * it can pass EOF. Otherwise read is within
@@ -783,13 +791,16 @@  static void ceph_aio_complete_req(struct ceph_osd_request *req)
 				aio_req->total_len = rc + zlen;
 			}
 
-			if (zlen > 0)
-				ceph_zero_page_vector_range(zoff, zlen,
-							    osd_data->pages);
+			iov_iter_bvec(&i, ITER_BVEC, osd_data->bvec_pos.bvecs,
+				      osd_data->num_bvecs,
+				      osd_data->bvec_pos.iter.bi_size);
+			iov_iter_advance(&i, rc);
+			iov_iter_zero(zlen, &i);
 		}
 	}
 
-	ceph_put_page_vector(osd_data->pages, num_pages, aio_req->should_dirty);
+	put_bvecs(osd_data->bvec_pos.bvecs, osd_data->num_bvecs,
+		  aio_req->should_dirty);
 	ceph_osdc_put_request(req);
 
 	if (rc < 0)
@@ -877,7 +888,7 @@  ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
 	struct ceph_vino vino;
 	struct ceph_osd_request *req;
-	struct page **pages;
+	struct bio_vec *bvecs;
 	struct ceph_aio_request *aio_req = NULL;
 	int num_pages = 0;
 	int flags;
@@ -912,8 +923,7 @@  ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 	}
 
 	while (iov_iter_count(iter) > 0) {
-		u64 size = dio_get_pagev_size(iter);
-		size_t start = 0;
+		u64 size = iov_iter_count(iter);
 		ssize_t len;
 
 		vino = ceph_vino(inode);
@@ -936,11 +946,10 @@  ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 		else
 			size = min_t(u64, size, fsc->mount_options->rsize);
 
-		len = size;
-		pages = dio_get_pages_alloc(iter, len, &start, &num_pages);
-		if (IS_ERR(pages)) {
+		len = iter_get_bvecs_alloc(iter, size, &bvecs, &num_pages);
+		if (len < 0) {
 			ceph_osdc_put_request(req);
-			ret = PTR_ERR(pages);
+			ret = len;
 			break;
 		}
 
@@ -975,8 +984,7 @@  ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 			req->r_mtime = mtime;
 		}
 
-		osd_req_op_extent_osd_data_pages(req, 0, pages, len, start,
-						 false, false);
+		osd_req_op_extent_osd_data_bvecs(req, 0, bvecs, num_pages, len);
 
 		if (aio_req) {
 			aio_req->total_len += len;
@@ -1002,18 +1010,21 @@  ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 			if (ret == -ENOENT)
 				ret = 0;
 			if (ret >= 0 && ret < len && pos + ret < size) {
+				struct iov_iter i;
 				int zlen = min_t(size_t, len - ret,
 						 size - pos - ret);
-				ceph_zero_page_vector_range(start + ret, zlen,
-							    pages);
+
+				iov_iter_bvec(&i, ITER_BVEC, bvecs, num_pages,
+					      len);
+				iov_iter_advance(&i, ret);
+				iov_iter_zero(zlen, &i);
 				ret += zlen;
 			}
 			if (ret >= 0)
 				len = ret;
 		}
 
-		ceph_put_page_vector(pages, num_pages, should_dirty);
-
+		put_bvecs(bvecs, num_pages, should_dirty);
 		ceph_osdc_put_request(req);
 		if (ret < 0)
 			break;
-- 
2.4.3