Message ID | 20150701062752.GC17109@ZenIV.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 07/01/2015 09:27 AM, Al Viro wrote: > On Mon, Jun 22, 2015 at 03:02:11PM +0300, Andrey Ryabinin wrote: >> On 06/22/2015 12:12 AM, Al Viro wrote: >>> On Thu, Apr 23, 2015 at 01:16:15PM +0300, Andrey Ryabinin wrote: >>>> This change caused following: >>> >>>> This could happen when p9pdu_readf() changes 'count' to some value > iov_iter_count(from): >>>> >>>> p9_client_write(): >>>> <...> >>>> int count = iov_iter_count(from); >>>> <...> >>>> *err = p9pdu_readf(req->rc, clnt->proto_version, "d", &count); >>>> <...> >>>> iov_iter_advance(from, count); >>> >>> *blink* >>> >>> That's a bug, all right, but I would love to see how you trigger it. >>> It would require server to respond to "write that many bytes" with "OK, >>> <greater number> bytes written". We certainly need to cope with that >>> (we can't trust the server to be sane), but if that's what is going on, >>> you've got a server bug as well. >>> >>> Could you check if the patch below triggers WARN_ON() in it on your >>> reproducer? p9_client_read() has a similar issue as well... >>> >> >> I've tried something like your patch before to check the read side >> and I haven't seen anything before and don't see it right now. >> Though, this doesn't mean that there is no problem with read. >> I mean that trinity hits this on write and may just not hit this on read. > > "This" being the WARN_ON() in that patch? Yes. > Could you please run the same > test with the following delta and post its printks? # dmesg | grep fucked [ 114.732166] fucked: sent 2037, server says it got 2047 (err = 0) [ 124.937105] fucked: sent 27, server says it got 4096 (err = 0) [ 154.075400] fucked: sent 19, server says it got 4096 (err = 0) > It's one thing if > you are hitting a buggy server, it gets confused and tells you it has > written more bytes than you told it to write. Quite a different story > in case if we are miscalculating the size we are putting into RWRITE > packet and/or advancing the iterator when we shouldn't... > > What server are you using, BTW? And which transport (virtio or network - > IOW, is it zero-copy path or not)? qemu v2.2.1, virtio transport. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 01, 2015 at 10:50:59AM +0300, Andrey Ryabinin wrote: > # dmesg | grep fucked > > [ 114.732166] fucked: sent 2037, server says it got 2047 (err = 0) > [ 124.937105] fucked: sent 27, server says it got 4096 (err = 0) > [ 154.075400] fucked: sent 19, server says it got 4096 (err = 0) > > > It's one thing if > > you are hitting a buggy server, it gets confused and tells you it has > > written more bytes than you told it to write. Quite a different story > > in case if we are miscalculating the size we are putting into RWRITE TWRITE, sorry about the braino. > > packet and/or advancing the iterator when we shouldn't... > > > > What server are you using, BTW? And which transport (virtio or network - > > IOW, is it zero-copy path or not)? > > qemu v2.2.1, virtio transport. Hmm... The first one should've hit the zero-copy path then, the last two are too short for that... *grumble* I wonder if these short writes are getting matched with response from wrong request - that would explain full-page responses... Could you check if 3.19 was getting anything similar? I.e. in p9_client_write() there add if (count > rsize) printk(KERN_ERR "bogus RWRITE: %d -> %d\n", rsize, count); just before p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count); and see if that triggers... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/01/2015 11:27 AM, Al Viro wrote: > > Could you check if 3.19 was getting anything similar? I.e. in > p9_client_write() there add > if (count > rsize) > printk(KERN_ERR "bogus RWRITE: %d -> %d\n", rsize, count); > just before > p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count); > and see if that triggers... > Yeah, the same thing: [ 125.962374] bogus RWRITE: 27 -> 4096 [ 207.587632] bogus RWRITE: 27 -> 4096 [ 215.055627] bogus RWRITE: 27 -> 4096 [ 235.583138] bogus RWRITE: 27 -> 4096 [ 245.749174] bogus RWRITE: 27 -> 4096 [ 246.759270] bogus RWRITE: 27 -> 4096 [ 248.020787] bogus RWRITE: 27 -> 4096 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 01, 2015 at 11:41:04AM +0300, Andrey Ryabinin wrote: > On 07/01/2015 11:27 AM, Al Viro wrote: > > > > Could you check if 3.19 was getting anything similar? I.e. in > > p9_client_write() there add > > if (count > rsize) > > printk(KERN_ERR "bogus RWRITE: %d -> %d\n", rsize, count); > > just before > > p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count); > > and see if that triggers... > > > > Yeah, the same thing: > [ 125.962374] bogus RWRITE: 27 -> 4096 > [ 207.587632] bogus RWRITE: 27 -> 4096 > [ 215.055627] bogus RWRITE: 27 -> 4096 > [ 235.583138] bogus RWRITE: 27 -> 4096 > [ 245.749174] bogus RWRITE: 27 -> 4096 > [ 246.759270] bogus RWRITE: 27 -> 4096 > [ 248.020787] bogus RWRITE: 27 -> 4096 Hrm... Could you add (int)req->rc->id, (int)req->rc->tag and (int)req->tc->tag to that printk (on either kernel, the problem's apparently not new)? The question is whether we are mismatching replies, sending bogus requests or if it's really the server sending bogus replies. Which qemu version are you using, BTW? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/01/2015 11:55 AM, Al Viro wrote: > On Wed, Jul 01, 2015 at 11:41:04AM +0300, Andrey Ryabinin wrote: >> On 07/01/2015 11:27 AM, Al Viro wrote: >>> >>> Could you check if 3.19 was getting anything similar? I.e. in >>> p9_client_write() there add >>> if (count > rsize) >>> printk(KERN_ERR "bogus RWRITE: %d -> %d\n", rsize, count); >>> just before >>> p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count); >>> and see if that triggers... >>> >> >> Yeah, the same thing: >> [ 125.962374] bogus RWRITE: 27 -> 4096 >> [ 207.587632] bogus RWRITE: 27 -> 4096 >> [ 215.055627] bogus RWRITE: 27 -> 4096 >> [ 235.583138] bogus RWRITE: 27 -> 4096 >> [ 245.749174] bogus RWRITE: 27 -> 4096 >> [ 246.759270] bogus RWRITE: 27 -> 4096 >> [ 248.020787] bogus RWRITE: 27 -> 4096 > > Hrm... Could you add (int)req->rc->id, (int)req->rc->tag and (int)req->tc->tag > to that printk (on either kernel, the problem's apparently not new)? > I've attached gdb instead. So, after message "bogus RWRITE: 93 -> 4096" I've got this: (gdb) p *req->rc $11 = {size = 11, id = 119 'w', tag = 3, offset = 11, capacity = 8192, sdata = 0xffff8802347b8020 "\v"} (gdb) p *req->tc $12 = {size = 116, id = 118 'v', tag = 3, offset = 0, capacity = 8192, sdata = 0xffff88023479c020 "t"} > The question is whether we are mismatching replies, sending bogus requests or > if it's really the server sending bogus replies. Which qemu version are > you using, BTW? > As I said before qemu's version is 2.2.1. So, I've decided to try kvmtool. It took a bit longer to trigger, but still: [ 466.552432] bogus RWRITE: 57 -> 8168 [ 969.317058] bogus RWRITE: 27 -> 8168 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/9p/client.c b/net/9p/client.c index 6f4c4c8..80e45a5 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -1638,6 +1638,9 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err) req = p9_client_rpc(clnt, P9_TWRITE, "dqV", fid->fid, offset, rsize, from); } + if (iov_iter_count(from) != count) + printk(KERN_ERR "fucked: iterator got advanced [%d -> %zd]\n", + count, iov_iter_count(from)); if (IS_ERR(req)) { *err = PTR_ERR(req); break; @@ -1649,6 +1652,10 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err) p9_free_req(clnt, req); } + if (count > rsize) + printk(KERN_ERR "fucked: sent %d, server says it got %d (err = %d)\n", + rsize, count, *err); + p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count); p9_free_req(clnt, req);