[git,pull] vfs part 2
diff mbox

Message ID 20150701184408.GF17109@ZenIV.linux.org.uk
State New
Headers show

Commit Message

Al Viro July 1, 2015, 6:44 p.m. UTC
On Wed, Jul 01, 2015 at 02:25:43PM +0300, Andrey Ryabinin wrote:

> I've attached gdb instead.
> So, after message "bogus RWRITE: 93 -> 4096"
> I've got this:
> 
> (gdb) p *req->rc
> $11 = {size = 11, id = 119 'w', tag = 3, offset = 11, capacity = 8192, sdata = 0xffff8802347b8020 "\v"}
> (gdb) p *req->tc
> $12 = {size = 116, id = 118 'v', tag = 3, offset = 0, capacity = 8192, sdata = 0xffff88023479c020 "t"}

*grumble*

Request: id = P9_TWRITE, tag = 3.  Response: id = P9_RWRITE, tag = 3.
Size of request is reasonable: it should be 32bit request size + 8bit id
(TWRITE) + 16bit tag + 32bit fid + 64bit offset + 32bit count + payload,
i.e. 23 + payload size.  23 + 93 == 116, which is what we have there.
Success response is 32bit request size + 8bit id (RWRITE) + 16bit tag + 32bit
count, i.e. 11 bytes and that seems to be what we are getting here.

That would appear to exclude the possibility of bogus request - even if we had
somehow ended up with count == 4096 in TWRITE arguments, server wouldn't have
anywhere to get that much data from and either the things are *really* badly
fucked on server, or it should've replied with RERROR.

To exclude it completely we could check 4 bytes at req->tc->sdata + 19
(that's where count is), but I don't believe that this is where the problem
is.

The thing is, looking through qemu hw/9pfs/virtio-9p.c:v9fs_write() and the
stuff it calls, I don't see any way for that kind of crap to happen there...
Just in case - after the do-while loop in qemu v9fs_write(), just prior
to
    offset = 7;
    err = pdu_marshal(pdu, offset, "d", total);
    if (err < 0) {
        goto out;
    }

could you slap something like
	if (total > count)
		*(char *)0 = 0;

and see if it dumps core on your test?  Or use some more humane form of
debugging - it's been a while since I last ran qemu under gdb and I'd
need more coffee to bring that memory up...

Mismatched reply could also be a possibility, but only if we end up with
sending more than one request with the same tag without waiting for response
for the first one.

The reason why I don't want to let it go just with the "cap count with rsize
in a couple of places in net/9p/client.c" (which we'll definitely need - we
can't assume that server is sane, not compromised, etc.) is that it smells
like a symptom of something very fishy and I'd prefer to get to the root of
that thing...

Oh, lovely - I do see one bug in there that could've lead to bogosities around
the request lifetimes, but it's more recent than 3.19, so we have something
else going on.  In any case, the patch below is needed to fix a fuckup
introduced in 4.0 - getting truncated RWRITE packet from server (as in
"badly malformed response", not "short write") ends up with double-free.
Almost certainly not what we are hitting here, though.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Al Viro July 2, 2015, 3:20 a.m. UTC | #1
On Wed, Jul 01, 2015 at 07:44:08PM +0100, Al Viro wrote:
> Mismatched reply could also be a possibility, but only if we end up with
> sending more than one request with the same tag without waiting for response
> for the first one.

... and I think I see what's going on.  Tags are 16bit.  Suppose the
server stalls for some reason *and* we keep piling the requests up.
New tags keep being grabbed by this:

        tag = P9_NOTAG;
        if (type != P9_TVERSION) {
                tag = p9_idpool_get(c->tagpool);
                if (tag < 0)
                        return ERR_PTR(-ENOMEM);
        }
tag is int here.  Then we pass tag to
        req = p9_tag_alloc(c, tag, req_size);
and that's what sets req->tc->tag.  OK, but... The argument of p9_tag_alloc()
in u16, so after 2^16 pending requests we'll wrap around.  p9_idpool_get()
will happily return values greater than 65535 - it's using idr and it's
used (with different pools) for 16bit tags and 32bit FIDs.

Now, p9_tag_alloc(c, 65539, max_size) will return the same req we'd got from
p9_tag_alloc(c, 3, max_size).  And we are fucked - as far as the server is
concerned, we'd just sent another request with tag 3.  And on the client
there are two threads waiting for responses on the same p9_req_t.  Both
happen to be TWRITE.  Response to the first request arrives and we happen
to let the second thread go at it first.  Voila - the first request had
been for page-sized write() and got successfully handled.  The _second_ one
had been short and is very surprised to see confirmation of 4Kb worth of
data having been written.

It should be easy to confirm - in p9_client_prepare_req() add
		if (WARN_ON_ONCE(tag != (u16)tag)) {
			p9_idpool_put(tag, c->tagpool);
			return ERR_PTR(-ENOMEM);
		}
right after
                tag = p9_idpool_get(c->tagpool);
                if (tag < 0)
                        return ERR_PTR(-ENOMEM);

and see if it triggers.  I'm not sure if failing with ENOMEM is the
right response (another variant is to sleep there until the pile
gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
not for the real work, but it will do for confirming that this is what
we are hitting.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 2, 2015, 4:10 a.m. UTC | #2
[9p and sunrpc folks added to Cc]

On Thu, Jul 02, 2015 at 04:20:42AM +0100, Al Viro wrote:
> On Wed, Jul 01, 2015 at 07:44:08PM +0100, Al Viro wrote:
> > Mismatched reply could also be a possibility, but only if we end up with
> > sending more than one request with the same tag without waiting for response
> > for the first one.
> 
> ... and I think I see what's going on.  Tags are 16bit.  Suppose the
> server stalls for some reason *and* we keep piling the requests up.
> New tags keep being grabbed by this:
> 
>         tag = P9_NOTAG;
>         if (type != P9_TVERSION) {
>                 tag = p9_idpool_get(c->tagpool);
>                 if (tag < 0)
>                         return ERR_PTR(-ENOMEM);
>         }
> tag is int here.  Then we pass tag to
>         req = p9_tag_alloc(c, tag, req_size);
> and that's what sets req->tc->tag.  OK, but... The argument of p9_tag_alloc()
> in u16, so after 2^16 pending requests we'll wrap around.  p9_idpool_get()
> will happily return values greater than 65535 - it's using idr and it's
> used (with different pools) for 16bit tags and 32bit FIDs.
> 
> Now, p9_tag_alloc(c, 65539, max_size) will return the same req we'd got from
> p9_tag_alloc(c, 3, max_size).  And we are fucked - as far as the server is
> concerned, we'd just sent another request with tag 3.  And on the client
> there are two threads waiting for responses on the same p9_req_t.  Both
> happen to be TWRITE.  Response to the first request arrives and we happen
> to let the second thread go at it first.  Voila - the first request had
> been for page-sized write() and got successfully handled.  The _second_ one
> had been short and is very surprised to see confirmation of 4Kb worth of
> data having been written.
> 
> It should be easy to confirm - in p9_client_prepare_req() add
> 		if (WARN_ON_ONCE(tag != (u16)tag)) {
> 			p9_idpool_put(tag, c->tagpool);
> 			return ERR_PTR(-ENOMEM);
> 		}
> right after
>                 tag = p9_idpool_get(c->tagpool);
>                 if (tag < 0)
>                         return ERR_PTR(-ENOMEM);
> 
> and see if it triggers.  I'm not sure if failing with ENOMEM is the
> right response (another variant is to sleep there until the pile
> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
> not for the real work, but it will do for confirming that this is what
> we are hitting.

FWIW, we probably would be better off with throttling rather than ENOMEM
in such situations.  I'm not familiar with sunrpc enough to be sure how
to do that right way (note that RPC equivalent of 9P tags is 32bit, so
the throttling there is based on memory shortage rather than running out
of XID space), but the interesting issues should be similar - potential
deadlocks in near-OOM situations.  Suggestions?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrey Ryabinin July 2, 2015, 7:50 a.m. UTC | #3
[repeating, since my previous email didn't reach mailing lists]

2015-07-02 7:10 GMT+03:00 Al Viro <viro@zeniv.linux.org.uk>:
>> It should be easy to confirm - in p9_client_prepare_req() add
>>               if (WARN_ON_ONCE(tag != (u16)tag)) {
>>                       p9_idpool_put(tag, c->tagpool);
>>                       return ERR_PTR(-ENOMEM);
>>               }
>> right after
>>                 tag = p9_idpool_get(c->tagpool);
>>                 if (tag < 0)
>>                         return ERR_PTR(-ENOMEM);
>>
>> and see if it triggers.  I'm not sure if failing with ENOMEM is the
>> right response (another variant is to sleep there until the pile
>> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
>> not for the real work, but it will do for confirming that this is what
>> we are hitting.
>

Apparently, I'm seeing something else. That WARN_ON_ONCE didn't trigger.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 2, 2015, 7:50 a.m. UTC | #4
On Thu, Jul 02, 2015 at 10:19:07AM +0300, Andrey Ryabinin wrote:
> On 07/02/2015 07:10 AM, Al Viro wrote:
> >>
> >> It should be easy to confirm - in p9_client_prepare_req() add
> >> 		if (WARN_ON_ONCE(tag != (u16)tag)) {
> >> 			p9_idpool_put(tag, c->tagpool);
> >> 			return ERR_PTR(-ENOMEM);
> >> 		}
> >> right after
> >>                 tag = p9_idpool_get(c->tagpool);
> >>                 if (tag < 0)
> >>                         return ERR_PTR(-ENOMEM);
> >>
> >> and see if it triggers.  I'm not sure if failing with ENOMEM is the
> >> right response (another variant is to sleep there until the pile
> >> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
> >> not for the real work, but it will do for confirming that this is what
> >> we are hitting.
> > 
> 
> Apparently, I'm seeing something else. That WARN_ON_ONCE didn't trigger.

While the one in p9_client_write() (on rsize < count) did?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 2, 2015, 7:59 a.m. UTC | #5
On Thu, Jul 02, 2015 at 10:50:05AM +0300, Andrey Ryabinin wrote:

> >> and see if it triggers.  I'm not sure if failing with ENOMEM is the
> >> right response (another variant is to sleep there until the pile
> >> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
> >> not for the real work, but it will do for confirming that this is what
> >> we are hitting.
> >
> 
> Apparently, I'm seeing something else. That WARN_ON_ONCE didn't trigger.

Summary for those who'd missed the beginning of the thread: what we are
seeing is p9_client_write() issing TWRITE and getting RWRITE in reply
(tags match, packets look plausible) with count in RWRITE way more than
that in TWRITE.

IOW, we are telling the server to write e.g. 93 bytes and are getting told
that yes, the write had been successful - all 4096 bytes of it.

qemu virtio-9p for server; from my reading of qemu side of things, it can't
be sending reply with count greater than that in request.

The bug I suspected to be the cause of that is in tag allocation in
net/9p/client.c - we could end up wrapping around 2^16 with enough pending
requests and that would have triggered that kind of mess.  However, Andrey
doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger.
BTW, was that on the run where debugging printk in p9_client_write() *did*
trigger?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrey Ryabinin July 2, 2015, 8:19 a.m. UTC | #6
On 07/02/2015 10:59 AM, Al Viro wrote:
> On Thu, Jul 02, 2015 at 10:50:05AM +0300, Andrey Ryabinin wrote:
>
>>>> and see if it triggers.  I'm not sure if failing with ENOMEM is the
>>>> right response (another variant is to sleep there until the pile
>>>> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
>>>> not for the real work, but it will do for confirming that this is what
>>>> we are hitting.
>>>
>>
>> Apparently, I'm seeing something else. That WARN_ON_ONCE didn't trigger.
>
> Summary for those who'd missed the beginning of the thread: what we are
> seeing is p9_client_write() issing TWRITE and getting RWRITE in reply
> (tags match, packets look plausible) with count in RWRITE way more than
> that in TWRITE.
>
> IOW, we are telling the server to write e.g. 93 bytes and are getting told
> that yes, the write had been successful - all 4096 bytes of it.
>
> qemu virtio-9p for server; from my reading of qemu side of things, it can't
> be sending reply with count greater than that in request.

Besides qemu, I've also tried kvmtool with the same result. IOW I'm seeing
this under kvmtool as well. It just takes a bit longer to reproduce
this in kvmtool.

> The bug I suspected to be the cause of that is in tag allocation in
> net/9p/client.c - we could end up wrapping around 2^16 with enough pending
> requests and that would have triggered that kind of mess.  However, Andrey
> doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger.
> BTW, was that on the run where debugging printk in p9_client_write() *did*
> trigger?

Yes, WARN_ON_ONCE() in p9_client_prepare_req() didn't trigger,
but debug printk in p9_client_write() *did* trigger.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 2, 2015, 8:25 a.m. UTC | #7
On Thu, Jul 02, 2015 at 11:19:03AM +0300, Andrey Ryabinin wrote:
> Besides qemu, I've also tried kvmtool with the same result. IOW I'm seeing
> this under kvmtool as well. It just takes a bit longer to reproduce
> this in kvmtool.
> 
> > The bug I suspected to be the cause of that is in tag allocation in
> > net/9p/client.c - we could end up wrapping around 2^16 with enough pending
> > requests and that would have triggered that kind of mess.  However, Andrey
> > doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger.
> > BTW, was that on the run where debugging printk in p9_client_write() *did*
> > trigger?
> 
> Yes, WARN_ON_ONCE() in p9_client_prepare_req() didn't trigger,
> but debug printk in p9_client_write() *did* trigger.

Bloody wonderful...  Could you check if v9fs_write() in qemu
hw/9pfs/virtio-9p.c ever gets to
    offset = 7;
    err = pdu_marshal(pdu, offset, "d", total);
with total > count on your testcase?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 2, 2015, 8:42 a.m. UTC | #8
On Thu, Jul 02, 2015 at 09:25:30AM +0100, Al Viro wrote:
> On Thu, Jul 02, 2015 at 11:19:03AM +0300, Andrey Ryabinin wrote:
> > Besides qemu, I've also tried kvmtool with the same result. IOW I'm seeing
> > this under kvmtool as well. It just takes a bit longer to reproduce
> > this in kvmtool.
> > 
> > > The bug I suspected to be the cause of that is in tag allocation in
> > > net/9p/client.c - we could end up wrapping around 2^16 with enough pending
> > > requests and that would have triggered that kind of mess.  However, Andrey
> > > doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger.
> > > BTW, was that on the run where debugging printk in p9_client_write() *did*
> > > trigger?
> > 
> > Yes, WARN_ON_ONCE() in p9_client_prepare_req() didn't trigger,
> > but debug printk in p9_client_write() *did* trigger.
> 
> Bloody wonderful...  Could you check if v9fs_write() in qemu
> hw/9pfs/virtio-9p.c ever gets to
>     offset = 7;
>     err = pdu_marshal(pdu, offset, "d", total);
> with total > count on your testcase?

Another thing that might be worth checking: in p9_tag_alloc() (net/9p/client.c)
before
        req->status = REQ_STATUS_ALLOC;
check that req->status == REQ_STATUS_IDLE and yell if it isn't.

BTW, the loop in there (
                /* check again since original check was outside of lock */
                while (tag >= c->max_tag) {
) looks fishy.  If we get more than P9_ROW_MAXTAG allocations at once,
we'll have trouble, but I doubt that this is what we are hitting.  In any
case, adding WARN_ON(c->req[row]); right after
                        row = (tag / P9_ROW_MAXTAG);
wouldn't hurt.  I would be very surprised if that one triggered, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton July 2, 2015, noon UTC | #9
On Thu, 2 Jul 2015 04:20:42 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Wed, Jul 01, 2015 at 07:44:08PM +0100, Al Viro wrote:
> > Mismatched reply could also be a possibility, but only if we end up with
> > sending more than one request with the same tag without waiting for response
> > for the first one.
> 
> ... and I think I see what's going on.  Tags are 16bit.  Suppose the
> server stalls for some reason *and* we keep piling the requests up.
> New tags keep being grabbed by this:
> 
>         tag = P9_NOTAG;
>         if (type != P9_TVERSION) {
>                 tag = p9_idpool_get(c->tagpool);
>                 if (tag < 0)
>                         return ERR_PTR(-ENOMEM);
>         }
> tag is int here.  Then we pass tag to
>         req = p9_tag_alloc(c, tag, req_size);
> and that's what sets req->tc->tag.  OK, but... The argument of p9_tag_alloc()
> in u16, so after 2^16 pending requests we'll wrap around.  p9_idpool_get()
> will happily return values greater than 65535 - it's using idr and it's
> used (with different pools) for 16bit tags and 32bit FIDs.
> 
> Now, p9_tag_alloc(c, 65539, max_size) will return the same req we'd got from
> p9_tag_alloc(c, 3, max_size).  And we are fucked - as far as the server is
> concerned, we'd just sent another request with tag 3.  And on the client
> there are two threads waiting for responses on the same p9_req_t.  Both
> happen to be TWRITE.  Response to the first request arrives and we happen
> to let the second thread go at it first.  Voila - the first request had
> been for page-sized write() and got successfully handled.  The _second_ one
> had been short and is very surprised to see confirmation of 4Kb worth of
> data having been written.
> 
> It should be easy to confirm - in p9_client_prepare_req() add
> 		if (WARN_ON_ONCE(tag != (u16)tag)) {
> 			p9_idpool_put(tag, c->tagpool);
> 			return ERR_PTR(-ENOMEM);
> 		}
> right after
>                 tag = p9_idpool_get(c->tagpool);
>                 if (tag < 0)
>                         return ERR_PTR(-ENOMEM);
> 
> and see if it triggers.  I'm not sure if failing with ENOMEM is the
> right response (another variant is to sleep there until the pile
> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
> not for the real work, but it will do for confirming that this is what
> we are hitting.

ISTM that pd_idpool_get ought to be using idr_alloc_cyclic instead.
That should ensure that it's only allocating values from within the
given range.
Jeff Layton July 2, 2015, 12:07 p.m. UTC | #10
On Thu, 2 Jul 2015 08:00:26 -0400
Jeff Layton <jlayton@poochiereds.net> wrote:

> On Thu, 2 Jul 2015 04:20:42 +0100
> Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> > On Wed, Jul 01, 2015 at 07:44:08PM +0100, Al Viro wrote:
> > > Mismatched reply could also be a possibility, but only if we end up with
> > > sending more than one request with the same tag without waiting for response
> > > for the first one.
> > 
> > ... and I think I see what's going on.  Tags are 16bit.  Suppose the
> > server stalls for some reason *and* we keep piling the requests up.
> > New tags keep being grabbed by this:
> > 
> >         tag = P9_NOTAG;
> >         if (type != P9_TVERSION) {
> >                 tag = p9_idpool_get(c->tagpool);
> >                 if (tag < 0)
> >                         return ERR_PTR(-ENOMEM);
> >         }
> > tag is int here.  Then we pass tag to
> >         req = p9_tag_alloc(c, tag, req_size);
> > and that's what sets req->tc->tag.  OK, but... The argument of p9_tag_alloc()
> > in u16, so after 2^16 pending requests we'll wrap around.  p9_idpool_get()
> > will happily return values greater than 65535 - it's using idr and it's
> > used (with different pools) for 16bit tags and 32bit FIDs.
> > 
> > Now, p9_tag_alloc(c, 65539, max_size) will return the same req we'd got from
> > p9_tag_alloc(c, 3, max_size).  And we are fucked - as far as the server is
> > concerned, we'd just sent another request with tag 3.  And on the client
> > there are two threads waiting for responses on the same p9_req_t.  Both
> > happen to be TWRITE.  Response to the first request arrives and we happen
> > to let the second thread go at it first.  Voila - the first request had
> > been for page-sized write() and got successfully handled.  The _second_ one
> > had been short and is very surprised to see confirmation of 4Kb worth of
> > data having been written.
> > 
> > It should be easy to confirm - in p9_client_prepare_req() add
> > 		if (WARN_ON_ONCE(tag != (u16)tag)) {
> > 			p9_idpool_put(tag, c->tagpool);
> > 			return ERR_PTR(-ENOMEM);
> > 		}
> > right after
> >                 tag = p9_idpool_get(c->tagpool);
> >                 if (tag < 0)
> >                         return ERR_PTR(-ENOMEM);
> > 
> > and see if it triggers.  I'm not sure if failing with ENOMEM is the
> > right response (another variant is to sleep there until the pile
> > gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
> > not for the real work, but it will do for confirming that this is what
> > we are hitting.
> 
> ISTM that pd_idpool_get ought to be using idr_alloc_cyclic instead.
> That should ensure that it's only allocating values from within the
> given range.
> 

Erm...and why is it passing in '0' to idr_alloc for the end value if it
can't deal with more than 16 bits? That seems like a plain old bug...

The other stuff you've noted should also be fixed of course, but the
IDR usage here could use a little rework.
Al Viro July 2, 2015, 4:45 p.m. UTC | #11
On Thu, Jul 02, 2015 at 08:07:38AM -0400, Jeff Layton wrote:

> Erm...and why is it passing in '0' to idr_alloc for the end value if it
> can't deal with more than 16 bits? That seems like a plain old bug...

Because they are using the same function (with different pool, obviously)
for FID allocation, and those are 32bit...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton July 2, 2015, 5:01 p.m. UTC | #12
On Thu, 2 Jul 2015 17:45:35 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Thu, Jul 02, 2015 at 08:07:38AM -0400, Jeff Layton wrote:
> 
> > Erm...and why is it passing in '0' to idr_alloc for the end value if it
> > can't deal with more than 16 bits? That seems like a plain old bug...
> 
> Because they are using the same function (with different pool, obviously)
> for FID allocation, and those are 32bit...

Ahh, right...

So p9_idpool_create should take an argument for the "end" value, and
then store that in a new field in p9_idpool. Then they can pass that in
as the "end" parm in idr_alloc. Or, they could give up using the same
function there and use a different one for tags and FIDs.

In any case...allowing this thing to allocate tag values that can
collide seems fundamentally wrong. Using idr_alloc_cyclic might also
not hurt either, particularly given that these tag values are supposed
to function something like an XID and you probably don't want to be
reusing them too quickly.
Dominique Martinet July 2, 2015, 5:56 p.m. UTC | #13
Jeff Layton wrote on Thu, Jul 02, 2015:
> So p9_idpool_create should take an argument for the "end" value, and
> then store that in a new field in p9_idpool. Then they can pass that in
> as the "end" parm in idr_alloc. Or, they could give up using the same
> function there and use a different one for tags and FIDs.
> 
> In any case...allowing this thing to allocate tag values that can
> collide seems fundamentally wrong. Using idr_alloc_cyclic might also
> not hurt either, particularly given that these tag values are supposed
> to function something like an XID and you probably don't want to be
> reusing them too quickly.

Using cache=none here so behavious is likely different with cache, but
basically you can't get more than one tag per user thread accessing the
9P mount...
And in RDMA there's a credit so I can't get past whatever sq option was
given (defaults to 32) -- tbh even with other transports I doubt it's
going to get much higher.

Still definitely needs fixing, but I think the issue is somewhere
else... If Andrey could share the workload he uses I can try with other
servers, would be nice if we can rule a qemu bug out completely :)
Al Viro July 2, 2015, 6:40 p.m. UTC | #14
On Thu, Jul 02, 2015 at 01:01:39PM -0400, Jeff Layton wrote:

> So p9_idpool_create should take an argument for the "end" value, and
> then store that in a new field in p9_idpool. Then they can pass that in
> as the "end" parm in idr_alloc. Or, they could give up using the same
> function there and use a different one for tags and FIDs.
> 
> In any case...allowing this thing to allocate tag values that can
> collide seems fundamentally wrong. Using idr_alloc_cyclic might also
> not hurt either, particularly given that these tag values are supposed
> to function something like an XID and you probably don't want to be
> reusing them too quickly.

All they are used for is matching response to request.  Basically, you
can have up to 65535 pending requests.  Reusing it right after getting
the response is fine.

Keep in mind that it's not supposed to be used on top of something like
UDP - *all* retransmits, etc., are responsibility of transport.  It's
connection-oriented, reliable and order-preserving, with a shitload of state
tied to connection, starting with FIDs - unlike FHANDLE, FID meaning depends
upon connection history.  Tags are even more transient.

Basically, the rules are
	* request bears a 16bit tag.
	* server can process pending requests in any order (with one exception)
and it must put the same tag into responses.
	* client can ask to abort a pending request by issuing Tflush[old_tag];
	* server must handle Tflush immediately.  It must drop any pending
request matching old_tag and send Rflush confirming that.  No response to
any request matching old_tag sent before Tflush should be issued after issuing
Rflush.
	* if client has not issued Tflush, it must not reuse the tag until
getting a response bearing that tag.
	* if client *has* issued Tflush, it must not reuse the tag until
getting Rflush, even if it does get response to the request it has tried to
abort. 

BTW, failure to send Tflush means that we should leave the tag in use,
period.  As it is, p9_client_rpc()/p9_client_zc_rpc() ignore such
situations - failure from p9_client_flush() is simply not noticed.
I seriously doubt that this is what we are hitting here, but it's a bug
all the same.

We also must _not_ let p9_client_cb() do anything unless req is non-NULL
and req->status is REQ_STATUS_SENT - stray tags from server shouldn't
be acted upon.  As it is, the whole thing is trivial to oops - just have
server send _any_ R-message with something like 0xfff0 for tag.  End of
story, p9_tag_lookup() returns NULL, p9_client_cb() oopses.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 2, 2015, 6:43 p.m. UTC | #15
On Thu, Jul 02, 2015 at 07:56:29PM +0200, Dominique Martinet wrote:

> Using cache=none here so behavious is likely different with cache, but
> basically you can't get more than one tag per user thread accessing the
> 9P mount...

Yes, and...?  You can get a lot more than one user thread...  Andrey is
using trinity(1) on client, and that's *definitely* not single-threaded -
the whole point is stressing the damn thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton July 2, 2015, 6:59 p.m. UTC | #16
On Thu, 2 Jul 2015 19:56:29 +0200
Dominique Martinet <dominique.martinet@cea.fr> wrote:

> Jeff Layton wrote on Thu, Jul 02, 2015:
> > So p9_idpool_create should take an argument for the "end" value, and
> > then store that in a new field in p9_idpool. Then they can pass that in
> > as the "end" parm in idr_alloc. Or, they could give up using the same
> > function there and use a different one for tags and FIDs.
> > 
> > In any case...allowing this thing to allocate tag values that can
> > collide seems fundamentally wrong. Using idr_alloc_cyclic might also
> > not hurt either, particularly given that these tag values are supposed
> > to function something like an XID and you probably don't want to be
> > reusing them too quickly.
> 
> Using cache=none here so behavious is likely different with cache, but
> basically you can't get more than one tag per user thread accessing the
> 9P mount...
> And in RDMA there's a credit so I can't get past whatever sq option was
> given (defaults to 32) -- tbh even with other transports I doubt it's
> going to get much higher.
> 
> Still definitely needs fixing, but I think the issue is somewhere
> else... If Andrey could share the workload he uses I can try with other
> servers, would be nice if we can rule a qemu bug out completely :)
> 

Fair enough...

If you're in there and decide to fix this up, then consider moving this
over to IDA instead of IDR. The pointers stored are not terribly
interesting (always the same as the p9_idpool), so by doing that you'll
save quite a bit of memory as well.
Linus Torvalds July 2, 2015, 7:16 p.m. UTC | #17
On Thu, Jul 2, 2015 at 11:40 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> All they are used for is matching response to request.  Basically, you
> can have up to 65535 pending requests.  Reusing it right after getting
> the response is fine.

Reusing a tag right after getting the completion may be fine in
theory, but it still sounds like a bad idea. Sure, it's used to match
the command with the reply, but using those kinds of things for
matching re-sends and to index into various "current data structures"
is also very common (not having looked at p9 I don't know how much it
does), and basically reusing tags "soon" tends to make those kidns of
things fragile.

Which can easily turn a "this _should_ work" into "it doesn't
_actually_ work" just because it ends up making things like race
conditions and re-ordering of replies trigger worse behavior. For
example, things like "done with previous command X" and "now starting
new command X" - if the tag is the same and those *independent*
messages get re-ordered, the tag just failed in what it was supposed
to do.

So circular allocators are likely a good idea even if there are other
layers that should handle retransmits etc.

So it does sound like it would be better to use a circular tag
allocator rather than a "lowest tag first" allocator.

            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrey Ryabinin July 2, 2015, 8:36 p.m. UTC | #18
2015-07-02 20:56 GMT+03:00 Dominique Martinet <dominique.martinet@cea.fr>:
>
> Still definitely needs fixing, but I think the issue is somewhere
> else... If Andrey could share the workload he uses I can try with other
> servers, would be nice if we can rule a qemu bug out completely :)
>

I simply run trinity from 9p rootfs:
         ./trinity  -qqq --dangerous -xinit_module -C100

qemu guest, virtio transport.

BTW, I've discovered that all this bogus writes comes from trinity
logger (log.c)
which just opens one file per thread (trinity-childX.log) and writes to it.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 2, 2015, 8:44 p.m. UTC | #19
On Thu, Jul 02, 2015 at 12:16:14PM -0700, Linus Torvalds wrote:
> On Thu, Jul 2, 2015 at 11:40 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > All they are used for is matching response to request.  Basically, you
> > can have up to 65535 pending requests.  Reusing it right after getting
> > the response is fine.
> 
> Reusing a tag right after getting the completion may be fine in
> theory, but it still sounds like a bad idea. Sure, it's used to match
> the command with the reply, but using those kinds of things for
> matching re-sends and to index into various "current data structures"
> is also very common (not having looked at p9 I don't know how much it
> does), and basically reusing tags "soon" tends to make those kidns of
> things fragile.

_All_ retransmits are done in transport layer there.  It's not NFS - it
really expects reliable ordered connection for transport.  No retransmits,
no duplicates, etc.  I'm not dead against circular allocation, but I would
really like to figure out what's going on first.

I still wonder if we are seeing wraparound (should've posted a diff instead
of verbal description - mea culpa).  If we are not, it smells like response
to request having arrived while the tag had been not in use from the client
POV, _or_ buggered barriers of some kind.  Maybe buggered ordering of
replies somewhere, but that's only if Tflush had been involved (as in
-> Twrite tag = 3
-> Tflush tag = 42 old_tag = 3		<- Rwrite tag = 3
<- Rflush tag = 42
mark tag 3 free to be reused
reuse tag 3
... somehow get to seeing Rwrite only now

But I don't see where such ordering violation could've happened at the moment.
The way it's supposed to work is that the sequence
-> Twhatever tag = N
-> Tflush old_tag = N
must either end up with no response to the former arriving at all, or
arriving before the response to the latter.  Transport itself does preserve
ordering (TCP certainly would, but virtio queue also does, AFAICS) and
we really need to have p9_client_cb() called in order of arrival.

Hmm...  This is a stab in the dark, but... we have vring_interrupt() calling
req_done(), which does
        while (1) {
                spin_lock_irqsave(&chan->lock, flags);
                rc = virtqueue_get_buf(chan->vq, &len);
                if (rc == NULL) {
                        spin_unlock_irqrestore(&chan->lock, flags);
                        break;
                }
                chan->ring_bufs_avail = 1;
                spin_unlock_irqrestore(&chan->lock, flags);
                /* Wakeup if anyone waiting for VirtIO ring space. */
                wake_up(chan->vc_wq);
                p9_debug(P9_DEBUG_TRANS, ": rc %p\n", rc);
                p9_debug(P9_DEBUG_TRANS, ": lookup tag %d\n", rc->tag);
                req = p9_tag_lookup(chan->client, rc->tag);
                p9_client_cb(chan->client, req, REQ_STATUS_RCVD);
        }

What's to prevent *another* vring_interrupt() (called from some kind of
IRQ handler) hitting on another CPU and competing with this one for the
queue?  While we are at it, both p9_tag_lookup() and p9_client_cb()
should be find with being called under spin_lock_irqsave, so why not
hold it outside of the loop?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dominique Martinet July 2, 2015, 9 p.m. UTC | #20
Al Viro wrote on Thu, Jul 02, 2015:
> On Thu, Jul 02, 2015 at 07:56:29PM +0200, Dominique Martinet wrote:
> > Using cache=none here so behavious is likely different with cache, but
> > basically you can't get more than one tag per user thread accessing the
> > 9P mount...
> 
> Yes, and...?  You can get a lot more than one user thread...  Andrey is
> using trinity(1) on client, and that's *definitely* not single-threaded -
> the whole point is stressing the damn thing.

I have run trinity quite a bit and it doesn't fork bomb as far as I can
recall, with him running it with -C100 we're not quite at 2^16 yet?

I do agree it's a problem, just don't think it's the one we're
hitting -- I'll try again on a recent kernel to see if anything changed
with rdma/tcp as well, but I'm starting to doubt I'll get any luck with
anything other than virtio; which doesn't really help since it's not the
same order of latencies.

FWIW I don't *think* trinity can issue TFlush either without user
interaction, that's a really special call. It can only happen in rpc()
or zc_rpc() if it's interrupted by ERESTARTSYS which I understand as ^C?



(I'll look into making the pools use IDA unless someone else steps up,
sure. Thanks Jeff)

Patch
diff mbox

diff --git a/net/9p/client.c b/net/9p/client.c
index 6f4c4c8..ca3b342 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1647,6 +1647,7 @@  p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
 		if (*err) {
 			trace_9p_protocol_dump(clnt, req->rc);
 			p9_free_req(clnt, req);
+			break;
 		}
 
 		p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);