diff mbox

[V9fs-developer] 9pfs hangs since 4.7

Message ID 20170104013355.4a8923b6@duuni (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Tuomas Tynkkynen Jan. 3, 2017, 11:34 p.m. UTC
On Mon, 2 Jan 2017 16:23:09 +0000
Al Viro <viro@ZenIV.linux.org.uk> wrote:


> 
> What I'd like to see is a log of 9p traffic in those; to hell with the
> payloads, just the type and tag of from each message [...]

Thanks for the suggestions. With the following patch to QEMU:


I got these logs from the server & client with 9p tracepoints enabled:

https://gist.githubusercontent.com/dezgeg/02447100b3182167403099fe7de4d941/raw/3772e408ddf586fb662ac9148fc10943529a6b99/dmesg%2520with%25209p%2520trace
https://gist.githubusercontent.com/dezgeg/e1e0c7f354042e1d9bdf7e9135934a65/raw/3a0e3b4f7a5229fd0be032c6839b578d47a21ce4/qemu.log


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

Comments

Al Viro Jan. 4, 2017, 1:47 a.m. UTC | #1
On Wed, Jan 04, 2017 at 01:34:50AM +0200, Tuomas Tynkkynen wrote:

> I got these logs from the server & client with 9p tracepoints enabled:
> 
> https://gist.githubusercontent.com/dezgeg/02447100b3182167403099fe7de4d941/raw/3772e408ddf586fb662ac9148fc10943529a6b99/dmesg%2520with%25209p%2520trace
> https://gist.githubusercontent.com/dezgeg/e1e0c7f354042e1d9bdf7e9135934a65/raw/3a0e3b4f7a5229fd0be032c6839b578d47a21ce4/qemu.log
> 

Lovely...

27349:out 8 0001 		TSTATFS, tag 1
27350:out 12 0001		TLOPEN, tag 1
27351:complete 9 0001		RSTATFS, tag 1
27352:complete 13 0001		RLOPEN, tag 1

27677:out 8 0001		TSTATFS, tag 1
27678:out 12 0001		TLOPEN, tag 1
27679:complete 9 0001		RSTATFS, tag 1
27680:complete 13 0001		RLOPEN, tag 1

29667:out 8 0001		TSTATFS, tag 1
29668:out 110 0001		TWALK, tag 1
29669:complete 9 0001		RSTATFS, tag 1
29670:complete 7 0001		RLERROR, tag 1

42317:out 110 0001		TWALK, tag 1
42318:out 8 0001		TSTATFS, tag 1
42319:complete 9 0001		RSTATFS, tag 1
42320:complete 7 0001		RERROR, tag 1

Those are outright protocol violations: tag can be reused only after either
a reply bearing the same tag has arrived *or* TFLUSH for that tag had been
issued and successful (type == RFLUSH) reply bearing the tag of that TFLUSH
has arrived.  Your log doesn't contain any TFLUSH (108) at all, so it should've
been plain and simple "no reuse until server sends a reply with matching tag".

Otherwise the the dump looks sane.  In the end all issued requests had been
served, so it's not as if the client had been waiting for a free tag or for
a response to be produced by the server.

Interesting...  dmesg doesn't seem to contain the beginning of the 9P trace -
only 89858 out of 108818 in the dump, even though it claims to have lost
only 4445 events.  The last exchanges in the trace are
> P9_TLOPEN tag  90 err 0
> P9_TLOPEN tag  25 err 0
> P9_TLOPEN tag  15 err 0
> P9_TLOPEN tag  104 err 0
> P9_TLOPEN tag  102 err 0
> P9_TLOPEN tag  91 err 0
> P9_TLOPEN tag  12 err 0
< P9_TREADLINK tag  12

which should've been
complete 13 005a
complete 13 0019
complete 13 000f
complete 13 0068
complete 13 0066
complete 13 005b
out 22 000c

and the only plausible match for that is

108784:complete 13 005a
108798:complete 13 000f
108799:complete 13 0019
108800:complete 13 0068
108801:complete 13 0066
108802:complete 13 005b

but there's a bunch of replies in between that hadn't been logged *and* such
a TREADLINK request isn't seen after
108341:out 22 000c

... and sure enough, there's a whole lot of processes stuck in readlink.

Interesting...  Which version of qemu/what arguments are you using there?

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
Tuomas Tynkkynen Jan. 4, 2017, 8:04 p.m. UTC | #2
On Wed, 4 Jan 2017 01:47:53 +0000
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Wed, Jan 04, 2017 at 01:34:50AM +0200, Tuomas Tynkkynen wrote:
> 
> > I got these logs from the server & client with 9p tracepoints enabled:
> > 
> > https://gist.githubusercontent.com/dezgeg/02447100b3182167403099fe7de4d941/raw/3772e408ddf586fb662ac9148fc10943529a6b99/dmesg%2520with%25209p%2520trace
> > https://gist.githubusercontent.com/dezgeg/e1e0c7f354042e1d9bdf7e9135934a65/raw/3a0e3b4f7a5229fd0be032c6839b578d47a21ce4/qemu.log
> >   
> 
> Lovely...
> 
> 27349:out 8 0001 		TSTATFS, tag 1
> 27350:out 12 0001		TLOPEN, tag 1
> 27351:complete 9 0001		RSTATFS, tag 1
> 27352:complete 13 0001		RLOPEN, tag 1
> 
> 27677:out 8 0001		TSTATFS, tag 1
> 27678:out 12 0001		TLOPEN, tag 1
> 27679:complete 9 0001		RSTATFS, tag 1
> 27680:complete 13 0001		RLOPEN, tag 1
> 
> 29667:out 8 0001		TSTATFS, tag 1
> 29668:out 110 0001		TWALK, tag 1
> 29669:complete 9 0001		RSTATFS, tag 1
> 29670:complete 7 0001		RLERROR, tag 1
> 
> 42317:out 110 0001		TWALK, tag 1
> 42318:out 8 0001		TSTATFS, tag 1
> 42319:complete 9 0001		RSTATFS, tag 1
> 42320:complete 7 0001		RERROR, tag 1
> 
> Those are outright protocol violations: tag can be reused only after either
> a reply bearing the same tag has arrived *or* TFLUSH for that tag had been
> issued and successful (type == RFLUSH) reply bearing the tag of that TFLUSH
> has arrived.  Your log doesn't contain any TFLUSH (108) at all, so it should've
> been plain and simple "no reuse until server sends a reply with matching tag".
> 

Argh, I had completely forgotten there is another 9p mount involved, so the log
is mixed between the two mounts. Apologies, there is now a new trace attached
with the V9fsState pointer dumped in the mix as well.

> Otherwise the the dump looks sane.  In the end all issued requests had been
> served, so it's not as if the client had been waiting for a free tag or for
> a response to be produced by the server.
> 
> Interesting...  dmesg doesn't seem to contain the beginning of the 9P trace -
> only 89858 out of 108818 in the dump, even though it claims to have lost
> only 4445 events.  [...]

Here's logs that should be complete this time:

https://gist.githubusercontent.com/dezgeg/08629d4c8ca79da794bc087e5951e518/raw/a1a82b9bc24e5282c82beb43a9dc91974ffcf75a/9p.qemu.log
https://gist.githubusercontent.com/dezgeg/1d5f1cc0647e336c59989fc62780eb2e/raw/d7623755a895b0441c608ddba366d20bbf47ab0b/9p.dmesg.log

> 
> Interesting...  Which version of qemu/what arguments are you using there?

This is QEMU 2.7.0, with these on the server side:
    -virtfs local,path=/nix/store,security_model=none,mount_tag=store
    -virtfs local,path=$TMPDIR/xchg,security_model=none,mount_tag=xchg

and on the client side:

store on /nix/store type 9p (rw,relatime,dirsync,trans=virtio,version=9p2000.L,cache=loose)
xchg on /tmp/xchg type 9p (rw,relatime,dirsync,trans=virtio,version=9p2000.L,cache=loose)

'store' is the one receiving the hammering, 'xchg' I think is mostly sitting idle.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
Al Viro Jan. 4, 2017, 11:01 p.m. UTC | #3
> Here's logs that should be complete this time:
> 
> https://gist.githubusercontent.com/dezgeg/08629d4c8ca79da794bc087e5951e518/raw/a1a82b9bc24e5282c82beb43a9dc91974ffcf75a/9p.qemu.log
> https://gist.githubusercontent.com/dezgeg/1d5f1cc0647e336c59989fc62780eb2e/raw/d7623755a895b0441c608ddba366d20bbf47ab0b/9p.dmesg.log

Fun.  All requests prior to
[  360.110282]       dd-1899    1.... 18497262us : 9p_client_req: client 18446612134390128640 request P9_TWALK tag  25
line in dmesg had been seen by the servers; all requests starting with that
one had not.  Replies to earlier requests kept arriving just fine.

>From the server side, everything looks nice and sane - it has processed
all requests it had seen, and aside of slight difference in the arrival
order server-side and client-side logs match, except for the last 26
requests the client claims to have sent and the server has never seen.

All traffic for another client (there had been less of it) has ceased long
before that point, so we can't really tell if it's just this client that
got buggered.  Interesting...

Looking at the tracepoints, those requests got through p9_client_prepare_req();
we have no idea whether it got through p9_virtio_request().  OTOH, AFAICS
nothing had been sleeping in there...

FWIW, it might be interesting to try
	WARN_ON(!virtqueue_kick(chan->vq));
in p9_virtio_request() (instead of blind virtqueue_kick(chan->vq)) and see
if it triggers.  Incidentally, it looks like p9_virtio_request() ought to
return an error if that happens...

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
Greg Kurz Jan. 6, 2017, 1:52 p.m. UTC | #4
On Wed, 4 Jan 2017 23:01:01 +0000
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> > Here's logs that should be complete this time:
> > 
> > https://gist.githubusercontent.com/dezgeg/08629d4c8ca79da794bc087e5951e518/raw/a1a82b9bc24e5282c82beb43a9dc91974ffcf75a/9p.qemu.log
> > https://gist.githubusercontent.com/dezgeg/1d5f1cc0647e336c59989fc62780eb2e/raw/d7623755a895b0441c608ddba366d20bbf47ab0b/9p.dmesg.log  
> 
> Fun.  All requests prior to
> [  360.110282]       dd-1899    1.... 18497262us : 9p_client_req: client 18446612134390128640 request P9_TWALK tag  25
> line in dmesg had been seen by the servers; all requests starting with that
> one had not.  Replies to earlier requests kept arriving just fine.
> 

Looking at the tag numbers, I think we're hitting the hardcoded limit of 128
simultaneous requests in QEMU (which doesn't produce any error, new requests
are silently dropped).

Tuomas, can you change MAX_REQ to some higher value (< 65535 since tag is
2-byte and 0xffff is reserved) to confirm ?

> >From the server side, everything looks nice and sane - it has processed  
> all requests it had seen, and aside of slight difference in the arrival
> order server-side and client-side logs match, except for the last 26
> requests the client claims to have sent and the server has never seen.
> 
> All traffic for another client (there had been less of it) has ceased long
> before that point, so we can't really tell if it's just this client that
> got buggered.  Interesting...
> 
> Looking at the tracepoints, those requests got through p9_client_prepare_req();
> we have no idea whether it got through p9_virtio_request().  OTOH, AFAICS
> nothing had been sleeping in there...
> 
> FWIW, it might be interesting to try
> 	WARN_ON(!virtqueue_kick(chan->vq));
> in p9_virtio_request() (instead of blind virtqueue_kick(chan->vq)) and see
> if it triggers.  Incidentally, it looks like p9_virtio_request() ought to
> return an error if that happens...
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most 
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> V9fs-developer mailing list
> V9fs-developer@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/v9fs-developer


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
Al Viro Jan. 7, 2017, 6:26 a.m. UTC | #5
On Fri, Jan 06, 2017 at 02:52:35PM +0100, Greg Kurz wrote:

> Looking at the tag numbers, I think we're hitting the hardcoded limit of 128
> simultaneous requests in QEMU (which doesn't produce any error, new requests
> are silently dropped).
> 
> Tuomas, can you change MAX_REQ to some higher value (< 65535 since tag is
> 2-byte and 0xffff is reserved) to confirm ?

Huh?

Just how is a client supposed to cope with that behaviour?  9P is not
SunRPC - there's a reason why it doesn't live on top of UDP.  Sure, it's
datagram-oriented, but it really wants reliable transport...

Setting the ring size at MAX_REQ is fine; that'll give you ENOSPC on
attempt to put a request there, and p9_virtio_request() will wait for
things to clear, but if you've accepted a request, that's bloody it -
you really should go and handle it.

How does it happen, anyway?  qemu-side, I mean...  Does it move the buffer
to used ring as soon as it has fetched the request?  AFAICS, it doesn't -
virtqueue_push() is called just before pdu_free(); we might get complications
in case of TFLUSH handling (queue with MAX_REQ-1 requests submitted, TFLUSH
arrives, cancel_pdu is found and ->cancelled is set on it, then v9fs_flush()
waits for it to complete.  Once the damn thing is done, buffer is released by
virtqueue_push(), but pdu freeing is delayed until v9fs_flush() gets woken
up.  In the meanwhile, another request arrives into the slot of freed by
that virtqueue_push() and we are out of pdus.

So it could happen, and the things might get unpleasant to some extent, but...
no TFLUSH had been present in all that traffic.  And none of the stuck
processes had been spinning in p9_virtio_request(), so they *did* find
ring slots...

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
Greg Kurz Jan. 7, 2017, 3:10 p.m. UTC | #6
On Sat, 7 Jan 2017 06:26:47 +0000
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Fri, Jan 06, 2017 at 02:52:35PM +0100, Greg Kurz wrote:
> 
> > Looking at the tag numbers, I think we're hitting the hardcoded limit of 128
> > simultaneous requests in QEMU (which doesn't produce any error, new requests
> > are silently dropped).
> > 
> > Tuomas, can you change MAX_REQ to some higher value (< 65535 since tag is
> > 2-byte and 0xffff is reserved) to confirm ?  
> 
> Huh?
> 
> Just how is a client supposed to cope with that behaviour?  9P is not
> SunRPC - there's a reason why it doesn't live on top of UDP.  Sure, it's
> datagram-oriented, but it really wants reliable transport...
> 
> Setting the ring size at MAX_REQ is fine; that'll give you ENOSPC on
> attempt to put a request there, and p9_virtio_request() will wait for
> things to clear, but if you've accepted a request, that's bloody it -
> you really should go and handle it.
> 

Yes you're right and "dropped" in my previous mail meant "not accepted"
actually (virtqueue_pop() not called)... sorry for the confusion. :-\

> How does it happen, anyway?  qemu-side, I mean...  Does it move the buffer
> to used ring as soon as it has fetched the request?  AFAICS, it doesn't -
> virtqueue_push() is called just before pdu_free(); we might get complications
> in case of TFLUSH handling (queue with MAX_REQ-1 requests submitted, TFLUSH
> arrives, cancel_pdu is found and ->cancelled is set on it, then v9fs_flush()
> waits for it to complete.  Once the damn thing is done, buffer is released by
> virtqueue_push(), but pdu freeing is delayed until v9fs_flush() gets woken
> up.  In the meanwhile, another request arrives into the slot of freed by
> that virtqueue_push() and we are out of pdus.
> 

Indeed. Even if this doesn't seem to be the problem here, I guess this should
be fixed.

> So it could happen, and the things might get unpleasant to some extent, but...
> no TFLUSH had been present in all that traffic.  And none of the stuck
> processes had been spinning in p9_virtio_request(), so they *did* find
> ring slots...

So we're back to your previous proposal of checking if virtqueue_kick() returned
false...

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
Al Viro Jan. 7, 2017, 5:19 p.m. UTC | #7
On Sat, Jan 07, 2017 at 04:10:45PM +0100, Greg Kurz wrote:
> > virtqueue_push(), but pdu freeing is delayed until v9fs_flush() gets woken
> > up.  In the meanwhile, another request arrives into the slot of freed by
> > that virtqueue_push() and we are out of pdus.
> > 
> 
> Indeed. Even if this doesn't seem to be the problem here, I guess this should
> be fixed.

	FWIW, there's something that looks like an off-by-one in
v9fs_device_realize_common():
    /* initialize pdu allocator */
    QLIST_INIT(&s->free_list);
    QLIST_INIT(&s->active_list);
    for (i = 0; i < (MAX_REQ - 1); i++) {
        QLIST_INSERT_HEAD(&s->free_list, &s->pdus[i], next);
        s->pdus[i].s = s;
        s->pdus[i].idx = i;
    }

Had been there since the original merge of 9p support into qemu - that code
had moved around a bit, but it had never inserted s->pdus[MAX_REQ - 1] into
free list.  So your scenario with failing pdu_alloc() is still possible.
In that log the total amount of pending requests has reached 128 for the
first time right when the requests had stopped being handled and even
though it had dropped below that shortly after, extra requests being put
into queue had not been processed at all...

I'm not familiar with qemu guts enough to tell if that's a plausible scenario,
though... shouldn't subsequent queue insertions (after enough slots had been
released) simply trigger virtio_queue_notify_vq() again?  It *is* a bug
(if we get a burst filling a previously empty queue all at once, there won't
be any slots becoming freed), but that's obviously not the case here -
slots were getting freed, after all.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
Al Viro Jan. 7, 2017, 6:15 p.m. UTC | #8
On Sat, Jan 07, 2017 at 05:19:10PM +0000, Al Viro wrote:

> released) simply trigger virtio_queue_notify_vq() again?  It *is* a bug
> (if we get a burst filling a previously empty queue all at once, there won't
> be any slots becoming freed

Umm...  Nope, even in that scenario we'll have all requests except the last
one processed.  I'm trying to put together a simpler reproducer, but...
no luck so far.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
Al Viro Jan. 8, 2017, 5:46 a.m. UTC | #9
On Sat, Jan 07, 2017 at 06:15:23PM +0000, Al Viro wrote:
> On Sat, Jan 07, 2017 at 05:19:10PM +0000, Al Viro wrote:
> 
> > released) simply trigger virtio_queue_notify_vq() again?  It *is* a bug
> > (if we get a burst filling a previously empty queue all at once, there won't
> > be any slots becoming freed
> 
> Umm...  Nope, even in that scenario we'll have all requests except the last
> one processed.  I'm trying to put together a simpler reproducer, but...
> no luck so far.

FWIW, here's another fun bug in qemu 9pfs: client may issue multiple
Tflush on the same pending request.  Server must not reply to them
out of order and it must reply to at least the last one.  If client has
sent more than one Tflush, it must treat an arrival of Rflush to one of
those as if it has received an Rflush to each of the earlier Tflush as
well (IOW, *their* tags are released).  Client may not reuse the tag of
victim request until it has received Rflush for the last Tflush it has
sent.

Linux kernel-side 9p client doesn't issue such multiple Tflush, but the
protocol allows that.

qemu server does not guarantee in-order replies to multiple Tflush; worse,
while it is likely to reply only to one of those, it may very well be
the _first_ one.  Subsequent ones will be lost; moreover, it'll leak
both coroutines and ring slots.

AFAICS, a clean way to deal with that would be to handle Tflush synchronously,
right in pdu_submit():
	if pdu->type == Tflush
		look the victim request up.
		if !victim || victim == pdu	// [*]
			send reply and free pdu immediately.
		if victim->type == Tflush	// [**]
			pdu->victim = victim->victim
		else
			pdu->victim = victim
			mark the victim cancelled
		add pdu to the tail of pdu->victim->flushes
and let pdu_complete(pdu) send a reply to each request in pdu->flushes
as well (freeing them as we go)

Some locking might be needed to avoid the races between pdu_submit() and
pdu_complete(), but it's not going to be a wide critical area.  I'm not
familiar with qemu locking primitives, sorry; what's providing an exclusion
between QLIST_INSERT_HEAD in pdu_alloc() (from handle_9p_output())
and QLIST_REMOVE in pdu_free() (from pdu_complete())?  Looks like the same
thing ought to suffice here...

[*] a cute way to hurt the current qemu server, BTW - coroutine that waits for
itself to complete...

[**] Tflush to Tflush is another fun corner case - it does *not* do anything
to the earlier Tflush, but reply to it must follow that to original Tflush
to avoid tag reuse problems.  Note that this is not the "multiple Tflush"
case mentioned above - those would all have oldtag pointing to the same
request; they are not chained and unlike this one can happen with legitimate
clients.  Tflush to Tflush, OTOH, is not something a sane client would do.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
Greg Kurz Jan. 9, 2017, 6:29 p.m. UTC | #10
On Sat, 7 Jan 2017 17:19:10 +0000
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Sat, Jan 07, 2017 at 04:10:45PM +0100, Greg Kurz wrote:
> > > virtqueue_push(), but pdu freeing is delayed until v9fs_flush() gets woken
> > > up.  In the meanwhile, another request arrives into the slot of freed by
> > > that virtqueue_push() and we are out of pdus.
> > >   
> > 
> > Indeed. Even if this doesn't seem to be the problem here, I guess this should
> > be fixed.  
> 
> 	FWIW, there's something that looks like an off-by-one in
> v9fs_device_realize_common():
>     /* initialize pdu allocator */
>     QLIST_INIT(&s->free_list);
>     QLIST_INIT(&s->active_list);
>     for (i = 0; i < (MAX_REQ - 1); i++) {
>         QLIST_INSERT_HEAD(&s->free_list, &s->pdus[i], next);
>         s->pdus[i].s = s;
>         s->pdus[i].idx = i;
>     }
> 
> Had been there since the original merge of 9p support into qemu - that code
> had moved around a bit, but it had never inserted s->pdus[MAX_REQ - 1] into
> free list.  So your scenario with failing pdu_alloc() is still possible.

Indeed, this (MAX_REQ - 1) thing looks wrong. Thanks for poiting that out.

> In that log the total amount of pending requests has reached 128 for the
> first time right when the requests had stopped being handled and even
> though it had dropped below that shortly after, extra requests being put
> into queue had not been processed at all...
> 
> I'm not familiar with qemu guts enough to tell if that's a plausible scenario,
> though... shouldn't subsequent queue insertions (after enough slots had been
> released) simply trigger virtio_queue_notify_vq() again?  It *is* a bug
> (if we get a burst filling a previously empty queue all at once, there won't
> be any slots becoming freed), but that's obviously not the case here -
> slots were getting freed, after all.

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
Tuomas Tynkkynen Jan. 9, 2017, 6:39 p.m. UTC | #11
On Sat, 7 Jan 2017 17:19:10 +0000
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Sat, Jan 07, 2017 at 04:10:45PM +0100, Greg Kurz wrote:
> > > virtqueue_push(), but pdu freeing is delayed until v9fs_flush() gets woken
> > > up.  In the meanwhile, another request arrives into the slot of freed by
> > > that virtqueue_push() and we are out of pdus.
> > >   
> > 
> > Indeed. Even if this doesn't seem to be the problem here, I guess this should
> > be fixed.  
> 
> 	FWIW, there's something that looks like an off-by-one in
> v9fs_device_realize_common():
>     /* initialize pdu allocator */
>     QLIST_INIT(&s->free_list);
>     QLIST_INIT(&s->active_list);
>     for (i = 0; i < (MAX_REQ - 1); i++) {
>         QLIST_INSERT_HEAD(&s->free_list, &s->pdus[i], next);
>         s->pdus[i].s = s;
>         s->pdus[i].idx = i;
>     }
> 
> Had been there since the original merge of 9p support into qemu - that code
> had moved around a bit, but it had never inserted s->pdus[MAX_REQ - 1] into
> free list.  So your scenario with failing pdu_alloc() is still possible.
> In that log the total amount of pending requests has reached 128 for the
> first time right when the requests had stopped being handled and even
> though it had dropped below that shortly after, extra requests being put
> into queue had not been processed at all...
> 

Yes, this does seem to be related to this or otherwise MAX_REQ related!
- Bumping MAX_REQ up to 1024 makes the hang go away (on 4.7).
- Dropping it to 64 makes the same hang happen on kernels where it worked
  before (I tried 4.4.x).
- Doing s/(MAX_REQ - 1)/MAX_REQ/ makes the hang go away.

I tested QEMU 2.8.0 as well and the behaviour is the same there.

Here are the logs for 4.4 hanging with MAX_REQ == 64 without the loop
condition changed:

https://gist.githubusercontent.com/dezgeg/b5f0b7f8a0f3d8b6acb1566d7edcb2f0/raw/00241426890ea28d844986243c3b706881432fb4/9p-44.dmesg.log
https://gist.githubusercontent.com/dezgeg/2bffe6c0271c4c9c382ac6363ce1864b/raw/92329eaef38305f82090de5dde3c944561afa372/9p-44.qemu.log

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
Al Viro Jan. 9, 2017, 8:05 p.m. UTC | #12
On Mon, Jan 09, 2017 at 08:39:31PM +0200, Tuomas Tynkkynen wrote:

> Yes, this does seem to be related to this or otherwise MAX_REQ related!
> - Bumping MAX_REQ up to 1024 makes the hang go away (on 4.7).
> - Dropping it to 64 makes the same hang happen on kernels where it worked
>   before (I tried 4.4.x).
> - Doing s/(MAX_REQ - 1)/MAX_REQ/ makes the hang go away.

Note that it's still possible to trigger the same situation with that
off-by-one taken care of; if client sends 64 Treadlink and 64 Tflush
(one for each of those), then follows by another pile of Treadlink (feeding
them in as soon as free slots appear), the server can end up with failing
pdu_alloc() - completion of readlink will release its slot immediately,
but pdu won't get freed until the corresponding flush has gotten the
CPU.

I'm _not_ familiar with scheduling in qemu and I don't quite understand
the mechanism of getting from "handle_9p_output() bailed out with some
requests still not processed" to "no further requests get processed", so
it might be that for some reason triggering the former as described above
won't escalate to the latter, but I wouldn't count upon that.

Another thing I'm very certain about is that 9 0 0 0 108 1 0 1 0 sent by
broken client (Tflush tag = 1 oldtag = 1) will do nasty things to qemu
server.  v9fs_flush() will try to find the pdu of request to cancel,
find its own argument, put itself on its ->complete and yield CPU, expecting
to get back once the victim gets through to pdu_complete().  Since the
victim is itself...

AFAICS, the things client can expect wrt Tflush handling are
	* in no case should Rflush be sent before the reply to request
its trying to cancel
	* the only case when server _may_ not send Rflush is the arrival
of more than one Tflush with the same oldtag; in that case it is allowed
to suppress replies to earlier ones.  If they are not suppressed, replies
should come in the order of Tflush arrivals.
	* if reply to Tflush is sent (see above), it must be Rflush.
	* multiple Tflush with the same oldtag are allowed; Linux kernel
client does not issue those, but other clients might.  As the matter of
fact, Plan 9 kernel client *does* issue those.
	* Tflush to Tflush is no-op; it still needs a reply, and ordering
constraints apply (it can't be sent before the reply to Tflush it's been
refering to, which, in turn, can't be sent before the reply to request
the first Tflush refers to).  Normally such requests are not sent, but
in principle they are allowed.
	* Tflush to request that isn't being processed should be
answered immediately.  The same goes for Tflush refering to itself.
The former is not an error (we might have already sent a reply), but
the latter might be worth a loud warning - clients are definitely not
supposed to do that.  It still needs Rflush in response - Rerror is not
allowed.

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
Greg Kurz Jan. 10, 2017, 8:16 a.m. UTC | #13
On Sun, 8 Jan 2017 05:46:39 +0000
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Sat, Jan 07, 2017 at 06:15:23PM +0000, Al Viro wrote:
> > On Sat, Jan 07, 2017 at 05:19:10PM +0000, Al Viro wrote:
> >   
> > > released) simply trigger virtio_queue_notify_vq() again?  It *is* a bug
> > > (if we get a burst filling a previously empty queue all at once, there won't
> > > be any slots becoming freed  
> > 
> > Umm...  Nope, even in that scenario we'll have all requests except the last
> > one processed.  I'm trying to put together a simpler reproducer, but...
> > no luck so far.  
> 
> FWIW, here's another fun bug in qemu 9pfs: client may issue multiple
> Tflush on the same pending request.  Server must not reply to them
> out of order and it must reply to at least the last one.  If client has
> sent more than one Tflush, it must treat an arrival of Rflush to one of
> those as if it has received an Rflush to each of the earlier Tflush as
> well (IOW, *their* tags are released).  Client may not reuse the tag of
> victim request until it has received Rflush for the last Tflush it has
> sent.
> 
> Linux kernel-side 9p client doesn't issue such multiple Tflush, but the
> protocol allows that.
> 
> qemu server does not guarantee in-order replies to multiple Tflush; worse,
> while it is likely to reply only to one of those, it may very well be
> the _first_ one.  Subsequent ones will be lost; moreover, it'll leak
> both coroutines and ring slots.
> 

Yeah I'm aware about that, I had started to work on a fix but it's low
priority on my TODO list since linux guests don't do that... and I got
scheduled.

> AFAICS, a clean way to deal with that would be to handle Tflush synchronously,
> right in pdu_submit():
> 	if pdu->type == Tflush
> 		look the victim request up.
> 		if !victim || victim == pdu	// [*]
> 			send reply and free pdu immediately.
> 		if victim->type == Tflush	// [**]
> 			pdu->victim = victim->victim
> 		else
> 			pdu->victim = victim
> 			mark the victim cancelled
> 		add pdu to the tail of pdu->victim->flushes
> and let pdu_complete(pdu) send a reply to each request in pdu->flushes
> as well (freeing them as we go)
> 

I had kept the asynchronous way and it resulted in quite convoluted code.
I'll give a try as per your suggestion.

> Some locking might be needed to avoid the races between pdu_submit() and
> pdu_complete(), but it's not going to be a wide critical area.  I'm not
> familiar with qemu locking primitives, sorry; what's providing an exclusion
> between QLIST_INSERT_HEAD in pdu_alloc() (from handle_9p_output())
> and QLIST_REMOVE in pdu_free() (from pdu_complete())?  Looks like the same
> thing ought to suffice here...
> 

This code runs in coroutines (stack switching, see include/qemu/coroutine.h for
details): it is serialized.

> [*] a cute way to hurt the current qemu server, BTW - coroutine that waits for
> itself to complete...
> 

Indeed :)

> [**] Tflush to Tflush is another fun corner case - it does *not* do anything
> to the earlier Tflush, but reply to it must follow that to original Tflush
> to avoid tag reuse problems.  Note that this is not the "multiple Tflush"
> case mentioned above - those would all have oldtag pointing to the same
> request; they are not chained and unlike this one can happen with legitimate
> clients.  Tflush to Tflush, OTOH, is not something a sane client would do.

I agree I don't see when a client would want to do that but FWIW, it is
mentioned in the last paragraph of http://man.cat-v.org/plan_9/5/flush :)

Thanks for your valuable suggestions, Al!

Cheers.

--
Greg

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
diff mbox

Patch

diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index aea7e9d..8a6b426 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -662,6 +662,7 @@  static void coroutine_fn pdu_complete(V9fsPDU *pdu, ssize_t len)
 
     /* fill out the header */
     pdu_marshal(pdu, 0, "dbw", (int32_t)len, id, pdu->tag);
+    fprintf(stderr, "complete %d %04x\n", id, pdu->tag);
 
     /* keep these in sync */
     pdu->size = len;
@@ -2347,6 +2348,7 @@  static void v9fs_flush(void *opaque)
         return;
     }
     trace_v9fs_flush(pdu->tag, pdu->id, tag);
+    fprintf(stderr, "flush %04x %04x\n", tag, pdu->tag);
 
     QLIST_FOREACH(cancel_pdu, &s->active_list, next) {
         if (cancel_pdu->tag == tag) {
diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 1782e4a..6a5ac04 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -76,6 +76,7 @@  static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
 
         pdu->id = out.id;
         pdu->tag = le16_to_cpu(out.tag_le);
+        fprintf(stderr, "out %d %04x\n", pdu->id, pdu->tag);
 
         qemu_co_queue_init(&pdu->complete);
         pdu_submit(pdu);