diff mbox

linux-aio: keep processing events if MAX_EVENTS reached

Message ID CAJrWOzChMP7tTogWzN+i=kNqE5rR5mX9Tcqb13N15nwhQRiotg@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Roman Pen July 12, 2016, 2:12 p.m. UTC
On Tue, Jun 28, 2016 at 11:42 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Mon, Jun 27, 2016 at 08:36:19PM +0200, Roman Penyaev wrote:
>> On Jun 27, 2016 6:37 PM, "Stefan Hajnoczi" <stefanha@redhat.com> wrote:
>> >
>> > Commit ccb9dc10129954d0bcd7814298ed445e684d5a2a ("linux-aio: Cancel BH
>> > if not needed") exposed a side-effect of scheduling the BH for nested
>> > event loops.  When io_getevents(2) fetches MAX_EVENTS events then it's
>> > necessary to call it again.  Failure to do so can lead to hung I/O
>> > because the eventfd has already been cleared and the completion BH will
>> > not run again.
>> >
>> > This patch fixes the hang by calling io_getevents(2) again if the events
>> > array was totally full.
>> >
>> > Reported-by: Roman Penyaev <roman.penyaev@profitbricks.com>
>> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>> > ---
>> >  block/linux-aio.c | 5 +++++
>> >  1 file changed, 5 insertions(+)
>> >
>> > diff --git a/block/linux-aio.c b/block/linux-aio.c
>> > index e468960..af15f85 100644
>> > --- a/block/linux-aio.c
>> > +++ b/block/linux-aio.c
>> > @@ -117,6 +117,7 @@ static void qemu_laio_completion_bh(void *opaque)
>> >      LinuxAioState *s = opaque;
>> >
>> >      /* Fetch more completion events when empty */
>> > +more_events:
>> >      if (s->event_idx == s->event_max) {
>> >          do {
>> >              struct timespec ts = { 0 };
>> > @@ -146,6 +147,10 @@ static void qemu_laio_completion_bh(void *opaque)
>> >          qemu_laio_process_completion(laiocb);
>> >      }
>> >
>> > +    if (s->event_idx == MAX_EVENTS) {
>> > +        goto more_events; /* there might still be events waiting for us
>> */
>> > +    }
>> > +
>>
>> I am on vacation and can't check anything, but seems you also
>> could reproduce an issue and seems you are right: what I've
>> also noticed is that io hangs only if total number of requests
>> in guest > than 128 (what is defined in linux-aio.c).
>>
>> But I do not understand why you have to repeat io_getevents in
>> case of return value == MAX_EVENTS. According to my
>> understanding you cant submit more than 128, so the first
>> io_getevents call returns you exactly what you've submitted.
>
> Hi Roman,
> There is nothing like discussions on qemu-devel from the beach.  True
> vacation!
>
> The issue is:
>
> 1. s->events[] is only MAX_EVENTS large so each io_getevents() call can
>    only fetch that maximum number of events.
>
> 2. qemu_laio_completion_cb() clears the eventfd so the event loop will
>    not call us again (unless additional I/O requests complete).
>
> Therefore, returning from qemu_laio_completion_bh() without having
> processed all events causes a hang.

Hi Stefan,

The issue is clear now.  The thing is that I had an assumption, that we
never submit more than MAX_EVENTS.  Now I see that according to the
ioq_submit() this is not true, so we can have inflights > MAX_EVENTS.

Frankly, that seems a bug to me, because we promise the kernel [when
we call io_setup(MAX_EVENTS)] not to submit more than specified value.
Kernel allocates ring buffer aligned to the page size, also does some
compensations to have enough free requests for each CPU, so the final
io events number *can be* > than we have requested.  So eventually
kernel can "swallow" more events than MAX_EVENTS.

I did the following patch (at the bottom of this email), which restricts
submission more than MAX_EVENTS and the interesting thing, that it works
faster than your current fix for this hang:

my setup: 1 iothread, VCPU=8, MQ=8

your current patch
"linux-aio: keep processing events if MAX_EVENTS reached":

   READ: io=48199MB, aggrb=1606.5MB/s, minb=1606.5MB/s,
maxb=1606.5MB/s, mint=30003msec, maxt=30003msec
  WRITE: io=48056MB, aggrb=1601.8MB/s, minb=1601.8MB/s,
maxb=1601.8MB/s, mint=30003msec, maxt=30003msec

mine changes:

   READ: io=53294MB, aggrb=1776.3MB/s, minb=1776.3MB/s,
maxb=1776.3MB/s, mint=30003msec, maxt=30003msec
  WRITE: io=53177MB, aggrb=1772.4MB/s, minb=1772.4MB/s,
maxb=1772.4MB/s, mint=30003msec, maxt=30003msec

But what is the most important thing here is, that reverting
"linux-aio: Cancel BH if not needed" brings these numbers:

   READ: io=56362MB, aggrb=1878.4MB/s, minb=1878.4MB/s,
maxb=1878.4MB/s, mint=30007msec, maxt=30007msec
  WRITE: io=56255MB, aggrb=1874.8MB/s, minb=1874.8MB/s,
maxb=1874.8MB/s, mint=30007msec, maxt=30007msec

So, it seems to me that "linux-aio: Cancel BH if not needed" introduces
regression.

--
Roman

---
 block/linux-aio.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

Comments

Paolo Bonzini July 12, 2016, 2:59 p.m. UTC | #1
On 12/07/2016 16:12, Roman Penyaev wrote:
> But what is the most important thing here is, that reverting
> "linux-aio: Cancel BH if not needed" brings these numbers:
> 
>    READ: io=56362MB, aggrb=1878.4MB/s, minb=1878.4MB/s,
> maxb=1878.4MB/s, mint=30007msec, maxt=30007msec
>   WRITE: io=56255MB, aggrb=1874.8MB/s, minb=1874.8MB/s,
> maxb=1874.8MB/s, mint=30007msec, maxt=30007msec
> 
> So, it seems to me that "linux-aio: Cancel BH if not needed" introduces
> regression.

It's possible that this depends on the workload.  But, thanks for
measuring it.  It's not a small effect (~5%).  We should consider indeed
reverting the patch.

> -    unsigned int n;
> +    unsigned int inqueue;
> +    unsigned int inflight;

Please use in_queue (or queue_length) and in_flight.

> @@ -203,9 +206,12 @@ static void ioq_submit(LinuxAioState *s)
> 
>      do {
>          len = 0;
> +        if (s->io_q.inflight >= MAX_EVENTS)
> +            break;
>          QSIMPLEQ_FOREACH(aiocb, &s->io_q.pending, next) {
>              iocbs[len++] = &aiocb->iocb;
> -            if (len == MAX_QUEUED_IO) {
> +            if (len == MAX_QUEUED_IO ||
> +                (s->io_q.inflight + len) >= MAX_EVENTS) {

Interesting!

I suggest an additional cleanup here, which is to replace MAX_QUEUED_IO
with MAX_EVENTS everywhere and remove the "len == MAX_QUEUED_IO" part.

Also, please remove the parentheses on the left of >=.

>      if (!s->io_q.blocked &&
> -        (!s->io_q.plugged || s->io_q.n >= MAX_QUEUED_IO)) {
> +        (!s->io_q.plugged || s->io_q.inqueue >= MAX_QUEUED_IO)) {

This should now be

    s->io_q.inflight + s->io_q.inqueue >= MAX_EVENTS

(the logic is: if we can do a "full" io_submit, do it).

Thanks!

Paolo
Roman Pen July 12, 2016, 4:34 p.m. UTC | #2
On Tue, Jul 12, 2016 at 4:59 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 12/07/2016 16:12, Roman Penyaev wrote:
>> But what is the most important thing here is, that reverting
>> "linux-aio: Cancel BH if not needed" brings these numbers:
>>
>>    READ: io=56362MB, aggrb=1878.4MB/s, minb=1878.4MB/s,
>> maxb=1878.4MB/s, mint=30007msec, maxt=30007msec
>>   WRITE: io=56255MB, aggrb=1874.8MB/s, minb=1874.8MB/s,
>> maxb=1874.8MB/s, mint=30007msec, maxt=30007msec
>>
>> So, it seems to me that "linux-aio: Cancel BH if not needed" introduces
>> regression.
>
> It's possible that this depends on the workload.  But, thanks for
> measuring it.  It's not a small effect (~5%).  We should consider indeed
> reverting the patch.

Without any doubts it depends.  But I can't say that my fio config
is something special, take a look:

[global]
bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4
fadvise_hint=0
rw=randrw:2
direct=1

ioengine=libaio
iodepth=64
iodepth_batch_submit=64
iodepth_batch_complete=64
numjobs=8
gtod_reduce=1
group_reporting=1

time_based=1
runtime=30

[job]
filename=/dev/vda

>
>> -    unsigned int n;
>> +    unsigned int inqueue;
>> +    unsigned int inflight;
>
> Please use in_queue (or queue_length) and in_flight.

Yeah, but that was just a quick exercise.
If Stefan or you want this as a normal patch, I will resend
with all tweaks made.

>
>> @@ -203,9 +206,12 @@ static void ioq_submit(LinuxAioState *s)
>>
>>      do {
>>          len = 0;
>> +        if (s->io_q.inflight >= MAX_EVENTS)
>> +            break;
>>          QSIMPLEQ_FOREACH(aiocb, &s->io_q.pending, next) {
>>              iocbs[len++] = &aiocb->iocb;
>> -            if (len == MAX_QUEUED_IO) {
>> +            if (len == MAX_QUEUED_IO ||
>> +                (s->io_q.inflight + len) >= MAX_EVENTS) {
>
> Interesting!
>
> I suggest an additional cleanup here, which is to replace MAX_QUEUED_IO
> with MAX_EVENTS everywhere and remove the "len == MAX_QUEUED_IO" part.

Yes, that makes sense.

>
> Also, please remove the parentheses on the left of >=.

Ok.

>
>>      if (!s->io_q.blocked &&
>> -        (!s->io_q.plugged || s->io_q.n >= MAX_QUEUED_IO)) {
>> +        (!s->io_q.plugged || s->io_q.inqueue >= MAX_QUEUED_IO)) {
>
> This should now be
>
>     s->io_q.inflight + s->io_q.inqueue >= MAX_EVENTS
>
> (the logic is: if we can do a "full" io_submit, do it).

Ok.

--
Roman
Paolo Bonzini July 12, 2016, 4:58 p.m. UTC | #3
On 12/07/2016 18:34, Roman Penyaev wrote:
> On Tue, Jul 12, 2016 at 4:59 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 12/07/2016 16:12, Roman Penyaev wrote:
>>> But what is the most important thing here is, that reverting
>>> "linux-aio: Cancel BH if not needed" brings these numbers:
>>>
>>>    READ: io=56362MB, aggrb=1878.4MB/s, minb=1878.4MB/s,
>>> maxb=1878.4MB/s, mint=30007msec, maxt=30007msec
>>>   WRITE: io=56255MB, aggrb=1874.8MB/s, minb=1874.8MB/s,
>>> maxb=1874.8MB/s, mint=30007msec, maxt=30007msec
>>>
>>> So, it seems to me that "linux-aio: Cancel BH if not needed" introduces
>>> regression.
>>
>> It's possible that this depends on the workload.  But, thanks for
>> measuring it.  It's not a small effect (~5%).  We should consider indeed
>> reverting the patch.
> 
> Without any doubts it depends.  But I can't say that my fio config
> is something special

Yup, I wasn't expecting anything special.

>>> -    unsigned int n;
>>> +    unsigned int inqueue;
>>> +    unsigned int inflight;
>>
>> Please use in_queue (or queue_length) and in_flight.
> 
> Yeah, but that was just a quick exercise.
> If Stefan or you want this as a normal patch, I will resend
> with all tweaks made.

Yes, please do send it.

Paolo
Stefan Hajnoczi July 15, 2016, 2:18 p.m. UTC | #4
On Tue, Jul 12, 2016 at 04:12:42PM +0200, Roman Penyaev wrote:
> On Tue, Jun 28, 2016 at 11:42 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Mon, Jun 27, 2016 at 08:36:19PM +0200, Roman Penyaev wrote:
> >> On Jun 27, 2016 6:37 PM, "Stefan Hajnoczi" <stefanha@redhat.com> wrote:
> >> >
> >> > Commit ccb9dc10129954d0bcd7814298ed445e684d5a2a ("linux-aio: Cancel BH
> >> > if not needed") exposed a side-effect of scheduling the BH for nested
> >> > event loops.  When io_getevents(2) fetches MAX_EVENTS events then it's
> >> > necessary to call it again.  Failure to do so can lead to hung I/O
> >> > because the eventfd has already been cleared and the completion BH will
> >> > not run again.
> >> >
> >> > This patch fixes the hang by calling io_getevents(2) again if the events
> >> > array was totally full.
> >> >
> >> > Reported-by: Roman Penyaev <roman.penyaev@profitbricks.com>
> >> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> >> > ---
> >> >  block/linux-aio.c | 5 +++++
> >> >  1 file changed, 5 insertions(+)
> >> >
> >> > diff --git a/block/linux-aio.c b/block/linux-aio.c
> >> > index e468960..af15f85 100644
> >> > --- a/block/linux-aio.c
> >> > +++ b/block/linux-aio.c
> >> > @@ -117,6 +117,7 @@ static void qemu_laio_completion_bh(void *opaque)
> >> >      LinuxAioState *s = opaque;
> >> >
> >> >      /* Fetch more completion events when empty */
> >> > +more_events:
> >> >      if (s->event_idx == s->event_max) {
> >> >          do {
> >> >              struct timespec ts = { 0 };
> >> > @@ -146,6 +147,10 @@ static void qemu_laio_completion_bh(void *opaque)
> >> >          qemu_laio_process_completion(laiocb);
> >> >      }
> >> >
> >> > +    if (s->event_idx == MAX_EVENTS) {
> >> > +        goto more_events; /* there might still be events waiting for us
> >> */
> >> > +    }
> >> > +
> >>
> >> I am on vacation and can't check anything, but seems you also
> >> could reproduce an issue and seems you are right: what I've
> >> also noticed is that io hangs only if total number of requests
> >> in guest > than 128 (what is defined in linux-aio.c).
> >>
> >> But I do not understand why you have to repeat io_getevents in
> >> case of return value == MAX_EVENTS. According to my
> >> understanding you cant submit more than 128, so the first
> >> io_getevents call returns you exactly what you've submitted.
> >
> > Hi Roman,
> > There is nothing like discussions on qemu-devel from the beach.  True
> > vacation!
> >
> > The issue is:
> >
> > 1. s->events[] is only MAX_EVENTS large so each io_getevents() call can
> >    only fetch that maximum number of events.
> >
> > 2. qemu_laio_completion_cb() clears the eventfd so the event loop will
> >    not call us again (unless additional I/O requests complete).
> >
> > Therefore, returning from qemu_laio_completion_bh() without having
> > processed all events causes a hang.
> 
> Hi Stefan,
> 
> The issue is clear now.  The thing is that I had an assumption, that we
> never submit more than MAX_EVENTS.  Now I see that according to the
> ioq_submit() this is not true, so we can have inflights > MAX_EVENTS.
> 
> Frankly, that seems a bug to me, because we promise the kernel [when
> we call io_setup(MAX_EVENTS)] not to submit more than specified value.
> Kernel allocates ring buffer aligned to the page size, also does some
> compensations to have enough free requests for each CPU, so the final
> io events number *can be* > than we have requested.  So eventually
> kernel can "swallow" more events than MAX_EVENTS.

You are right.  QEMU shouldn't exceed MAX_EVENTS in-flight requests,
even though the kernel overallocates resources so we don't see an error
(yet).

Your patch makes linux-aio.c more correct overall, so let's take that.

> I did the following patch (at the bottom of this email), which restricts
> submission more than MAX_EVENTS and the interesting thing, that it works
> faster than your current fix for this hang:
> 
> my setup: 1 iothread, VCPU=8, MQ=8
> 
> your current patch
> "linux-aio: keep processing events if MAX_EVENTS reached":
> 
>    READ: io=48199MB, aggrb=1606.5MB/s, minb=1606.5MB/s,
> maxb=1606.5MB/s, mint=30003msec, maxt=30003msec
>   WRITE: io=48056MB, aggrb=1601.8MB/s, minb=1601.8MB/s,
> maxb=1601.8MB/s, mint=30003msec, maxt=30003msec
> 
> mine changes:
> 
>    READ: io=53294MB, aggrb=1776.3MB/s, minb=1776.3MB/s,
> maxb=1776.3MB/s, mint=30003msec, maxt=30003msec
>   WRITE: io=53177MB, aggrb=1772.4MB/s, minb=1772.4MB/s,
> maxb=1772.4MB/s, mint=30003msec, maxt=30003msec
> 
> But what is the most important thing here is, that reverting
> "linux-aio: Cancel BH if not needed" brings these numbers:
> 
>    READ: io=56362MB, aggrb=1878.4MB/s, minb=1878.4MB/s,
> maxb=1878.4MB/s, mint=30007msec, maxt=30007msec
>   WRITE: io=56255MB, aggrb=1874.8MB/s, minb=1874.8MB/s,
> maxb=1874.8MB/s, mint=30007msec, maxt=30007msec
> 
> So, it seems to me that "linux-aio: Cancel BH if not needed" introduces
> regression.

Unfortunately the patch does two things so it would be best to identify
which one causes the performance regression.

Is it #1?

@@ -149,6 +149,8 @@ static void qemu_laio_completion_bh(void *opaque)
     if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
+
+    qemu_bh_cancel(s->completion_bh);
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)

That seems a little unlikely since qemu_bh_cancel() is very
light-weight.

Or #2?

@@ -156,7 +158,7 @@ static void qemu_laio_completion_cb(EventNotifier *e)
     LinuxAioState *s = container_of(e, LinuxAioState, e);
 
     if (event_notifier_test_and_clear(&s->e)) {
-        qemu_bh_schedule(s->completion_bh);
+        qemu_laio_completion_bh(s);
     }
 }

If it's #2 then there may be an fd handler/BH ordering issue that we
should investigate.  In other words, the problem isn't that calling
qemu_laio_completion_bh() directly is "slower" but that calling it here
reduces the number of completion events that io_getevents() can
harvest.  This is a form of batching but it could also happen if other
fd handlers/BHs are submitting I/O.
Roman Pen July 15, 2016, 2:42 p.m. UTC | #5
On Fri, Jul 15, 2016 at 4:18 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Jul 12, 2016 at 04:12:42PM +0200, Roman Penyaev wrote:
>> On Tue, Jun 28, 2016 at 11:42 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> > On Mon, Jun 27, 2016 at 08:36:19PM +0200, Roman Penyaev wrote:
>> >> On Jun 27, 2016 6:37 PM, "Stefan Hajnoczi" <stefanha@redhat.com> wrote:
>> >> >
>> >> > Commit ccb9dc10129954d0bcd7814298ed445e684d5a2a ("linux-aio: Cancel BH
>> >> > if not needed") exposed a side-effect of scheduling the BH for nested
>> >> > event loops.  When io_getevents(2) fetches MAX_EVENTS events then it's
>> >> > necessary to call it again.  Failure to do so can lead to hung I/O
>> >> > because the eventfd has already been cleared and the completion BH will
>> >> > not run again.
>> >> >
>> >> > This patch fixes the hang by calling io_getevents(2) again if the events
>> >> > array was totally full.
>> >> >
>> >> > Reported-by: Roman Penyaev <roman.penyaev@profitbricks.com>
>> >> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>> >> > ---
>> >> >  block/linux-aio.c | 5 +++++
>> >> >  1 file changed, 5 insertions(+)
>> >> >
>> >> > diff --git a/block/linux-aio.c b/block/linux-aio.c
>> >> > index e468960..af15f85 100644
>> >> > --- a/block/linux-aio.c
>> >> > +++ b/block/linux-aio.c
>> >> > @@ -117,6 +117,7 @@ static void qemu_laio_completion_bh(void *opaque)
>> >> >      LinuxAioState *s = opaque;
>> >> >
>> >> >      /* Fetch more completion events when empty */
>> >> > +more_events:
>> >> >      if (s->event_idx == s->event_max) {
>> >> >          do {
>> >> >              struct timespec ts = { 0 };
>> >> > @@ -146,6 +147,10 @@ static void qemu_laio_completion_bh(void *opaque)
>> >> >          qemu_laio_process_completion(laiocb);
>> >> >      }
>> >> >
>> >> > +    if (s->event_idx == MAX_EVENTS) {
>> >> > +        goto more_events; /* there might still be events waiting for us
>> >> */
>> >> > +    }
>> >> > +
>> >>
>> >> I am on vacation and can't check anything, but seems you also
>> >> could reproduce an issue and seems you are right: what I've
>> >> also noticed is that io hangs only if total number of requests
>> >> in guest > than 128 (what is defined in linux-aio.c).
>> >>
>> >> But I do not understand why you have to repeat io_getevents in
>> >> case of return value == MAX_EVENTS. According to my
>> >> understanding you cant submit more than 128, so the first
>> >> io_getevents call returns you exactly what you've submitted.
>> >
>> > Hi Roman,
>> > There is nothing like discussions on qemu-devel from the beach.  True
>> > vacation!
>> >
>> > The issue is:
>> >
>> > 1. s->events[] is only MAX_EVENTS large so each io_getevents() call can
>> >    only fetch that maximum number of events.
>> >
>> > 2. qemu_laio_completion_cb() clears the eventfd so the event loop will
>> >    not call us again (unless additional I/O requests complete).
>> >
>> > Therefore, returning from qemu_laio_completion_bh() without having
>> > processed all events causes a hang.
>>
>> Hi Stefan,
>>
>> The issue is clear now.  The thing is that I had an assumption, that we
>> never submit more than MAX_EVENTS.  Now I see that according to the
>> ioq_submit() this is not true, so we can have inflights > MAX_EVENTS.
>>
>> Frankly, that seems a bug to me, because we promise the kernel [when
>> we call io_setup(MAX_EVENTS)] not to submit more than specified value.
>> Kernel allocates ring buffer aligned to the page size, also does some
>> compensations to have enough free requests for each CPU, so the final
>> io events number *can be* > than we have requested.  So eventually
>> kernel can "swallow" more events than MAX_EVENTS.
>
> You are right.  QEMU shouldn't exceed MAX_EVENTS in-flight requests,
> even though the kernel overallocates resources so we don't see an error
> (yet).
>
> Your patch makes linux-aio.c more correct overall, so let's take that.
>
>> I did the following patch (at the bottom of this email), which restricts
>> submission more than MAX_EVENTS and the interesting thing, that it works
>> faster than your current fix for this hang:
>>
>> my setup: 1 iothread, VCPU=8, MQ=8
>>
>> your current patch
>> "linux-aio: keep processing events if MAX_EVENTS reached":
>>
>>    READ: io=48199MB, aggrb=1606.5MB/s, minb=1606.5MB/s,
>> maxb=1606.5MB/s, mint=30003msec, maxt=30003msec
>>   WRITE: io=48056MB, aggrb=1601.8MB/s, minb=1601.8MB/s,
>> maxb=1601.8MB/s, mint=30003msec, maxt=30003msec
>>
>> mine changes:
>>
>>    READ: io=53294MB, aggrb=1776.3MB/s, minb=1776.3MB/s,
>> maxb=1776.3MB/s, mint=30003msec, maxt=30003msec
>>   WRITE: io=53177MB, aggrb=1772.4MB/s, minb=1772.4MB/s,
>> maxb=1772.4MB/s, mint=30003msec, maxt=30003msec
>>
>> But what is the most important thing here is, that reverting
>> "linux-aio: Cancel BH if not needed" brings these numbers:
>>
>>    READ: io=56362MB, aggrb=1878.4MB/s, minb=1878.4MB/s,
>> maxb=1878.4MB/s, mint=30007msec, maxt=30007msec
>>   WRITE: io=56255MB, aggrb=1874.8MB/s, minb=1874.8MB/s,
>> maxb=1874.8MB/s, mint=30007msec, maxt=30007msec
>>
>> So, it seems to me that "linux-aio: Cancel BH if not needed" introduces
>> regression.
>
> Unfortunately the patch does two things so it would be best to identify
> which one causes the performance regression.
>
> Is it #1?
>
> @@ -149,6 +149,8 @@ static void qemu_laio_completion_bh(void *opaque)
>      if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
>          ioq_submit(s);
>      }
> +
> +    qemu_bh_cancel(s->completion_bh);
>  }
>
>  static void qemu_laio_completion_cb(EventNotifier *e)
>
> That seems a little unlikely since qemu_bh_cancel() is very
> light-weight.

Yes, this BH cancellation.  qemu_laio_completion_bh() should be called
more often to reduce latency and to give fio a chance to submit more
"fresh" requests.  Recently I sent description of the issue here:

http://lists.nongnu.org/archive/html/qemu-devel/2016-07/msg03375.html

You should be also in Cc.

>
> Or #2?
>
> @@ -156,7 +158,7 @@ static void qemu_laio_completion_cb(EventNotifier *e)
>      LinuxAioState *s = container_of(e, LinuxAioState, e);
>
>      if (event_notifier_test_and_clear(&s->e)) {
> -        qemu_bh_schedule(s->completion_bh);
> +        qemu_laio_completion_bh(s);
>      }
>  }
>
> If it's #2 then there may be an fd handler/BH ordering issue that we
> should investigate.

Not #2.  This change does not impact the performance, at least I have not
noticed anything.

> In other words, the problem isn't that calling
> qemu_laio_completion_bh() directly is "slower" but that calling it here
> reduces the number of completion events that io_getevents() can
> harvest.

No, it is not slower and the cost of if it is almost ~0, especially when we
can harvest completions from userspace.

The point is that we have to consume completed events more often. On fast
devices (like nullb which completes everything almost immediately) we have
to peek completions ASAP.

>  This is a form of batching but it could also happen if other
> fd handlers/BHs are submitting I/O.

Please, take a look on the link I gave above.

--
Roman
diff mbox

Patch

diff --git a/block/linux-aio.c b/block/linux-aio.c
index e468960..dd768cc 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -44,7 +44,8 @@  struct qemu_laiocb {

 typedef struct {
     int plugged;
-    unsigned int n;
+    unsigned int inqueue;
+    unsigned int inflight;
     bool blocked;
     QSIMPLEQ_HEAD(, qemu_laiocb) pending;
 } LaioQueue;
@@ -129,6 +130,7 @@  static void qemu_laio_completion_bh(void *opaque)
             s->event_max = 0;
             return; /* no more events */
         }
+        s->io_q.inflight -= s->event_max;
     }

     /* Reschedule so nested event loops see currently pending completions */
@@ -190,7 +192,8 @@  static void ioq_init(LaioQueue *io_q)
 {
     QSIMPLEQ_INIT(&io_q->pending);
     io_q->plugged = 0;
-    io_q->n = 0;
+    io_q->inqueue = 0;
+    io_q->inflight = 0;
     io_q->blocked = false;
 }

@@ -203,9 +206,12 @@  static void ioq_submit(LinuxAioState *s)

     do {
         len = 0;
+        if (s->io_q.inflight >= MAX_EVENTS)
+            break;
         QSIMPLEQ_FOREACH(aiocb, &s->io_q.pending, next) {
             iocbs[len++] = &aiocb->iocb;
-            if (len == MAX_QUEUED_IO) {
+            if (len == MAX_QUEUED_IO ||
+                (s->io_q.inflight + len) >= MAX_EVENTS) {
                 break;
             }
         }
@@ -218,11 +224,12 @@  static void ioq_submit(LinuxAioState *s)
             abort();
         }

-        s->io_q.n -= ret;
+        s->io_q.inflight += ret;
+        s->io_q.inqueue  -= ret;
         aiocb = container_of(iocbs[ret - 1], struct qemu_laiocb, iocb);
         QSIMPLEQ_SPLIT_AFTER(&s->io_q.pending, aiocb, next, &completed);
     } while (ret == len && !QSIMPLEQ_EMPTY(&s->io_q.pending));
-    s->io_q.blocked = (s->io_q.n > 0);
+    s->io_q.blocked = (s->io_q.inqueue > 0);
 }

 void laio_io_plug(BlockDriverState *bs, LinuxAioState *s)
@@ -263,9 +270,9 @@  static int laio_do_submit(int fd, struct
qemu_laiocb *laiocb, off_t offset,
     io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));

     QSIMPLEQ_INSERT_TAIL(&s->io_q.pending, laiocb, next);
-    s->io_q.n++;
+    s->io_q.inqueue++;
     if (!s->io_q.blocked &&
-        (!s->io_q.plugged || s->io_q.n >= MAX_QUEUED_IO)) {
+        (!s->io_q.plugged || s->io_q.inqueue >= MAX_QUEUED_IO)) {
         ioq_submit(s);
     }