diff mbox series

[4/4] dm: support I/O polling

Message ID 20210302190555.201228400@debian-a64.vm (mailing list archive)
State Superseded, archived
Delegated to: Mike Snitzer
Headers show
Series Reworked device mapper polling patchset | expand

Commit Message

Mikulas Patocka March 2, 2021, 7:05 p.m. UTC
Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
cookie.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm.c |    5 +++++
 1 file changed, 5 insertions(+)


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Comments

Jingbo Xu March 3, 2021, 2:53 a.m. UTC | #1
On 3/3/21 3:05 AM, Mikulas Patocka wrote:

> Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
> cookie.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> ---
>  drivers/md/dm.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> Index: linux-2.6/drivers/md/dm.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
> +++ linux-2.6/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
> @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
>  		}
>  	}
>  
> +	if (ci.poll_cookie != BLK_QC_T_NONE) {
> +		while (atomic_read(&ci.io->io_count) > 1 &&
> +		       blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
> +	}
> +
>  	/* drop the extra reference count */
>  	dec_pending(ci.io, errno_to_blk_status(error));
>  }

It seems that the general idea of your design is to
1) submit *one* split bio
2) blk_poll(), waiting the previously submitted split bio complets

and then submit next split bio, repeating the above process. I'm afraid
the performance may be an issue here, since the batch every time
blk_poll() reaps may decrease.


Besides, the submitting routine and polling routine is bound together
here, i.e., polling is always synchronous.
Mikulas Patocka March 3, 2021, 10:09 a.m. UTC | #2
On Wed, 3 Mar 2021, JeffleXu wrote:

> 
> 
> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
> 
> > Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
> > cookie.
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > 
> > ---
> >  drivers/md/dm.c |    5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > Index: linux-2.6/drivers/md/dm.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
> > +++ linux-2.6/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
> > @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
> >  		}
> >  	}
> >  
> > +	if (ci.poll_cookie != BLK_QC_T_NONE) {
> > +		while (atomic_read(&ci.io->io_count) > 1 &&
> > +		       blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
> > +	}
> > +
> >  	/* drop the extra reference count */
> >  	dec_pending(ci.io, errno_to_blk_status(error));
> >  }
> 
> It seems that the general idea of your design is to
> 1) submit *one* split bio
> 2) blk_poll(), waiting the previously submitted split bio complets

No, I submit all the bios and poll for the last one.

> and then submit next split bio, repeating the above process. I'm afraid
> the performance may be an issue here, since the batch every time
> blk_poll() reaps may decrease.

Could you benchmark it?

> Besides, the submitting routine and polling routine is bound together
> here, i.e., polling is always synchronous.

__split_and_process_bio calls __split_and_process_non_flush in a loop and 
__split_and_process_non_flush records the poll cookie in ci.poll_cookie. 
When we processed all the bios, we poll for the last cookie here:

        if (ci.poll_cookie != BLK_QC_T_NONE) {
                while (atomic_read(&ci.io->io_count) > 1 &&
                       blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
        }


Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Jingbo Xu March 4, 2021, 2:57 a.m. UTC | #3
On 3/3/21 6:09 PM, Mikulas Patocka wrote:
> 
> 
> On Wed, 3 Mar 2021, JeffleXu wrote:
> 
>>
>>
>> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
>>
>>> Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
>>> cookie.
>>>
>>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>>>
>>> ---
>>>  drivers/md/dm.c |    5 +++++
>>>  1 file changed, 5 insertions(+)
>>>
>>> Index: linux-2.6/drivers/md/dm.c
>>> ===================================================================
>>> --- linux-2.6.orig/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
>>> +++ linux-2.6/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
>>> @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
>>>  		}
>>>  	}
>>>  
>>> +	if (ci.poll_cookie != BLK_QC_T_NONE) {
>>> +		while (atomic_read(&ci.io->io_count) > 1 &&
>>> +		       blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
>>> +	}
>>> +
>>>  	/* drop the extra reference count */
>>>  	dec_pending(ci.io, errno_to_blk_status(error));
>>>  }
>>
>> It seems that the general idea of your design is to
>> 1) submit *one* split bio
>> 2) blk_poll(), waiting the previously submitted split bio complets
> 
> No, I submit all the bios and poll for the last one.
> 
>> and then submit next split bio, repeating the above process. I'm afraid
>> the performance may be an issue here, since the batch every time
>> blk_poll() reaps may decrease.
> 
> Could you benchmark it?
> 

I will once I finished some other issues.


>> Besides, the submitting routine and polling routine is bound together
>> here, i.e., polling is always synchronous.
> 
> __split_and_process_bio calls __split_and_process_non_flush in a loop

I also noticed that you sent this patch.
https://patchwork.kernel.org/project/dm-devel/patch/alpine.LRH.2.02.2103010457510.631@file01.intranet.prod.int.rdu2.redhat.com/

I agree with you that this while() loop here is unnecessary. And thus
there's no loop calling __split_and_process_non_flush() in
__split_and_process_bio().


> __split_and_process_non_flush records the poll cookie in ci.poll_cookie. 
> When we processed all the bios, we poll for the last cookie here:
> 
>         if (ci.poll_cookie != BLK_QC_T_NONE) {
>                 while (atomic_read(&ci.io->io_count) > 1 &&
>                        blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
>         }

So what will happen if one bio submitted to dm device crosses the device
boundary among several target devices (e.g., dm-stripe)? Please refer
the following call graph.

```
submit_bio
  __submit_bio_noacct
    disk->fops->submit_bio(), calling into __split_and_process_bio(),
call __split_and_process_non_flush() once, submitting the *first* split bio
    disk->fops->submit_bio(), calling into __split_and_process_bio(),
call __split_and_process_non_flush() once, submitting the *second* split bio
    ...
```


So the loop is in __submit_bio_noacct(), rather than
__split_and_process_bio(). Your design will send the first split bio,
and then poll on this split bio, then send the next split bio, polling
on this, go on and on...
Mikulas Patocka March 4, 2021, 10:09 a.m. UTC | #4
On Thu, 4 Mar 2021, JeffleXu wrote:

> > __split_and_process_non_flush records the poll cookie in ci.poll_cookie. 
> > When we processed all the bios, we poll for the last cookie here:
> > 
> >         if (ci.poll_cookie != BLK_QC_T_NONE) {
> >                 while (atomic_read(&ci.io->io_count) > 1 &&
> >                        blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
> >         }
> 
> So what will happen if one bio submitted to dm device crosses the device
> boundary among several target devices (e.g., dm-stripe)? Please refer
> the following call graph.
> 
> ```
> submit_bio
>   __submit_bio_noacct
>     disk->fops->submit_bio(), calling into __split_and_process_bio(),
> call __split_and_process_non_flush() once, submitting the *first* split bio
>     disk->fops->submit_bio(), calling into __split_and_process_bio(),
> call __split_and_process_non_flush() once, submitting the *second* split bio
>     ...
> ```
> 
> 
> So the loop is in __submit_bio_noacct(), rather than
> __split_and_process_bio(). Your design will send the first split bio,
> and then poll on this split bio, then send the next split bio, polling
> on this, go on and on...

No. It will send all the bios and poll for the last one.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Jeff Moyer March 4, 2021, 3:01 p.m. UTC | #5
Hi, Mikulas,

Mikulas Patocka <mpatocka@redhat.com> writes:

> On Wed, 3 Mar 2021, JeffleXu wrote:
>
>> 
>> 
>> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
>> 
>> > Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
>> > cookie.
>> > 
>> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>> > 
>> > ---
>> >  drivers/md/dm.c |    5 +++++
>> >  1 file changed, 5 insertions(+)
>> > 
>> > Index: linux-2.6/drivers/md/dm.c
>> > ===================================================================
>> > --- linux-2.6.orig/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
>> > +++ linux-2.6/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
>> > @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
>> >  		}
>> >  	}
>> >  
>> > +	if (ci.poll_cookie != BLK_QC_T_NONE) {
>> > +		while (atomic_read(&ci.io->io_count) > 1 &&
>> > +		       blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
>> > +	}
>> > +
>> >  	/* drop the extra reference count */
>> >  	dec_pending(ci.io, errno_to_blk_status(error));
>> >  }
>> 
>> It seems that the general idea of your design is to
>> 1) submit *one* split bio
>> 2) blk_poll(), waiting the previously submitted split bio complets
>
> No, I submit all the bios and poll for the last one.

What happens if the last bio completes first?  It looks like you will
call blk_poll with a cookie that already completed, and I'm pretty sure
that's invalid.

Thanks,
Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer March 4, 2021, 3:11 p.m. UTC | #6
On Thu, Mar 04 2021 at 10:01am -0500,
Jeff Moyer <jmoyer@redhat.com> wrote:

> Hi, Mikulas,
> 
> Mikulas Patocka <mpatocka@redhat.com> writes:
> 
> > On Wed, 3 Mar 2021, JeffleXu wrote:
> >
> >> 
> >> 
> >> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
> >> 
> >> > Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
> >> > cookie.
> >> > 
> >> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> >> > 
> >> > ---
> >> >  drivers/md/dm.c |    5 +++++
> >> >  1 file changed, 5 insertions(+)
> >> > 
> >> > Index: linux-2.6/drivers/md/dm.c
> >> > ===================================================================
> >> > --- linux-2.6.orig/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
> >> > +++ linux-2.6/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
> >> > @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
> >> >  		}
> >> >  	}
> >> >  
> >> > +	if (ci.poll_cookie != BLK_QC_T_NONE) {
> >> > +		while (atomic_read(&ci.io->io_count) > 1 &&
> >> > +		       blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
> >> > +	}
> >> > +
> >> >  	/* drop the extra reference count */
> >> >  	dec_pending(ci.io, errno_to_blk_status(error));
> >> >  }
> >> 
> >> It seems that the general idea of your design is to
> >> 1) submit *one* split bio
> >> 2) blk_poll(), waiting the previously submitted split bio complets
> >
> > No, I submit all the bios and poll for the last one.
> 
> What happens if the last bio completes first?  It looks like you will
> call blk_poll with a cookie that already completed, and I'm pretty sure
> that's invalid.

In addition, I'm concerned this approach to have DM internalize IO
polling is a non-starter.

I just don't think this approach adheres to the io_uring + IO polling
interface.. it never returns a cookie to upper layers... so there is
really no opportunity for standard io_uring + IO polling interface to
work is there?

But Heinz and Mikulas are about to kick off some fio io_uring + hipri=1
(io polling) testing of Jeffle's latest v5 patchset:
https://patchwork.kernel.org/project/dm-devel/list/?series=442075

compared to Mikulas' patchset:
https://patchwork.kernel.org/project/dm-devel/list/?series=440719

We should have definitive answers soon enough, just using Jeffle's fio
config (with hipri=1 for IO polling) that was documented here:
https://listman.redhat.com/archives/dm-devel/2020-October/msg00129.html

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka March 4, 2021, 3:12 p.m. UTC | #7
On Thu, 4 Mar 2021, Jeff Moyer wrote:

> Hi, Mikulas,
> 
> Mikulas Patocka <mpatocka@redhat.com> writes:
> 
> > On Wed, 3 Mar 2021, JeffleXu wrote:
> >
> >> 
> >> 
> >> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
> >> 
> >> > Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
> >> > cookie.
> >> > 
> >> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> >> > 
> >> > ---
> >> >  drivers/md/dm.c |    5 +++++
> >> >  1 file changed, 5 insertions(+)
> >> > 
> >> > Index: linux-2.6/drivers/md/dm.c
> >> > ===================================================================
> >> > --- linux-2.6.orig/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
> >> > +++ linux-2.6/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
> >> > @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
> >> >  		}
> >> >  	}
> >> >  
> >> > +	if (ci.poll_cookie != BLK_QC_T_NONE) {
> >> > +		while (atomic_read(&ci.io->io_count) > 1 &&
> >> > +		       blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
> >> > +	}
> >> > +
> >> >  	/* drop the extra reference count */
> >> >  	dec_pending(ci.io, errno_to_blk_status(error));
> >> >  }
> >> 
> >> It seems that the general idea of your design is to
> >> 1) submit *one* split bio
> >> 2) blk_poll(), waiting the previously submitted split bio complets
> >
> > No, I submit all the bios and poll for the last one.
> 
> What happens if the last bio completes first?  It looks like you will
> call blk_poll with a cookie that already completed, and I'm pretty sure
> that's invalid.
> 
> Thanks,
> Jeff

If the last bio completes first, the other bios will use interrupt-driven 
endio.

Calling blk_poll with already completed cookie is IMHO legal - it happens 
with the current direct-io code too. The direct-io code will check for bio 
completion and if the bio is not completed, call blk_poll. The bio may be 
completed from an interrupt after the check and before blk_poll - in this 
case, we will call blk_poll with already completed cookie.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Jingbo Xu March 5, 2021, 9:52 a.m. UTC | #8
On 3/3/21 6:09 PM, Mikulas Patocka wrote:
> 
> 
> On Wed, 3 Mar 2021, JeffleXu wrote:
> 
>>
>>
>> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
>>
>>> Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
>>> cookie.
>>>
>>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>>>
>>> ---
>>>  drivers/md/dm.c |    5 +++++
>>>  1 file changed, 5 insertions(+)
>>>
>>> Index: linux-2.6/drivers/md/dm.c
>>> ===================================================================
>>> --- linux-2.6.orig/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
>>> +++ linux-2.6/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
>>> @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
>>>  		}
>>>  	}
>>>  
>>> +	if (ci.poll_cookie != BLK_QC_T_NONE) {
>>> +		while (atomic_read(&ci.io->io_count) > 1 &&
>>> +		       blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
>>> +	}
>>> +
>>>  	/* drop the extra reference count */
>>>  	dec_pending(ci.io, errno_to_blk_status(error));
>>>  }
>>
>> It seems that the general idea of your design is to
>> 1) submit *one* split bio
>> 2) blk_poll(), waiting the previously submitted split bio complets
> 
> No, I submit all the bios and poll for the last one.
> 
>> and then submit next split bio, repeating the above process. I'm afraid
>> the performance may be an issue here, since the batch every time
>> blk_poll() reaps may decrease.
> 
> Could you benchmark it?

I only tested dm-linear.

The configuration (dm table) of dm-linear is:
0 1048576 linear /dev/nvme0n1 0
1048576 1048576 linear /dev/nvme2n1 0
2097152 1048576 linear /dev/nvme5n1 0


fio script used is:
```
$cat fio.conf
[global]
name=iouring-sqpoll-iopoll-1
ioengine=io_uring
iodepth=128
numjobs=1
thread
rw=randread
direct=1
registerfiles=1
hipri=1
runtime=10
time_based
group_reporting
randrepeat=0
filename=/dev/mapper/testdev
bs=4k

[job-1]
cpus_allowed=14
```

IOPS (IRQ mode) | IOPS (iopoll mode (hipri=1))
--------------- | --------------------
           213k |		   19k

At least, it doesn't work well with io_uring interface.
Heinz Mauelshagen March 5, 2021, 5:46 p.m. UTC | #9
On 3/5/21 10:52 AM, JeffleXu wrote:
>
> On 3/3/21 6:09 PM, Mikulas Patocka wrote:
>>
>> On Wed, 3 Mar 2021, JeffleXu wrote:
>>
>>>
>>> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
>>>
>>>> Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
>>>> cookie.
>>>>
>>>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>>>>
>>>> ---
>>>>   drivers/md/dm.c |    5 +++++
>>>>   1 file changed, 5 insertions(+)
>>>>
>>>> Index: linux-2.6/drivers/md/dm.c
>>>> ===================================================================
>>>> --- linux-2.6.orig/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
>>>> +++ linux-2.6/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
>>>> @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
>>>>   		}
>>>>   	}
>>>>   
>>>> +	if (ci.poll_cookie != BLK_QC_T_NONE) {
>>>> +		while (atomic_read(&ci.io->io_count) > 1 &&
>>>> +		       blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
>>>> +	}
>>>> +
>>>>   	/* drop the extra reference count */
>>>>   	dec_pending(ci.io, errno_to_blk_status(error));
>>>>   }
>>> It seems that the general idea of your design is to
>>> 1) submit *one* split bio
>>> 2) blk_poll(), waiting the previously submitted split bio complets
>> No, I submit all the bios and poll for the last one.
>>
>>> and then submit next split bio, repeating the above process. I'm afraid
>>> the performance may be an issue here, since the batch every time
>>> blk_poll() reaps may decrease.
>> Could you benchmark it?
> I only tested dm-linear.
>
> The configuration (dm table) of dm-linear is:
> 0 1048576 linear /dev/nvme0n1 0
> 1048576 1048576 linear /dev/nvme2n1 0
> 2097152 1048576 linear /dev/nvme5n1 0
>
>
> fio script used is:
> ```
> $cat fio.conf
> [global]
> name=iouring-sqpoll-iopoll-1
> ioengine=io_uring
> iodepth=128
> numjobs=1
> thread
> rw=randread
> direct=1
> registerfiles=1
> hipri=1
> runtime=10
> time_based
> group_reporting
> randrepeat=0
> filename=/dev/mapper/testdev
> bs=4k
>
> [job-1]
> cpus_allowed=14
> ```
>
> IOPS (IRQ mode) | IOPS (iopoll mode (hipri=1))
> --------------- | --------------------
>             213k |		   19k
>
> At least, it doesn't work well with io_uring interface.
>
>


Jeffe,

I ran your above fio test on a linear LV split across 3 NVMes to second your split mapping
(system: 32 core Intel, 256GiB RAM) comparing io engines sync, libaio and io_uring,
the latter w/ and w/o hipri (sync+libaio obviously w/o registerfiles and hipri) which resulted ok:



sync  |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1) 
------|----------|---------------------|----------------- 56.3K |    
290K  |                329K |             351K I can't second your 
drastic hipri=1 drop here...
Heinz

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Heinz Mauelshagen March 5, 2021, 5:56 p.m. UTC | #10
On 3/5/21 6:46 PM, Heinz Mauelshagen wrote:
> On 3/5/21 10:52 AM, JeffleXu wrote:
>>
>> On 3/3/21 6:09 PM, Mikulas Patocka wrote:
>>>
>>> On Wed, 3 Mar 2021, JeffleXu wrote:
>>>
>>>>
>>>> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
>>>>
>>>>> Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
>>>>> cookie.
>>>>>
>>>>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>>>>>
>>>>> ---
>>>>>   drivers/md/dm.c |    5 +++++
>>>>>   1 file changed, 5 insertions(+)
>>>>>
>>>>> Index: linux-2.6/drivers/md/dm.c
>>>>> ===================================================================
>>>>> --- linux-2.6.orig/drivers/md/dm.c    2021-03-02 
>>>>> 19:26:34.000000000 +0100
>>>>> +++ linux-2.6/drivers/md/dm.c    2021-03-02 19:26:34.000000000 +0100
>>>>> @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
>>>>>           }
>>>>>       }
>>>>>   +    if (ci.poll_cookie != BLK_QC_T_NONE) {
>>>>> +        while (atomic_read(&ci.io->io_count) > 1 &&
>>>>> +               blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
>>>>> +    }
>>>>> +
>>>>>       /* drop the extra reference count */
>>>>>       dec_pending(ci.io, errno_to_blk_status(error));
>>>>>   }
>>>> It seems that the general idea of your design is to
>>>> 1) submit *one* split bio
>>>> 2) blk_poll(), waiting the previously submitted split bio complets
>>> No, I submit all the bios and poll for the last one.
>>>
>>>> and then submit next split bio, repeating the above process. I'm 
>>>> afraid
>>>> the performance may be an issue here, since the batch every time
>>>> blk_poll() reaps may decrease.
>>> Could you benchmark it?
>> I only tested dm-linear.
>>
>> The configuration (dm table) of dm-linear is:
>> 0 1048576 linear /dev/nvme0n1 0
>> 1048576 1048576 linear /dev/nvme2n1 0
>> 2097152 1048576 linear /dev/nvme5n1 0
>>
>>
>> fio script used is:
>> ```
>> $cat fio.conf
>> [global]
>> name=iouring-sqpoll-iopoll-1
>> ioengine=io_uring
>> iodepth=128
>> numjobs=1
>> thread
>> rw=randread
>> direct=1
>> registerfiles=1
>> hipri=1
>> runtime=10
>> time_based
>> group_reporting
>> randrepeat=0
>> filename=/dev/mapper/testdev
>> bs=4k
>>
>> [job-1]
>> cpus_allowed=14
>> ```
>>
>> IOPS (IRQ mode) | IOPS (iopoll mode (hipri=1))
>> --------------- | --------------------
>>             213k |           19k
>>
>> At least, it doesn't work well with io_uring interface.
>>
>>
>
>
> Jeffle,
>
> I ran your above fio test on a linear LV split across 3 NVMes to 
> second your split mapping
> (system: 32 core Intel, 256GiB RAM) comparing io engines sync, libaio 
> and io_uring,
> the latter w/ and w/o hipri (sync+libaio obviously w/o registerfiles 
> and hipri) which resulted ok:
>
>
>
> sync  |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1) 
> ------|----------|---------------------|----------------- 56.3K |    
> 290K  |                329K |             351K I can't second your 
> drastic hipri=1 drop here...


Sorry, email mess.


sync   |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
-------|----------|---------------------|-----------------
56.3K  |    290K  |                329K |             351K



I can't second your drastic hipri=1 drop here...


> Heinz 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer March 5, 2021, 6:09 p.m. UTC | #11
On Fri, Mar 05 2021 at 12:56pm -0500,
Heinz Mauelshagen <heinzm@redhat.com> wrote:

> 
> On 3/5/21 6:46 PM, Heinz Mauelshagen wrote:
> >On 3/5/21 10:52 AM, JeffleXu wrote:
> >>
> >>On 3/3/21 6:09 PM, Mikulas Patocka wrote:
> >>>
> >>>On Wed, 3 Mar 2021, JeffleXu wrote:
> >>>
> >>>>
> >>>>On 3/3/21 3:05 AM, Mikulas Patocka wrote:
> >>>>
> >>>>>Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
> >>>>>cookie.
> >>>>>
> >>>>>Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> >>>>>
> >>>>>---
> >>>>>  drivers/md/dm.c |    5 +++++
> >>>>>  1 file changed, 5 insertions(+)
> >>>>>
> >>>>>Index: linux-2.6/drivers/md/dm.c
> >>>>>===================================================================
> >>>>>--- linux-2.6.orig/drivers/md/dm.c    2021-03-02
> >>>>>19:26:34.000000000 +0100
> >>>>>+++ linux-2.6/drivers/md/dm.c    2021-03-02 19:26:34.000000000 +0100
> >>>>>@@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
> >>>>>          }
> >>>>>      }
> >>>>>  +    if (ci.poll_cookie != BLK_QC_T_NONE) {
> >>>>>+        while (atomic_read(&ci.io->io_count) > 1 &&
> >>>>>+               blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
> >>>>>+    }
> >>>>>+
> >>>>>      /* drop the extra reference count */
> >>>>>      dec_pending(ci.io, errno_to_blk_status(error));
> >>>>>  }
> >>>>It seems that the general idea of your design is to
> >>>>1) submit *one* split bio
> >>>>2) blk_poll(), waiting the previously submitted split bio complets
> >>>No, I submit all the bios and poll for the last one.
> >>>
> >>>>and then submit next split bio, repeating the above process.
> >>>>I'm afraid
> >>>>the performance may be an issue here, since the batch every time
> >>>>blk_poll() reaps may decrease.
> >>>Could you benchmark it?
> >>I only tested dm-linear.
> >>
> >>The configuration (dm table) of dm-linear is:
> >>0 1048576 linear /dev/nvme0n1 0
> >>1048576 1048576 linear /dev/nvme2n1 0
> >>2097152 1048576 linear /dev/nvme5n1 0
> >>
> >>
> >>fio script used is:
> >>```
> >>$cat fio.conf
> >>[global]
> >>name=iouring-sqpoll-iopoll-1
> >>ioengine=io_uring
> >>iodepth=128
> >>numjobs=1
> >>thread
> >>rw=randread
> >>direct=1
> >>registerfiles=1
> >>hipri=1
> >>runtime=10
> >>time_based
> >>group_reporting
> >>randrepeat=0
> >>filename=/dev/mapper/testdev
> >>bs=4k
> >>
> >>[job-1]
> >>cpus_allowed=14
> >>```
> >>
> >>IOPS (IRQ mode) | IOPS (iopoll mode (hipri=1))
> >>--------------- | --------------------
> >>            213k |           19k
> >>
> >>At least, it doesn't work well with io_uring interface.
> >>
> >>
> >
> >
> >Jeffle,
> >
> >I ran your above fio test on a linear LV split across 3 NVMes to
> >second your split mapping
> >(system: 32 core Intel, 256GiB RAM) comparing io engines sync,
> >libaio and io_uring,
> >the latter w/ and w/o hipri (sync+libaio obviously w/o
> >registerfiles and hipri) which resulted ok:
> >
> >
> >
> >sync  |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
> >------|----------|---------------------|----------------- 56.3K
> >|    290K  |                329K |             351K I can't second
> >your drastic hipri=1 drop here...
> 
> 
> Sorry, email mess.
> 
> 
> sync   |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
> -------|----------|---------------------|-----------------
> 56.3K  |    290K  |                329K |             351K
> 
> 
> 
> I can't second your drastic hipri=1 drop here...

I think your result is just showcasing your powerful system's ability to
poll every related HW queue.. whereas Jeffle's system is likely somehow
more constrained (on a cpu level, memory, whatever).

My basis for this is that Mikulas' changes simply always return an
invalid cookie (BLK_QC_T_NONE) for purposes of intelligent IO polling.

Such an implementation is completely invalid.

I discussed with Jens and he said:
"it needs to return something that f_op->iopoll() can make sense of.
otherwise you have no option but to try everything."

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Heinz Mauelshagen March 5, 2021, 6:19 p.m. UTC | #12
On 3/5/21 7:09 PM, Mike Snitzer wrote:
> On Fri, Mar 05 2021 at 12:56pm -0500,
> Heinz Mauelshagen <heinzm@redhat.com> wrote:
>
>> On 3/5/21 6:46 PM, Heinz Mauelshagen wrote:
>>> On 3/5/21 10:52 AM, JeffleXu wrote:
>>>> On 3/3/21 6:09 PM, Mikulas Patocka wrote:
>>>>> On Wed, 3 Mar 2021, JeffleXu wrote:
>>>>>
>>>>>> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
>>>>>>
>>>>>>> Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
>>>>>>> cookie.
>>>>>>>
>>>>>>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>>>>>>>
>>>>>>> ---
>>>>>>>    drivers/md/dm.c |    5 +++++
>>>>>>>    1 file changed, 5 insertions(+)
>>>>>>>
>>>>>>> Index: linux-2.6/drivers/md/dm.c
>>>>>>> ===================================================================
>>>>>>> --- linux-2.6.orig/drivers/md/dm.c    2021-03-02
>>>>>>> 19:26:34.000000000 +0100
>>>>>>> +++ linux-2.6/drivers/md/dm.c    2021-03-02 19:26:34.000000000 +0100
>>>>>>> @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
>>>>>>>            }
>>>>>>>        }
>>>>>>>    +    if (ci.poll_cookie != BLK_QC_T_NONE) {
>>>>>>> +        while (atomic_read(&ci.io->io_count) > 1 &&
>>>>>>> +               blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
>>>>>>> +    }
>>>>>>> +
>>>>>>>        /* drop the extra reference count */
>>>>>>>        dec_pending(ci.io, errno_to_blk_status(error));
>>>>>>>    }
>>>>>> It seems that the general idea of your design is to
>>>>>> 1) submit *one* split bio
>>>>>> 2) blk_poll(), waiting the previously submitted split bio complets
>>>>> No, I submit all the bios and poll for the last one.
>>>>>
>>>>>> and then submit next split bio, repeating the above process.
>>>>>> I'm afraid
>>>>>> the performance may be an issue here, since the batch every time
>>>>>> blk_poll() reaps may decrease.
>>>>> Could you benchmark it?
>>>> I only tested dm-linear.
>>>>
>>>> The configuration (dm table) of dm-linear is:
>>>> 0 1048576 linear /dev/nvme0n1 0
>>>> 1048576 1048576 linear /dev/nvme2n1 0
>>>> 2097152 1048576 linear /dev/nvme5n1 0
>>>>
>>>>
>>>> fio script used is:
>>>> ```
>>>> $cat fio.conf
>>>> [global]
>>>> name=iouring-sqpoll-iopoll-1
>>>> ioengine=io_uring
>>>> iodepth=128
>>>> numjobs=1
>>>> thread
>>>> rw=randread
>>>> direct=1
>>>> registerfiles=1
>>>> hipri=1
>>>> runtime=10
>>>> time_based
>>>> group_reporting
>>>> randrepeat=0
>>>> filename=/dev/mapper/testdev
>>>> bs=4k
>>>>
>>>> [job-1]
>>>> cpus_allowed=14
>>>> ```
>>>>
>>>> IOPS (IRQ mode) | IOPS (iopoll mode (hipri=1))
>>>> --------------- | --------------------
>>>>              213k |           19k
>>>>
>>>> At least, it doesn't work well with io_uring interface.
>>>>
>>>>
>>>
>>> Jeffle,
>>>
>>> I ran your above fio test on a linear LV split across 3 NVMes to
>>> second your split mapping
>>> (system: 32 core Intel, 256GiB RAM) comparing io engines sync,
>>> libaio and io_uring,
>>> the latter w/ and w/o hipri (sync+libaio obviously w/o
>>> registerfiles and hipri) which resulted ok:
>>>
>>>
>>>
>>> sync  |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
>>> ------|----------|---------------------|----------------- 56.3K
>>> |    290K  |                329K |             351K I can't second
>>> your drastic hipri=1 drop here...
>>
>> Sorry, email mess.
>>
>>
>> sync   |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
>> -------|----------|---------------------|-----------------
>> 56.3K  |    290K  |                329K |             351K
>>
>>
>>
>> I can't second your drastic hipri=1 drop here...
> I think your result is just showcasing your powerful system's ability to
> poll every related HW queue.. whereas Jeffle's system is likely somehow
> more constrained (on a cpu level, memory, whatever).

Jeffle,

mind explaining your test system configuration a bit more (cores, ram, 
i/o config to your NVMes)
so that we have a better base for such assumption?

Thanks,
Heinz

>
> My basis for this is that Mikulas' changes simply always return an
> invalid cookie (BLK_QC_T_NONE) for purposes of intelligent IO polling.
>
> Such an implementation is completely invalid.
>
> I discussed with Jens and he said:
> "it needs to return something that f_op->iopoll() can make sense of.
> otherwise you have no option but to try everything."
>
> Mike
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Jens Axboe March 5, 2021, 6:21 p.m. UTC | #13
On 3/4/21 3:09 AM, Mikulas Patocka wrote:
> 
> 
> On Thu, 4 Mar 2021, JeffleXu wrote:
> 
>>> __split_and_process_non_flush records the poll cookie in ci.poll_cookie. 
>>> When we processed all the bios, we poll for the last cookie here:
>>>
>>>         if (ci.poll_cookie != BLK_QC_T_NONE) {
>>>                 while (atomic_read(&ci.io->io_count) > 1 &&
>>>                        blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
>>>         }
>>
>> So what will happen if one bio submitted to dm device crosses the device
>> boundary among several target devices (e.g., dm-stripe)? Please refer
>> the following call graph.
>>
>> ```
>> submit_bio
>>   __submit_bio_noacct
>>     disk->fops->submit_bio(), calling into __split_and_process_bio(),
>> call __split_and_process_non_flush() once, submitting the *first* split bio
>>     disk->fops->submit_bio(), calling into __split_and_process_bio(),
>> call __split_and_process_non_flush() once, submitting the *second* split bio
>>     ...
>> ```
>>
>>
>> So the loop is in __submit_bio_noacct(), rather than
>> __split_and_process_bio(). Your design will send the first split bio,
>> and then poll on this split bio, then send the next split bio, polling
>> on this, go on and on...
> 
> No. It will send all the bios and poll for the last one.

I took a quick look, and this seems very broken. You must not poll off
the submission path, polling should be invoked by the higher layer when
someone wants to reap events. IOW, dm should not be calling blk_poll()
by itself, only off mq_ops->poll(). Your patch seems to do it off
submission once you submit the last bio in that batch, effectively
implementing sync polling for that series. That's not right.
Jingbo Xu March 8, 2021, 3:54 a.m. UTC | #14
On 3/6/21 1:56 AM, Heinz Mauelshagen wrote:
> 
> On 3/5/21 6:46 PM, Heinz Mauelshagen wrote:
>> On 3/5/21 10:52 AM, JeffleXu wrote:
>>>
>>> On 3/3/21 6:09 PM, Mikulas Patocka wrote:
>>>>
>>>> On Wed, 3 Mar 2021, JeffleXu wrote:
>>>>
>>>>>
>>>>> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
>>>>>
>>>>>> Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
>>>>>> cookie.
>>>>>>
>>>>>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>>>>>>
>>>>>> ---
>>>>>>   drivers/md/dm.c |    5 +++++
>>>>>>   1 file changed, 5 insertions(+)
>>>>>>
>>>>>> Index: linux-2.6/drivers/md/dm.c
>>>>>> ===================================================================
>>>>>> --- linux-2.6.orig/drivers/md/dm.c    2021-03-02
>>>>>> 19:26:34.000000000 +0100
>>>>>> +++ linux-2.6/drivers/md/dm.c    2021-03-02 19:26:34.000000000 +0100
>>>>>> @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
>>>>>>           }
>>>>>>       }
>>>>>>   +    if (ci.poll_cookie != BLK_QC_T_NONE) {
>>>>>> +        while (atomic_read(&ci.io->io_count) > 1 &&
>>>>>> +               blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
>>>>>> +    }
>>>>>> +
>>>>>>       /* drop the extra reference count */
>>>>>>       dec_pending(ci.io, errno_to_blk_status(error));
>>>>>>   }
>>>>> It seems that the general idea of your design is to
>>>>> 1) submit *one* split bio
>>>>> 2) blk_poll(), waiting the previously submitted split bio complets
>>>> No, I submit all the bios and poll for the last one.
>>>>
>>>>> and then submit next split bio, repeating the above process. I'm
>>>>> afraid
>>>>> the performance may be an issue here, since the batch every time
>>>>> blk_poll() reaps may decrease.
>>>> Could you benchmark it?
>>> I only tested dm-linear.
>>>
>>> The configuration (dm table) of dm-linear is:
>>> 0 1048576 linear /dev/nvme0n1 0
>>> 1048576 1048576 linear /dev/nvme2n1 0
>>> 2097152 1048576 linear /dev/nvme5n1 0
>>>
>>>
>>> fio script used is:
>>> ```
>>> $cat fio.conf
>>> [global]
>>> name=iouring-sqpoll-iopoll-1
>>> ioengine=io_uring
>>> iodepth=128
>>> numjobs=1
>>> thread
>>> rw=randread
>>> direct=1
>>> registerfiles=1
>>> hipri=1
>>> runtime=10
>>> time_based
>>> group_reporting
>>> randrepeat=0
>>> filename=/dev/mapper/testdev
>>> bs=4k
>>>
>>> [job-1]
>>> cpus_allowed=14
>>> ```
>>>
>>> IOPS (IRQ mode) | IOPS (iopoll mode (hipri=1))
>>> --------------- | --------------------
>>>             213k |           19k
>>>
>>> At least, it doesn't work well with io_uring interface.
>>>
>>>
>>
>>
>> Jeffle,
>>
>> I ran your above fio test on a linear LV split across 3 NVMes to
>> second your split mapping
>> (system: 32 core Intel, 256GiB RAM) comparing io engines sync, libaio
>> and io_uring,
>> the latter w/ and w/o hipri (sync+libaio obviously w/o registerfiles
>> and hipri) which resulted ok:
>>
>>
>>
>> sync  |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
>> ------|----------|---------------------|----------------- 56.3K |   
>> 290K  |                329K |             351K I can't second your
>> drastic hipri=1 drop here...
> 
> 
> Sorry, email mess.
> 
> 
> sync   |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
> -------|----------|---------------------|-----------------
> 56.3K  |    290K  |                329K |             351K
> 
> 
> 
> I can't second your drastic hipri=1 drop here...
> 

Hummm, that's indeed somewhat strange...

My test environment:
- CPU: 128 cores, though only one CPU core is used since
'cpus_allowed=14' in fio configuration
- memory: 983G memory free
- NVMe: Huawai ES3510P (HWE52P434T0L005N), with 'nvme.poll_queues=3'

Maybe you didn't specify 'nvme.poll_queues=XXX'? In this case, IO still
goes into IRQ mode, even you have specified 'hipri=1'?
Jens Axboe March 8, 2021, 3:55 a.m. UTC | #15
On 3/7/21 8:54 PM, JeffleXu wrote:
> 
> 
> On 3/6/21 1:56 AM, Heinz Mauelshagen wrote:
>>
>> On 3/5/21 6:46 PM, Heinz Mauelshagen wrote:
>>> On 3/5/21 10:52 AM, JeffleXu wrote:
>>>>
>>>> On 3/3/21 6:09 PM, Mikulas Patocka wrote:
>>>>>
>>>>> On Wed, 3 Mar 2021, JeffleXu wrote:
>>>>>
>>>>>>
>>>>>> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
>>>>>>
>>>>>>> Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
>>>>>>> cookie.
>>>>>>>
>>>>>>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>>>>>>>
>>>>>>> ---
>>>>>>>   drivers/md/dm.c |    5 +++++
>>>>>>>   1 file changed, 5 insertions(+)
>>>>>>>
>>>>>>> Index: linux-2.6/drivers/md/dm.c
>>>>>>> ===================================================================
>>>>>>> --- linux-2.6.orig/drivers/md/dm.c    2021-03-02
>>>>>>> 19:26:34.000000000 +0100
>>>>>>> +++ linux-2.6/drivers/md/dm.c    2021-03-02 19:26:34.000000000 +0100
>>>>>>> @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
>>>>>>>           }
>>>>>>>       }
>>>>>>>   +    if (ci.poll_cookie != BLK_QC_T_NONE) {
>>>>>>> +        while (atomic_read(&ci.io->io_count) > 1 &&
>>>>>>> +               blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
>>>>>>> +    }
>>>>>>> +
>>>>>>>       /* drop the extra reference count */
>>>>>>>       dec_pending(ci.io, errno_to_blk_status(error));
>>>>>>>   }
>>>>>> It seems that the general idea of your design is to
>>>>>> 1) submit *one* split bio
>>>>>> 2) blk_poll(), waiting the previously submitted split bio complets
>>>>> No, I submit all the bios and poll for the last one.
>>>>>
>>>>>> and then submit next split bio, repeating the above process. I'm
>>>>>> afraid
>>>>>> the performance may be an issue here, since the batch every time
>>>>>> blk_poll() reaps may decrease.
>>>>> Could you benchmark it?
>>>> I only tested dm-linear.
>>>>
>>>> The configuration (dm table) of dm-linear is:
>>>> 0 1048576 linear /dev/nvme0n1 0
>>>> 1048576 1048576 linear /dev/nvme2n1 0
>>>> 2097152 1048576 linear /dev/nvme5n1 0
>>>>
>>>>
>>>> fio script used is:
>>>> ```
>>>> $cat fio.conf
>>>> [global]
>>>> name=iouring-sqpoll-iopoll-1
>>>> ioengine=io_uring
>>>> iodepth=128
>>>> numjobs=1
>>>> thread
>>>> rw=randread
>>>> direct=1
>>>> registerfiles=1
>>>> hipri=1
>>>> runtime=10
>>>> time_based
>>>> group_reporting
>>>> randrepeat=0
>>>> filename=/dev/mapper/testdev
>>>> bs=4k
>>>>
>>>> [job-1]
>>>> cpus_allowed=14
>>>> ```
>>>>
>>>> IOPS (IRQ mode) | IOPS (iopoll mode (hipri=1))
>>>> --------------- | --------------------
>>>>             213k |           19k
>>>>
>>>> At least, it doesn't work well with io_uring interface.
>>>>
>>>>
>>>
>>>
>>> Jeffle,
>>>
>>> I ran your above fio test on a linear LV split across 3 NVMes to
>>> second your split mapping
>>> (system: 32 core Intel, 256GiB RAM) comparing io engines sync, libaio
>>> and io_uring,
>>> the latter w/ and w/o hipri (sync+libaio obviously w/o registerfiles
>>> and hipri) which resulted ok:
>>>
>>>
>>>
>>> sync  |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
>>> ------|----------|---------------------|----------------- 56.3K |   
>>> 290K  |                329K |             351K I can't second your
>>> drastic hipri=1 drop here...
>>
>>
>> Sorry, email mess.
>>
>>
>> sync   |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
>> -------|----------|---------------------|-----------------
>> 56.3K  |    290K  |                329K |             351K
>>
>>
>>
>> I can't second your drastic hipri=1 drop here...
>>
> 
> Hummm, that's indeed somewhat strange...
> 
> My test environment:
> - CPU: 128 cores, though only one CPU core is used since
> 'cpus_allowed=14' in fio configuration
> - memory: 983G memory free
> - NVMe: Huawai ES3510P (HWE52P434T0L005N), with 'nvme.poll_queues=3'
> 
> Maybe you didn't specify 'nvme.poll_queues=XXX'? In this case, IO still
> goes into IRQ mode, even you have specified 'hipri=1'?

That would be my guess too, and the patches also have a very suspicious
clear of HIPRI which shouldn't be there (which would let that fly through).
Heinz Mauelshagen March 9, 2021, 11:42 a.m. UTC | #16
On 3/8/21 4:54 AM, JeffleXu wrote:
>
> On 3/6/21 1:56 AM, Heinz Mauelshagen wrote:
>> On 3/5/21 6:46 PM, Heinz Mauelshagen wrote:
>>> On 3/5/21 10:52 AM, JeffleXu wrote:
>>>> On 3/3/21 6:09 PM, Mikulas Patocka wrote:
>>>>> On Wed, 3 Mar 2021, JeffleXu wrote:
>>>>>
>>>>>> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
>>>>>>
>>>>>>> Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
>>>>>>> cookie.
>>>>>>>
>>>>>>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>>>>>>>
>>>>>>> ---
>>>>>>>    drivers/md/dm.c |    5 +++++
>>>>>>>    1 file changed, 5 insertions(+)
>>>>>>>
>>>>>>> Index: linux-2.6/drivers/md/dm.c
>>>>>>> ===================================================================
>>>>>>> --- linux-2.6.orig/drivers/md/dm.c    2021-03-02
>>>>>>> 19:26:34.000000000 +0100
>>>>>>> +++ linux-2.6/drivers/md/dm.c    2021-03-02 19:26:34.000000000 +0100
>>>>>>> @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
>>>>>>>            }
>>>>>>>        }
>>>>>>>    +    if (ci.poll_cookie != BLK_QC_T_NONE) {
>>>>>>> +        while (atomic_read(&ci.io->io_count) > 1 &&
>>>>>>> +               blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
>>>>>>> +    }
>>>>>>> +
>>>>>>>        /* drop the extra reference count */
>>>>>>>        dec_pending(ci.io, errno_to_blk_status(error));
>>>>>>>    }
>>>>>> It seems that the general idea of your design is to
>>>>>> 1) submit *one* split bio
>>>>>> 2) blk_poll(), waiting the previously submitted split bio complets
>>>>> No, I submit all the bios and poll for the last one.
>>>>>
>>>>>> and then submit next split bio, repeating the above process. I'm
>>>>>> afraid
>>>>>> the performance may be an issue here, since the batch every time
>>>>>> blk_poll() reaps may decrease.
>>>>> Could you benchmark it?
>>>> I only tested dm-linear.
>>>>
>>>> The configuration (dm table) of dm-linear is:
>>>> 0 1048576 linear /dev/nvme0n1 0
>>>> 1048576 1048576 linear /dev/nvme2n1 0
>>>> 2097152 1048576 linear /dev/nvme5n1 0
>>>>
>>>>
>>>> fio script used is:
>>>> ```
>>>> $cat fio.conf
>>>> [global]
>>>> name=iouring-sqpoll-iopoll-1
>>>> ioengine=io_uring
>>>> iodepth=128
>>>> numjobs=1
>>>> thread
>>>> rw=randread
>>>> direct=1
>>>> registerfiles=1
>>>> hipri=1
>>>> runtime=10
>>>> time_based
>>>> group_reporting
>>>> randrepeat=0
>>>> filename=/dev/mapper/testdev
>>>> bs=4k
>>>>
>>>> [job-1]
>>>> cpus_allowed=14
>>>> ```
>>>>
>>>> IOPS (IRQ mode) | IOPS (iopoll mode (hipri=1))
>>>> --------------- | --------------------
>>>>              213k |           19k
>>>>
>>>> At least, it doesn't work well with io_uring interface.
>>>>
>>>>
>>>
>>> Jeffle,
>>>
>>> I ran your above fio test on a linear LV split across 3 NVMes to
>>> second your split mapping
>>> (system: 32 core Intel, 256GiB RAM) comparing io engines sync, libaio
>>> and io_uring,
>>> the latter w/ and w/o hipri (sync+libaio obviously w/o registerfiles
>>> and hipri) which resulted ok:
>>>
>>>
>>>
>>> sync  |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
>>> ------|----------|---------------------|----------------- 56.3K |
>>> 290K  |                329K |             351K I can't second your
>>> drastic hipri=1 drop here...
>>
>> Sorry, email mess.
>>
>>
>> sync   |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
>> -------|----------|---------------------|-----------------
>> 56.3K  |    290K  |                329K |             351K
>>
>>
>>
>> I can't second your drastic hipri=1 drop here...
>>
> Hummm, that's indeed somewhat strange...
>
> My test environment:
> - CPU: 128 cores, though only one CPU core is used since
> 'cpus_allowed=14' in fio configuration
> - memory: 983G memory free
> - NVMe: Huawai ES3510P (HWE52P434T0L005N), with 'nvme.poll_queues=3'
>
> Maybe you didn't specify 'nvme.poll_queues=XXX'? In this case, IO still
> goes into IRQ mode, even you have specified 'hipri=1'?
>
Jeffle,

nvme.poll_queues was zero indeed.

At 3 results are hipri=0 / hipri=1 : 1699K / 1702K IOPS (all cores)

Single core results : 315K / 329K

Still no extreme drop...

FWIW:

Thanks,
Heinz

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
diff mbox series

Patch

Index: linux-2.6/drivers/md/dm.c
===================================================================
--- linux-2.6.orig/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
+++ linux-2.6/drivers/md/dm.c	2021-03-02 19:26:34.000000000 +0100
@@ -1682,6 +1682,11 @@  static void __split_and_process_bio(stru
 		}
 	}
 
+	if (ci.poll_cookie != BLK_QC_T_NONE) {
+		while (atomic_read(&ci.io->io_count) > 1 &&
+		       blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
+	}
+
 	/* drop the extra reference count */
 	dec_pending(ci.io, errno_to_blk_status(error));
 }