diff mbox series

[V2,1/1] pmem: set QUEUE_FLAG_NOWAIT

Message ID 20230731224617.8665-2-kch@nvidia.com (mailing list archive)
State New, archived
Headers show
Series pmem: set QUEUE_FLAG_NOWAIT | expand

Commit Message

Chaitanya Kulkarni July 31, 2023, 10:46 p.m. UTC
Set the QUEUE_FLAG_NOWAIT. Following are the performance numbers with
io_uring fio engine for random read, note that device has been populated
fully with randwrite workload before taking these numbers :-

linux-block (pmem-nowait-on) # grep IOPS  pmem*fio | column -t
pmem-nowait-off-1.fio:  read:  IOPS=3683k,  BW=14.0GiB/s
pmem-nowait-off-2.fio:  read:  IOPS=3819k,  BW=14.6GiB/s
pmem-nowait-off-3.fio:  read:  IOPS=3999k,  BW=15.3GiB/s

pmem-nowait-on-1.fio:   read:  IOPS=5837k,  BW=22.3GiB/s
pmem-nowait-on-2.fio:   read:  IOPS=5936k,  BW=22.6GiB/s
pmem-nowait-on-3.fio:   read:  IOPS=5945k,  BW=22.7GiB/s

linux-block (pmem-nowait-on) # grep cpu  pmem*fio | column -t
pmem-nowait-off-1.fio:  cpu  :  usr=7.09%,   sys=29.65%,  ctx=198742065
pmem-nowait-off-2.fio:  cpu  :  usr=6.89%,   sys=30.56%,  ctx=205817652
pmem-nowait-off-3.fio:  cpu  :  usr=6.86%,   sys=30.94%,  ctx=222627094

pmem-nowait-on-1.fio:   cpu  :  usr=10.58%,  sys=88.44%,  ctx=27181   
pmem-nowait-on-2.fio:   cpu  :  usr=10.50%,  sys=87.75%,  ctx=25746   
pmem-nowait-on-3.fio:   cpu  :  usr=10.67%,  sys=88.60%,  ctx=28261   

linux-block (pmem-nowait-on) # grep slat  pmem*fio | column -t
pmem-nowait-off-1.fio:  slat  (nsec):  min=432,   max=50847k,  avg=9324.69
pmem-nowait-off-2.fio:  slat  (nsec):  min=441,   max=52557k,  avg=9132.45
pmem-nowait-off-3.fio:  slat  (nsec):  min=430,   max=36113k,  avg=9132.63

pmem-nowait-on-1.fio:   slat  (nsec):  min=1393,  max=68090k,  avg=7615.31
pmem-nowait-on-2.fio:   slat  (nsec):  min=1222,  max=44137k,  avg=7493.77
pmem-nowait-on-3.fio:   slat  (nsec):  min=1493,  max=40100k,  avg=7486.36

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvdimm/pmem.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Jeff Moyer Aug. 1, 2023, 3:23 p.m. UTC | #1
Chaitanya Kulkarni <kch@nvidia.com> writes:

> Set the QUEUE_FLAG_NOWAIT. Following are the performance numbers with
> io_uring fio engine for random read, note that device has been populated
> fully with randwrite workload before taking these numbers :-

I am slightly embarrassed to have to ask this question, but what are the
implications of setting this queue flag?  Is the submit_bio routine
expected to never block?  Is the I/O expected to be performed
asynchronously?  If either of those is the case, then pmem does not
qualify.

-Jeff

>
> linux-block (pmem-nowait-on) # grep IOPS  pmem*fio | column -t
> pmem-nowait-off-1.fio:  read:  IOPS=3683k,  BW=14.0GiB/s
> pmem-nowait-off-2.fio:  read:  IOPS=3819k,  BW=14.6GiB/s
> pmem-nowait-off-3.fio:  read:  IOPS=3999k,  BW=15.3GiB/s
>
> pmem-nowait-on-1.fio:   read:  IOPS=5837k,  BW=22.3GiB/s
> pmem-nowait-on-2.fio:   read:  IOPS=5936k,  BW=22.6GiB/s
> pmem-nowait-on-3.fio:   read:  IOPS=5945k,  BW=22.7GiB/s
>
> linux-block (pmem-nowait-on) # grep cpu  pmem*fio | column -t
> pmem-nowait-off-1.fio:  cpu  :  usr=7.09%,   sys=29.65%,  ctx=198742065
> pmem-nowait-off-2.fio:  cpu  :  usr=6.89%,   sys=30.56%,  ctx=205817652
> pmem-nowait-off-3.fio:  cpu  :  usr=6.86%,   sys=30.94%,  ctx=222627094
>
> pmem-nowait-on-1.fio:   cpu  :  usr=10.58%,  sys=88.44%,  ctx=27181   
> pmem-nowait-on-2.fio:   cpu  :  usr=10.50%,  sys=87.75%,  ctx=25746   
> pmem-nowait-on-3.fio:   cpu  :  usr=10.67%,  sys=88.60%,  ctx=28261   
>
> linux-block (pmem-nowait-on) # grep slat  pmem*fio | column -t
> pmem-nowait-off-1.fio:  slat  (nsec):  min=432,   max=50847k,  avg=9324.69
> pmem-nowait-off-2.fio:  slat  (nsec):  min=441,   max=52557k,  avg=9132.45
> pmem-nowait-off-3.fio:  slat  (nsec):  min=430,   max=36113k,  avg=9132.63
>
> pmem-nowait-on-1.fio:   slat  (nsec):  min=1393,  max=68090k,  avg=7615.31
> pmem-nowait-on-2.fio:   slat  (nsec):  min=1222,  max=44137k,  avg=7493.77
> pmem-nowait-on-3.fio:   slat  (nsec):  min=1493,  max=40100k,  avg=7486.36
>
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> ---
>  drivers/nvdimm/pmem.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 46e094e56159..ddd485c377eb 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -543,6 +543,7 @@ static int pmem_attach_disk(struct device *dev,
>  	blk_queue_max_hw_sectors(q, UINT_MAX);
>  	blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
>  	blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q);
> +	blk_queue_flag_set(QUEUE_FLAG_NOWAIT, q);
>  	if (pmem->pfn_flags & PFN_MAP)
>  		blk_queue_flag_set(QUEUE_FLAG_DAX, q);
Christoph Hellwig Aug. 1, 2023, 3:59 p.m. UTC | #2
On Tue, Aug 01, 2023 at 11:23:36AM -0400, Jeff Moyer wrote:
> I am slightly embarrassed to have to ask this question, but what are the
> implications of setting this queue flag?  Is the submit_bio routine
> expected to never block?

Yes, at least not significantly.

> Is the I/O expected to be performed
> asynchronously?

Not nessecarily if it is fast enough..
Jeff Moyer Aug. 1, 2023, 5:49 p.m. UTC | #3
Hi, Christoph,

Christoph Hellwig <hch@lst.de> writes:

> On Tue, Aug 01, 2023 at 11:23:36AM -0400, Jeff Moyer wrote:
>> I am slightly embarrassed to have to ask this question, but what are the
>> implications of setting this queue flag?  Is the submit_bio routine
>> expected to never block?
>
> Yes, at least not significantly.

If there happens to be poisoned memory, the write path can make an acpi
device specific method call.  That involves taking a mutex (see the call
chain starting at acpi_ex_enter_interpreter()).  I'm not sure how long a
DSM takes, but I doubt it's fast.

>> Is the I/O expected to be performed
>> asynchronously?
>
> Not nessecarily if it is fast enough..

Clear as mud.  :) It's a memcpy on potentially "slow" memory.  So, the
amount of time spent copying depends on the speed of the cpu, the media
and the size of the I/O.  I don't know off-hand what the upper bound
would be on today's pmem.

-Jeff
Chaitanya Kulkarni Aug. 2, 2023, 1:25 a.m. UTC | #4
On 8/1/23 10:49, Jeff Moyer wrote:
> Hi, Christoph,
>
> Christoph Hellwig <hch@lst.de> writes:
>
>> On Tue, Aug 01, 2023 at 11:23:36AM -0400, Jeff Moyer wrote:
>>> I am slightly embarrassed to have to ask this question, but what are the
>>> implications of setting this queue flag?  Is the submit_bio routine
>>> expected to never block?
>> Yes, at least not significantly.
> If there happens to be poisoned memory, the write path can make an acpi
> device specific method call.  That involves taking a mutex (see the call
> chain starting at acpi_ex_enter_interpreter()).  I'm not sure how long a
> DSM takes, but I doubt it's fast.

for this case I can add bio->bio_opf & REQ_NOWAIT check and return
with bio_wouldblock_error() that will take care of blocking case e.g.
something like this untested as a prep patch for write path :-

linux-block (pmem-nowait-on) # git diff
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ddd485c377eb..eff100f74357 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -179,12 +179,16 @@ static blk_status_t pmem_do_read(struct 
pmem_device *pmem,

  static blk_status_t pmem_do_write(struct pmem_device *pmem,
                         struct page *page, unsigned int page_off,
-                       sector_t sector, unsigned int len)
+                       sector_t sector, unsigned int len, struct bio *bio)
  {
         phys_addr_t pmem_off = to_offset(pmem, sector);
         void *pmem_addr = pmem->virt_addr + pmem_off;

         if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) {
+               if (bio && bio->bi_opf & REQ_NOWAIT) {
+                       bio_wouldblock_error(bio);
+                       return BLK_STS_AGAIN;
+               }
                 blk_status_t rc = pmem_clear_poison(pmem, pmem_off, len);

                 if (rc != BLK_STS_OK)
@@ -217,7 +221,7 @@ static void pmem_submit_bio(struct bio *bio)
         bio_for_each_segment(bvec, bio, iter) {
                 if (op_is_write(bio_op(bio)))
                         rc = pmem_do_write(pmem, bvec.bv_page, 
bvec.bv_offset,
-                               iter.bi_sector, bvec.bv_len);
+                               iter.bi_sector, bvec.bv_len, bio);
                 else
                         rc = pmem_do_read(pmem, bvec.bv_page, 
bvec.bv_offset,
                                 iter.bi_sector, bvec.bv_len);
@@ -297,7 +301,7 @@ static int pmem_dax_zero_page_range(struct 
dax_device *dax_dev, pgoff_t pgoff,

         return blk_status_to_errno(pmem_do_write(pmem, ZERO_PAGE(0), 0,
                                    PFN_PHYS(pgoff) >> SECTOR_SHIFT,
-                                  PAGE_SIZE));
+                                  PAGE_SIZE, NULL));
  }

  static long pmem_dax_direct_access(struct dax_device *dax_dev,


>>> Is the I/O expected to be performed
>>> asynchronously?
>> Not nessecarily if it is fast enough..
> Clear as mud.  :) It's a memcpy on potentially "slow" memory.  So, the
> amount of time spent copying depends on the speed of the cpu, the media
> and the size of the I/O.  I don't know off-hand what the upper bound
> would be on today's pmem.

Above scenario depends on the system and I'm not sure if we can 
generalize this
for all the deployments. In case we end up generalizing above scenario 
then we
can always add a modparam to disable nowait so user has total control 
whether
to enable or disable this an incremental patch ..

For the deployments those are not suffering from the "mempcy on potentially
slow memory" this patch shows clear performance win with enabling NOWAIT
for io_uring ..

-ck
Christoph Hellwig Aug. 2, 2023, 12:30 p.m. UTC | #5
Given that pmem simply loops over an arbitrarily large bio I think
we also need a threshold for which to allow nowait I/O.  While it
won't block for giant I/Os, doing all of them in the submitter
context isn't exactly the idea behind the nowait I/O.

I'm not really sure what a good theshold would be, though.

Btw, please also always add linux-block to the Cc list for block
driver patches that are even the slightest bit about the block
layer interface.
Jeff Moyer Aug. 2, 2023, 3:27 p.m. UTC | #6
Christoph Hellwig <hch@lst.de> writes:

> Given that pmem simply loops over an arbitrarily large bio I think
> we also need a threshold for which to allow nowait I/O.  While it
> won't block for giant I/Os, doing all of them in the submitter
> context isn't exactly the idea behind the nowait I/O.
>
> I'm not really sure what a good theshold would be, though.

There's no mention of the latency of the submission side in the
documentation for RWF_NOWAIT.  The man page says "Do not wait for data
which is not immediately available."  Data in pmem *is* immediately
available.  If we wanted to enforce a latency threshold for submission,
it would have to be configurable and, ideally, a part of the API.  I
don't think it's something we should even try to guarantee, though,
unless application writers are asking for it.

So, I think with the change to return -EAGAIN for writes to poisoned
memory, this patch is probably ok.

Chaitanya, could you send a v2?

Thanks,
Jeff
Jens Axboe Aug. 2, 2023, 3:36 p.m. UTC | #7
On 8/2/23 6:30?AM, Christoph Hellwig wrote:
> Given that pmem simply loops over an arbitrarily large bio I think
> we also need a threshold for which to allow nowait I/O.  While it
> won't block for giant I/Os, doing all of them in the submitter
> context isn't exactly the idea behind the nowait I/O.

You can do a LOT of looping over a giant bio and still come out way
ahead compared to needing to punt to a different thread. So I do think
it's the right choice. But I'm making assumptions here on what it looks
like, as I haven't seen the patch...

> Btw, please also always add linux-block to the Cc list for block
> driver patches that are even the slightest bit about the block
> layer interface.

Indeed. Particularly for these nowait changes, as some of them have been
pretty broken in the past.
Christoph Hellwig Aug. 2, 2023, 4:25 p.m. UTC | #8
On Wed, Aug 02, 2023 at 09:36:32AM -0600, Jens Axboe wrote:
> You can do a LOT of looping over a giant bio and still come out way
> ahead compared to needing to punt to a different thread. So I do think
> it's the right choice. But I'm making assumptions here on what it looks
> like, as I haven't seen the patch...

"a LOT" is relative.  A GB or two will block the submission thread for
quite a while.
Chaitanya Kulkarni Aug. 3, 2023, 3:17 a.m. UTC | #9
On 8/2/23 05:30, Christoph Hellwig wrote:
> Given that pmem simply loops over an arbitrarily large bio I think
> we also need a threshold for which to allow nowait I/O.  While it
> won't block for giant I/Os, doing all of them in the submitter
> context isn't exactly the idea behind the nowait I/O.
>
> I'm not really sure what a good theshold would be, though.
>
> Btw, please also always add linux-block to the Cc list for block
> driver patches that are even the slightest bit about the block
> layer interface.

will do ..

-ck
Chaitanya Kulkarni Aug. 3, 2023, 3:21 a.m. UTC | #10
On 8/2/23 08:36, Jens Axboe wrote:
> On 8/2/23 6:30?AM, Christoph Hellwig wrote:
>> Given that pmem simply loops over an arbitrarily large bio I think
>> we also need a threshold for which to allow nowait I/O.  While it
>> won't block for giant I/Os, doing all of them in the submitter
>> context isn't exactly the idea behind the nowait I/O.
> You can do a LOT of looping over a giant bio and still come out way
> ahead compared to needing to punt to a different thread. So I do think
> it's the right choice. But I'm making assumptions here on what it looks
> like, as I haven't seen the patch...
>

will send out the V3 soon and CC you and linux-block ...

>> Btw, please also always add linux-block to the Cc list for block
>> driver patches that are even the slightest bit about the block
>> layer interface.
> Indeed. Particularly for these nowait changes, as some of them have been
> pretty broken in the past.
>

yes, didn't forgot, lately dealing with bunch of stupidity, I'll send
zram and null_blk nowait changes soon with your comments fixed ..

-ck
Chaitanya Kulkarni Aug. 3, 2023, 3:24 a.m. UTC | #11
On 8/2/23 08:27, Jeff Moyer wrote:
> Christoph Hellwig <hch@lst.de> writes:
>
>> Given that pmem simply loops over an arbitrarily large bio I think
>> we also need a threshold for which to allow nowait I/O.  While it
>> won't block for giant I/Os, doing all of them in the submitter
>> context isn't exactly the idea behind the nowait I/O.
>>
>> I'm not really sure what a good theshold would be, though.
> There's no mention of the latency of the submission side in the
> documentation for RWF_NOWAIT.  The man page says "Do not wait for data
> which is not immediately available."  Data in pmem *is* immediately
> available.  If we wanted to enforce a latency threshold for submission,
> it would have to be configurable and, ideally, a part of the API.  I
> don't think it's something we should even try to guarantee, though,
> unless application writers are asking for it.

I need to see the usecase from application writter who cannot come
with a solution so kernel have to provide this interface, I'll be happy
to do that ..

> So, I think with the change to return -EAGAIN for writes to poisoned
> memory, this patch is probably ok.

I believe you mean the same one I've  provided earlier incremental ..

> Chaitanya, could you send a v2?

sure ...

-ck
Jeff Moyer Aug. 3, 2023, 1:11 p.m. UTC | #12
Chaitanya Kulkarni <chaitanyak@nvidia.com> writes:

> On 8/2/23 08:27, Jeff Moyer wrote:
>> So, I think with the change to return -EAGAIN for writes to poisoned
>> memory, this patch is probably ok.
>
> I believe you mean the same one I've  provided earlier incremental ..

Yes, sorry if that wasn't clear.

>> Chaitanya, could you send a v2?
>
> sure ...

and I guess I should have said v3.  ;-)

Cheers,
Jeff
diff mbox series

Patch

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 46e094e56159..ddd485c377eb 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -543,6 +543,7 @@  static int pmem_attach_disk(struct device *dev,
 	blk_queue_max_hw_sectors(q, UINT_MAX);
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
 	blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q);
+	blk_queue_flag_set(QUEUE_FLAG_NOWAIT, q);
 	if (pmem->pfn_flags & PFN_MAP)
 		blk_queue_flag_set(QUEUE_FLAG_DAX, q);