diff mbox series

[RFC] nvdimm: pmem: always flush nvdimm for write request

Message ID 1555051638-32354-1-git-send-email-lirongqing@baidu.com (mailing list archive)
State New, archived
Headers show
Series [RFC] nvdimm: pmem: always flush nvdimm for write request | expand

Commit Message

Li RongQing April 12, 2019, 6:47 a.m. UTC
flush nvdimm for write request can speed the random write, give
about 20% performance

The below is result of fio 4k random write nvdimm as /dev/pmem0

Before:
Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta 01m:43s]
After:
Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s]

This makes sure that the newly written data is durable too

Co-developed-by: Liang ZhiCheng <liangzhicheng@baidu.com>
Signed-off-by: Liang ZhiCheng <liangzhicheng@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
This test is done on intel AEP

 drivers/nvdimm/pmem.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Dan Williams April 12, 2019, 4:32 p.m. UTC | #1
On Thu, Apr 11, 2019 at 11:47 PM Li RongQing <lirongqing@baidu.com> wrote:
>
> flush nvdimm for write request can speed the random write, give
> about 20% performance
>
> The below is result of fio 4k random write nvdimm as /dev/pmem0
>
> Before:
> Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta 01m:43s]
> After:
> Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s]

Interesting result. Another experiment proposed below...

> This makes sure that the newly written data is durable too

It's overkill for durability because ADR already handles the necessary
write-queue flush at power-loss.

> Co-developed-by: Liang ZhiCheng <liangzhicheng@baidu.com>
> Signed-off-by: Liang ZhiCheng <liangzhicheng@baidu.com>
> Signed-off-by: Li RongQing <lirongqing@baidu.com>
> ---
> This test is done on intel AEP
>
>  drivers/nvdimm/pmem.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 1d432c5ed..9f8f25880 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -197,6 +197,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
>         unsigned long start;
>         struct bio_vec bvec;
>         struct bvec_iter iter;
> +       unsigned int op = bio_op(bio);
>         struct pmem_device *pmem = q->queuedata;
>         struct nd_region *nd_region = to_region(pmem);
>
> @@ -206,7 +207,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
>         do_acct = nd_iostat_start(bio, &start);
>         bio_for_each_segment(bvec, bio, iter) {
>                 rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
> -                               bvec.bv_offset, bio_op(bio), iter.bi_sector);
> +                               bvec.bv_offset, op, iter.bi_sector);
>                 if (rc) {
>                         bio->bi_status = rc;
>                         break;
> @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
>         if (do_acct)
>                 nd_iostat_end(bio, start);
>
> -       if (bio->bi_opf & REQ_FUA)
> +       if (bio->bi_opf & REQ_FUA || op_is_write(op))
>                 nvdimm_flush(nd_region);

One question is whether this benefit is coming from just the write
ordering from the barrier, or the WPQ flush itself. Can you try the
following and see if you get the same result?

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index bc2f700feef8..10c24de52395 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -217,6 +217,8 @@ static blk_qc_t pmem_make_request(struct
request_queue *q, struct bio *bio)

        if (bio->bi_opf & REQ_FUA)
                nvdimm_flush(nd_region);
+       else if (op_is_write(bio_op(bio)))
+               wmb();

        bio_endio(bio);
        return BLK_QC_T_NONE;
Elliott, Robert (Servers) April 14, 2019, 3:20 a.m. UTC | #2
>> @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
>> 	if (do_acct)
>> 		nd_iostat_end(bio, start);
>> 
>> -	if (bio->bi_opf & REQ_FUA)
>> +	if (bio->bi_opf & REQ_FUA || op_is_write(op))
>>  		nvdimm_flush(nd_region);
...
>> Before:
>> Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta 01m:43s]
>> After:
>> Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s]
>>
>> -RongQing


Doing more work cannot be faster than doing less work, so something else
must be happening here.

Please post the full fio job file and how you invoke it (i.e., with numactl).

These tools help show what is happening on the CPUs and memory channels:
    perf top
    pcm.x
    pcm-memory.x -pmm
Li RongQing April 15, 2019, 3:26 a.m. UTC | #3
> -----邮件原件-----
> 发件人: Elliott, Robert (Servers) [mailto:elliott@hpe.com]
> 发送时间: 2019年4月14日 11:21
> 收件人: Li,Rongqing <lirongqing@baidu.com>; Dan Williams
> <dan.j.williams@intel.com>
> 抄送: linux-nvdimm <linux-nvdimm@lists.01.org>
> 主题: RE: [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request
> 
> 
> 
> >> @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct
> request_queue *q, struct bio *bio)
> >> 	if (do_acct)
> >> 		nd_iostat_end(bio, start);
> >>
> >> -	if (bio->bi_opf & REQ_FUA)
> >> +	if (bio->bi_opf & REQ_FUA || op_is_write(op))
> >>  		nvdimm_flush(nd_region);
> ...
> >> Before:
> >> Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta
> >> 01m:43s]
> >> After:
> >> Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s]
> >>
> >> -RongQing
> 
> 
> Doing more work cannot be faster than doing less work, so something else
> must be happening here.
> 

Dan Williams maybe know more.


> Please post the full fio job file and how you invoke it (i.e., with numactl).
> 
This fio file is below, and we bind fio with cpu and node:  numactl --membind=0 taskset -c 2-24 ./fio test_io_raw

[global]
numjobs=32
direct=1
filename=/dev/pmem0.1
iodepth=32
ioengine=libaio
group_reporting=1
bs=4K
time_based=1
 
[write1]
rw=randwrite
runtime=60
stonewall

> These tools help show what is happening on the CPUs and memory channels:
>     perf top

62.40%  [kernel]            [k] memcpy_flushcache                                                                                                                     
 21.17%  [kernel]            [k] fput                                                                                                                                  
  6.12%  [kernel]            [k] apic_timer_interrupt                                                                                                                  
  0.89%  [kernel]            [k] rq_qos_done_bio                                                                                                                       
  0.66%  [kernel]            [k] bio_endio                                                                                                                             
  0.44%  [kernel]            [k] aio_complete_rw                                                                                                                       
  0.39%  [kernel]            [k] blkdev_bio_end_io                                                                                                                     
  0.31%  [kernel]            [k] entry_SYSCALL_64                                                                                                                      
  0.26%  [kernel]            [k] bio_disassociate_task                                                                                                                 
  0.23%  [kernel]            [k] read_tsc                                                                                                                              
  0.21%  fio                 [.] axmap_isset                                                                                                                           
  0.20%  [kernel]            [k] ktime_get_raw_ts64                                                                                                                    
  0.19%  [vdso]              [.] 0x7ffc475e2b30                                                                                                                        
  0.18%  [kernel]            [k] gup_pgd_range                                                                                                                         
  0.18%  [kernel]            [k] entry_SYSCALL_64_after_hwframe                                                                                                        
  0.16%  [kernel]            [k] __audit_syscall_exit                                                                                                                  
  0.16%  [kernel]            [k] __x86_indirect_thunk_rax                                                                                                              
  0.13%  [kernel]            [k] copy_user_enhanced_fast_string                                                                                                        
  0.13%  [kernel]            [k] syscall_return_via_sysret                                                                                                             
  0.12%  [kernel]            [k] preempt_count_add                                                                                                                     
  0.12%  [kernel]            [k] preempt_count_sub                                                                                                                     
  0.11%  [kernel]            [k] __x64_sys_clock_gettime                                                                                                               
  0.11%  [kernel]            [k] tracer_hardirqs_off                                                                                                                   
  0.10%  [kernel]            [k] native_write_msr                                                                                                                      
  0.10%  [kernel]            [k] posix_get_monotonic_raw                                                                                                               
  0.10%  fio                 [.] get_io_u

>     pcm.x

http://pasted.co/6fc93b42

>     pcm-memory.x -pmm
> 
http://pasted.co/d5c0c96b


If not bind fio with cpu and numa node, the performance will larger lower, but this optimization is suitable both condition , it will about 40% improvement sometime.

-Li
Dan Williams April 15, 2019, 5:04 p.m. UTC | #4
On Sun, Apr 14, 2019 at 8:26 PM Li,Rongqing <lirongqing@baidu.com> wrote:
>
>
>
> > -----邮件原件-----
> > 发件人: Elliott, Robert (Servers) [mailto:elliott@hpe.com]
> > 发送时间: 2019年4月14日 11:21
> > 收件人: Li,Rongqing <lirongqing@baidu.com>; Dan Williams
> > <dan.j.williams@intel.com>
> > 抄送: linux-nvdimm <linux-nvdimm@lists.01.org>
> > 主题: RE: [PATCH][RFC] nvdimm: pmem: always flush nvdimm for write request
> >
> >
> >
> > >> @@ -215,7 +216,7 @@ static blk_qc_t pmem_make_request(struct
> > request_queue *q, struct bio *bio)
> > >>    if (do_acct)
> > >>            nd_iostat_end(bio, start);
> > >>
> > >> -  if (bio->bi_opf & REQ_FUA)
> > >> +  if (bio->bi_opf & REQ_FUA || op_is_write(op))
> > >>            nvdimm_flush(nd_region);
> > ...
> > >> Before:
> > >> Jobs: 32 (f=32): [W(32)][14.2%][w=1884MiB/s][w=482k IOPS][eta
> > >> 01m:43s]
> > >> After:
> > >> Jobs: 32 (f=32): [W(32)][8.3%][w=2378MiB/s][w=609k IOPS][eta 01m:50s]
> > >>
> > >> -RongQing
> >
> >
> > Doing more work cannot be faster than doing less work, so something else
> > must be happening here.
> >
>
> Dan Williams maybe know more.

One thought is that back-pressure from awaiting write-posted-queue
flush completion causes full buffer writes to coalesce at the
controller, i.e. write-combining effects from the flush-delay.
diff mbox series

Patch

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 1d432c5ed..9f8f25880 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -197,6 +197,7 @@  static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long start;
 	struct bio_vec bvec;
 	struct bvec_iter iter;
+	unsigned int op = bio_op(bio);
 	struct pmem_device *pmem = q->queuedata;
 	struct nd_region *nd_region = to_region(pmem);
 
@@ -206,7 +207,7 @@  static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter) {
 		rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
-				bvec.bv_offset, bio_op(bio), iter.bi_sector);
+				bvec.bv_offset, op, iter.bi_sector);
 		if (rc) {
 			bio->bi_status = rc;
 			break;
@@ -215,7 +216,7 @@  static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	if (do_acct)
 		nd_iostat_end(bio, start);
 
-	if (bio->bi_opf & REQ_FUA)
+	if (bio->bi_opf & REQ_FUA || op_is_write(op))
 		nvdimm_flush(nd_region);
 
 	bio_endio(bio);