diff mbox

[2/3] zram: support page-based parallel write

Message ID 20161005020153.GA2988@bbox (mailing list archive)
State New, archived
Headers show

Commit Message

Minchan Kim Oct. 5, 2016, 2:01 a.m. UTC
Hi Sergey,

On Tue, Oct 04, 2016 at 01:43:14PM +0900, Sergey Senozhatsky wrote:

< snip >

> TEST
> ****
> 
> new tests results; same tests, same conditions, same .config.
> 4-way test:
> - BASE zram, fio direct=1
> - BASE zram, fio fsync_on_close=1
> - NEW zram, fio direct=1
> - NEW zram, fio fsync_on_close=1
> 
> 
> 
> and what I see is that:
>  - new zram is x3 times slower when we do a lot of direct=1 IO
> and
>  - 10% faster when we use buffered IO (fsync_on_close); but not always;
>    for instance, test execution time is longer (a reproducible behavior)
>    when the number of jobs equals the number of CPUs - 4.
> 
> 
> 
> if flushing is a problem for new zram during direct=1 test, then I would
> assume that writing a huge number of small files (creat/write 4k/close)
> would probably have same fsync_on_close=1 performance as direct=1.
> 
> 
> ENV
> ===
> 
>    x86_64 SMP (4 CPUs), "bare zram" 3g, lzo, static compression buffer.
> 
> 
> TEST COMMAND
> ============
> 
>   ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX={NEW, OLD} FIO_LOOPS=2 ./zram-fio-test.sh
> 
> 
> EXECUTED TESTS
> ==============
> 
>   - [seq-read]
>   - [rand-read]
>   - [seq-write]
>   - [rand-write]
>   - [mixed-seq]
>   - [mixed-rand]
> 
> 
> fio-perf-o-meter.sh test-fio-zram-OLD test-fio-zram-OLD-flush test-fio-zram-NEW test-fio-zram-NEW-flush
> Processing test-fio-zram-OLD
> Processing test-fio-zram-OLD-flush
> Processing test-fio-zram-NEW
> Processing test-fio-zram-NEW-flush
> 
>                 BASE             BASE              NEW             NEW
>                 direct=1         fsync_on_close=1  direct=1        fsync_on_close=1
> 
> #jobs1                         	                	                	                
> READ:           2345.1MB/s	 2177.2MB/s	 2373.2MB/s	 2185.8MB/s
> READ:           1948.2MB/s	 1417.7MB/s	 1987.7MB/s	 1447.4MB/s
> WRITE:          1292.7MB/s	 1406.1MB/s	 275277KB/s	 1521.1MB/s
> WRITE:          1047.5MB/s	 1143.8MB/s	 257140KB/s	 1202.4MB/s
> READ:           429530KB/s	 779523KB/s	 175450KB/s	 782237KB/s
> WRITE:          429840KB/s	 780084KB/s	 175576KB/s	 782800KB/s
> READ:           414074KB/s	 408214KB/s	 164091KB/s	 383426KB/s
> WRITE:          414402KB/s	 408539KB/s	 164221KB/s	 383730KB/s


I tested your benchmark for job 1 on my 4 CPU mahcine with this diff.

Nothing different.

1. just changed ordering of test execution - hope to reduce testing time due to
   block population before the first reading or reading just zero pages
2. used sync_on_close instead of direct io
3. Don't use perf to avoid noise
4. echo 0 > /sys/block/zram0/use_aio to test synchronous IO for old behavior


And got following result.

1. ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=async FIO_LOOPS=2 MAX_ITER=1 ./zram-fio-test.sh
2. modify script to disable aio via /sys/block/zram0/use_aio
   ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=sync FIO_LOOPS=2 MAX_ITER=1 ./zram-fio-test.sh

      seq-write     380930     474325     124.52%
     rand-write     286183     357469     124.91%
       seq-read     266813     265731      99.59%
      rand-read     211747     210670      99.49%
   mixed-seq(R)     145750     171232     117.48%
   mixed-seq(W)     145736     171215     117.48%
  mixed-rand(R)     115355     125239     108.57%
  mixed-rand(W)     115371     125256     108.57%

LZO compression is fast and a CPU for queueing while 3 CPU for compressing
it cannot saturate CPU full bandwidth. Nonetheless, it shows 24% enhancement.
It could be more in slow CPU like embedded.

I tested it with deflate. The result is 300% enhancement.

      seq-write      33598     109882     327.05%
     rand-write      32815     102293     311.73%
       seq-read     154323     153765      99.64%
      rand-read     129978     129241      99.43%
   mixed-seq(R)      15887      44995     283.22%
   mixed-seq(W)      15885      44990     283.22%
  mixed-rand(R)      25074      55491     221.31%
  mixed-rand(W)      25078      55499     221.31%

So, curious with your test.
Am my test sync with yours? If you cannot see enhancment in job1, could
you test with deflate? It seems your CPU is really fast.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sergey Senozhatsky Oct. 6, 2016, 8:29 a.m. UTC | #1
Hello Minchan,

On (10/05/16 11:01), Minchan Kim wrote:
[..]
> 1. just changed ordering of test execution - hope to reduce testing time due to
>    block population before the first reading or reading just zero pages
> 2. used sync_on_close instead of direct io
> 3. Don't use perf to avoid noise
> 4. echo 0 > /sys/block/zram0/use_aio to test synchronous IO for old behavior

ok, will use it in the tests below.

> 1. ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=async FIO_LOOPS=2 MAX_ITER=1 ./zram-fio-test.sh
> 2. modify script to disable aio via /sys/block/zram0/use_aio
>    ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=sync FIO_LOOPS=2 MAX_ITER=1 ./zram-fio-test.sh
>
>       seq-write     380930     474325     124.52%
>      rand-write     286183     357469     124.91%
>        seq-read     266813     265731      99.59%
>       rand-read     211747     210670      99.49%
>    mixed-seq(R)     145750     171232     117.48%
>    mixed-seq(W)     145736     171215     117.48%
>   mixed-rand(R)     115355     125239     108.57%
>   mixed-rand(W)     115371     125256     108.57%

                no_aio           use_aio

WRITE:          1432.9MB/s	 1511.5MB/s
WRITE:          1173.9MB/s	 1186.9MB/s
READ:           912699KB/s	 912170KB/s
WRITE:          912497KB/s	 911968KB/s
READ:           725658KB/s	 726747KB/s
READ:           579003KB/s	 594543KB/s
READ:           373276KB/s	 373719KB/s
WRITE:          373572KB/s	 374016KB/s

seconds elapsed        45.399702511	44.280199716

> LZO compression is fast and a CPU for queueing while 3 CPU for compressing
> it cannot saturate CPU full bandwidth. Nonetheless, it shows 24% enhancement.
> It could be more in slow CPU like embedded.
> 
> I tested it with deflate. The result is 300% enhancement.
> 
>       seq-write      33598     109882     327.05%
>      rand-write      32815     102293     311.73%
>        seq-read     154323     153765      99.64%
>       rand-read     129978     129241      99.43%
>    mixed-seq(R)      15887      44995     283.22%
>    mixed-seq(W)      15885      44990     283.22%
>   mixed-rand(R)      25074      55491     221.31%
>   mixed-rand(W)      25078      55499     221.31%
>
> So, curious with your test.
> Am my test sync with yours? If you cannot see enhancment in job1, could
> you test with deflate? It seems your CPU is really fast.

interesting observation.

                no_aio           use_aio
WRITE:          47882KB/s	 158931KB/s
WRITE:          47714KB/s	 156484KB/s
READ:           42914KB/s	 137997KB/s
WRITE:          42904KB/s	 137967KB/s
READ:           333764KB/s	 332828KB/s
READ:           293883KB/s	 294709KB/s
READ:           51243KB/s	 129701KB/s
WRITE:          51284KB/s	 129804KB/s

seconds elapsed        480.869169882	181.678431855

yes, looks like with lzo CPU manages to process bdi writeback fast enough
to keep fio-template-static-buffer worker active.

to prove this theory: direct=1 cures zram-deflate.

                no_aio           use_aio
WRITE:          41873KB/s	 34257KB/s
WRITE:          41455KB/s	 34087KB/s
READ:           36705KB/s	 28960KB/s
WRITE:          36697KB/s	 28954KB/s
READ:           327902KB/s	 327270KB/s
READ:           316217KB/s	 316886KB/s
READ:           35980KB/s	 28131KB/s
WRITE:          36008KB/s	 28153KB/s

seconds elapsed        515.575252170	629.114626795



as soon as wb flush kworker can't keep up anymore things are going off
the rails. most of the time, fio-template-static-buffer are in D state,
while the biggest bdi flush kworker is doing the job (a lot of job):

  PID USER      PR  NI    VIRT    RES  %CPU %MEM     TIME+ S COMMAND
 6274 root      20   0    0.0m   0.0m 100.0  0.0   1:15.60 R [kworker/u8:1]
11169 root      20   0  718.1m   1.6m  16.6  0.0   0:01.88 D fio ././conf/fio-template-static-buffer
11171 root      20   0  718.1m   1.6m   3.3  0.0   0:01.15 D fio ././conf/fio-template-static-buffer
11170 root      20   0  718.1m   3.3m   2.6  0.1   0:00.98 D fio ././conf/fio-template-static-buffer


and still working...

 6274 root      20   0    0.0m   0.0m 100.0  0.0   3:05.49 R [kworker/u8:1]
12048 root      20   0  718.1m   1.6m  16.7  0.0   0:01.80 R fio ././conf/fio-template-static-buffer
12047 root      20   0  718.1m   1.6m   3.3  0.0   0:01.12 D fio ././conf/fio-template-static-buffer
12049 root      20   0  718.1m   1.6m   3.3  0.0   0:01.12 D fio ././conf/fio-template-static-buffer
12050 root      20   0  718.1m   1.6m   2.0  0.0   0:00.98 D fio ././conf/fio-template-static-buffer

and working...


[ 4159.338731] CPU: 0 PID: 105 Comm: kworker/u8:4
[ 4159.338734] Workqueue: writeback wb_workfn (flush-254:0)
[ 4159.338746]  [<ffffffffa01d8cff>] zram_make_request+0x4a3/0x67b [zram]
[ 4159.338748]  [<ffffffff810543fe>] ? try_to_wake_up+0x201/0x213
[ 4159.338750]  [<ffffffff810ae9d3>] ? mempool_alloc+0x5e/0x124
[ 4159.338752]  [<ffffffff811a9922>] generic_make_request+0xb8/0x156
[ 4159.338753]  [<ffffffff811a9aaf>] submit_bio+0xef/0xf8
[ 4159.338755]  [<ffffffff81121a97>] submit_bh_wbc.isra.10+0x16b/0x178
[ 4159.338757]  [<ffffffff811223ec>] __block_write_full_page+0x1b2/0x2a6
[ 4159.338758]  [<ffffffff8112403e>] ? bh_submit_read+0x5a/0x5a
[ 4159.338760]  [<ffffffff81120f9a>] ? end_buffer_write_sync+0x36/0x36
[ 4159.338761]  [<ffffffff8112403e>] ? bh_submit_read+0x5a/0x5a
[ 4159.338763]  [<ffffffff811226d8>] block_write_full_page+0xf6/0xff
[ 4159.338765]  [<ffffffff81124342>] blkdev_writepage+0x13/0x15
[ 4159.338767]  [<ffffffff810b498c>] __writepage+0xe/0x26
[ 4159.338768]  [<ffffffff810b65aa>] write_cache_pages+0x28c/0x376
[ 4159.338770]  [<ffffffff810b497e>] ? __wb_calc_thresh+0x83/0x83
[ 4159.338772]  [<ffffffff810b66dc>] generic_writepages+0x48/0x67
[ 4159.338773]  [<ffffffff81124318>] blkdev_writepages+0x9/0xb
[ 4159.338775]  [<ffffffff81124318>] ? blkdev_writepages+0x9/0xb
[ 4159.338776]  [<ffffffff810b6716>] do_writepages+0x1b/0x24
[ 4159.338778]  [<ffffffff8111b12c>] __writeback_single_inode+0x3d/0x155
[ 4159.338779]  [<ffffffff8111b407>] writeback_sb_inodes+0x1c3/0x32c
[ 4159.338781]  [<ffffffff8111b5e1>] __writeback_inodes_wb+0x71/0xa9
[ 4159.338783]  [<ffffffff8111b7ce>] wb_writeback+0x10f/0x1a1
[ 4159.338785]  [<ffffffff8111be32>] wb_workfn+0x1c9/0x24c
[ 4159.338786]  [<ffffffff8111be32>] ? wb_workfn+0x1c9/0x24c
[ 4159.338788]  [<ffffffff8104a2e2>] process_one_work+0x1a4/0x2a7
[ 4159.338790]  [<ffffffff8104ae32>] worker_thread+0x23b/0x37c
[ 4159.338792]  [<ffffffff8104abf7>] ? rescuer_thread+0x2eb/0x2eb
[ 4159.338793]  [<ffffffff8104f285>] kthread+0xce/0xd6
[ 4159.338794]  [<ffffffff8104f1b7>] ? kthread_create_on_node+0x1ad/0x1ad
[ 4159.338796]  [<ffffffff8145ad12>] ret_from_fork+0x22/0x30


so the question is -- can we move this parallelization out of zram
and instead flush bdi in more than one kthread? how bad that would
be? can anyone else benefit from this?



[1] https://lwn.net/Articles/353844/
[2] https://lwn.net/Articles/354852/

	-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Minchan Kim Oct. 7, 2016, 6:33 a.m. UTC | #2
Hi Sergey,

On Thu, Oct 06, 2016 at 05:29:15PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (10/05/16 11:01), Minchan Kim wrote:
> [..]
> > 1. just changed ordering of test execution - hope to reduce testing time due to
> >    block population before the first reading or reading just zero pages
> > 2. used sync_on_close instead of direct io
> > 3. Don't use perf to avoid noise
> > 4. echo 0 > /sys/block/zram0/use_aio to test synchronous IO for old behavior
> 
> ok, will use it in the tests below.
> 
> > 1. ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=async FIO_LOOPS=2 MAX_ITER=1 ./zram-fio-test.sh
> > 2. modify script to disable aio via /sys/block/zram0/use_aio
> >    ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=sync FIO_LOOPS=2 MAX_ITER=1 ./zram-fio-test.sh
> >
> >       seq-write     380930     474325     124.52%
> >      rand-write     286183     357469     124.91%
> >        seq-read     266813     265731      99.59%
> >       rand-read     211747     210670      99.49%
> >    mixed-seq(R)     145750     171232     117.48%
> >    mixed-seq(W)     145736     171215     117.48%
> >   mixed-rand(R)     115355     125239     108.57%
> >   mixed-rand(W)     115371     125256     108.57%
> 
>                 no_aio           use_aio
> 
> WRITE:          1432.9MB/s	 1511.5MB/s
> WRITE:          1173.9MB/s	 1186.9MB/s
> READ:           912699KB/s	 912170KB/s
> WRITE:          912497KB/s	 911968KB/s
> READ:           725658KB/s	 726747KB/s
> READ:           579003KB/s	 594543KB/s
> READ:           373276KB/s	 373719KB/s
> WRITE:          373572KB/s	 374016KB/s
> 
> seconds elapsed        45.399702511	44.280199716
> 
> > LZO compression is fast and a CPU for queueing while 3 CPU for compressing
> > it cannot saturate CPU full bandwidth. Nonetheless, it shows 24% enhancement.
> > It could be more in slow CPU like embedded.
> > 
> > I tested it with deflate. The result is 300% enhancement.
> > 
> >       seq-write      33598     109882     327.05%
> >      rand-write      32815     102293     311.73%
> >        seq-read     154323     153765      99.64%
> >       rand-read     129978     129241      99.43%
> >    mixed-seq(R)      15887      44995     283.22%
> >    mixed-seq(W)      15885      44990     283.22%
> >   mixed-rand(R)      25074      55491     221.31%
> >   mixed-rand(W)      25078      55499     221.31%
> >
> > So, curious with your test.
> > Am my test sync with yours? If you cannot see enhancment in job1, could
> > you test with deflate? It seems your CPU is really fast.
> 
> interesting observation.
> 
>                 no_aio           use_aio
> WRITE:          47882KB/s	 158931KB/s
> WRITE:          47714KB/s	 156484KB/s
> READ:           42914KB/s	 137997KB/s
> WRITE:          42904KB/s	 137967KB/s
> READ:           333764KB/s	 332828KB/s
> READ:           293883KB/s	 294709KB/s
> READ:           51243KB/s	 129701KB/s
> WRITE:          51284KB/s	 129804KB/s
> 
> seconds elapsed        480.869169882	181.678431855
> 
> yes, looks like with lzo CPU manages to process bdi writeback fast enough
> to keep fio-template-static-buffer worker active.
> 
> to prove this theory: direct=1 cures zram-deflate.
> 
>                 no_aio           use_aio
> WRITE:          41873KB/s	 34257KB/s
> WRITE:          41455KB/s	 34087KB/s
> READ:           36705KB/s	 28960KB/s
> WRITE:          36697KB/s	 28954KB/s
> READ:           327902KB/s	 327270KB/s
> READ:           316217KB/s	 316886KB/s
> READ:           35980KB/s	 28131KB/s
> WRITE:          36008KB/s	 28153KB/s
> 
> seconds elapsed        515.575252170	629.114626795
> 
> 
> 
> as soon as wb flush kworker can't keep up anymore things are going off
> the rails. most of the time, fio-template-static-buffer are in D state,
> while the biggest bdi flush kworker is doing the job (a lot of job):
> 
>   PID USER      PR  NI    VIRT    RES  %CPU %MEM     TIME+ S COMMAND
>  6274 root      20   0    0.0m   0.0m 100.0  0.0   1:15.60 R [kworker/u8:1]
> 11169 root      20   0  718.1m   1.6m  16.6  0.0   0:01.88 D fio ././conf/fio-template-static-buffer
> 11171 root      20   0  718.1m   1.6m   3.3  0.0   0:01.15 D fio ././conf/fio-template-static-buffer
> 11170 root      20   0  718.1m   3.3m   2.6  0.1   0:00.98 D fio ././conf/fio-template-static-buffer
> 
> 
> and still working...
> 
>  6274 root      20   0    0.0m   0.0m 100.0  0.0   3:05.49 R [kworker/u8:1]
> 12048 root      20   0  718.1m   1.6m  16.7  0.0   0:01.80 R fio ././conf/fio-template-static-buffer
> 12047 root      20   0  718.1m   1.6m   3.3  0.0   0:01.12 D fio ././conf/fio-template-static-buffer
> 12049 root      20   0  718.1m   1.6m   3.3  0.0   0:01.12 D fio ././conf/fio-template-static-buffer
> 12050 root      20   0  718.1m   1.6m   2.0  0.0   0:00.98 D fio ././conf/fio-template-static-buffer
> 
> and working...
> 
> 
> [ 4159.338731] CPU: 0 PID: 105 Comm: kworker/u8:4
> [ 4159.338734] Workqueue: writeback wb_workfn (flush-254:0)
> [ 4159.338746]  [<ffffffffa01d8cff>] zram_make_request+0x4a3/0x67b [zram]
> [ 4159.338748]  [<ffffffff810543fe>] ? try_to_wake_up+0x201/0x213
> [ 4159.338750]  [<ffffffff810ae9d3>] ? mempool_alloc+0x5e/0x124
> [ 4159.338752]  [<ffffffff811a9922>] generic_make_request+0xb8/0x156
> [ 4159.338753]  [<ffffffff811a9aaf>] submit_bio+0xef/0xf8
> [ 4159.338755]  [<ffffffff81121a97>] submit_bh_wbc.isra.10+0x16b/0x178
> [ 4159.338757]  [<ffffffff811223ec>] __block_write_full_page+0x1b2/0x2a6
> [ 4159.338758]  [<ffffffff8112403e>] ? bh_submit_read+0x5a/0x5a
> [ 4159.338760]  [<ffffffff81120f9a>] ? end_buffer_write_sync+0x36/0x36
> [ 4159.338761]  [<ffffffff8112403e>] ? bh_submit_read+0x5a/0x5a
> [ 4159.338763]  [<ffffffff811226d8>] block_write_full_page+0xf6/0xff
> [ 4159.338765]  [<ffffffff81124342>] blkdev_writepage+0x13/0x15
> [ 4159.338767]  [<ffffffff810b498c>] __writepage+0xe/0x26
> [ 4159.338768]  [<ffffffff810b65aa>] write_cache_pages+0x28c/0x376
> [ 4159.338770]  [<ffffffff810b497e>] ? __wb_calc_thresh+0x83/0x83
> [ 4159.338772]  [<ffffffff810b66dc>] generic_writepages+0x48/0x67
> [ 4159.338773]  [<ffffffff81124318>] blkdev_writepages+0x9/0xb
> [ 4159.338775]  [<ffffffff81124318>] ? blkdev_writepages+0x9/0xb
> [ 4159.338776]  [<ffffffff810b6716>] do_writepages+0x1b/0x24
> [ 4159.338778]  [<ffffffff8111b12c>] __writeback_single_inode+0x3d/0x155
> [ 4159.338779]  [<ffffffff8111b407>] writeback_sb_inodes+0x1c3/0x32c
> [ 4159.338781]  [<ffffffff8111b5e1>] __writeback_inodes_wb+0x71/0xa9
> [ 4159.338783]  [<ffffffff8111b7ce>] wb_writeback+0x10f/0x1a1
> [ 4159.338785]  [<ffffffff8111be32>] wb_workfn+0x1c9/0x24c
> [ 4159.338786]  [<ffffffff8111be32>] ? wb_workfn+0x1c9/0x24c
> [ 4159.338788]  [<ffffffff8104a2e2>] process_one_work+0x1a4/0x2a7
> [ 4159.338790]  [<ffffffff8104ae32>] worker_thread+0x23b/0x37c
> [ 4159.338792]  [<ffffffff8104abf7>] ? rescuer_thread+0x2eb/0x2eb
> [ 4159.338793]  [<ffffffff8104f285>] kthread+0xce/0xd6
> [ 4159.338794]  [<ffffffff8104f1b7>] ? kthread_create_on_node+0x1ad/0x1ad
> [ 4159.338796]  [<ffffffff8145ad12>] ret_from_fork+0x22/0x30
> 
> 
> so the question is -- can we move this parallelization out of zram
> and instead flush bdi in more than one kthread? how bad that would
> be? can anyone else benefit from this?

Isn't it blk-mq you mentioned? With blk-mq, I have some concerns.

1. read speed degradation
2. no work with rw_page
3. more memory footprint by bio/request queue allocation

Having said, it's worth to look into it in detail more.
I will have time to see that approach to know what I can do
with that.

Thanks!

> 
> [1] https://lwn.net/Articles/353844/
> [2] https://lwn.net/Articles/354852/
> 
> 	-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sergey Senozhatsky Oct. 7, 2016, 6:08 p.m. UTC | #3
Hello Minchan,

On (10/07/16 15:33), Minchan Kim wrote:
[..]
> > as soon as wb flush kworker can't keep up anymore things are going off
> > the rails. most of the time, fio-template-static-buffer are in D state,
> > while the biggest bdi flush kworker is doing the job (a lot of job):
> > 
> >   PID USER      PR  NI    VIRT    RES  %CPU %MEM     TIME+ S COMMAND
> >  6274 root      20   0    0.0m   0.0m 100.0  0.0   1:15.60 R [kworker/u8:1]
> > 11169 root      20   0  718.1m   1.6m  16.6  0.0   0:01.88 D fio ././conf/fio-template-static-buffer
> > 11171 root      20   0  718.1m   1.6m   3.3  0.0   0:01.15 D fio ././conf/fio-template-static-buffer
> > 11170 root      20   0  718.1m   3.3m   2.6  0.1   0:00.98 D fio ././conf/fio-template-static-buffer
> > 
> > 
> > and still working...
> > 
> >  6274 root      20   0    0.0m   0.0m 100.0  0.0   3:05.49 R [kworker/u8:1]
> > 12048 root      20   0  718.1m   1.6m  16.7  0.0   0:01.80 R fio ././conf/fio-template-static-buffer
> > 12047 root      20   0  718.1m   1.6m   3.3  0.0   0:01.12 D fio ././conf/fio-template-static-buffer
> > 12049 root      20   0  718.1m   1.6m   3.3  0.0   0:01.12 D fio ././conf/fio-template-static-buffer
> > 12050 root      20   0  718.1m   1.6m   2.0  0.0   0:00.98 D fio ././conf/fio-template-static-buffer
> > 
> > and working...
[..]
> Isn't it blk-mq you mentioned? With blk-mq, I have some concerns.
> 
> 1. read speed degradation
> 2. no work with rw_page
> 3. more memory footprint by bio/request queue allocation

yes, I did. and I've seen your concerns in another email - I
just don't have enough knowledge at the moment to say something
not entirely stupid. gotta look more at the whole thing.

> Having said, it's worth to look into it in detail more.
> I will have time to see that approach to know what I can do
> with that.

thanks a lot!
will keep looking as well.

	-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Minchan Kim Oct. 17, 2016, 5:04 a.m. UTC | #4
Hi Sergey,

On Fri, Oct 07, 2016 at 03:33:22PM +0900, Minchan Kim wrote:

< snip >

> > so the question is -- can we move this parallelization out of zram
> > and instead flush bdi in more than one kthread? how bad that would
> > be? can anyone else benefit from this?
> 
> Isn't it blk-mq you mentioned? With blk-mq, I have some concerns.
> 
> 1. read speed degradation
> 2. no work with rw_page
> 3. more memory footprint by bio/request queue allocation
> 
> Having said, it's worth to look into it in detail more.
> I will have time to see that approach to know what I can do
> with that.

queue_mode=2 bs=4096 nr_devices=1 submit_queues=4 hw_queue_depth=128

Last week, I played with null_blk and blk-mq.c to get an idea how
blk-mq works and I realized it's not good for zram because it aims
to solve 1) dispatch queue bottleneck 2) cache-friendly IO completion
through IRQ so 3) avoids remote memory accesses.

For zram which is used for embedded as primary purpose, ones listed
abvoe are not a severe problem. Most imporant thing is there is no
model to support that a process queueing IO request on *a* CPU while
other CPUs issues the queued IO to driver.

Anyway, Although blk-mrq can support that model, it is blk-layer thing.
IOW, it's software stuff for fast IO delievry but what we need is
device parallelism of zram itself. So, although we follow blk-mq,
we still need multiple threads to compress in parallel which is most of
code I wrote in this patchset.

If I cannot get huge benefit(e.g., reduce a lot of zram-speicif code
to support such model) with blk-mq, I don't feel to switch to request
model at the cost of reasons I stated above.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sergey Senozhatsky Oct. 21, 2016, 6:08 a.m. UTC | #5
Hello Minchan,

On (10/17/16 14:04), Minchan Kim wrote:
> Hi Sergey,
> 
> On Fri, Oct 07, 2016 at 03:33:22PM +0900, Minchan Kim wrote:
> 
> < snip >
> 
> > > so the question is -- can we move this parallelization out of zram
> > > and instead flush bdi in more than one kthread? how bad that would
> > > be? can anyone else benefit from this?
> > 
> > Isn't it blk-mq you mentioned? With blk-mq, I have some concerns.
> > 
> > 1. read speed degradation
> > 2. no work with rw_page
> > 3. more memory footprint by bio/request queue allocation
> > 
> > Having said, it's worth to look into it in detail more.
> > I will have time to see that approach to know what I can do
> > with that.
> 
> queue_mode=2 bs=4096 nr_devices=1 submit_queues=4 hw_queue_depth=128
> 
> Last week, I played with null_blk and blk-mq.c to get an idea how
> blk-mq works and I realized it's not good for zram because it aims
> to solve 1) dispatch queue bottleneck 2) cache-friendly IO completion
> through IRQ so 3) avoids remote memory accesses.
> 
> For zram which is used for embedded as primary purpose, ones listed
> abvoe are not a severe problem. Most imporant thing is there is no
> model to support that a process queueing IO request on *a* CPU while
> other CPUs issues the queued IO to driver.
> 
> Anyway, Although blk-mrq can support that model, it is blk-layer thing.
> IOW, it's software stuff for fast IO delievry but what we need is
> device parallelism of zram itself. So, although we follow blk-mq,
> we still need multiple threads to compress in parallel which is most of
> code I wrote in this patchset.

yes. but at least wb can be multi-threaded. well, sort of. seems like.
sometimes.

> If I cannot get huge benefit(e.g., reduce a lot of zram-speicif code
> to support such model) with blk-mq, I don't feel to switch to request
> model at the cost of reasons I stated above.

thanks.
I'm looking at your patches.

	-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Minchan Kim Oct. 24, 2016, 4:51 a.m. UTC | #6
On Fri, Oct 21, 2016 at 03:08:09PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (10/17/16 14:04), Minchan Kim wrote:
> > Hi Sergey,
> > 
> > On Fri, Oct 07, 2016 at 03:33:22PM +0900, Minchan Kim wrote:
> > 
> > < snip >
> > 
> > > > so the question is -- can we move this parallelization out of zram
> > > > and instead flush bdi in more than one kthread? how bad that would
> > > > be? can anyone else benefit from this?
> > > 
> > > Isn't it blk-mq you mentioned? With blk-mq, I have some concerns.
> > > 
> > > 1. read speed degradation
> > > 2. no work with rw_page
> > > 3. more memory footprint by bio/request queue allocation
> > > 
> > > Having said, it's worth to look into it in detail more.
> > > I will have time to see that approach to know what I can do
> > > with that.
> > 
> > queue_mode=2 bs=4096 nr_devices=1 submit_queues=4 hw_queue_depth=128
> > 
> > Last week, I played with null_blk and blk-mq.c to get an idea how
> > blk-mq works and I realized it's not good for zram because it aims
> > to solve 1) dispatch queue bottleneck 2) cache-friendly IO completion
> > through IRQ so 3) avoids remote memory accesses.
> > 
> > For zram which is used for embedded as primary purpose, ones listed
> > abvoe are not a severe problem. Most imporant thing is there is no
> > model to support that a process queueing IO request on *a* CPU while
> > other CPUs issues the queued IO to driver.
> > 
> > Anyway, Although blk-mrq can support that model, it is blk-layer thing.
> > IOW, it's software stuff for fast IO delievry but what we need is
> > device parallelism of zram itself. So, although we follow blk-mq,
> > we still need multiple threads to compress in parallel which is most of
> > code I wrote in this patchset.
> 
> yes. but at least wb can be multi-threaded. well, sort of. seems like.
> sometimes.

Maybe, but it would be rather greedy approach for zram because zram will
do real IO(esp, compression which consumed a lot of time) in that context
although the context is sharable resource of all processes in the system.

> 
> > If I cannot get huge benefit(e.g., reduce a lot of zram-speicif code
> > to support such model) with blk-mq, I don't feel to switch to request
> > model at the cost of reasons I stated above.
> 
> thanks.
> I'm looking at your patches.

Currently, I found some subtle bug in my patchset so I will resend them
after hunting that with fixing a bug you found.

Thanks, Sergey!


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/conf/fio-template-static-buffer b/conf/fio-template-static-buffer
index 1a9a473..22ddee8 100644
--- a/conf/fio-template-static-buffer
+++ b/conf/fio-template-static-buffer
@@ -1,7 +1,7 @@ 
 [global]
 bs=${BLOCK_SIZE}k
 ioengine=sync
-direct=1
+fsync_on_close=1
 nrfiles=${NRFILES}
 size=${SIZE}
 numjobs=${NUMJOBS}
@@ -14,18 +14,18 @@  new_group
 group_reporting
 threads=1
 
-[seq-read]
-rw=read
-
-[rand-read]
-rw=randread
-
 [seq-write]
 rw=write
 
 [rand-write]
 rw=randwrite
 
+[seq-read]
+rw=read
+
+[rand-read]
+rw=randread
+
 [mixed-seq]
 rw=rw
 
diff --git a/zram-fio-test.sh b/zram-fio-test.sh
index 39c11b3..ca2d065 100755
--- a/zram-fio-test.sh
+++ b/zram-fio-test.sh
@@ -1,4 +1,4 @@ 
-#!/bin/sh
+#!/bin/bash
 
 
 # Sergey Senozhatsky. sergey.senozhatsky@gmail.com
@@ -37,6 +37,7 @@  function create_zram
 	echo $ZRAM_COMP_ALG > /sys/block/zram0/comp_algorithm
 	cat /sys/block/zram0/comp_algorithm
 
+	echo 0 > /sys/block/zram0/use_aio
 	echo $ZRAM_SIZE > /sys/block/zram0/disksize
 	if [ $? != 0 ]; then
 		return -1
@@ -137,7 +138,7 @@  function main
 		echo "#jobs$i fio" >> $LOG
 
 		BLOCK_SIZE=4 SIZE=100% NUMJOBS=$i NRFILES=$i FIO_LOOPS=$FIO_LOOPS \
-			$PERF stat -o $LOG-perf-stat $FIO ./$FIO_TEMPLATE >> $LOG
+			$FIO ./$FIO_TEMPLATE > $LOG
 
 		echo -n "perfstat jobs$i" >> $LOG
 		cat $LOG-perf-stat >> $LOG