diff mbox series

[1/3] zram: allow user to set QUEUE_FLAG_NOWAIT

Message ID 20230512082958.6550-2-kch@nvidia.com (mailing list archive)
State New, archived
Headers show
Series zram: queue flag nowait and mior cleanup | expand

Commit Message

Chaitanya Kulkarni May 12, 2023, 8:29 a.m. UTC
Allow user to set the QUEUE_FLAG_NOWAIT optionally using module
parameter to retain the default behaviour. Also, update respective
allocation flags in the write path. Following are the performance
numbers with io_uring fio engine for random read, note that device has
been populated fully with randwrite workload before taking these
numbers :-

* linux-block (for-next) # grep IOPS  zram*fio | column -t 

default-nowait-off-1.fio:  read:  IOPS=802k,  BW=3133MiB/s
default-nowait-off-2.fio:  read:  IOPS=796k,  BW=3111MiB/s
default-nowait-off-3.fio:  read:  IOPS=796k,  BW=3108MiB/s

nowait-on-1.fio:           read:  IOPS=857k,  BW=3346MiB/s
nowait-on-2.fio:           read:  IOPS=857k,  BW=3347MiB/s
nowait-on-3.fio:           read:  IOPS=858k,  BW=3353MiB/s

* linux-block (for-next) # grep cpu  zram*fio | column -t 
default-nowait-off-1.fio:  cpu  :  usr=5.82%,  sys=13.54%,  ctx=36301915
default-nowait-off-2.fio:  cpu  :  usr=5.84%,  sys=13.03%,  ctx=37781937
default-nowait-off-3.fio:  cpu  :  usr=5.84%,  sys=12.90%,  ctx=37492533

nowait-on-1.fio:           cpu  :  usr=1.74%,  sys=97.62%,  ctx=24068,  
nowait-on-2.fio:           cpu  :  usr=1.74%,  sys=97.57%,  ctx=24674,  
nowait-on-3.fio:           cpu  :  usr=1.76%,  sys=97.59%,  ctx=24725,  

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/block/zram/zram_drv.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig May 12, 2023, 2:31 p.m. UTC | #1
On Fri, May 12, 2023 at 01:29:56AM -0700, Chaitanya Kulkarni wrote:
> Allow user to set the QUEUE_FLAG_NOWAIT optionally using module
> parameter to retain the default behaviour. Also, update respective
> allocation flags in the write path. Following are the performance
> numbers with io_uring fio engine for random read, note that device has
> been populated fully with randwrite workload before taking these
> numbers :-

Why would you add a module option, except to make everyones life hell?
Jens Axboe May 12, 2023, 2:34 p.m. UTC | #2
On 5/12/23 8:31 AM, Christoph Hellwig wrote:
> On Fri, May 12, 2023 at 01:29:56AM -0700, Chaitanya Kulkarni wrote:
>> Allow user to set the QUEUE_FLAG_NOWAIT optionally using module
>> parameter to retain the default behaviour. Also, update respective
>> allocation flags in the write path. Following are the performance
>> numbers with io_uring fio engine for random read, note that device has
>> been populated fully with randwrite workload before taking these
>> numbers :-
> 
> Why would you add a module option, except to make everyones life hell?

Yeah that makes no sense. Either the driver is nowait compatible and
it should just set the flag, or it's not.
Chaitanya Kulkarni May 13, 2023, 1:05 a.m. UTC | #3
On 5/12/23 07:31, Christoph Hellwig wrote:
> On Fri, May 12, 2023 at 01:29:56AM -0700, Chaitanya Kulkarni wrote:
>> Allow user to set the QUEUE_FLAG_NOWAIT optionally using module
>> parameter to retain the default behaviour. Also, update respective
>> allocation flags in the write path. Following are the performance
>> numbers with io_uring fio engine for random read, note that device has
>> been populated fully with randwrite workload before taking these
>> numbers :-
> Why would you add a module option, except to make everyones life hell?

send v2 without modparam.

-ck
Chaitanya Kulkarni May 13, 2023, 1:06 a.m. UTC | #4
On 5/12/23 07:34, Jens Axboe wrote:
> On 5/12/23 8:31 AM, Christoph Hellwig wrote:
>> On Fri, May 12, 2023 at 01:29:56AM -0700, Chaitanya Kulkarni wrote:
>>> Allow user to set the QUEUE_FLAG_NOWAIT optionally using module
>>> parameter to retain the default behaviour. Also, update respective
>>> allocation flags in the write path. Following are the performance
>>> numbers with io_uring fio engine for random read, note that device has
>>> been populated fully with randwrite workload before taking these
>>> numbers :-
>> Why would you add a module option, except to make everyones life hell?
> Yeah that makes no sense. Either the driver is nowait compatible and
> it should just set the flag, or it's not.
>

send v2 without modparam.

-ck
Chaitanya Kulkarni May 16, 2023, 5:51 a.m. UTC | #5
Hi Jens/Christoph,

On 5/12/23 18:06, Chaitanya Kulkarni wrote:
> On 5/12/23 07:34, Jens Axboe wrote:
>> On 5/12/23 8:31 AM, Christoph Hellwig wrote:
>>> On Fri, May 12, 2023 at 01:29:56AM -0700, Chaitanya Kulkarni wrote:
>>>> Allow user to set the QUEUE_FLAG_NOWAIT optionally using module
>>>> parameter to retain the default behaviour. Also, update respective
>>>> allocation flags in the write path. Following are the performance
>>>> numbers with io_uring fio engine for random read, note that device has
>>>> been populated fully with randwrite workload before taking these
>>>> numbers :-
>>> Why would you add a module option, except to make everyones life hell?
>> Yeah that makes no sense. Either the driver is nowait compatible and
>> it should just set the flag, or it's not.
>>
> send v2 without modparam.
>
> -ck

Removed modparam v2 is ready to send, but I've few  concerns enabling
nowait unconditionally for zram :-

 From brd data [1] and zram data [2] from my setup :-

         IOPs  (old->new)    | sys cpu% (old->new)
--------------------------------------------------
brd  | 1.5x (3919 -> 5874) | 3x (29 -> 87)
zram | 1.09x ( 29 ->   87) | 9x (11 -> 97)

brd:-
IOPs increased by               ~1.5  times (50% up)
sys CPU percentage increased by ~3.0  times (200% up)

zram:-
IOPs increased by               ~1.09 times (  9% up)
sys CPU percentage increased by ~8.81 times (781% up)

This comparison clearly demonstrates that zram experiences a much more
substantial CPU load relative to the increase in IOPs compared to brd.
Such a significant difference might suggest a potential CPU regression
in zram ?

Especially for zram, if applications are not expecting this high cpu
usage then they we'll get regression reports with default nowait
approach. How about we avoid something like this with one of the
following options ?

1. Provide a fix with module parameter. (Already NACKed).
2. Allow user to configure nowait from command line using zramctl.
Set QUEUE_FLAG_NOWAIT disabled by default.
3. Add a block layer generic sysfs attr nowait like nomerges, since
    similar changes I've posted for pmem [3] and bcache [4] have same
    issue. This generic way we will avoid duplicating code in
    driver and allows user freedom to set or unset based on their
Set QUEUE_FLAG_NOWAIT disabled by default.

or please suggest any other way you guys can think is appropriate ...

-ck

[1] brd nowait off vs nowait on :-

linux-block (zram-nowait) #  grep cpu  brd-*fio | column -t

brd-default-nowait-off-1.fio:  cpu  :  usr=6.34%,   sys=29.84%, 
ctx=216249754,
brd-default-nowait-off-2.fio:  cpu  :  usr=6.41%,   sys=29.83%, 
ctx=217773657,
brd-default-nowait-off-3.fio:  cpu  :  usr=6.37%,   sys=30.05%, 
ctx=222667703,

brd-nowait-on-1.fio:           cpu  :  usr=10.18%,  sys=88.35%, ctx=23221,
brd-nowait-on-2.fio:           cpu  :  usr=10.02%,  sys=86.82%, ctx=22396,
brd-nowait-on-3.fio:           cpu  :  usr=10.17%,  sys=86.29%, ctx=22207,

linux-block (zram-nowait) #  grep IOPS  brd-*fio | column -t

brd-default-nowait-off-1.fio:  read:  IOPS=3872k,  BW=14.8GiB/s
brd-default-nowait-off-2.fio:  read:  IOPS=3933k,  BW=15.0GiB/s
brd-default-nowait-off-3.fio:  read:  IOPS=3953k,  BW=15.1GiB/s

brd-nowait-on-1.fio:           read:  IOPS=5884k,  BW=22.4GiB/s
brd-nowait-on-2.fio:           read:  IOPS=5870k,  BW=22.4GiB/s
brd-nowait-on-3.fio:           read:  IOPS=5870k,  BW=22.4GiB/s

linux-block (zram-nowait) #  grep slat  brd-*fio | column -t

brd-default-nowait-off-1.fio:  slat  (nsec):  min=440, max=56072k,  
avg=9579.17
brd-default-nowait-off-2.fio:  slat  (nsec):  min=440, max=42743k,  
avg=9468.83
brd-default-nowait-off-3.fio:  slat  (nsec):  min=431, max=32493k,  
avg=9532.96

brd-nowait-on-1.fio:           slat  (nsec):  min=1523, max=37786k,  
avg=7596.58
brd-nowait-on-2.fio:           slat  (nsec):  min=1503, max=40101k,  
avg=7612.64
brd-nowait-on-3.fio:           slat  (nsec):  min=1463, max=37298k,  
avg=7610.89

[2] zram nowait off vs nowait on:-

linux-block (zram-nowait) #  grep IOPS  zram-*fio | column -t
zram-default-nowait-off-1.fio:  read:  IOPS=833k,  BW=3254MiB/s
zram-default-nowait-off-2.fio:  read:  IOPS=845k,  BW=3301MiB/s
zram-default-nowait-off-3.fio:  read:  IOPS=845k,  BW=3301MiB/s

zram-nowait-on-1.fio:           read:  IOPS=917k,  BW=3582MiB/s
zram-nowait-on-2.fio:           read:  IOPS=914k,  BW=3569MiB/s
zram-nowait-on-3.fio:           read:  IOPS=917k,  BW=3581MiB/s

linux-block (zram-nowait) #  grep cpu  zram-*fio | column -t
zram-default-nowait-off-1.fio:  cpu  :  usr=5.18%,  sys=11.31%, ctx=39945072
zram-default-nowait-off-2.fio:  cpu  :  usr=5.20%,  sys=11.49%, ctx=40591907
zram-default-nowait-off-3.fio:  cpu  :  usr=5.31%,  sys=11.86%, ctx=40252142

zram-nowait-on-1.fio:           cpu  :  usr=1.87%,  sys=97.05%, ctx=24337
zram-nowait-on-2.fio:           cpu  :  usr=1.83%,  sys=97.20%, ctx=21452
zram-nowait-on-3.fio:           cpu  :  usr=1.84%,  sys=97.20%, ctx=21051

linux-block (zram-nowait) #  grep slat  zram-*fio | column -t
zram-default-nowait-off-1.fio:  slat  (nsec):  min=420, max=6960.6k,  
avg=1859.09
zram-default-nowait-off-2.fio:  slat  (nsec):  min=411, max=5387.7k,  
avg=1848.79
zram-default-nowait-off-3.fio:  slat  (nsec):  min=410, max=6914.2k,  
avg=1902.51

zram-nowait-on-1.fio:           slat  (nsec):  min=1092, max=32268k,   
avg=51605.49
zram-nowait-on-2.fio:           slat  (nsec):  min=1052, max=7676.3k,  
avg=51806.82
zram-nowait-on-3.fio:           slat  (nsec):  min=1062, max=10444k,   
avg=51625.89

[3] 
https://lore.kernel.org/nvdimm/b90ff1ba-b22c-bb37-db0a-4c46bb5f2a06@nvidia.com/T/#t
[4] https://marc.info/?l=linux-bcache&m=168388522305946&w=1
https://marc.info/?l=linux-bcache&m=168401821129652&w=2
Sergey Senozhatsky May 16, 2023, 1:08 p.m. UTC | #6
On (23/05/16 05:51), Chaitanya Kulkarni wrote:
> Removed modparam v2 is ready to send, but I've few  concerns enabling
> nowait unconditionally for zram :-
> 
>  From brd data [1] and zram data [2] from my setup :-
> 
>          IOPs  (old->new)    | sys cpu% (old->new)
> --------------------------------------------------
> brd  | 1.5x (3919 -> 5874) | 3x (29 -> 87)
> zram | 1.09x ( 29 ->   87) | 9x (11 -> 97)
> 
> brd:-
> IOPs increased by               ~1.5  times (50% up)
> sys CPU percentage increased by ~3.0  times (200% up)
> 
> zram:-
> IOPs increased by               ~1.09 times (  9% up)
> sys CPU percentage increased by ~8.81 times (781% up)
> 
> This comparison clearly demonstrates that zram experiences a much more
> substantial CPU load relative to the increase in IOPs compared to brd.
> Such a significant difference might suggest a potential CPU regression
> in zram ?
> 
> Especially for zram, if applications are not expecting this high cpu
> usage then they we'll get regression reports with default nowait
> approach. How about we avoid something like this with one of the
> following options ?

Well, zram performs decompression/compression on the CPU (per-CPU
crypto streams) for each IO operation, so zram IO is CPU intensive.
Chaitanya Kulkarni May 16, 2023, 8:41 p.m. UTC | #7
On 5/16/23 06:08, Sergey Senozhatsky wrote:
> On (23/05/16 05:51), Chaitanya Kulkarni wrote:
>> Removed modparam v2 is ready to send, but I've few  concerns enabling
>> nowait unconditionally for zram :-
>>
>>   From brd data [1] and zram data [2] from my setup :-
>>
>>           IOPs  (old->new)    | sys cpu% (old->new)
>> --------------------------------------------------
>> brd  | 1.5x (3919 -> 5874) | 3x (29 -> 87)
>> zram | 1.09x ( 29 ->   87) | 9x (11 -> 97)
>>
>> brd:-
>> IOPs increased by               ~1.5  times (50% up)
>> sys CPU percentage increased by ~3.0  times (200% up)
>>
>> zram:-
>> IOPs increased by               ~1.09 times (  9% up)
>> sys CPU percentage increased by ~8.81 times (781% up)
>>
>> This comparison clearly demonstrates that zram experiences a much more
>> substantial CPU load relative to the increase in IOPs compared to brd.
>> Such a significant difference might suggest a potential CPU regression
>> in zram ?
>>
>> Especially for zram, if applications are not expecting this high cpu
>> usage then they we'll get regression reports with default nowait
>> approach. How about we avoid something like this with one of the
>> following options ?
> Well, zram performs decompression/compression on the CPU (per-CPU
> crypto streams) for each IO operation, so zram IO is CPU intensive.

and that is exactly I've raised this issue, are you okay with that ?
I'll send V2 with nowait enabled by default ..

-ck
Jens Axboe May 16, 2023, 8:43 p.m. UTC | #8
On 5/16/23 2:41 PM, Chaitanya Kulkarni wrote:
> On 5/16/23 06:08, Sergey Senozhatsky wrote:
>> On (23/05/16 05:51), Chaitanya Kulkarni wrote:
>>> Removed modparam v2 is ready to send, but I've few  concerns enabling
>>> nowait unconditionally for zram :-
>>>
>>>   From brd data [1] and zram data [2] from my setup :-
>>>
>>>           IOPs  (old->new)    | sys cpu% (old->new)
>>> --------------------------------------------------
>>> brd  | 1.5x (3919 -> 5874) | 3x (29 -> 87)
>>> zram | 1.09x ( 29 ->   87) | 9x (11 -> 97)
>>>
>>> brd:-
>>> IOPs increased by               ~1.5  times (50% up)
>>> sys CPU percentage increased by ~3.0  times (200% up)
>>>
>>> zram:-
>>> IOPs increased by               ~1.09 times (  9% up)
>>> sys CPU percentage increased by ~8.81 times (781% up)
>>>
>>> This comparison clearly demonstrates that zram experiences a much more
>>> substantial CPU load relative to the increase in IOPs compared to brd.
>>> Such a significant difference might suggest a potential CPU regression
>>> in zram ?
>>>
>>> Especially for zram, if applications are not expecting this high cpu
>>> usage then they we'll get regression reports with default nowait
>>> approach. How about we avoid something like this with one of the
>>> following options ?
>> Well, zram performs decompression/compression on the CPU (per-CPU
>> crypto streams) for each IO operation, so zram IO is CPU intensive.
> 
> and that is exactly I've raised this issue, are you okay with that ?
> I'll send V2 with nowait enabled by default ..

Did you check that it's actually nowait sane to begin with? I spent
literally 30 seconds on that when you sent the first patch, and the
partial/sync path does not look kosher.
Chaitanya Kulkarni May 16, 2023, 9:32 p.m. UTC | #9
On 5/16/23 13:43, Jens Axboe wrote:
> On 5/16/23 2:41 PM, Chaitanya Kulkarni wrote:
>> On 5/16/23 06:08, Sergey Senozhatsky wrote:
>>> On (23/05/16 05:51), Chaitanya Kulkarni wrote:
>>>> Removed modparam v2 is ready to send, but I've few  concerns enabling
>>>> nowait unconditionally for zram :-
>>>>
>>>>    From brd data [1] and zram data [2] from my setup :-
>>>>
>>>>            IOPs  (old->new)    | sys cpu% (old->new)
>>>> --------------------------------------------------
>>>> brd  | 1.5x (3919 -> 5874) | 3x (29 -> 87)
>>>> zram | 1.09x ( 29 ->   87) | 9x (11 -> 97)
>>>>
>>>> brd:-
>>>> IOPs increased by               ~1.5  times (50% up)
>>>> sys CPU percentage increased by ~3.0  times (200% up)
>>>>
>>>> zram:-
>>>> IOPs increased by               ~1.09 times (  9% up)
>>>> sys CPU percentage increased by ~8.81 times (781% up)
>>>>
>>>> This comparison clearly demonstrates that zram experiences a much more
>>>> substantial CPU load relative to the increase in IOPs compared to brd.
>>>> Such a significant difference might suggest a potential CPU regression
>>>> in zram ?
>>>>
>>>> Especially for zram, if applications are not expecting this high cpu
>>>> usage then they we'll get regression reports with default nowait
>>>> approach. How about we avoid something like this with one of the
>>>> following options ?
>>> Well, zram performs decompression/compression on the CPU (per-CPU
>>> crypto streams) for each IO operation, so zram IO is CPU intensive.
>> and that is exactly I've raised this issue, are you okay with that ?
>> I'll send V2 with nowait enabled by default ..
> Did you check that it's actually nowait sane to begin with? I spent
> literally 30 seconds on that when you sent the first patch, and the
> partial/sync path does not look kosher.
>

I did check for it and it needs a nowait sane preparation in V2 with
something like this [1] plus returning error with bio_wouldblock_error()
when we end up in read_from_bdev_sync() when it is not called from
writeback_store(). (rough changes [1])

But after taking performance numbers repeatedly, the main concern I
failed to resolve is above numbers & default enabling nowait for
zram ...

-ck

[1]
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 08d198431a88..c2f154911cc0 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -55,7 +55,7 @@ static const struct block_device_operations zram_devops;

  static void zram_free_page(struct zram *zram, size_t index);
  static int zram_read_page(struct zram *zram, struct page *page, u32 index,
-              struct bio *parent, blk_opf_t opf);
+              struct bio *parent, bool nowait);

  static int zram_slot_trylock(struct zram *zram, u32 index)
  {
@@ -690,7 +690,7 @@ static ssize_t writeback_store(struct device *dev,
          /* Need for hugepage writeback racing */
          zram_set_flag(zram, index, ZRAM_IDLE);
          zram_slot_unlock(zram, index);
-        if (zram_read_page(zram, page, index, NULL, 0)) {
+        if (zram_read_page(zram, page, index, NULL, false)) {
              zram_slot_lock(zram, index);
              zram_clear_flag(zram, index, ZRAM_UNDER_WB);
              zram_clear_flag(zram, index, ZRAM_IDLE);
@@ -772,6 +772,7 @@ struct zram_work {
      unsigned long entry;
      struct page *page;
      int error;
+    bool nowait;
  };

  static void zram_sync_read(struct work_struct *work)
@@ -779,8 +780,9 @@ static void zram_sync_read(struct work_struct *work)
      struct zram_work *zw = container_of(work, struct zram_work, work);
      struct bio_vec bv;
      struct bio bio;
+    blk_opf_t_nowait = zw->nowait ? REQ_NOWAIT : 0;

-    bio_init(&bio, zw->zram->bdev, &bv, 1, REQ_OP_READ);
+    bio_init(&bio, zw->zram->bdev, &bv, 1, REQ_OP_READ | nowait);
      bio.bi_iter.bi_sector = zw->entry * (PAGE_SIZE >> 9);
      __bio_add_page(&bio, zw->page, PAGE_SIZE, 0);
      zw->error = submit_bio_wait(&bio);
@@ -792,18 +794,14 @@ static void zram_sync_read(struct work_struct *work)
   * use a worker thread context.
   */
  static int read_from_bdev_sync(struct zram *zram, struct page *page,
-                unsigned long entry, blk_opf_t opf)
+                unsigned long entry, bool nowait)
  {
      struct zram_work work;

      work.page = page;
      work.zram = zram;
      work.entry = entry;
-
-    if (opf & REQ_NOWAIT) {
-        zram_sync_read(&work);
-        return work.error;
-    }
+    work.nowait = nowait;

      INIT_WORK_ONSTACK(&work.work, zram_sync_read);
      queue_work(system_unbound_wq, &work.work);
@@ -814,21 +812,21 @@ static int read_from_bdev_sync(struct zram *zram, 
struct page *page,
  }

  static int read_from_bdev(struct zram *zram, struct page *page,
-            unsigned long entry, struct bio *parent, blk_opf_t opf)
+            unsigned long entry, struct bio *parent, bool nowait)
  {
      atomic64_inc(&zram->stats.bd_reads);
      if (!parent) {
          if (WARN_ON_ONCE(!IS_ENABLED(ZRAM_PARTIAL_IO)))
              return -EIO;
-        return read_from_bdev_sync(zram, page, entry, opf);
+        return read_from_bdev_sync(zram, page, entry, nowait);
      }
-    read_from_bdev_async(zram, page, entry, parent, opf);
+    read_from_bdev_async(zram, page, entry, parent, nowait);
      return 0;
  }
  #else
  static inline void reset_bdev(struct zram *zram) {};
  static int read_from_bdev(struct zram *zram, struct page *page,
-            unsigned long entry, struct bio *parent, blk_opf_t opf)
+            unsigned long entry, struct bio *parent, bool nowait)
  {
      return -EIO;
  }
@@ -1361,7 +1359,7 @@ static int zram_read_from_zspool(struct zram 
*zram, struct page *page,
  }

  static int zram_read_page(struct zram *zram, struct page *page, u32 index,
-              struct bio *parent, blk_opf_t bi_opf)
+              struct bio *parent, bool nowait)
  {
      int ret;

@@ -1378,7 +1376,7 @@ static int zram_read_page(struct zram *zram, 
struct page *page, u32 index,
          zram_slot_unlock(zram, index);

          ret = read_from_bdev(zram, page, zram_get_element(zram, index),
-                     parent, bi_opf);
+                     parent, nowait);
      }

      /* Should NEVER happen. Return bio error if it does. */
@@ -1395,13 +1393,14 @@ static int zram_read_page(struct zram *zram, 
struct page *page, u32 index,
  static int zram_bvec_read_partial(struct zram *zram, struct bio_vec *bvec,
                    u32 index, int offset, struct bio *bio)
  {
+    bool nowait = bio->bi_opf & REQ_NOWAIT;
      struct page *page;
      int ret;

-    page = alloc_page(bio->bi_opf & REQ_NOWAIT ? GFP_NOWAIT : GFP_NOIO);
+    page = alloc_page(nowait ? GFP_NOWAIT : GFP_NOIO);
      if (!page)
          return -ENOMEM;
-    ret = zram_read_page(zram, page, index, NULL, bio->bi_opf);
+    ret = zram_read_page(zram, page, index, NULL, nowait);
      if (likely(!ret))
          memcpy_to_bvec(bvec, page_address(page) + offset);
      __free_page(page);
@@ -1413,7 +1412,8 @@ static int zram_bvec_read(struct zram *zram, 
struct bio_vec *bvec,
  {
      if (is_partial_io(bvec))
          return zram_bvec_read_partial(zram, bvec, index, offset, bio);
-    return zram_read_page(zram, bvec->bv_page, index, bio, bio->bi_opf);
+    return zram_read_page(zram, bvec->bv_page, index, bio,
+                  bio->bi_opf & REQ_NOWAIT);
  }

  static int zram_write_page(struct zram *zram, struct page *page, u32 
index)
@@ -1547,14 +1547,15 @@ static int zram_write_page(struct zram *zram, 
struct page *page, u32 index)
  static int zram_bvec_write_partial(struct zram *zram, struct bio_vec 
*bvec,
                     u32 index, int offset, struct bio *bio)
  {
+    bool nowait = bio->bi_opf & REQ_NOWAIT;
      struct page *page;
      int ret;

-    page = alloc_page(bio->bi_opf & REQ_NOWAIT ? GFP_NOWAIT : GFP_NOIO);
+    page = alloc_page(nowait ? GFP_NOWAIT : GFP_NOIO);
      if (!page)
          return -ENOMEM;

-    ret = zram_read_page(zram, page, index, bio, bio->bi_opf);
+    ret = zram_read_page(zram, page, index, bio, nowait);
      if (!ret) {
          memcpy_from_bvec(page_address(page) + offset, bvec);
          ret = zram_write_page(zram, page, index);
Jens Axboe May 16, 2023, 10:03 p.m. UTC | #10
On 5/16/23 3:32?PM, Chaitanya Kulkarni wrote:
> On 5/16/23 13:43, Jens Axboe wrote:
>> On 5/16/23 2:41?PM, Chaitanya Kulkarni wrote:
>>> On 5/16/23 06:08, Sergey Senozhatsky wrote:
>>>> On (23/05/16 05:51), Chaitanya Kulkarni wrote:
>>>>> Removed modparam v2 is ready to send, but I've few  concerns enabling
>>>>> nowait unconditionally for zram :-
>>>>>
>>>>>    From brd data [1] and zram data [2] from my setup :-
>>>>>
>>>>>            IOPs  (old->new)    | sys cpu% (old->new)
>>>>> --------------------------------------------------
>>>>> brd  | 1.5x (3919 -> 5874) | 3x (29 -> 87)
>>>>> zram | 1.09x ( 29 ->   87) | 9x (11 -> 97)
>>>>>
>>>>> brd:-
>>>>> IOPs increased by               ~1.5  times (50% up)
>>>>> sys CPU percentage increased by ~3.0  times (200% up)
>>>>>
>>>>> zram:-
>>>>> IOPs increased by               ~1.09 times (  9% up)
>>>>> sys CPU percentage increased by ~8.81 times (781% up)
>>>>>
>>>>> This comparison clearly demonstrates that zram experiences a much more
>>>>> substantial CPU load relative to the increase in IOPs compared to brd.
>>>>> Such a significant difference might suggest a potential CPU regression
>>>>> in zram ?
>>>>>
>>>>> Especially for zram, if applications are not expecting this high cpu
>>>>> usage then they we'll get regression reports with default nowait
>>>>> approach. How about we avoid something like this with one of the
>>>>> following options ?
>>>> Well, zram performs decompression/compression on the CPU (per-CPU
>>>> crypto streams) for each IO operation, so zram IO is CPU intensive.
>>> and that is exactly I've raised this issue, are you okay with that ?
>>> I'll send V2 with nowait enabled by default ..
>> Did you check that it's actually nowait sane to begin with? I spent
>> literally 30 seconds on that when you sent the first patch, and the
>> partial/sync path does not look kosher.
>>
> 
> I did check for it and it needs a nowait sane preparation in V2 with
> something like this [1] plus returning error with bio_wouldblock_error()

That looks pretty awful... Things like:

> @@ -779,8 +780,9 @@ static void zram_sync_read(struct work_struct *work)
>       struct zram_work *zw = container_of(work, struct zram_work, work);
>       struct bio_vec bv;
>       struct bio bio;
> +    blk_opf_t_nowait = zw->nowait ? REQ_NOWAIT : 0;
> 
> -    bio_init(&bio, zw->zram->bdev, &bv, 1, REQ_OP_READ);
> +    bio_init(&bio, zw->zram->bdev, &bv, 1, REQ_OP_READ | nowait);
>       bio.bi_iter.bi_sector = zw->entry * (PAGE_SIZE >> 9);
>       __bio_add_page(&bio, zw->page, PAGE_SIZE, 0);
>       zw->error = submit_bio_wait(&bio);
>

make absolutely no sense, setting REQ_NOWAIT and then waiting on IO
completion.

> when we end up in read_from_bdev_sync() when it is not called from
> writeback_store(). (rough changes [1])

writeback_store() will never have nowait set. Outside of that, pushing
the nowait handling to the workqueue makes no sense, that flag only
applies for inline issue.

> But after taking performance numbers repeatedly, the main concern I
> failed to resolve is above numbers & default enabling nowait for
> zram ...

It's a choice you make if you do nowait IO, normal block IO would not
change at all. So I think it's fine, but the driver needs to be in a
sane state to support nowait to begin with.
diff mbox series

Patch

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f6d90f1ba5cf..b2e419f15f71 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -36,6 +36,10 @@ 
 
 #include "zram_drv.h"
 
+static bool g_nowait;
+module_param_named(nowait, g_nowait, bool, 0444);
+MODULE_PARM_DESC(nowait, "set QUEUE_FLAG_NOWAIT. Default: False");
+
 static DEFINE_IDR(zram_index_idr);
 /* idr index must be protected */
 static DEFINE_MUTEX(zram_index_mutex);
@@ -1540,9 +1544,10 @@  static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
 				   u32 index, int offset, struct bio *bio)
 {
-	struct page *page = alloc_page(GFP_NOIO);
+	struct page *page;
 	int ret;
 
+	page = alloc_page(bio->bi_opf & REQ_NOWAIT ? GFP_NOWAIT : GFP_NOIO);
 	if (!page)
 		return -ENOMEM;
 
@@ -2215,6 +2220,8 @@  static int zram_add(void)
 	/* zram devices sort of resembles non-rotational disks */
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, zram->disk->queue);
 	blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, zram->disk->queue);
+	if (g_nowait)
+		blk_queue_flag_set(QUEUE_FLAG_NOWAIT, zram->disk->queue);
 
 	/*
 	 * To ensure that we always get PAGE_SIZE aligned