diff mbox series

[RESEND,RFC] zram: Allow rw_page when page isn't written back.

Message ID 20220908125037.1119114-1-bgeffon@google.com (mailing list archive)
State New
Headers show
Series [RESEND,RFC] zram: Allow rw_page when page isn't written back. | expand

Commit Message

Brian Geffon Sept. 8, 2022, 12:50 p.m. UTC
Today when a zram device has a backing device we change the ops to
a new set which does not expose a rw_page method. This prevents the
upper layers from trying to issue a synchronous rw. This has the
downside that we penalize every rw even when it could possibly
still be performed as a synchronous rw. By the very nature of
zram all writes are synchronous so it's unfortunate to have to
accept this limitation.

This change will always expose a rw_page function and if the page
has been written back it will return -EOPNOTSUPP which will force the
upper layers to try again with bio.

To safely allow a synchronous read to proceed for pages which have not
yet written back we introduce a new flag ZRAM_NO_WB. On the first
synchronous read if the page is not written back we will set the
ZRAM_NO_WB flag. This flag, which is never cleared, prevents writeback
from ever happening to that page.

This approach works because in the case of zram as a swap backing device
the page is going to be removed from zram shortly thereafter so
preventing writeback is fine. However, if zram is being used as a
generic block device then this might prevent writeback of the page.

This proposal is still very much RFC, feedback would be appreciated.

Signed-off-by: Brian Geffon <bgeffon@google.com>
---
 drivers/block/zram/zram_drv.c | 68 +++++++++++++++++++++--------------
 drivers/block/zram/zram_drv.h |  1 +
 2 files changed, 43 insertions(+), 26 deletions(-)

Comments

Sergey Senozhatsky Sept. 9, 2022, 8:30 a.m. UTC | #1
On (22/09/08 08:50), Brian Geffon wrote:
[..]
> +++ b/drivers/block/zram/zram_drv.h
> @@ -50,6 +50,7 @@ enum zram_pageflags {
>  	ZRAM_UNDER_WB,	/* page is under writeback */
>  	ZRAM_HUGE,	/* Incompressible page */
>  	ZRAM_IDLE,	/* not accessed page since last idle marking */
> +	ZRAM_NO_WB,	/* Do not allow page to be written back */
>  
>  	__NR_ZRAM_PAGEFLAGS,
>  };

Unrelated but somehow related.

I wonder if it's time for us to introduce a dedicated, say u16,
flags member to struct zram_table_entry. Unless my calculations
are extremely wrong, we are about to run out of spare bits in
zram_table_entry::flags on 32-bit systems.
Sergey Senozhatsky Sept. 12, 2022, 4:37 a.m. UTC | #2
On (22/09/09 17:30), Sergey Senozhatsky wrote:
> > +++ b/drivers/block/zram/zram_drv.h
> > @@ -50,6 +50,7 @@ enum zram_pageflags {
> >  	ZRAM_UNDER_WB,	/* page is under writeback */
> >  	ZRAM_HUGE,	/* Incompressible page */
> >  	ZRAM_IDLE,	/* not accessed page since last idle marking */
> > +	ZRAM_NO_WB,	/* Do not allow page to be written back */
> >  
> >  	__NR_ZRAM_PAGEFLAGS,
> >  };
> 
> Unrelated but somehow related.
> 
> I wonder if it's time for us to introduce a dedicated, say u16,
> flags member to struct zram_table_entry. Unless my calculations
> are extremely wrong, we are about to run out of spare bits in
> zram_table_entry::flags on 32-bit systems.

Looking at it more - I wonder why do we define ZRAM_FLAG_SHIFT
as 24? This is far more than maximum zram object size. Our max
size needs PAGE_SHIFT bits (which is normally 12 bits, can be up
to 16 (for 64k arm64 pages)). So we probably can start defining
ZRAM_FLAG_SHIFT as (PAGE_SHIFT + 1).

Or am I missing something?

---

diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index b4eecef2a11f..cb8f1f644baf 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -39,7 +39,7 @@
  * The lower ZRAM_FLAG_SHIFT bits is for object size (excluding header),
  * the higher bits is for zram_pageflags.
  */
-#define ZRAM_FLAG_SHIFT 24
+#define ZRAM_FLAG_SHIFT (PAGE_SHIFT + 1)

 /* Flags for zram pages (table[page_no].flags) */
 enum zram_pageflags {
Sergey Senozhatsky Sept. 12, 2022, 6:07 a.m. UTC | #3
On (22/09/12 13:37), Sergey Senozhatsky wrote:
> On (22/09/09 17:30), Sergey Senozhatsky wrote:
> > > +++ b/drivers/block/zram/zram_drv.h
> > > @@ -50,6 +50,7 @@ enum zram_pageflags {
> > >  	ZRAM_UNDER_WB,	/* page is under writeback */
> > >  	ZRAM_HUGE,	/* Incompressible page */
> > >  	ZRAM_IDLE,	/* not accessed page since last idle marking */
> > > +	ZRAM_NO_WB,	/* Do not allow page to be written back */
> > >  
> > >  	__NR_ZRAM_PAGEFLAGS,
> > >  };
> > 
> > Unrelated but somehow related.
> > 
> > I wonder if it's time for us to introduce a dedicated, say u16,
> > flags member to struct zram_table_entry. Unless my calculations
> > are extremely wrong, we are about to run out of spare bits in
> > zram_table_entry::flags on 32-bit systems.
> 
> Looking at it more - I wonder why do we define ZRAM_FLAG_SHIFT
> as 24? This is far more than maximum zram object size. Our max
> size needs PAGE_SHIFT bits (which is normally 12 bits, can be up
> to 16 (for 64k arm64 pages)). So we probably can start defining
> ZRAM_FLAG_SHIFT as (PAGE_SHIFT + 1).
> 
> Or am I missing something?

So I think what happened was that flags used to be a u8 member of
zram_table_entry, commit d2d5e762c8990 merged u16 size (which was
too large for object size) and u8 flags and just kept 8 bits for
flags (and somehow assumed 32-bit long? 32 - 8)

We definitely can store size in PAGE_SHIFT bits and have some
extra spare bits for zram pageflags. Would have been even nicer
if we could change type of flags from unsigned long to unsigned
int, but bit_lock requires "volatile unsigned long *" data type,
so because of bit_lock our zram_table_entriy is 4 extra bytes in
size. Which probably isn't too bad; having extra pageflag bits
on 32-bit systems is good news.
Minchan Kim Sept. 23, 2022, 7:31 p.m. UTC | #4
On Thu, Sep 08, 2022 at 08:50:37AM -0400, Brian Geffon wrote:
> Today when a zram device has a backing device we change the ops to
> a new set which does not expose a rw_page method. This prevents the
> upper layers from trying to issue a synchronous rw. This has the
> downside that we penalize every rw even when it could possibly

Do you mean addiontal bio alloc/free?
Please specify something more detail.

> still be performed as a synchronous rw. By the very nature of

Even though zram go though the block layer in the case, it's still
synchronous operation against on in-memory compressed data. Only
asynchrnous IO happens for the data in backing device.

> zram all writes are synchronous so it's unfortunate to have to
> accept this limitation.
> 
> This change will always expose a rw_page function and if the page
> has been written back it will return -EOPNOTSUPP which will force the
> upper layers to try again with bio.

Sounds a good idea.

> 
> To safely allow a synchronous read to proceed for pages which have not
> yet written back we introduce a new flag ZRAM_NO_WB. On the first
> synchronous read if the page is not written back we will set the
> ZRAM_NO_WB flag. This flag, which is never cleared, prevents writeback
> from ever happening to that page.

Why do we need a addtional flag?
Why couldn't we do?

1. expose the rw_page all the time.
2. If the page was written back, just return an error in rw_page to make
upper layer retry it with bio.

> 
> This approach works because in the case of zram as a swap backing device
> the page is going to be removed from zram shortly thereafter so
> preventing writeback is fine. However, if zram is being used as a
> generic block device then this might prevent writeback of the page.
> 
> This proposal is still very much RFC, feedback would be appreciated.
> 
> Signed-off-by: Brian Geffon <bgeffon@google.com>
> ---
>  drivers/block/zram/zram_drv.c | 68 +++++++++++++++++++++--------------
>  drivers/block/zram/zram_drv.h |  1 +
>  2 files changed, 43 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 92cb929a45b7..22b69e8b6042 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -52,9 +52,6 @@ static unsigned int num_devices = 1;
>  static size_t huge_class_size;
>  
>  static const struct block_device_operations zram_devops;
> -#ifdef CONFIG_ZRAM_WRITEBACK
> -static const struct block_device_operations zram_wb_devops;
> -#endif
>  
>  static void zram_free_page(struct zram *zram, size_t index);
>  static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
> @@ -309,7 +306,8 @@ static void mark_idle(struct zram *zram, ktime_t cutoff)
>  		 */
>  		zram_slot_lock(zram, index);
>  		if (zram_allocated(zram, index) &&
> -				!zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
> +				!zram_test_flag(zram, index, ZRAM_UNDER_WB) &&
> +				!zram_test_flag(zram, index, ZRAM_NO_WB)) {
>  #ifdef CONFIG_ZRAM_MEMORY_TRACKING
>  			is_idle = !cutoff || ktime_after(cutoff, zram->table[index].ac_time);
>  #endif
> @@ -439,7 +437,6 @@ static void reset_bdev(struct zram *zram)
>  	filp_close(zram->backing_dev, NULL);
>  	zram->backing_dev = NULL;
>  	zram->bdev = NULL;
> -	zram->disk->fops = &zram_devops;
>  	kvfree(zram->bitmap);
>  	zram->bitmap = NULL;
>  }
> @@ -543,17 +540,6 @@ static ssize_t backing_dev_store(struct device *dev,
>  	zram->backing_dev = backing_dev;
>  	zram->bitmap = bitmap;
>  	zram->nr_pages = nr_pages;
> -	/*
> -	 * With writeback feature, zram does asynchronous IO so it's no longer
> -	 * synchronous device so let's remove synchronous io flag. Othewise,
> -	 * upper layer(e.g., swap) could wait IO completion rather than
> -	 * (submit and return), which will cause system sluggish.
> -	 * Furthermore, when the IO function returns(e.g., swap_readpage),
> -	 * upper layer expects IO was done so it could deallocate the page
> -	 * freely but in fact, IO is going on so finally could cause
> -	 * use-after-free when the IO is really done.
> -	 */
> -	zram->disk->fops = &zram_wb_devops;
>  	up_write(&zram->init_lock);
>  
>  	pr_info("setup backing device %s\n", file_name);
> @@ -722,7 +708,8 @@ static ssize_t writeback_store(struct device *dev,
>  
>  		if (zram_test_flag(zram, index, ZRAM_WB) ||
>  				zram_test_flag(zram, index, ZRAM_SAME) ||
> -				zram_test_flag(zram, index, ZRAM_UNDER_WB))
> +				zram_test_flag(zram, index, ZRAM_UNDER_WB) ||
> +				zram_test_flag(zram, index, ZRAM_NO_WB))
>  			goto next;
>  
>  		if (mode & IDLE_WRITEBACK &&
> @@ -1226,6 +1213,10 @@ static void zram_free_page(struct zram *zram, size_t index)
>  		goto out;
>  	}
>  
> +	if (zram_test_flag(zram, index, ZRAM_NO_WB)) {
> +		zram_clear_flag(zram, index, ZRAM_NO_WB);
> +	}
> +
>  	/*
>  	 * No memory is allocated for same element filled pages.
>  	 * Simply clear same page flag.
> @@ -1654,6 +1645,40 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
>  	index = sector >> SECTORS_PER_PAGE_SHIFT;
>  	offset = (sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
>  
> +#ifdef CONFIG_ZRAM_WRITEBACK
> +	/*
> +	 * With writeback feature, zram does asynchronous IO so it's no longer
> +	 * synchronous device so let's remove synchronous io flag. Othewise,
> +	 * upper layer(e.g., swap) could wait IO completion rather than
> +	 * (submit and return), which will cause system sluggish.
> +	 * Furthermore, when the IO function returns(e.g., swap_readpage),
> +	 * upper layer expects IO was done so it could deallocate the page
> +	 * freely but in fact, IO is going on so finally could cause
> +	 * use-after-free when the IO is really done.
> +	 *
> +	 * If the page is not currently written back then we may proceed to
> +	 * read the page synchronously, otherwise, we must fail with
> +	 * -EOPNOTSUPP to force the upper layers to use a normal bio.
> +	 */
> +	zram_slot_lock(zram, index);
> +	if (zram_test_flag(zram, index, ZRAM_WB) ||
> +			zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
> +		zram_slot_unlock(zram, index);
> +		/* We cannot proceed with synchronous read */
> +		return -EOPNOTSUPP;
> +	}
> +
> +	/*
> +	 * Don't allow the page to be written back while we read it,
> +	 * this flag is never cleared. It shouldn't be a problem that
> +	 * we don't clear this flag because in the case of swap this
> +	 * page will be removed shortly after this read anyway.
> +	 */
> +	if (op == REQ_OP_READ)
> +		zram_set_flag(zram, index, ZRAM_NO_WB);
> +	zram_slot_unlock(zram, index);
> +#endif
> +
>  	bv.bv_page = page;
>  	bv.bv_len = PAGE_SIZE;
>  	bv.bv_offset = 0;
> @@ -1827,15 +1852,6 @@ static const struct block_device_operations zram_devops = {
>  	.owner = THIS_MODULE
>  };
>  
> -#ifdef CONFIG_ZRAM_WRITEBACK
> -static const struct block_device_operations zram_wb_devops = {
> -	.open = zram_open,
> -	.submit_bio = zram_submit_bio,
> -	.swap_slot_free_notify = zram_slot_free_notify,
> -	.owner = THIS_MODULE
> -};
> -#endif
> -
>  static DEVICE_ATTR_WO(compact);
>  static DEVICE_ATTR_RW(disksize);
>  static DEVICE_ATTR_RO(initstate);
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index 158c91e54850..20e4c6a579e0 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -50,6 +50,7 @@ enum zram_pageflags {
>  	ZRAM_UNDER_WB,	/* page is under writeback */
>  	ZRAM_HUGE,	/* Incompressible page */
>  	ZRAM_IDLE,	/* not accessed page since last idle marking */
> +	ZRAM_NO_WB,	/* Do not allow page to be written back */
>  
>  	__NR_ZRAM_PAGEFLAGS,
>  };
> -- 
> 2.37.2.789.g6183377224-goog
> 
>
Brian Geffon Sept. 30, 2022, 7:33 p.m. UTC | #5
Hi Minchan,
Thank you for taking a look.

On Fri, Sep 23, 2022 at 3:31 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Thu, Sep 08, 2022 at 08:50:37AM -0400, Brian Geffon wrote:
> > Today when a zram device has a backing device we change the ops to
> > a new set which does not expose a rw_page method. This prevents the
> > upper layers from trying to issue a synchronous rw. This has the
> > downside that we penalize every rw even when it could possibly
>
> Do you mean addiontal bio alloc/free?
> Please specify something more detail.

Yes, the additional overhead is primarily this bio_alloc/free.

>
> > still be performed as a synchronous rw. By the very nature of
>
> Even though zram go though the block layer in the case, it's still
> synchronous operation against on in-memory compressed data. Only
> asynchrnous IO happens for the data in backing device.
>
> > zram all writes are synchronous so it's unfortunate to have to
> > accept this limitation.
> >
> > This change will always expose a rw_page function and if the page
> > has been written back it will return -EOPNOTSUPP which will force the
> > upper layers to try again with bio.
>
> Sounds a good idea.
>
> >
> > To safely allow a synchronous read to proceed for pages which have not
> > yet written back we introduce a new flag ZRAM_NO_WB. On the first
> > synchronous read if the page is not written back we will set the
> > ZRAM_NO_WB flag. This flag, which is never cleared, prevents writeback
> > from ever happening to that page.
>
> Why do we need a addtional flag?
> Why couldn't we do?
>
> 1. expose the rw_page all the time.
> 2. If the page was written back, just return an error in rw_page to make
> upper layer retry it with bio.

Yes this approach is much simpler, I'll send a new patch.

>
> >
> > This approach works because in the case of zram as a swap backing device
> > the page is going to be removed from zram shortly thereafter so
> > preventing writeback is fine. However, if zram is being used as a
> > generic block device then this might prevent writeback of the page.
> >
> > This proposal is still very much RFC, feedback would be appreciated.
> >
> > Signed-off-by: Brian Geffon <bgeffon@google.com>
> > ---
> >  drivers/block/zram/zram_drv.c | 68 +++++++++++++++++++++--------------
> >  drivers/block/zram/zram_drv.h |  1 +
> >  2 files changed, 43 insertions(+), 26 deletions(-)
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index 92cb929a45b7..22b69e8b6042 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -52,9 +52,6 @@ static unsigned int num_devices = 1;
> >  static size_t huge_class_size;
> >
> >  static const struct block_device_operations zram_devops;
> > -#ifdef CONFIG_ZRAM_WRITEBACK
> > -static const struct block_device_operations zram_wb_devops;
> > -#endif
> >
> >  static void zram_free_page(struct zram *zram, size_t index);
> >  static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
> > @@ -309,7 +306,8 @@ static void mark_idle(struct zram *zram, ktime_t cutoff)
> >                */
> >               zram_slot_lock(zram, index);
> >               if (zram_allocated(zram, index) &&
> > -                             !zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
> > +                             !zram_test_flag(zram, index, ZRAM_UNDER_WB) &&
> > +                             !zram_test_flag(zram, index, ZRAM_NO_WB)) {
> >  #ifdef CONFIG_ZRAM_MEMORY_TRACKING
> >                       is_idle = !cutoff || ktime_after(cutoff, zram->table[index].ac_time);
> >  #endif
> > @@ -439,7 +437,6 @@ static void reset_bdev(struct zram *zram)
> >       filp_close(zram->backing_dev, NULL);
> >       zram->backing_dev = NULL;
> >       zram->bdev = NULL;
> > -     zram->disk->fops = &zram_devops;
> >       kvfree(zram->bitmap);
> >       zram->bitmap = NULL;
> >  }
> > @@ -543,17 +540,6 @@ static ssize_t backing_dev_store(struct device *dev,
> >       zram->backing_dev = backing_dev;
> >       zram->bitmap = bitmap;
> >       zram->nr_pages = nr_pages;
> > -     /*
> > -      * With writeback feature, zram does asynchronous IO so it's no longer
> > -      * synchronous device so let's remove synchronous io flag. Othewise,
> > -      * upper layer(e.g., swap) could wait IO completion rather than
> > -      * (submit and return), which will cause system sluggish.
> > -      * Furthermore, when the IO function returns(e.g., swap_readpage),
> > -      * upper layer expects IO was done so it could deallocate the page
> > -      * freely but in fact, IO is going on so finally could cause
> > -      * use-after-free when the IO is really done.
> > -      */
> > -     zram->disk->fops = &zram_wb_devops;
> >       up_write(&zram->init_lock);
> >
> >       pr_info("setup backing device %s\n", file_name);
> > @@ -722,7 +708,8 @@ static ssize_t writeback_store(struct device *dev,
> >
> >               if (zram_test_flag(zram, index, ZRAM_WB) ||
> >                               zram_test_flag(zram, index, ZRAM_SAME) ||
> > -                             zram_test_flag(zram, index, ZRAM_UNDER_WB))
> > +                             zram_test_flag(zram, index, ZRAM_UNDER_WB) ||
> > +                             zram_test_flag(zram, index, ZRAM_NO_WB))
> >                       goto next;
> >
> >               if (mode & IDLE_WRITEBACK &&
> > @@ -1226,6 +1213,10 @@ static void zram_free_page(struct zram *zram, size_t index)
> >               goto out;
> >       }
> >
> > +     if (zram_test_flag(zram, index, ZRAM_NO_WB)) {
> > +             zram_clear_flag(zram, index, ZRAM_NO_WB);
> > +     }
> > +
> >       /*
> >        * No memory is allocated for same element filled pages.
> >        * Simply clear same page flag.
> > @@ -1654,6 +1645,40 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
> >       index = sector >> SECTORS_PER_PAGE_SHIFT;
> >       offset = (sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
> >
> > +#ifdef CONFIG_ZRAM_WRITEBACK
> > +     /*
> > +      * With writeback feature, zram does asynchronous IO so it's no longer
> > +      * synchronous device so let's remove synchronous io flag. Othewise,
> > +      * upper layer(e.g., swap) could wait IO completion rather than
> > +      * (submit and return), which will cause system sluggish.
> > +      * Furthermore, when the IO function returns(e.g., swap_readpage),
> > +      * upper layer expects IO was done so it could deallocate the page
> > +      * freely but in fact, IO is going on so finally could cause
> > +      * use-after-free when the IO is really done.
> > +      *
> > +      * If the page is not currently written back then we may proceed to
> > +      * read the page synchronously, otherwise, we must fail with
> > +      * -EOPNOTSUPP to force the upper layers to use a normal bio.
> > +      */
> > +     zram_slot_lock(zram, index);
> > +     if (zram_test_flag(zram, index, ZRAM_WB) ||
> > +                     zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
> > +             zram_slot_unlock(zram, index);
> > +             /* We cannot proceed with synchronous read */
> > +             return -EOPNOTSUPP;
> > +     }
> > +
> > +     /*
> > +      * Don't allow the page to be written back while we read it,
> > +      * this flag is never cleared. It shouldn't be a problem that
> > +      * we don't clear this flag because in the case of swap this
> > +      * page will be removed shortly after this read anyway.
> > +      */
> > +     if (op == REQ_OP_READ)
> > +             zram_set_flag(zram, index, ZRAM_NO_WB);
> > +     zram_slot_unlock(zram, index);
> > +#endif
> > +
> >       bv.bv_page = page;
> >       bv.bv_len = PAGE_SIZE;
> >       bv.bv_offset = 0;
> > @@ -1827,15 +1852,6 @@ static const struct block_device_operations zram_devops = {
> >       .owner = THIS_MODULE
> >  };
> >
> > -#ifdef CONFIG_ZRAM_WRITEBACK
> > -static const struct block_device_operations zram_wb_devops = {
> > -     .open = zram_open,
> > -     .submit_bio = zram_submit_bio,
> > -     .swap_slot_free_notify = zram_slot_free_notify,
> > -     .owner = THIS_MODULE
> > -};
> > -#endif
> > -
> >  static DEVICE_ATTR_WO(compact);
> >  static DEVICE_ATTR_RW(disksize);
> >  static DEVICE_ATTR_RO(initstate);
> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> > index 158c91e54850..20e4c6a579e0 100644
> > --- a/drivers/block/zram/zram_drv.h
> > +++ b/drivers/block/zram/zram_drv.h
> > @@ -50,6 +50,7 @@ enum zram_pageflags {
> >       ZRAM_UNDER_WB,  /* page is under writeback */
> >       ZRAM_HUGE,      /* Incompressible page */
> >       ZRAM_IDLE,      /* not accessed page since last idle marking */
> > +     ZRAM_NO_WB,     /* Do not allow page to be written back */
> >
> >       __NR_ZRAM_PAGEFLAGS,
> >  };
> > --
> > 2.37.2.789.g6183377224-goog
> >
> >
diff mbox series

Patch

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 92cb929a45b7..22b69e8b6042 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -52,9 +52,6 @@  static unsigned int num_devices = 1;
 static size_t huge_class_size;
 
 static const struct block_device_operations zram_devops;
-#ifdef CONFIG_ZRAM_WRITEBACK
-static const struct block_device_operations zram_wb_devops;
-#endif
 
 static void zram_free_page(struct zram *zram, size_t index);
 static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
@@ -309,7 +306,8 @@  static void mark_idle(struct zram *zram, ktime_t cutoff)
 		 */
 		zram_slot_lock(zram, index);
 		if (zram_allocated(zram, index) &&
-				!zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
+				!zram_test_flag(zram, index, ZRAM_UNDER_WB) &&
+				!zram_test_flag(zram, index, ZRAM_NO_WB)) {
 #ifdef CONFIG_ZRAM_MEMORY_TRACKING
 			is_idle = !cutoff || ktime_after(cutoff, zram->table[index].ac_time);
 #endif
@@ -439,7 +437,6 @@  static void reset_bdev(struct zram *zram)
 	filp_close(zram->backing_dev, NULL);
 	zram->backing_dev = NULL;
 	zram->bdev = NULL;
-	zram->disk->fops = &zram_devops;
 	kvfree(zram->bitmap);
 	zram->bitmap = NULL;
 }
@@ -543,17 +540,6 @@  static ssize_t backing_dev_store(struct device *dev,
 	zram->backing_dev = backing_dev;
 	zram->bitmap = bitmap;
 	zram->nr_pages = nr_pages;
-	/*
-	 * With writeback feature, zram does asynchronous IO so it's no longer
-	 * synchronous device so let's remove synchronous io flag. Othewise,
-	 * upper layer(e.g., swap) could wait IO completion rather than
-	 * (submit and return), which will cause system sluggish.
-	 * Furthermore, when the IO function returns(e.g., swap_readpage),
-	 * upper layer expects IO was done so it could deallocate the page
-	 * freely but in fact, IO is going on so finally could cause
-	 * use-after-free when the IO is really done.
-	 */
-	zram->disk->fops = &zram_wb_devops;
 	up_write(&zram->init_lock);
 
 	pr_info("setup backing device %s\n", file_name);
@@ -722,7 +708,8 @@  static ssize_t writeback_store(struct device *dev,
 
 		if (zram_test_flag(zram, index, ZRAM_WB) ||
 				zram_test_flag(zram, index, ZRAM_SAME) ||
-				zram_test_flag(zram, index, ZRAM_UNDER_WB))
+				zram_test_flag(zram, index, ZRAM_UNDER_WB) ||
+				zram_test_flag(zram, index, ZRAM_NO_WB))
 			goto next;
 
 		if (mode & IDLE_WRITEBACK &&
@@ -1226,6 +1213,10 @@  static void zram_free_page(struct zram *zram, size_t index)
 		goto out;
 	}
 
+	if (zram_test_flag(zram, index, ZRAM_NO_WB)) {
+		zram_clear_flag(zram, index, ZRAM_NO_WB);
+	}
+
 	/*
 	 * No memory is allocated for same element filled pages.
 	 * Simply clear same page flag.
@@ -1654,6 +1645,40 @@  static int zram_rw_page(struct block_device *bdev, sector_t sector,
 	index = sector >> SECTORS_PER_PAGE_SHIFT;
 	offset = (sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+	/*
+	 * With writeback feature, zram does asynchronous IO so it's no longer
+	 * synchronous device so let's remove synchronous io flag. Othewise,
+	 * upper layer(e.g., swap) could wait IO completion rather than
+	 * (submit and return), which will cause system sluggish.
+	 * Furthermore, when the IO function returns(e.g., swap_readpage),
+	 * upper layer expects IO was done so it could deallocate the page
+	 * freely but in fact, IO is going on so finally could cause
+	 * use-after-free when the IO is really done.
+	 *
+	 * If the page is not currently written back then we may proceed to
+	 * read the page synchronously, otherwise, we must fail with
+	 * -EOPNOTSUPP to force the upper layers to use a normal bio.
+	 */
+	zram_slot_lock(zram, index);
+	if (zram_test_flag(zram, index, ZRAM_WB) ||
+			zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
+		zram_slot_unlock(zram, index);
+		/* We cannot proceed with synchronous read */
+		return -EOPNOTSUPP;
+	}
+
+	/*
+	 * Don't allow the page to be written back while we read it,
+	 * this flag is never cleared. It shouldn't be a problem that
+	 * we don't clear this flag because in the case of swap this
+	 * page will be removed shortly after this read anyway.
+	 */
+	if (op == REQ_OP_READ)
+		zram_set_flag(zram, index, ZRAM_NO_WB);
+	zram_slot_unlock(zram, index);
+#endif
+
 	bv.bv_page = page;
 	bv.bv_len = PAGE_SIZE;
 	bv.bv_offset = 0;
@@ -1827,15 +1852,6 @@  static const struct block_device_operations zram_devops = {
 	.owner = THIS_MODULE
 };
 
-#ifdef CONFIG_ZRAM_WRITEBACK
-static const struct block_device_operations zram_wb_devops = {
-	.open = zram_open,
-	.submit_bio = zram_submit_bio,
-	.swap_slot_free_notify = zram_slot_free_notify,
-	.owner = THIS_MODULE
-};
-#endif
-
 static DEVICE_ATTR_WO(compact);
 static DEVICE_ATTR_RW(disksize);
 static DEVICE_ATTR_RO(initstate);
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 158c91e54850..20e4c6a579e0 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -50,6 +50,7 @@  enum zram_pageflags {
 	ZRAM_UNDER_WB,	/* page is under writeback */
 	ZRAM_HUGE,	/* Incompressible page */
 	ZRAM_IDLE,	/* not accessed page since last idle marking */
+	ZRAM_NO_WB,	/* Do not allow page to be written back */
 
 	__NR_ZRAM_PAGEFLAGS,
 };