[v3,01/14] fs: Move enum rw_hint into a new header file

Message ID	20231017204739.3409052-2-bvanassche@acm.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@vger.kernel.org> From: Bart Van Assche <bvanassche@acm.org> To: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org, linux-fsdevel@vger.kernel.org, "Martin K . Petersen" <martin.petersen@oracle.com>, Christoph Hellwig <hch@lst.de>, Niklas Cassel <Niklas.Cassel@wdc.com>, Avri Altman <Avri.Altman@wdc.com>, Bean Huo <huobean@gmail.com>, Daejun Park <daejun7.park@samsung.com>, Bart Van Assche <bvanassche@acm.org>, Jan Kara <jack@suse.cz>, Christian Brauner <brauner@kernel.org>, Jaegeuk Kim <jaegeuk@kernel.org>, Chao Yu <chao@kernel.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Jeff Layton <jlayton@kernel.org>, Chuck Lever <chuck.lever@oracle.com> Subject: [PATCH v3 01/14] fs: Move enum rw_hint into a new header file Date: Tue, 17 Oct 2023 13:47:09 -0700 Message-ID: <20231017204739.3409052-2-bvanassche@acm.org> In-Reply-To: <20231017204739.3409052-1-bvanassche@acm.org> References: <20231017204739.3409052-1-bvanassche@acm.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Pass data temperature information to SCSI disk devices \| expand [v3,00/14] Pass data temperature information to SCSI disk devices [v3,01/14] fs: Move enum rw_hint into a new header file [v3,02/14] block: Restore data lifetime support in struct bio and struct request [v3,03/14] fs: Restore write hint support [v3,04/14] fs/f2fs: Restore data lifetime support [v3,05/14] scsi: core: Query the Block Limits Extension VPD page [v3,06/14] scsi_proto: Add structures and constants related to I/O groups and streams [v3,07/14] sd: Translate data lifetime information [v3,08/14] scsi_debug: Reduce code duplication [v3,09/14] scsi_debug: Support the block limits extension VPD page [v3,10/14] scsi_debug: Rework page code error handling [v3,11/14] scsi_debug: Rework subpage code error handling [v3,12/14] scsi_debug: Implement the IO Advice Hints Grouping mode page [v3,13/14] scsi_debug: Implement GET STREAM STATUS [v3,14/14] scsi_debug: Maintain write statistics per group number

Bart Van Assche Oct. 17, 2023, 8:47 p.m. UTC

Move enum rw_hint into a new header file to prepare for using this data
type in the block layer. Add the attribute __packed to reduce the space
occupied by instances of this data type from four bytes to one byte.
Change the data type of i_write_hint from u8 into enum rw_hint. Change
the RWH_* constants into literal constants to prevent that
<uapi/linux/fcntl.h> would have to be included.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 fs/f2fs/f2fs.h          |  1 +
 fs/fcntl.c              |  1 +
 fs/inode.c              |  1 +
 include/linux/fs.h      | 16 ++--------------
 include/linux/rw_hint.h | 20 ++++++++++++++++++++
 5 files changed, 25 insertions(+), 14 deletions(-)
 create mode 100644 include/linux/rw_hint.h

Kanchan Joshi Oct. 30, 2023, 11:11 a.m. UTC | #1

On 10/18/2023 2:17 AM, Bart Van Assche wrote:
> - * Write life time hint values.
> - * Stored in struct inode as u8.
> - */
> -enum rw_hint {
> -	WRITE_LIFE_NOT_SET	= 0,
> -	WRITE_LIFE_NONE		= RWH_WRITE_LIFE_NONE,
> -	WRITE_LIFE_SHORT	= RWH_WRITE_LIFE_SHORT,
> -	WRITE_LIFE_MEDIUM	= RWH_WRITE_LIFE_MEDIUM,
> -	WRITE_LIFE_LONG		= RWH_WRITE_LIFE_LONG,
> -	WRITE_LIFE_EXTREME	= RWH_WRITE_LIFE_EXTREME,
> -};
> -
>   /* Match RWF_* bits to IOCB bits */
>   #define IOCB_HIPRI		(__force int) RWF_HIPRI
>   #define IOCB_DSYNC		(__force int) RWF_DSYNC
> @@ -677,7 +665,7 @@ struct inode {
>   	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
>   	unsigned short          i_bytes;
>   	u8			i_blkbits;
> -	u8			i_write_hint;
> +	enum rw_hint		i_write_hint;
>   	blkcnt_t		i_blocks;
>   
>   #ifdef __NEED_I_SIZE_ORDERED
> diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h
> new file mode 100644
> index 000000000000..4a7d28945973
> --- /dev/null
> +++ b/include/linux/rw_hint.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_RW_HINT_H
> +#define _LINUX_RW_HINT_H
> +
> +#include <linux/build_bug.h>
> +#include <linux/compiler_attributes.h>
> +
> +/* Block storage write lifetime hint values. */
> +enum rw_hint {
> +	WRITE_LIFE_NOT_SET	= 0, /* RWH_WRITE_LIFE_NOT_SET */
> +	WRITE_LIFE_NONE		= 1, /* RWH_WRITE_LIFE_NONE */
> +	WRITE_LIFE_SHORT	= 2, /* RWH_WRITE_LIFE_SHORT */
> +	WRITE_LIFE_MEDIUM	= 3, /* RWH_WRITE_LIFE_MEDIUM */
> +	WRITE_LIFE_LONG		= 4, /* RWH_WRITE_LIFE_LONG */
> +	WRITE_LIFE_EXTREME	= 5, /* RWH_WRITE_LIFE_EXTREME */
> +} __packed;
> +
> +static_assert(sizeof(enum rw_hint) == 1);

Does it make sense to do away with these, and have temperature-neutral 
names instead e.g., WRITE_LIFE_1, WRITE_LIFE_2?

With the current choice:
- If the count goes up (beyond 5 hints), infra can scale fine but these 
names do not. Imagine ULTRA_EXTREME after EXTREME.
- Applications or in-kernel users can specify LONG hint with data that 
actually has a SHORT lifetime. Nothing really ensures that LONG is 
really LONG.

Temperature-neutral names seem more generic/scalable and do not present 
the unnecessary need to be accurate with relative temperatures.

Bart Van Assche Oct. 30, 2023, 4:10 p.m. UTC | #2

On 10/30/23 04:11, Kanchan Joshi wrote:
> On 10/18/2023 2:17 AM, Bart Van Assche wrote:
>> +/* Block storage write lifetime hint values. */
>> +enum rw_hint {
>> +	WRITE_LIFE_NOT_SET	= 0, /* RWH_WRITE_LIFE_NOT_SET */
>> +	WRITE_LIFE_NONE		= 1, /* RWH_WRITE_LIFE_NONE */
>> +	WRITE_LIFE_SHORT	= 2, /* RWH_WRITE_LIFE_SHORT */
>> +	WRITE_LIFE_MEDIUM	= 3, /* RWH_WRITE_LIFE_MEDIUM */
>> +	WRITE_LIFE_LONG		= 4, /* RWH_WRITE_LIFE_LONG */
>> +	WRITE_LIFE_EXTREME	= 5, /* RWH_WRITE_LIFE_EXTREME */
>> +} __packed;
>> +
>> +static_assert(sizeof(enum rw_hint) == 1);
> 
> Does it make sense to do away with these, and have temperature-neutral
> names instead e.g., WRITE_LIFE_1, WRITE_LIFE_2?
> 
> With the current choice:
> - If the count goes up (beyond 5 hints), infra can scale fine but these
> names do not. Imagine ULTRA_EXTREME after EXTREME.
> - Applications or in-kernel users can specify LONG hint with data that
> actually has a SHORT lifetime. Nothing really ensures that LONG is
> really LONG.
> 
> Temperature-neutral names seem more generic/scalable and do not present
> the unnecessary need to be accurate with relative temperatures.

Thanks for having taken a look at this patch series. Jens asked for data
that shows that this patch series improves performance. Is this
something Samsung can help with?

Thanks,

Bart.

Daejun Park Nov. 1, 2023, 6:39 a.m. UTC | #3

Hi Bart,

>On 10/30/23 04:11, Kanchan Joshi wrote:
>> On 10/18/2023 2:17 AM, Bart Van Assche wrote:
>>> +/* Block storage write lifetime hint values. */
>>> +enum rw_hint {
>>> +        WRITE_LIFE_NOT_SET        = 0, /* RWH_WRITE_LIFE_NOT_SET */
>>> +        WRITE_LIFE_NONE                = 1, /* RWH_WRITE_LIFE_NONE */
>>> +        WRITE_LIFE_SHORT        = 2, /* RWH_WRITE_LIFE_SHORT */
>>> +        WRITE_LIFE_MEDIUM        = 3, /* RWH_WRITE_LIFE_MEDIUM */
>>> +        WRITE_LIFE_LONG                = 4, /* RWH_WRITE_LIFE_LONG */
>>> +        WRITE_LIFE_EXTREME        = 5, /* RWH_WRITE_LIFE_EXTREME */
>>> +} __packed;
>>> +
>>> +static_assert(sizeof(enum rw_hint) == 1);
>> 
>> Does it make sense to do away with these, and have temperature-neutral
>> names instead e.g., WRITE_LIFE_1, WRITE_LIFE_2?
>> 
>> With the current choice:
>> - If the count goes up (beyond 5 hints), infra can scale fine but these
>> names do not. Imagine ULTRA_EXTREME after EXTREME.
>> - Applications or in-kernel users can specify LONG hint with data that
>> actually has a SHORT lifetime. Nothing really ensures that LONG is
>> really LONG.
>> 
>> Temperature-neutral names seem more generic/scalable and do not present
>> the unnecessary need to be accurate with relative temperatures.
>
>Thanks for having taken a look at this patch series. Jens asked for data
>that shows that this patch series improves performance. Is this
>something Samsung can help with?

We analyzed the NAND block erase counter with and without stream separation
through a long-term workload in F2FS.
The analysis showed that the erase counter is reduced by approximately 40% 
with stream seperation.
Long-term workload is a scenario where erase and write are repeated by
stream after performing precondition fill for each temperature of F2FS.

Thanks,

Daejun.

>
>Thanks,
>
>Bart.
>
> 
>

Bart Van Assche Nov. 1, 2023, 4:45 p.m. UTC | #4

On 10/31/23 23:39, Daejun Park wrote:
>> On 10/30/23 04:11, Kanchan Joshi wrote:
>>> On 10/18/2023 2:17 AM, Bart Van Assche wrote:
>> Thanks for having taken a look at this patch series. Jens asked for data
>> that shows that this patch series improves performance. Is this
>> something Samsung can help with?
> 
> We analyzed the NAND block erase counter with and without stream separation
> through a long-term workload in F2FS.
> The analysis showed that the erase counter is reduced by approximately 40%
> with stream seperation.
> Long-term workload is a scenario where erase and write are repeated by
> stream after performing precondition fill for each temperature of F2FS.

Hi Daejun,

Thank you for having shared this data. This is very helpful. Since I'm
not familiar with the erase counter: does the above data perhaps mean
that write amplification is reduced by 40% in the workload that has been
examined?

Thanks,

Bart.

Daejun Park Nov. 2, 2023, 7:31 a.m. UTC | #5

Hi Bart,

>On 10/31/23 23:39, Daejun Park wrote:
>>> On 10/30/23 04:11, Kanchan Joshi wrote:
>>>> On 10/18/2023 2:17 AM, Bart Van Assche wrote:
>>> Thanks for having taken a look at this patch series. Jens asked for data
>>> that shows that this patch series improves performance. Is this
>>> something Samsung can help with?
>> 
>> We analyzed the NAND block erase counter with and without stream separation
>> through a long-term workload in F2FS.
>> The analysis showed that the erase counter is reduced by approximately 40%
>> with stream seperation.
>> Long-term workload is a scenario where erase and write are repeated by
>> stream after performing precondition fill for each temperature of F2FS.
>
>Hi Daejun,
>
>Thank you for having shared this data. This is very helpful. Since I'm
>not familiar with the erase counter: does the above data perhaps mean
>that write amplification is reduced by 40% in the workload that has been
>examined?

WAF is not only caused by GC. It is also caused by other reasons.
During device GC, the valid pages in the victim block are migrated, and a
lower erase counter means that the effective GC is performed by selecting
a victim block with a small number of invalid pages.
Thus, it can be said that the WAF can be decreased about 40% by selecting
fewer victim blocks during device GC.

Thanks,

Daejun

>
>Thanks,
>
>Bart.

[v3,01/14] fs: Move enum rw_hint into a new header file

Commit Message

Comments

Patch