diff mbox series

[v2,01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits

Message ID 20231005194129.1882245-2-bvanassche@acm.org (mailing list archive)
State New, archived
Headers show
Series Pass data temperature information to UFS devices | expand

Commit Message

Bart Van Assche Oct. 5, 2023, 7:40 p.m. UTC
A later patch will store the data lifetime in the bio->bi_ioprio member
before bio_set_ioprio() is called. Make sure that bio_set_ioprio()
doesn't clear more bits than necessary.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c         |  3 ++-
 include/linux/ioprio.h | 10 ++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

Comments

Kanchan Joshi Oct. 6, 2023, 6:28 a.m. UTC | #1
On Thu, Oct 05, 2023 at 12:40:47PM -0700, Bart Van Assche wrote:
>A later patch will store the data lifetime in the bio->bi_ioprio member
>before bio_set_ioprio() is called. Make sure that bio_set_ioprio()
>doesn't clear more bits than necessary.

Only lifetime bits need to be retained, but the patch retains the CDL
bits too. Is that intentional?
Bart Van Assche Oct. 6, 2023, 6:20 p.m. UTC | #2
On 10/5/23 23:28, Kanchan Joshi wrote:
> On Thu, Oct 05, 2023 at 12:40:47PM -0700, Bart Van Assche wrote:
>> A later patch will store the data lifetime in the bio->bi_ioprio member
>> before bio_set_ioprio() is called. Make sure that bio_set_ioprio()
>> doesn't clear more bits than necessary.
> 
> Only lifetime bits need to be retained, but the patch retains the CDL
> bits too. Is that intentional?

Hi Kanchan,

Yes, this is intentional.

Thanks,

Bart.
Kanchan Joshi Oct. 10, 2023, 5:22 a.m. UTC | #3
On 10/6/2023 1:10 AM, Bart Van Assche wrote:
> A later patch will store the data lifetime in the bio->bi_ioprio member
> before bio_set_ioprio() is called. Make sure that bio_set_ioprio()
> doesn't clear more bits than necessary.
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/blk-mq.c         |  3 ++-
>   include/linux/ioprio.h | 10 ++++++++++
>   2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e2d11183f62e..bc086660ffd3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2926,7 +2926,8 @@ static void bio_set_ioprio(struct bio *bio)
>   {
>   	/* Nobody set ioprio so far? Initialize it based on task's nice value */
>   	if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
> -		bio->bi_ioprio = get_current_ioprio();
> +		ioprio_set_class_and_level(&bio->bi_ioprio,
> +					   get_current_ioprio());
>   	blkcg_set_ioprio(bio);
>   }
>   
> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> index 7578d4f6a969..f2e768ab4b35 100644
> --- a/include/linux/ioprio.h
> +++ b/include/linux/ioprio.h
> @@ -71,4 +71,14 @@ static inline int ioprio_check_cap(int ioprio)
>   }
>   #endif /* CONFIG_BLOCK */
>   
> +#define IOPRIO_CLASS_LEVEL_MASK ((IOPRIO_CLASS_MASK << IOPRIO_CLASS_SHIFT) | \
> +				 (IOPRIO_LEVEL_MASK << 0))
> +
> +static inline void ioprio_set_class_and_level(u16 *prio, u16 class_level)
> +{
> +	WARN_ON_ONCE(class_level & ~IOPRIO_CLASS_LEVEL_MASK);
> +	*prio &= ~IOPRIO_CLASS_LEVEL_MASK;
> +	*prio |= class_level;

Return of get_current_ioprio() will touch all 16 bits here. So 
user-defined value can alter whatever was set in bio by F2FS (patch 4 in 
this series). Is that not an issue?

And what is the user interface you have in mind. Is it ioprio based, or 
write-hint based or mix of both?
Bart Van Assche Oct. 11, 2023, 4:52 p.m. UTC | #4
On 10/9/23 22:22, Kanchan Joshi wrote:
> On 10/6/2023 1:10 AM, Bart Van Assche wrote:
>> A later patch will store the data lifetime in the bio->bi_ioprio member
>> before bio_set_ioprio() is called. Make sure that bio_set_ioprio()
>> doesn't clear more bits than necessary.
>>
>> Cc: Damien Le Moal <dlemoal@kernel.org>
>> Cc: Niklas Cassel <niklas.cassel@wdc.com>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> ---
>>    block/blk-mq.c         |  3 ++-
>>    include/linux/ioprio.h | 10 ++++++++++
>>    2 files changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index e2d11183f62e..bc086660ffd3 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2926,7 +2926,8 @@ static void bio_set_ioprio(struct bio *bio)
>>    {
>>    	/* Nobody set ioprio so far? Initialize it based on task's nice value */
>>    	if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
>> -		bio->bi_ioprio = get_current_ioprio();
>> +		ioprio_set_class_and_level(&bio->bi_ioprio,
>> +					   get_current_ioprio());
>>    	blkcg_set_ioprio(bio);
>>    }
>>    
>> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
>> index 7578d4f6a969..f2e768ab4b35 100644
>> --- a/include/linux/ioprio.h
>> +++ b/include/linux/ioprio.h
>> @@ -71,4 +71,14 @@ static inline int ioprio_check_cap(int ioprio)
>>    }
>>    #endif /* CONFIG_BLOCK */
>>    
>> +#define IOPRIO_CLASS_LEVEL_MASK ((IOPRIO_CLASS_MASK << IOPRIO_CLASS_SHIFT) | \
>> +				 (IOPRIO_LEVEL_MASK << 0))
>> +
>> +static inline void ioprio_set_class_and_level(u16 *prio, u16 class_level)
>> +{
>> +	WARN_ON_ONCE(class_level & ~IOPRIO_CLASS_LEVEL_MASK);
>> +	*prio &= ~IOPRIO_CLASS_LEVEL_MASK;
>> +	*prio |= class_level;
> 
> Return of get_current_ioprio() will touch all 16 bits here. So
> user-defined value can alter whatever was set in bio by F2FS (patch 4 in
> this series). Is that not an issue?

The above is incomprehensible to me. Anyway, I will try to answer.

It is not clear to me why it is claimed that "get_current_ioprio() will
touch all 16 bits here"? The return value of get_current_ioprio() is
passed to ioprio_set_class_and_level() and that function clears the hint
bits from the get_current_ioprio() return value.

ioprio_set_class_and_level() preserves the hint bits set by F2FS.

> And what is the user interface you have in mind. Is it ioprio based, or
> write-hint based or mix of both?

Since the data lifetime is encoded in the hint bits, the hint bits need
to be set by user space to set a data lifetime. In case you would help,
the blktest test that I wrote to test this functionality is available
below.

Thanks,

Bart.


diff --git a/tests/scsi/097 b/tests/scsi/097
new file mode 100755
index 000000000000..01d280021653
--- /dev/null
+++ b/tests/scsi/097
@@ -0,0 +1,62 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+# Copyright (C) 2022 Google LLC
+
+. tests/zbd/rc
+. common/null_blk
+. common/scsi_debug
+
+DESCRIPTION="test block data lifetime support"
+QUICK=1
+
+requires() {
+	_have_fio
+	_have_module scsi_debug
+}
+
+test() {
+	echo "Running ${TEST_NAME}"
+
+	local scsi_debug_params=(
+		delay=0
+		dev_size_mb=1024
+		sector_size=4096
+	)
+	_init_scsi_debug "${scsi_debug_params[@]}" &&
+	local dev="/dev/${SCSI_DEBUG_DEVICES[0]}" fail &&
+	ls -ld "${dev}" >>"${FULL}" &&
+	local i fio_args=(
+		--group_reporting=1
+		--gtod_reduce=1
+		--ioscheduler=none
+		--norandommap
+		--direct=1
+		--rw=randwrite
+		--ioengine=io_uring
+		--time_based
+	) &&
+	for ((i=1; i<=3; i++)); do
+		fio_args+=(
+			--name=whint"$i"
+			--filename="${dev}"
+			--prio=$((i<<6))
+			--time_based
+			--runtime=10
+		)
+	done &&
+	_run_fio "${fio_args[@]}" >>"${FULL}" 2>&1 ||
+	fail=true
+
+	head -n 999 /sys/bus/pseudo/drivers/scsi_debug/group_number_stats >> 
"${FULL}"
+	while read -r group count; do
+		if [ "$count" -gt 0 ]; then echo "$group"; fi
+	done < /sys/bus/pseudo/drivers/scsi_debug/group_number_stats
+	_exit_scsi_debug
+
+	if [ -z "$fail" ]; then
+		echo "Test complete"
+	else
+		echo "Test failed"
+		return 1
+	fi
+}
diff --git a/tests/scsi/097.out b/tests/scsi/097.out
new file mode 100644
index 000000000000..f6ed3d6de861
--- /dev/null
+++ b/tests/scsi/097.out
@@ -0,0 +1,5 @@
+Running scsi/097
+1
+2
+3
+Test complete
Kanchan Joshi Oct. 12, 2023, 8:49 a.m. UTC | #5
On 10/11/2023 10:22 PM, Bart Van Assche wrote:
>>> @@ -2926,7 +2926,8 @@ static void bio_set_ioprio(struct bio *bio)
>>>    {
>>>        /* Nobody set ioprio so far? Initialize it based on task's 
>>> nice value */
>>>        if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
>>> -        bio->bi_ioprio = get_current_ioprio();
>>> +        ioprio_set_class_and_level(&bio->bi_ioprio,
>>> +                       get_current_ioprio());
>>>        blkcg_set_ioprio(bio);
>>>    }
>>> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
>>> index 7578d4f6a969..f2e768ab4b35 100644
>>> --- a/include/linux/ioprio.h
>>> +++ b/include/linux/ioprio.h
>>> @@ -71,4 +71,14 @@ static inline int ioprio_check_cap(int ioprio)
>>>    }
>>>    #endif /* CONFIG_BLOCK */
>>> +#define IOPRIO_CLASS_LEVEL_MASK ((IOPRIO_CLASS_MASK << 
>>> IOPRIO_CLASS_SHIFT) | \
>>> +                 (IOPRIO_LEVEL_MASK << 0))
>>> +
>>> +static inline void ioprio_set_class_and_level(u16 *prio, u16 
>>> class_level)
>>> +{
>>> +    WARN_ON_ONCE(class_level & ~IOPRIO_CLASS_LEVEL_MASK);
>>> +    *prio &= ~IOPRIO_CLASS_LEVEL_MASK;
>>> +    *prio |= class_level;
>>
>> Return of get_current_ioprio() will touch all 16 bits here. So
>> user-defined value can alter whatever was set in bio by F2FS (patch 4 in
>> this series). Is that not an issue?
> 
> The above is incomprehensible to me. Anyway, I will try to answer.
> 
> It is not clear to me why it is claimed that "get_current_ioprio() will
> touch all 16 bits here"? The return value of get_current_ioprio() is
> passed to ioprio_set_class_and_level() and that function clears the hint
> bits from the get_current_ioprio() return value.

Function does OR bio->bi_ioprio with whatever is the return of 
get_current_ioprio(). So if lifetime bits were set in 
get_current_ioprio(), you will end up setting that in bio->bi_ioprio too.


> ioprio_set_class_and_level() preserves the hint bits set by F2FS.
> 
>> And what is the user interface you have in mind. Is it ioprio based, or
>> write-hint based or mix of both?
> 
> Since the data lifetime is encoded in the hint bits, the hint bits need
> to be set by user space to set a data lifetime.

I asked because more than one way seems to emerge here. Parts of this 
series (Patch 4) are taking inode->i_write_hint (and not ioprio value) 
and putting that into bio.
I wonder what to expect if application get to send one lifetime with 
fcntl (write-hints) and different one with ioprio. Is that not racy?


> In case you would help,
> the blktest test that I wrote to test this functionality is available
> below.
> 
> Thanks,
> 
> Bart.
> 
> 
> diff --git a/tests/scsi/097 b/tests/scsi/097
> new file mode 100755
> index 000000000000..01d280021653
> --- /dev/null
> +++ b/tests/scsi/097
> @@ -0,0 +1,62 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-3.0+
> +# Copyright (C) 2022 Google LLC
> +
> +. tests/zbd/rc
> +. common/null_blk
> +. common/scsi_debug
> +
> +DESCRIPTION="test block data lifetime support"
> +QUICK=1
> +
> +requires() {
> +    _have_fio
> +    _have_module scsi_debug
> +}
> +
> +test() {
> +    echo "Running ${TEST_NAME}"
> +
> +    local scsi_debug_params=(
> +        delay=0
> +        dev_size_mb=1024
> +        sector_size=4096
> +    )
> +    _init_scsi_debug "${scsi_debug_params[@]}" &&
> +    local dev="/dev/${SCSI_DEBUG_DEVICES[0]}" fail &&
> +    ls -ld "${dev}" >>"${FULL}" &&
> +    local i fio_args=(
> +        --group_reporting=1
> +        --gtod_reduce=1
> +        --ioscheduler=none
> +        --norandommap
> +        --direct=1
> +        --rw=randwrite
> +        --ioengine=io_uring
> +        --time_based
> +    ) &&
> +    for ((i=1; i<=3; i++)); do
> +        fio_args+=(
> +            --name=whint"$i"
> +            --filename="${dev}"
> +            --prio=$((i<<6))

This will not work as prio can only take values between 0-7.
Perhaps you want to use "priohint" to send lifetime.
Niklas Cassel Oct. 12, 2023, 2:03 p.m. UTC | #6
On Thu, Oct 12, 2023 at 02:19:02PM +0530, Kanchan Joshi wrote:
> On 10/11/2023 10:22 PM, Bart Van Assche wrote:
> >>> @@ -2926,7 +2926,8 @@ static void bio_set_ioprio(struct bio *bio)
> >>>    {
> >>>        /* Nobody set ioprio so far? Initialize it based on task's 
> >>> nice value */
> >>>        if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
> >>> -        bio->bi_ioprio = get_current_ioprio();
> >>> +        ioprio_set_class_and_level(&bio->bi_ioprio,
> >>> +                       get_current_ioprio());
> >>>        blkcg_set_ioprio(bio);
> >>>    }
> >>> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> >>> index 7578d4f6a969..f2e768ab4b35 100644
> >>> --- a/include/linux/ioprio.h
> >>> +++ b/include/linux/ioprio.h
> >>> @@ -71,4 +71,14 @@ static inline int ioprio_check_cap(int ioprio)
> >>>    }
> >>>    #endif /* CONFIG_BLOCK */
> >>> +#define IOPRIO_CLASS_LEVEL_MASK ((IOPRIO_CLASS_MASK << 
> >>> IOPRIO_CLASS_SHIFT) | \
> >>> +                 (IOPRIO_LEVEL_MASK << 0))
> >>> +
> >>> +static inline void ioprio_set_class_and_level(u16 *prio, u16 
> >>> class_level)
> >>> +{
> >>> +    WARN_ON_ONCE(class_level & ~IOPRIO_CLASS_LEVEL_MASK);
> >>> +    *prio &= ~IOPRIO_CLASS_LEVEL_MASK;
> >>> +    *prio |= class_level;
> >>
> >> Return of get_current_ioprio() will touch all 16 bits here. So
> >> user-defined value can alter whatever was set in bio by F2FS (patch 4 in
> >> this series). Is that not an issue?
> > 
> > The above is incomprehensible to me. Anyway, I will try to answer.
> > 
> > It is not clear to me why it is claimed that "get_current_ioprio() will
> > touch all 16 bits here"? The return value of get_current_ioprio() is
> > passed to ioprio_set_class_and_level() and that function clears the hint
> > bits from the get_current_ioprio() return value.
> 
> Function does OR bio->bi_ioprio with whatever is the return of 
> get_current_ioprio(). So if lifetime bits were set in 
> get_current_ioprio(), you will end up setting that in bio->bi_ioprio too.
> 
> 
> > ioprio_set_class_and_level() preserves the hint bits set by F2FS.
> > 
> >> And what is the user interface you have in mind. Is it ioprio based, or
> >> write-hint based or mix of both?
> > 
> > Since the data lifetime is encoded in the hint bits, the hint bits need
> > to be set by user space to set a data lifetime.
> 
> I asked because more than one way seems to emerge here. Parts of this 
> series (Patch 4) are taking inode->i_write_hint (and not ioprio value) 
> and putting that into bio.
> I wonder what to expect if application get to send one lifetime with 
> fcntl (write-hints) and different one with ioprio. Is that not racy?

Hello Kanchan,

The fcntl F_SET_RW_HINT still exists, which sets inode->i_write_hint.
This is currently only used by f2fs.

The usage of inode->i_write_hint was removed from all filesystems
(except f2fs) in:
c75e707fe1aa ("block: remove the per-bio/request write hint").
This commit also removed bi_write_hint from struct bio.

The fcntl F_SET_FILE_RW_HINT, which used to set f->f_write_hint was removed
in:
7b12e49669c9 ("fs: remove fs.f_write_hint")
This commit also removed f_write_hint from struct file.

My thinking when suggesting to reuse ioprio hints, was that we don't need
to readd write_hint struct members to struct bio and struct request.

SCSI can just reuse the hints bits in ioprio.



Note that my filesystem knowledge is not the best...

But for f2fs, I guess it just needs to set the bio->ioprio hint bits
correctly.

I guess the confusion is if an application does both:
ioprio_set() and fcntl(.., F_SET_RW_HINT, ..), what should the filesystem
use?

Or.. if you use e.g. io_uring to write to a file stored on f2fs...
io_uring has sqe->ioprio, which can contain a write hint, does this get
propagated to the filesystem?
And if so, what if you also did fcntl(.., F_SET_RW_HINT, ..) to set
i_write_hint? Which should the filesystem use to set bio->ioprio?


Kind regards,
Niklas
Bart Van Assche Oct. 12, 2023, 5:42 p.m. UTC | #7
On 10/12/23 01:49, Kanchan Joshi wrote:
> Function does OR bio->bi_ioprio with whatever is the return of
> get_current_ioprio().

No, that's not what ioprio_set_class_and_level() does. It clears the 
hint bits before it performs a logical OR.

> So if lifetime bits were set in get_current_ioprio(), you will end up
> setting that in bio->bi_ioprio too.
I'm not sure there are any use cases where it is useful to set the data
lifetime for an entire process.

Anyway, how about replacing this patch with the patch below? This will
allow to set hint information for an entire process.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e2d11183f62e..3419ca4c1bf4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2924,9 +2924,14 @@ static inline struct request 
*blk_mq_get_cached_request(struct request_queue *q,

  static void bio_set_ioprio(struct bio *bio)
  {
+	u16 cur_ioprio = get_current_ioprio();
+
  	/* Nobody set ioprio so far? Initialize it based on task's nice value */
  	if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
-		bio->bi_ioprio = get_current_ioprio();
+		bio->bi_ioprio |= cur_ioprio & IOPRIO_CLASS_LEVEL_MASK;
+	if (IOPRIO_PRIO_HINT(bio->bi_ioprio) == 0)
+		bio->bi_ioprio |= cur_ioprio &
+			(IOPRIO_HINT_MASK << IOPRIO_HINT_SHIFT);
  	blkcg_set_ioprio(bio);
  }

diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index 7578d4f6a969..5697832f35a3 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -71,4 +71,7 @@ static inline int ioprio_check_cap(int ioprio)
  }
  #endif /* CONFIG_BLOCK */

+#define IOPRIO_CLASS_LEVEL_MASK ((IOPRIO_CLASS_MASK << 
IOPRIO_CLASS_SHIFT) | \
+				 (IOPRIO_LEVEL_MASK << 0))
+
  #endif


>> ioprio_set_class_and_level() preserves the hint bits set by F2FS.
>>
>>> And what is the user interface you have in mind. Is it ioprio based, or
>>> write-hint based or mix of both?
>>
>> Since the data lifetime is encoded in the hint bits, the hint bits need
>> to be set by user space to set a data lifetime.
> 
> I asked because more than one way seems to emerge here. Parts of this
> series (Patch 4) are taking inode->i_write_hint (and not ioprio value)
> and putting that into bio.
> I wonder what to expect if application get to send one lifetime with
> fcntl (write-hints) and different one with ioprio. Is that not racy?

There is no race condition. F_SET_RW_HINT can be used to set 
inode->i_write_hint. The filesystem may use the inode->i_write_hint 
information. I think F2FS uses this information in its block allocator.

>> +            --prio=$((i<<6))
> 
> This will not work as prio can only take values between 0-7.
> Perhaps you want to use "priohint" to send lifetime.

Thanks for having mentioned the priohint option. The above works on my
test setup since I modified fio locally to accept larger prio values. I
will switch to the priohint option.

Bart.
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e2d11183f62e..bc086660ffd3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2926,7 +2926,8 @@  static void bio_set_ioprio(struct bio *bio)
 {
 	/* Nobody set ioprio so far? Initialize it based on task's nice value */
 	if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
-		bio->bi_ioprio = get_current_ioprio();
+		ioprio_set_class_and_level(&bio->bi_ioprio,
+					   get_current_ioprio());
 	blkcg_set_ioprio(bio);
 }
 
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index 7578d4f6a969..f2e768ab4b35 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -71,4 +71,14 @@  static inline int ioprio_check_cap(int ioprio)
 }
 #endif /* CONFIG_BLOCK */
 
+#define IOPRIO_CLASS_LEVEL_MASK ((IOPRIO_CLASS_MASK << IOPRIO_CLASS_SHIFT) | \
+				 (IOPRIO_LEVEL_MASK << 0))
+
+static inline void ioprio_set_class_and_level(u16 *prio, u16 class_level)
+{
+	WARN_ON_ONCE(class_level & ~IOPRIO_CLASS_LEVEL_MASK);
+	*prio &= ~IOPRIO_CLASS_LEVEL_MASK;
+	*prio |= class_level;
+}
+
 #endif