diff mbox series

[4/6] block: avoid ordered task state change for polled IO

Message ID 20181110151317.3813-5-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series Various block optimizations | expand

Commit Message

Jens Axboe Nov. 10, 2018, 3:13 p.m. UTC
We only really need the barrier if we're going to be sleeping,
if we're just polling we're fine with the __set_current_state()
variant.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

Comments

jianchao.wang Nov. 12, 2018, 9:35 a.m. UTC | #1
Hi Jens

On 11/10/18 11:13 PM, Jens Axboe wrote:
> We only really need the barrier if we're going to be sleeping,
> if we're just polling we're fine with the __set_current_state()
> variant.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/block_dev.c | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index c039abfb2052..e7550b1f9670 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -237,12 +237,23 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	qc = submit_bio(&bio);
>  	for (;;) {
> -		set_current_state(TASK_UNINTERRUPTIBLE);
> +		/*
> +		 * Using non-atomic task state for polling
> +		 */
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +
>  		if (!READ_ONCE(bio.bi_private))
>  			break;

When we don't use polling, the blkdev_bio_end_io_simple may come on a different cpu before
submit_bio returns 
For example,
__blkdev_direct_IO_simple
  qc = submit_bio(&bio) 
                                          blkdev_bio_end_io_simple
                                            WRITE_ONCE(bio.bi_private, NULL)
                                            wake_up_process(waiter)
  __set_current_state(TASK_UNINTERRUPTIBLE)                              
  READ_ONCE(bio.bi_private)

At the moment, a smp_rmb at least maybe needed before READ_ONCE(bio.bi_private)
to ensure the WRITE_ONCE(bio.bi_private, NULL) is seen if we miss the wakeup.

> +
>  		if (!(iocb->ki_flags & IOCB_HIPRI) ||
> -		    !blk_poll(bdev_get_queue(bdev), qc))
> +		    !blk_poll(bdev_get_queue(bdev), qc)) {
> +			/*
> +			 * If we're scheduling, ensure that task state
> +			 * change is ordered and seen.
> +			 */
> +			smp_mb();
>  			io_schedule();
> +		}
>  	}
>  	__set_current_state(TASK_RUNNING);
>  
> @@ -403,13 +414,19 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  		return -EIOCBQUEUED;
>  
>  	for (;;) {
> -		set_current_state(TASK_UNINTERRUPTIBLE);
> +		/*
> +		 * See __blkdev_direct_IO_simple() loop
> +		 */
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +

Similar with above.

>  		if (!READ_ONCE(dio->waiter))
>  			break;
>  
>  		if (!(iocb->ki_flags & IOCB_HIPRI) ||
> -		    !blk_poll(bdev_get_queue(bdev), qc))
> +		    !blk_poll(bdev_get_queue(bdev), qc)) {
> +			smp_mb();
>  			io_schedule();
> +		}
>  	}
>  	__set_current_state(TASK_RUNNING);
>  
> 

Thanks
Jianchao
Jens Axboe Nov. 12, 2018, 4:26 p.m. UTC | #2
On 11/12/18 2:35 AM, jianchao.wang wrote:
> Hi Jens
> 
> On 11/10/18 11:13 PM, Jens Axboe wrote:
>> We only really need the barrier if we're going to be sleeping,
>> if we're just polling we're fine with the __set_current_state()
>> variant.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/block_dev.c | 25 +++++++++++++++++++++----
>>  1 file changed, 21 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index c039abfb2052..e7550b1f9670 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -237,12 +237,23 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>>  
>>  	qc = submit_bio(&bio);
>>  	for (;;) {
>> -		set_current_state(TASK_UNINTERRUPTIBLE);
>> +		/*
>> +		 * Using non-atomic task state for polling
>> +		 */
>> +		__set_current_state(TASK_UNINTERRUPTIBLE);
>> +
>>  		if (!READ_ONCE(bio.bi_private))
>>  			break;
> 
> When we don't use polling, the blkdev_bio_end_io_simple may come on a different cpu before
> submit_bio returns 
> For example,
> __blkdev_direct_IO_simple
>   qc = submit_bio(&bio) 
>                                           blkdev_bio_end_io_simple
>                                             WRITE_ONCE(bio.bi_private, NULL)
>                                             wake_up_process(waiter)
>   __set_current_state(TASK_UNINTERRUPTIBLE)                              
>   READ_ONCE(bio.bi_private)

While I agree that you are right, that's an existing issue. The store
barrier implied by set_current_state() doesn't fix that.

How about this variant:

http://git.kernel.dk/cgit/linux-block/commit/?h=mq-perf&id=036ba7a8d1334889c0fe55d4858d78f4e8022f12

which then changes the later wake up helper patch to be:
 
http://git.kernel.dk/cgit/linux-block/commit/?h=mq-perf&id=f8c3f188425967adb040cfefb799b0a5a1df769d
jianchao.wang Nov. 13, 2018, 2:02 a.m. UTC | #3
Hi Jens

On 11/13/18 12:26 AM, Jens Axboe wrote:
> On 11/12/18 2:35 AM, jianchao.wang wrote:
>> Hi Jens
>>
>> On 11/10/18 11:13 PM, Jens Axboe wrote:
>>> We only really need the barrier if we're going to be sleeping,
>>> if we're just polling we're fine with the __set_current_state()
>>> variant.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>  fs/block_dev.c | 25 +++++++++++++++++++++----
>>>  1 file changed, 21 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>>> index c039abfb2052..e7550b1f9670 100644
>>> --- a/fs/block_dev.c
>>> +++ b/fs/block_dev.c
>>> @@ -237,12 +237,23 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>>>  
>>>  	qc = submit_bio(&bio);
>>>  	for (;;) {
>>> -		set_current_state(TASK_UNINTERRUPTIBLE);
>>> +		/*
>>> +		 * Using non-atomic task state for polling
>>> +		 */
>>> +		__set_current_state(TASK_UNINTERRUPTIBLE);
>>> +
>>>  		if (!READ_ONCE(bio.bi_private))
>>>  			break;
>>
>> When we don't use polling, the blkdev_bio_end_io_simple may come on a different cpu before
>> submit_bio returns 
>> For example,
>> __blkdev_direct_IO_simple
>>   qc = submit_bio(&bio) 
>>                                           blkdev_bio_end_io_simple
>>                                             WRITE_ONCE(bio.bi_private, NULL)
>>                                             wake_up_process(waiter)
>>   __set_current_state(TASK_UNINTERRUPTIBLE)                              
>>   READ_ONCE(bio.bi_private)
> 
> While I agree that you are right, that's an existing issue. The store
> barrier implied by set_current_state() doesn't fix that.
> 
> How about this variant:
> 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_linux-2Dblock_commit_-3Fh-3Dmq-2Dperf-26id-3D036ba7a8d1334889c0fe55d4858d78f4e8022f12&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=pjbk_9rxuy7OS1pCkQ9MAHzFf4eT27Yt4RL8jASDJCA&s=AnmYMIcGpIILUrjDFsabj61pcTyMCRHB1Pu8XXi47t0&e=
> 
> which then changes the later wake up helper patch to be:
>  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_linux-2Dblock_commit_-3Fh-3Dmq-2Dperf-26id-3Df8c3f188425967adb040cfefb799b0a5a1df769d&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=pjbk_9rxuy7OS1pCkQ9MAHzFf4eT27Yt4RL8jASDJCA&s=PdajuUul55e6p0FED_AzzWt5Gip0ekLi1eGYBMLiZPo&e=
> 

The wake up path has contained a full memory barrier with the raw_spin_lock and following smp_mb__after_spinlock

wake_up_process
  -> try_to_wake_up

	raw_spin_lock_irqsave(&p->pi_lock, flags);
	smp_mb__after_spinlock();

So a smp_rmb() could be enough. :)

Thanks
Jianchao
diff mbox series

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c039abfb2052..e7550b1f9670 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -237,12 +237,23 @@  __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 
 	qc = submit_bio(&bio);
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		/*
+		 * Using non-atomic task state for polling
+		 */
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+
 		if (!READ_ONCE(bio.bi_private))
 			break;
+
 		if (!(iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_poll(bdev_get_queue(bdev), qc))
+		    !blk_poll(bdev_get_queue(bdev), qc)) {
+			/*
+			 * If we're scheduling, ensure that task state
+			 * change is ordered and seen.
+			 */
+			smp_mb();
 			io_schedule();
+		}
 	}
 	__set_current_state(TASK_RUNNING);
 
@@ -403,13 +414,19 @@  __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		return -EIOCBQUEUED;
 
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		/*
+		 * See __blkdev_direct_IO_simple() loop
+		 */
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+
 		if (!READ_ONCE(dio->waiter))
 			break;
 
 		if (!(iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_poll(bdev_get_queue(bdev), qc))
+		    !blk_poll(bdev_get_queue(bdev), qc)) {
+			smp_mb();
 			io_schedule();
+		}
 	}
 	__set_current_state(TASK_RUNNING);