[RFC] io_uring: add a memory barrier before atomic_read

Message ID	1563453840-19778-1-git-send-email-liuzhengyuan@kylinos.cn (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Zhengyuan Liu <liuzhengyuan@kylinos.cn> To: axboe@kernel.dk Cc: linux-block@vger.kernel.org, liuyun01@kylinos.cn Subject: [RFC PATCH] io_uring: add a memory barrier before atomic_read Date: Thu, 18 Jul 2019 20:44:00 +0800 Message-Id: <1563453840-19778-1-git-send-email-liuzhengyuan@kylinos.cn> Feedback-ID: bizesmtp:kylinos.cn:qybgforeign:qybgforeign2 Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	[RFC] io_uring: add a memory barrier before atomic_read \| expand [RFC] io_uring: add a memory barrier before atomic_read

Zhengyuan Liu July 18, 2019, 12:44 p.m. UTC

There is a hang issue while using fio to do some basic test. The issue can
been easily reproduced using bellow scripts:

        while true
        do
                fio  --ioengine=io_uring  -rw=write -bs=4k -numjobs=1 \
                     -size=1G -iodepth=64 -name=uring   --filename=/dev/zero
        done

After serveral minutes, maybe more, fio would block at
io_uring_enter->io_cqring_wait in order to waiting for previously committed
sqes to be completed and cann't return to user anymore until we send a SIGTERM
to fio. After got SIGTERM, fio turns to hang at io_ring_ctx_wait_and_kill with
a backtrace like this:

        [54133.243816] Call Trace:
        [54133.243842]  __schedule+0x3a0/0x790
        [54133.243868]  schedule+0x38/0xa0
        [54133.243880]  schedule_timeout+0x218/0x3b0
        [54133.243891]  ? sched_clock+0x9/0x10
        [54133.243903]  ? wait_for_completion+0xa3/0x130
        [54133.243916]  ? _raw_spin_unlock_irq+0x2c/0x40
        [54133.243930]  ? trace_hardirqs_on+0x3f/0xe0
        [54133.243951]  wait_for_completion+0xab/0x130
        [54133.243962]  ? wake_up_q+0x70/0x70
        [54133.243984]  io_ring_ctx_wait_and_kill+0xa0/0x1d0
        [54133.243998]  io_uring_release+0x20/0x30
        [54133.244008]  __fput+0xcf/0x270
        [54133.244029]  ____fput+0xe/0x10
        [54133.244040]  task_work_run+0x7f/0xa0
        [54133.244056]  do_exit+0x305/0xc40
        [54133.244067]  ? get_signal+0x13b/0xbd0
        [54133.244088]  do_group_exit+0x50/0xd0
        [54133.244103]  get_signal+0x18d/0xbd0
        [54133.244112]  ? _raw_spin_unlock_irqrestore+0x36/0x60
        [54133.244142]  do_signal+0x34/0x720
        [54133.244171]  ? exit_to_usermode_loop+0x7e/0x130
        [54133.244190]  exit_to_usermode_loop+0xc0/0x130
        [54133.244209]  do_syscall_64+0x16b/0x1d0
        [54133.244221]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The reason is that we had added a req to ctx->pending_async at the very end, but
it got no chance to be processed anymore. How could this be happened?

        fio#cpu0                                        wq#cpu1

        io_add_to_prev_work                    io_sq_wq_submit_work

          atomic_read() <<< 1

                                                  atomic_dec_return() << 1->0
                                                  list_empty();    <<< true;

          list_add_tail()
          atomic_read() << 0 or 1?

As was said in atomic_ops.rst, atomic_read does not guarantee that the runtime
initialization by any other thread is visible yet, so we must take care of that
with a proper implicit or explicit memory barrier;

This issue was detected with the help of Jackie's <liuyun01@kylinos.cn>

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
---
 fs/io_uring.c | 1 +
 1 file changed, 1 insertion(+)

Jens Axboe July 18, 2019, 3:41 p.m. UTC | #1

On 7/18/19 6:44 AM, Zhengyuan Liu wrote:
> There is a hang issue while using fio to do some basic test. The issue can
> been easily reproduced using bellow scripts:
> 
>          while true
>          do
>                  fio  --ioengine=io_uring  -rw=write -bs=4k -numjobs=1 \
>                       -size=1G -iodepth=64 -name=uring   --filename=/dev/zero
>          done
> 
> After serveral minutes, maybe more, fio would block at
> io_uring_enter->io_cqring_wait in order to waiting for previously committed
> sqes to be completed and cann't return to user anymore until we send a SIGTERM
> to fio. After got SIGTERM, fio turns to hang at io_ring_ctx_wait_and_kill with
> a backtrace like this:
> 
>          [54133.243816] Call Trace:
>          [54133.243842]  __schedule+0x3a0/0x790
>          [54133.243868]  schedule+0x38/0xa0
>          [54133.243880]  schedule_timeout+0x218/0x3b0
>          [54133.243891]  ? sched_clock+0x9/0x10
>          [54133.243903]  ? wait_for_completion+0xa3/0x130
>          [54133.243916]  ? _raw_spin_unlock_irq+0x2c/0x40
>          [54133.243930]  ? trace_hardirqs_on+0x3f/0xe0
>          [54133.243951]  wait_for_completion+0xab/0x130
>          [54133.243962]  ? wake_up_q+0x70/0x70
>          [54133.243984]  io_ring_ctx_wait_and_kill+0xa0/0x1d0
>          [54133.243998]  io_uring_release+0x20/0x30
>          [54133.244008]  __fput+0xcf/0x270
>          [54133.244029]  ____fput+0xe/0x10
>          [54133.244040]  task_work_run+0x7f/0xa0
>          [54133.244056]  do_exit+0x305/0xc40
>          [54133.244067]  ? get_signal+0x13b/0xbd0
>          [54133.244088]  do_group_exit+0x50/0xd0
>          [54133.244103]  get_signal+0x18d/0xbd0
>          [54133.244112]  ? _raw_spin_unlock_irqrestore+0x36/0x60
>          [54133.244142]  do_signal+0x34/0x720
>          [54133.244171]  ? exit_to_usermode_loop+0x7e/0x130
>          [54133.244190]  exit_to_usermode_loop+0xc0/0x130
>          [54133.244209]  do_syscall_64+0x16b/0x1d0
>          [54133.244221]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> The reason is that we had added a req to ctx->pending_async at the very end, but
> it got no chance to be processed anymore. How could this be happened?
> 
>          fio#cpu0                                        wq#cpu1
> 
>          io_add_to_prev_work                    io_sq_wq_submit_work
> 
>            atomic_read() <<< 1
> 
>                                                    atomic_dec_return() << 1->0
>                                                    list_empty();    <<< true;
> 
>            list_add_tail()
>            atomic_read() << 0 or 1?
> 
> As was said in atomic_ops.rst, atomic_read does not guarantee that the runtime
> initialization by any other thread is visible yet, so we must take care of that
> with a proper implicit or explicit memory barrier;

Thanks for looking at this and finding this issue, it does looks like a problem.
But I'm not sure about the fix. Shouldn't we just need an smp_mb__after_atomic()
on the atomic_dec_return() side of things? Like the below.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5ec06e5ba0be..3c2a6f88a6b0 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1881,6 +1881,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	 */
 	if (async_list) {
 		ret = atomic_dec_return(&async_list->cnt);
+		smp_mb__after_atomic();
 		while (!ret && !list_empty(&async_list->list)) {
 			spin_lock(&async_list->lock);
 			atomic_inc(&async_list->cnt);
@@ -1894,6 +1895,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 				goto restart;
 			}
 			ret = atomic_dec_return(&async_list->cnt);
+			smp_mb__after_atomic();
 		}
 	}

Jens Axboe July 18, 2019, 4:43 p.m. UTC | #2

On 7/18/19 9:41 AM, Jens Axboe wrote:
> On 7/18/19 6:44 AM, Zhengyuan Liu wrote:
>> There is a hang issue while using fio to do some basic test. The issue can
>> been easily reproduced using bellow scripts:
>>
>>           while true
>>           do
>>                   fio  --ioengine=io_uring  -rw=write -bs=4k -numjobs=1 \
>>                        -size=1G -iodepth=64 -name=uring   --filename=/dev/zero
>>           done
>>
>> After serveral minutes, maybe more, fio would block at
>> io_uring_enter->io_cqring_wait in order to waiting for previously committed
>> sqes to be completed and cann't return to user anymore until we send a SIGTERM
>> to fio. After got SIGTERM, fio turns to hang at io_ring_ctx_wait_and_kill with
>> a backtrace like this:
>>
>>           [54133.243816] Call Trace:
>>           [54133.243842]  __schedule+0x3a0/0x790
>>           [54133.243868]  schedule+0x38/0xa0
>>           [54133.243880]  schedule_timeout+0x218/0x3b0
>>           [54133.243891]  ? sched_clock+0x9/0x10
>>           [54133.243903]  ? wait_for_completion+0xa3/0x130
>>           [54133.243916]  ? _raw_spin_unlock_irq+0x2c/0x40
>>           [54133.243930]  ? trace_hardirqs_on+0x3f/0xe0
>>           [54133.243951]  wait_for_completion+0xab/0x130
>>           [54133.243962]  ? wake_up_q+0x70/0x70
>>           [54133.243984]  io_ring_ctx_wait_and_kill+0xa0/0x1d0
>>           [54133.243998]  io_uring_release+0x20/0x30
>>           [54133.244008]  __fput+0xcf/0x270
>>           [54133.244029]  ____fput+0xe/0x10
>>           [54133.244040]  task_work_run+0x7f/0xa0
>>           [54133.244056]  do_exit+0x305/0xc40
>>           [54133.244067]  ? get_signal+0x13b/0xbd0
>>           [54133.244088]  do_group_exit+0x50/0xd0
>>           [54133.244103]  get_signal+0x18d/0xbd0
>>           [54133.244112]  ? _raw_spin_unlock_irqrestore+0x36/0x60
>>           [54133.244142]  do_signal+0x34/0x720
>>           [54133.244171]  ? exit_to_usermode_loop+0x7e/0x130
>>           [54133.244190]  exit_to_usermode_loop+0xc0/0x130
>>           [54133.244209]  do_syscall_64+0x16b/0x1d0
>>           [54133.244221]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> The reason is that we had added a req to ctx->pending_async at the very end, but
>> it got no chance to be processed anymore. How could this be happened?
>>
>>           fio#cpu0                                        wq#cpu1
>>
>>           io_add_to_prev_work                    io_sq_wq_submit_work
>>
>>             atomic_read() <<< 1
>>
>>                                                     atomic_dec_return() << 1->0
>>                                                     list_empty();    <<< true;
>>
>>             list_add_tail()
>>             atomic_read() << 0 or 1?
>>
>> As was said in atomic_ops.rst, atomic_read does not guarantee that the runtime
>> initialization by any other thread is visible yet, so we must take care of that
>> with a proper implicit or explicit memory barrier;
> 
> Thanks for looking at this and finding this issue, it does looks like a problem.
> But I'm not sure about the fix. Shouldn't we just need an smp_mb__after_atomic()
> on the atomic_dec_return() side of things? Like the below.
> 
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 5ec06e5ba0be..3c2a6f88a6b0 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1881,6 +1881,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>   	 */
>   	if (async_list) {
>   		ret = atomic_dec_return(&async_list->cnt);
> +		smp_mb__after_atomic();
>   		while (!ret && !list_empty(&async_list->list)) {
>   			spin_lock(&async_list->lock);
>   			atomic_inc(&async_list->cnt);
> @@ -1894,6 +1895,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>   				goto restart;
>   			}
>   			ret = atomic_dec_return(&async_list->cnt);
> +			smp_mb__after_atomic();
>   		}
>   	}
>   
> 

I don't think this is enough, I actually think your fix is the most
appropriate. I will apply it, thank you!

Jackie Liu July 19, 2019, 12:44 a.m. UTC | #3

> 在 2019年7月19日，00:43，Jens Axboe <axboe@kernel.dk> 写道：
> 
> On 7/18/19 9:41 AM, Jens Axboe wrote:
>> On 7/18/19 6:44 AM, Zhengyuan Liu wrote:
>>> There is a hang issue while using fio to do some basic test. The issue can
>>> been easily reproduced using bellow scripts:
>>> 
>>>          while true
>>>          do
>>>                  fio  --ioengine=io_uring  -rw=write -bs=4k -numjobs=1 \
>>>                       -size=1G -iodepth=64 -name=uring   --filename=/dev/zero
>>>          done
>>> 
>>> After serveral minutes, maybe more, fio would block at
>>> io_uring_enter->io_cqring_wait in order to waiting for previously committed
>>> sqes to be completed and cann't return to user anymore until we send a SIGTERM
>>> to fio. After got SIGTERM, fio turns to hang at io_ring_ctx_wait_and_kill with
>>> a backtrace like this:
>>> 
>>>          [54133.243816] Call Trace:
>>>          [54133.243842]  __schedule+0x3a0/0x790
>>>          [54133.243868]  schedule+0x38/0xa0
>>>          [54133.243880]  schedule_timeout+0x218/0x3b0
>>>          [54133.243891]  ? sched_clock+0x9/0x10
>>>          [54133.243903]  ? wait_for_completion+0xa3/0x130
>>>          [54133.243916]  ? _raw_spin_unlock_irq+0x2c/0x40
>>>          [54133.243930]  ? trace_hardirqs_on+0x3f/0xe0
>>>          [54133.243951]  wait_for_completion+0xab/0x130
>>>          [54133.243962]  ? wake_up_q+0x70/0x70
>>>          [54133.243984]  io_ring_ctx_wait_and_kill+0xa0/0x1d0
>>>          [54133.243998]  io_uring_release+0x20/0x30
>>>          [54133.244008]  __fput+0xcf/0x270
>>>          [54133.244029]  ____fput+0xe/0x10
>>>          [54133.244040]  task_work_run+0x7f/0xa0
>>>          [54133.244056]  do_exit+0x305/0xc40
>>>          [54133.244067]  ? get_signal+0x13b/0xbd0
>>>          [54133.244088]  do_group_exit+0x50/0xd0
>>>          [54133.244103]  get_signal+0x18d/0xbd0
>>>          [54133.244112]  ? _raw_spin_unlock_irqrestore+0x36/0x60
>>>          [54133.244142]  do_signal+0x34/0x720
>>>          [54133.244171]  ? exit_to_usermode_loop+0x7e/0x130
>>>          [54133.244190]  exit_to_usermode_loop+0xc0/0x130
>>>          [54133.244209]  do_syscall_64+0x16b/0x1d0
>>>          [54133.244221]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> 
>>> The reason is that we had added a req to ctx->pending_async at the very end, but
>>> it got no chance to be processed anymore. How could this be happened?
>>> 
>>>          fio#cpu0                                        wq#cpu1
>>> 
>>>          io_add_to_prev_work                    io_sq_wq_submit_work
>>> 
>>>            atomic_read() <<< 1
>>> 
>>>                                                    atomic_dec_return() << 1->0
>>>                                                    list_empty();    <<< true;
>>> 
>>>            list_add_tail()
>>>            atomic_read() << 0 or 1?
>>> 
>>> As was said in atomic_ops.rst, atomic_read does not guarantee that the runtime
>>> initialization by any other thread is visible yet, so we must take care of that
>>> with a proper implicit or explicit memory barrier;
>> 
>> Thanks for looking at this and finding this issue, it does looks like a problem.
>> But I'm not sure about the fix. Shouldn't we just need an smp_mb__after_atomic()
>> on the atomic_dec_return() side of things? Like the below.
>> 
>> 
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 5ec06e5ba0be..3c2a6f88a6b0 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -1881,6 +1881,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>  	 */
>>  	if (async_list) {
>>  		ret = atomic_dec_return(&async_list->cnt);
>> +		smp_mb__after_atomic();
>>  		while (!ret && !list_empty(&async_list->list)) {
>>  			spin_lock(&async_list->lock);
>>  			atomic_inc(&async_list->cnt);
>> @@ -1894,6 +1895,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>  				goto restart;
>>  			}
>>  			ret = atomic_dec_return(&async_list->cnt);
>> +			smp_mb__after_atomic();
>>  		}
>>  	}
>> 
>> 
> 
> I don't think this is enough, I actually think your fix is the most
> appropriate. I will apply it, thank you!
> 

Actually, although we can passed test use smp_mb(), but in the end we still do not
understand where the race conditions are, could you explain it. If it is said as 
Zhengyuan, because of atomic_read, I think we should only need smp_rmb. but failed.
smp_rmb can't help us pass the test. At the same time, we have tried smp_wmb, failed too.
it seems that only smp_mb works correctly.

Is it because list_add_tail requires smp_wmb and atomic_read requires smp_rmb? 

--
Jackie Liu

Jens Axboe July 19, 2019, 2:51 p.m. UTC | #4

On 7/18/19 6:54 PM, Zhengyuan Liu wrote:
> 
> On 7/19/19 12:43 AM, Jens Axboe wrote:
>> On 7/18/19 9:41 AM, Jens Axboe wrote:
>>> On 7/18/19 6:44 AM, Zhengyuan Liu wrote:
>>>> There is a hang issue while using fio to do some basic test. The issue can
>>>> been easily reproduced using bellow scripts:
>>>>
>>>>             while true
>>>>             do
>>>>                     fio  --ioengine=io_uring  -rw=write -bs=4k -numjobs=1 \
>>>>                          -size=1G -iodepth=64 -name=uring   --filename=/dev/zero
>>>>             done
>>>>
>>>> After serveral minutes, maybe more, fio would block at
>>>> io_uring_enter->io_cqring_wait in order to waiting for previously committed
>>>> sqes to be completed and cann't return to user anymore until we send a SIGTERM
>>>> to fio. After got SIGTERM, fio turns to hang at io_ring_ctx_wait_and_kill with
>>>> a backtrace like this:
>>>>
>>>>             [54133.243816] Call Trace:
>>>>             [54133.243842]  __schedule+0x3a0/0x790
>>>>             [54133.243868]  schedule+0x38/0xa0
>>>>             [54133.243880]  schedule_timeout+0x218/0x3b0
>>>>             [54133.243891]  ? sched_clock+0x9/0x10
>>>>             [54133.243903]  ? wait_for_completion+0xa3/0x130
>>>>             [54133.243916]  ? _raw_spin_unlock_irq+0x2c/0x40
>>>>             [54133.243930]  ? trace_hardirqs_on+0x3f/0xe0
>>>>             [54133.243951]  wait_for_completion+0xab/0x130
>>>>             [54133.243962]  ? wake_up_q+0x70/0x70
>>>>             [54133.243984]  io_ring_ctx_wait_and_kill+0xa0/0x1d0
>>>>             [54133.243998]  io_uring_release+0x20/0x30
>>>>             [54133.244008]  __fput+0xcf/0x270
>>>>             [54133.244029]  ____fput+0xe/0x10
>>>>             [54133.244040]  task_work_run+0x7f/0xa0
>>>>             [54133.244056]  do_exit+0x305/0xc40
>>>>             [54133.244067]  ? get_signal+0x13b/0xbd0
>>>>             [54133.244088]  do_group_exit+0x50/0xd0
>>>>             [54133.244103]  get_signal+0x18d/0xbd0
>>>>             [54133.244112]  ? _raw_spin_unlock_irqrestore+0x36/0x60
>>>>             [54133.244142]  do_signal+0x34/0x720
>>>>             [54133.244171]  ? exit_to_usermode_loop+0x7e/0x130
>>>>             [54133.244190]  exit_to_usermode_loop+0xc0/0x130
>>>>             [54133.244209]  do_syscall_64+0x16b/0x1d0
>>>>             [54133.244221]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>
>>>> The reason is that we had added a req to ctx->pending_async at the very end, but
>>>> it got no chance to be processed anymore. How could this be happened?
>>>>
>>>>             fio#cpu0                                        wq#cpu1
>>>>
>>>>             io_add_to_prev_work                    io_sq_wq_submit_work
>>>>
>>>>               atomic_read() <<< 1
>>>>
>>>>                                                       atomic_dec_return() << 1->0
>>>>                                                       list_empty();    <<< true;
>>>>
>>>>               list_add_tail()
>>>>               atomic_read() << 0 or 1?
>>>>
>>>> As was said in atomic_ops.rst, atomic_read does not guarantee that the runtime
>>>> initialization by any other thread is visible yet, so we must take care of that
>>>> with a proper implicit or explicit memory barrier;
>>> Thanks for looking at this and finding this issue, it does looks like a problem.
>>> But I'm not sure about the fix. Shouldn't we just need an smp_mb__after_atomic()
>>> on the atomic_dec_return() side of things? Like the below.
>>>
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 5ec06e5ba0be..3c2a6f88a6b0 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -1881,6 +1881,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>>     	 */
>>>     	if (async_list) {
>>>     		ret = atomic_dec_return(&async_list->cnt);
>>> +		smp_mb__after_atomic();
>>>     		while (!ret && !list_empty(&async_list->list)) {
>>>     			spin_lock(&async_list->lock);
>>>     			atomic_inc(&async_list->cnt);
>>> @@ -1894,6 +1895,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>>     				goto restart;
>>>     			}
>>>     			ret = atomic_dec_return(&async_list->cnt);
>>> +			smp_mb__after_atomic();
>>>     		}
>>>     	}
>>>     
>>>
>> I don't think this is enough, I actually think your fix is the most
>> appropriate. I will apply it, thank you!
>>
> 
> Hi, Jens.
> I have tested you fix and the issue still existed. Actually the
> implementation of atomic_dec_return has been implicitly surrounded
> already by mb()  and as I know, smp_mb__after/before_atomic are not
> suitable for atomic_t operation which does not return a value.

We aren't guaranteed to see the atomic_dec_return() update if it happens
at the same time. So we can either force ordering with the smp_mb(), or
we can do something ala:

	if (!atomic_sub_return(0, &list->cnt)) {
		...

io_add_to_prev_work() to achieve the same sort of effect. That should
work as well.

Zhengyuan Liu July 20, 2019, 3:14 p.m. UTC | #5

On Sat, Jul 20, 2019 at 2:27 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 7/18/19 6:54 PM, Zhengyuan Liu wrote:
> >
> > On 7/19/19 12:43 AM, Jens Axboe wrote:
> >> On 7/18/19 9:41 AM, Jens Axboe wrote:
> >>> On 7/18/19 6:44 AM, Zhengyuan Liu wrote:
> >>>> There is a hang issue while using fio to do some basic test. The issue can
> >>>> been easily reproduced using bellow scripts:
> >>>>
> >>>>             while true
> >>>>             do
> >>>>                     fio  --ioengine=io_uring  -rw=write -bs=4k -numjobs=1 \
> >>>>                          -size=1G -iodepth=64 -name=uring   --filename=/dev/zero
> >>>>             done
> >>>>
> >>>> After serveral minutes, maybe more, fio would block at
> >>>> io_uring_enter->io_cqring_wait in order to waiting for previously committed
> >>>> sqes to be completed and cann't return to user anymore until we send a SIGTERM
> >>>> to fio. After got SIGTERM, fio turns to hang at io_ring_ctx_wait_and_kill with
> >>>> a backtrace like this:
> >>>>
> >>>>             [54133.243816] Call Trace:
> >>>>             [54133.243842]  __schedule+0x3a0/0x790
> >>>>             [54133.243868]  schedule+0x38/0xa0
> >>>>             [54133.243880]  schedule_timeout+0x218/0x3b0
> >>>>             [54133.243891]  ? sched_clock+0x9/0x10
> >>>>             [54133.243903]  ? wait_for_completion+0xa3/0x130
> >>>>             [54133.243916]  ? _raw_spin_unlock_irq+0x2c/0x40
> >>>>             [54133.243930]  ? trace_hardirqs_on+0x3f/0xe0
> >>>>             [54133.243951]  wait_for_completion+0xab/0x130
> >>>>             [54133.243962]  ? wake_up_q+0x70/0x70
> >>>>             [54133.243984]  io_ring_ctx_wait_and_kill+0xa0/0x1d0
> >>>>             [54133.243998]  io_uring_release+0x20/0x30
> >>>>             [54133.244008]  __fput+0xcf/0x270
> >>>>             [54133.244029]  ____fput+0xe/0x10
> >>>>             [54133.244040]  task_work_run+0x7f/0xa0
> >>>>             [54133.244056]  do_exit+0x305/0xc40
> >>>>             [54133.244067]  ? get_signal+0x13b/0xbd0
> >>>>             [54133.244088]  do_group_exit+0x50/0xd0
> >>>>             [54133.244103]  get_signal+0x18d/0xbd0
> >>>>             [54133.244112]  ? _raw_spin_unlock_irqrestore+0x36/0x60
> >>>>             [54133.244142]  do_signal+0x34/0x720
> >>>>             [54133.244171]  ? exit_to_usermode_loop+0x7e/0x130
> >>>>             [54133.244190]  exit_to_usermode_loop+0xc0/0x130
> >>>>             [54133.244209]  do_syscall_64+0x16b/0x1d0
> >>>>             [54133.244221]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>>>
> >>>> The reason is that we had added a req to ctx->pending_async at the very end, but
> >>>> it got no chance to be processed anymore. How could this be happened?
> >>>>
> >>>>             fio#cpu0                                        wq#cpu1
> >>>>
> >>>>             io_add_to_prev_work                    io_sq_wq_submit_work
> >>>>
> >>>>               atomic_read() <<< 1
> >>>>
> >>>>                                                       atomic_dec_return() << 1->0
> >>>>                                                       list_empty();    <<< true;
> >>>>
> >>>>               list_add_tail()
> >>>>               atomic_read() << 0 or 1?
> >>>>
> >>>> As was said in atomic_ops.rst, atomic_read does not guarantee that the runtime
> >>>> initialization by any other thread is visible yet, so we must take care of that
> >>>> with a proper implicit or explicit memory barrier;
> >>> Thanks for looking at this and finding this issue, it does looks like a problem.
> >>> But I'm not sure about the fix. Shouldn't we just need an smp_mb__after_atomic()
> >>> on the atomic_dec_return() side of things? Like the below.
> >>>
> >>>
> >>> diff --git a/fs/io_uring.c b/fs/io_uring.c
> >>> index 5ec06e5ba0be..3c2a6f88a6b0 100644
> >>> --- a/fs/io_uring.c
> >>> +++ b/fs/io_uring.c
> >>> @@ -1881,6 +1881,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
> >>>              */
> >>>             if (async_list) {
> >>>                     ret = atomic_dec_return(&async_list->cnt);
> >>> +           smp_mb__after_atomic();
> >>>                     while (!ret && !list_empty(&async_list->list)) {
> >>>                             spin_lock(&async_list->lock);
> >>>                             atomic_inc(&async_list->cnt);
> >>> @@ -1894,6 +1895,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
> >>>                                     goto restart;
> >>>                             }
> >>>                             ret = atomic_dec_return(&async_list->cnt);
> >>> +                   smp_mb__after_atomic();
> >>>                     }
> >>>             }
> >>>
> >>>
> >> I don't think this is enough, I actually think your fix is the most
> >> appropriate. I will apply it, thank you!
> >>
> >
> > Hi, Jens.
> > I have tested you fix and the issue still existed. Actually the
> > implementation of atomic_dec_return has been implicitly surrounded
> > already by mb()  and as I know, smp_mb__after/before_atomic are not
> > suitable for atomic_t operation which does not return a value.
>
> We aren't guaranteed to see the atomic_dec_return() update if it happens
> at the same time. So we can either force ordering with the smp_mb(), or
> we can do something ala:
>
>         if (!atomic_sub_return(0, &list->cnt)) {
>                 ...
>
> io_add_to_prev_work() to achieve the same sort of effect. That should
> work as well.

Yeah,  but I'd prefer smp_mb(), since atomic_sub_return(0, &list->cnt) isn't
such clear.

Thanks.
>
> --
> Jens Axboe
>

Jens Axboe July 20, 2019, 3:19 p.m. UTC | #6

On 7/20/19 9:14 AM, Zhengyuan Liu wrote:
> On Sat, Jul 20, 2019 at 2:27 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 7/18/19 6:54 PM, Zhengyuan Liu wrote:
>>>
>>> On 7/19/19 12:43 AM, Jens Axboe wrote:
>>>> On 7/18/19 9:41 AM, Jens Axboe wrote:
>>>>> On 7/18/19 6:44 AM, Zhengyuan Liu wrote:
>>>>>> There is a hang issue while using fio to do some basic test. The issue can
>>>>>> been easily reproduced using bellow scripts:
>>>>>>
>>>>>>              while true
>>>>>>              do
>>>>>>                      fio  --ioengine=io_uring  -rw=write -bs=4k -numjobs=1 \
>>>>>>                           -size=1G -iodepth=64 -name=uring   --filename=/dev/zero
>>>>>>              done
>>>>>>
>>>>>> After serveral minutes, maybe more, fio would block at
>>>>>> io_uring_enter->io_cqring_wait in order to waiting for previously committed
>>>>>> sqes to be completed and cann't return to user anymore until we send a SIGTERM
>>>>>> to fio. After got SIGTERM, fio turns to hang at io_ring_ctx_wait_and_kill with
>>>>>> a backtrace like this:
>>>>>>
>>>>>>              [54133.243816] Call Trace:
>>>>>>              [54133.243842]  __schedule+0x3a0/0x790
>>>>>>              [54133.243868]  schedule+0x38/0xa0
>>>>>>              [54133.243880]  schedule_timeout+0x218/0x3b0
>>>>>>              [54133.243891]  ? sched_clock+0x9/0x10
>>>>>>              [54133.243903]  ? wait_for_completion+0xa3/0x130
>>>>>>              [54133.243916]  ? _raw_spin_unlock_irq+0x2c/0x40
>>>>>>              [54133.243930]  ? trace_hardirqs_on+0x3f/0xe0
>>>>>>              [54133.243951]  wait_for_completion+0xab/0x130
>>>>>>              [54133.243962]  ? wake_up_q+0x70/0x70
>>>>>>              [54133.243984]  io_ring_ctx_wait_and_kill+0xa0/0x1d0
>>>>>>              [54133.243998]  io_uring_release+0x20/0x30
>>>>>>              [54133.244008]  __fput+0xcf/0x270
>>>>>>              [54133.244029]  ____fput+0xe/0x10
>>>>>>              [54133.244040]  task_work_run+0x7f/0xa0
>>>>>>              [54133.244056]  do_exit+0x305/0xc40
>>>>>>              [54133.244067]  ? get_signal+0x13b/0xbd0
>>>>>>              [54133.244088]  do_group_exit+0x50/0xd0
>>>>>>              [54133.244103]  get_signal+0x18d/0xbd0
>>>>>>              [54133.244112]  ? _raw_spin_unlock_irqrestore+0x36/0x60
>>>>>>              [54133.244142]  do_signal+0x34/0x720
>>>>>>              [54133.244171]  ? exit_to_usermode_loop+0x7e/0x130
>>>>>>              [54133.244190]  exit_to_usermode_loop+0xc0/0x130
>>>>>>              [54133.244209]  do_syscall_64+0x16b/0x1d0
>>>>>>              [54133.244221]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>
>>>>>> The reason is that we had added a req to ctx->pending_async at the very end, but
>>>>>> it got no chance to be processed anymore. How could this be happened?
>>>>>>
>>>>>>              fio#cpu0                                        wq#cpu1
>>>>>>
>>>>>>              io_add_to_prev_work                    io_sq_wq_submit_work
>>>>>>
>>>>>>                atomic_read() <<< 1
>>>>>>
>>>>>>                                                        atomic_dec_return() << 1->0
>>>>>>                                                        list_empty();    <<< true;
>>>>>>
>>>>>>                list_add_tail()
>>>>>>                atomic_read() << 0 or 1?
>>>>>>
>>>>>> As was said in atomic_ops.rst, atomic_read does not guarantee that the runtime
>>>>>> initialization by any other thread is visible yet, so we must take care of that
>>>>>> with a proper implicit or explicit memory barrier;
>>>>> Thanks for looking at this and finding this issue, it does looks like a problem.
>>>>> But I'm not sure about the fix. Shouldn't we just need an smp_mb__after_atomic()
>>>>> on the atomic_dec_return() side of things? Like the below.
>>>>>
>>>>>
>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>> index 5ec06e5ba0be..3c2a6f88a6b0 100644
>>>>> --- a/fs/io_uring.c
>>>>> +++ b/fs/io_uring.c
>>>>> @@ -1881,6 +1881,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>>>>               */
>>>>>              if (async_list) {
>>>>>                      ret = atomic_dec_return(&async_list->cnt);
>>>>> +           smp_mb__after_atomic();
>>>>>                      while (!ret && !list_empty(&async_list->list)) {
>>>>>                              spin_lock(&async_list->lock);
>>>>>                              atomic_inc(&async_list->cnt);
>>>>> @@ -1894,6 +1895,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>>>>                                      goto restart;
>>>>>                              }
>>>>>                              ret = atomic_dec_return(&async_list->cnt);
>>>>> +                   smp_mb__after_atomic();
>>>>>                      }
>>>>>              }
>>>>>
>>>>>
>>>> I don't think this is enough, I actually think your fix is the most
>>>> appropriate. I will apply it, thank you!
>>>>
>>>
>>> Hi, Jens.
>>> I have tested you fix and the issue still existed. Actually the
>>> implementation of atomic_dec_return has been implicitly surrounded
>>> already by mb()  and as I know, smp_mb__after/before_atomic are not
>>> suitable for atomic_t operation which does not return a value.
>>
>> We aren't guaranteed to see the atomic_dec_return() update if it happens
>> at the same time. So we can either force ordering with the smp_mb(), or
>> we can do something ala:
>>
>>          if (!atomic_sub_return(0, &list->cnt)) {
>>                  ...
>>
>> io_add_to_prev_work() to achieve the same sort of effect. That should
>> work as well.
> 
> Yeah,  but I'd prefer smp_mb(), since atomic_sub_return(0, &list->cnt) isn't
> such clear.

In some ways I actually think it's more clear, as it makes it explicit
what we're synchronizing with, and it's then up to the atomic primitives
to use the right barrier. But I'm fine with the smp_mb() and that's what
I already queued up, so let's just stick with that.

[RFC] io_uring: add a memory barrier before atomic_read

Commit Message

Comments

Patch