From patchwork Tue Apr 26 11:15:35 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Pen X-Patchwork-Id: 8937931 Return-Path: X-Original-To: patchwork-linux-block@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id A0477BF29F for ; Tue, 26 Apr 2016 11:40:23 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 88F2620172 for ; Tue, 26 Apr 2016 11:40:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6B731201BB for ; Tue, 26 Apr 2016 11:40:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752610AbcDZLkV (ORCPT ); Tue, 26 Apr 2016 07:40:21 -0400 Received: from mail-wm0-f41.google.com ([74.125.82.41]:35216 "EHLO mail-wm0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752630AbcDZLkT (ORCPT ); Tue, 26 Apr 2016 07:40:19 -0400 Received: by mail-wm0-f41.google.com with SMTP id e201so107328952wme.0 for ; Tue, 26 Apr 2016 04:40:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=profitbricks-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id; bh=mHtselt30JbbndbF0+TnBQMarR6uFxZJMmTmzchGU1Y=; b=EXQ5TkuW4eLhi1aJGqOqmM24RvKtbSO7ZPYqaskB0xuDQmzndAiCT6u3zeono3vlBJ ktrrmKk23NUxUFTRtt6yGWie9dFRauaKp9CLjbdC/tjJiboU8yA63v73e5k284x03v5U mVOhUbhAfp+WErULa4PsH7nS1V+Xv7rwztiRnMFFrQda6DRsQFkQW4xwUamM0kXOSd9s Nh20gAKu5PmaBXY23TSBYaMUn+FUN4Gmtuhv8AjDtXWDEbn7KQM0GVsFcgz/El0cf+hI eQWavRFvNhDPHvrJKRO1cYEXyBOy9zRO8ICdFtFSrZrKf42KjJgVp2zPAhiqchmScuN2 ijgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=mHtselt30JbbndbF0+TnBQMarR6uFxZJMmTmzchGU1Y=; b=Vgb5sLyi8CQkjvbxfxqceRn/E6/o7/yH5AxdkSjD1Bws9w+U3cG0+Quqq3ypt0GbNw N2ogRnHXrPKRII1p/qLCoiTXXkQMXs9UfGGy4v03VK9CckMlRL9ODN2BpWNXWl7Kvl0V HluRaEcJ7IENW8RqhlnIAL9Xmg/q7QOYaamllVCH0w0rKFyEa/C9JHH5u6MadgGhWKOM NlKD5k11R0ZvZIhVQqXUyxm+luaBieZ8LyTIPWINrfVQ3k/J3cnutNEcl4Dih1ASTBEk GGCBOBvY9qlJu44YYnPiQTdCG1ZUMviA920/rhD3AnItuJefRA70Bi9dFR1CI+NbvMnq 1c+g== X-Gm-Message-State: AOPr4FU/Hp/Nm8nCS3h6EXXxEv0SHzmw0jD5zWGqXmTuoggjLIIT/it/DY2Sdco1Jt1iI0zI X-Received: by 10.194.7.234 with SMTP id m10mr2729806wja.52.1461670817432; Tue, 26 Apr 2016 04:40:17 -0700 (PDT) Received: from pb.pb.local ([62.217.45.26]) by smtp.gmail.com with ESMTPSA id o4sm17612657wjx.45.2016.04.26.04.40.16 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 26 Apr 2016 04:40:16 -0700 (PDT) From: Roman Pen Cc: Roman Pen , Gioh Kim , Michael Wang , Tejun Heo , Jens Axboe , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [PATCH 1/1] workqueue: fix ghost PENDING flag while doing MQ IO Date: Tue, 26 Apr 2016 13:15:35 +0200 Message-Id: <1461669335-19522-1-git-send-email-roman.penyaev@profitbricks.com> X-Mailer: git-send-email 2.7.4 To: unlisted-recipients:; (no To-header on input) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Spam-Status: No, score=-7.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD,T_DKIM_INVALID,UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list with the following backtrace: [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds. [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6 [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000 [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server] [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000 [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0 [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38 [ 601.350965] Call Trace: [ 601.351203] [] ? bit_wait+0x60/0x60 [ 601.351444] [] schedule+0x35/0x80 [ 601.351709] [] schedule_timeout+0x192/0x230 [ 601.351958] [] ? blk_flush_plug_list+0xc7/0x220 [ 601.352208] [] ? ktime_get+0x37/0xa0 [ 601.352446] [] ? bit_wait+0x60/0x60 [ 601.352688] [] io_schedule_timeout+0xa4/0x110 [ 601.352951] [] ? _raw_spin_unlock_irqrestore+0xe/0x10 [ 601.353196] [] bit_wait_io+0x1b/0x70 [ 601.353440] [] __wait_on_bit+0x5d/0x90 [ 601.353689] [] wait_on_page_bit+0xc0/0xd0 [ 601.353958] [] ? autoremove_wake_function+0x40/0x40 [ 601.354200] [] __filemap_fdatawait_range+0xe4/0x140 [ 601.354441] [] filemap_fdatawait_range+0x14/0x30 [ 601.354688] [] filemap_write_and_wait_range+0x3f/0x70 [ 601.354932] [] blkdev_fsync+0x1b/0x50 [ 601.355193] [] vfs_fsync_range+0x49/0xa0 [ 601.355432] [] blkdev_write_iter+0xca/0x100 [ 601.355679] [] __vfs_write+0xaa/0xe0 [ 601.355925] [] vfs_write+0xa9/0x1a0 [ 601.356164] [] kernel_write+0x38/0x50 The underlying device is a null_blk, with default parameters: queue_mode = MQ submit_queues = 1 Verification that nullb0 has something inflight: root@pserver8:~# cat /sys/block/nullb0/inflight 0 1 root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \; ... /sys/block/nullb0/mq/0/cpu2/rq_list CTX pending: ffff8838038e2400 ... During debug it became clear that stalled request is always inserted in the rq_list from the following path: save_stack_trace_tsk + 34 blk_mq_insert_requests + 231 blk_mq_flush_plug_list + 281 blk_flush_plug_list + 199 wait_on_page_bit + 192 __filemap_fdatawait_range + 228 filemap_fdatawait_range + 20 filemap_write_and_wait_range + 63 blkdev_fsync + 27 vfs_fsync_range + 73 blkdev_write_iter + 202 __vfs_write + 170 vfs_write + 169 kernel_write + 56 So blk_flush_plug_list() was called with from_schedule == true. If from_schedule is true, that means that finally blk_mq_insert_requests() offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue, i.e. it calls kblockd_schedule_delayed_work_on(). That means, that we race with another CPU, which is about to execute __blk_mq_run_hw_queue() work. Further debugging shows the following traces from different CPUs: CPU#0 CPU#1 ---------------------------------- ------------------------------- reqeust A inserted STORE hctx->ctx_map[0] bit marked kblockd_schedule...() returns 1 request B inserted STORE hctx->ctx_map[1] bit marked kblockd_schedule...() returns 0 *** WORK PENDING bit is cleared *** flush_busy_ctxs() is executed, but bit 1, set by CPU#1, is not observed As a result request B pended forever. This behaviour can be explained by speculative LOAD of hctx->ctx_map on CPU#0, which is reordered with clear of PENDING bit and executed _before_ actual STORE of bit 1 on CPU#1. The proper fix is an explicit full barrier , which guarantees that clear of PENDING bit is to be executed before all possible speculative LOADS or STORES inside actual work function. Signed-off-by: Roman Pen Cc: Gioh Kim Cc: Michael Wang Cc: Tejun Heo Cc: Jens Axboe Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org --- kernel/workqueue.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 450c21f..0ec0594 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -649,6 +649,35 @@ static void set_work_pool_and_clear_pending(struct work_struct *work, */ smp_wmb(); set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0); + /* + * The following mb guarantees that previous clear of a PENDING bit + * will not be reordered with any speculative LOADS or STORES from + * work->current_func, which is executed afterwards. This possible + * reordering can lead to a missed execution on attempt to qeueue + * the same @work. E.g. consider this case: + * + * CPU#0 CPU#1 + * ---------------------------- -------------------------------- + * + * 1 STORE event_indicated + * 2 queue_work_on() { + * 3 test_and_set_bit(PENDING) + * 4 } set_..._and_clear_pending() { + * 5 set_work_data() # clear bit + * 6 smp_mb() + * 7 work->current_func() { + * 8 LOAD event_indicated + * } + * + * Without an explicit full barrier speculative LOAD on line 8 can + * be executed before CPU#0 does STORE on line 1. If that happens, + * CPU#0 observes the PENDING bit is still set and new execution of + * a @work is not queued in a hope, that CPU#1 will eventually + * finish the queued @work. Meanwhile CPU#1 does not see + * event_indicated is set, because speculative LOAD was executed + * before actual STORE. + */ + smp_mb(); } static void clear_work_data(struct work_struct *work)