From patchwork Mon Apr 25 15:22:51 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Pen X-Patchwork-Id: 8929681 Return-Path: X-Original-To: patchwork-linux-block@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id BB0E3BF29F for ; Mon, 25 Apr 2016 15:23:29 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 791C220123 for ; Mon, 25 Apr 2016 15:23:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1782B20142 for ; Mon, 25 Apr 2016 15:23:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932731AbcDYPXZ (ORCPT ); Mon, 25 Apr 2016 11:23:25 -0400 Received: from mail-wm0-f52.google.com ([74.125.82.52]:36077 "EHLO mail-wm0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932511AbcDYPXX (ORCPT ); Mon, 25 Apr 2016 11:23:23 -0400 Received: by mail-wm0-f52.google.com with SMTP id v188so102661674wme.1 for ; Mon, 25 Apr 2016 08:23:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=profitbricks-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id; bh=rK/7Pil1s31mjPYx8lOEqrcS6CcmwcUYA1ZjyYpXoUA=; b=H8gqRmLtTScQ1+EuZOkXOfLmjBrxblsLRGRmME8VRgnps1/BmgQScaqb2dZDTiaQqQ IuBEzZoUwqvoFBWiu0H0lx79BdLBxxtDqC2SVNqKl4EDSu1gJh3c46p0Jsu4ATaa/RV1 G98XtTBkVgbqZiFfKw9b6WL6z7F1bhzVqBFH5D6VLYf8HNg1YePIYrmzsZf9u11H3i2E IXZbHboCqYcQGG1pr+Mn7Newnh/25YWI3UNd3+X9V/WeXC1DWt3rjqPeFUANLFzSuFPG vRhG8Vw8AJBt+/z+OEcVQQhuUQ2c9qSDWx2LU02+o62EQl0vnlWZtwOxa+kjNmkPAmHE 0AwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=rK/7Pil1s31mjPYx8lOEqrcS6CcmwcUYA1ZjyYpXoUA=; b=izxN6SS6YFoOHw7hNH+jQBJT6vT5Oa/WDUoS4vpd+4ktIUW6abs+rHUxbt8DpW4j+p QeRsG+Xpa07hxVyYeHHZgSf9V1zpq331LgXQ7HrIuDJmpfOjZ9RnnLMUfcQgmHroTClR H+dspwLoeTKMctbJAJ878wxRT3s9v7/Ez7LrxPe26SlFnxwjmPeUYy3TH3ww3wR86igA WDkUwRxoOZ8LMywRMMR6OVY0z6u75pQxyBlYP8GWiE4Fz0lzLu9hAQLLJRfeU/ngbBKs KgeVPEiSwfiBE08B/GX/XZiwNAdsayTDuNMbeOXBne0XaTlI3XVWeBlyEvnw7Hh1bppe 0nMQ== X-Gm-Message-State: AOPr4FVhruHSiTzAUei+aVpTsIeWTejvJmJdW4BXZZYszibPhp7W85dYFtOgdUzwLBQD8gtL X-Received: by 10.194.78.235 with SMTP id e11mr34277169wjx.54.1461597801301; Mon, 25 Apr 2016 08:23:21 -0700 (PDT) Received: from pb.pb.local ([62.217.45.26]) by smtp.gmail.com with ESMTPSA id a127sm13675370wmc.17.2016.04.25.08.23.20 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 25 Apr 2016 08:23:20 -0700 (PDT) From: Roman Pen Cc: Roman Pen , Tejun Heo , Jens Axboe , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO Date: Mon, 25 Apr 2016 17:22:51 +0200 Message-Id: <1461597771-25352-1-git-send-email-roman.penyaev@profitbricks.com> X-Mailer: git-send-email 2.7.4 To: unlisted-recipients:; (no To-header on input) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Spam-Status: No, score=-7.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD,T_DKIM_INVALID,UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi, This is RFC, because for last couple of days I hunt a mystery bug and since now I do not have a strong feeling that the following story is nothing but bug's trick and attempt to cheat me. [Sorry, the whole story and explanation are quite long] The bug is reproduced quite often on our server AMD Opteron(tm) 6386 SE with 64 cpus and 8 NUMA nodes on 4.4.5 kernel. What I observe is stalled IO request in MQ ctx->rq_list with the following backtrace: [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds. [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6 [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000 [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server] [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000 [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0 [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38 [ 601.350965] Call Trace: [ 601.351203] [] ? bit_wait+0x60/0x60 [ 601.351444] [] schedule+0x35/0x80 [ 601.351709] [] schedule_timeout+0x192/0x230 [ 601.351958] [] ? blk_flush_plug_list+0xc7/0x220 [ 601.352208] [] ? ktime_get+0x37/0xa0 [ 601.352446] [] ? bit_wait+0x60/0x60 [ 601.352688] [] io_schedule_timeout+0xa4/0x110 [ 601.352951] [] ? _raw_spin_unlock_irqrestore+0xe/0x10 [ 601.353196] [] bit_wait_io+0x1b/0x70 [ 601.353440] [] __wait_on_bit+0x5d/0x90 [ 601.353689] [] wait_on_page_bit+0xc0/0xd0 [ 601.353958] [] ? autoremove_wake_function+0x40/0x40 [ 601.354200] [] __filemap_fdatawait_range+0xe4/0x140 [ 601.354441] [] filemap_fdatawait_range+0x14/0x30 [ 601.354688] [] filemap_write_and_wait_range+0x3f/0x70 [ 601.354932] [] blkdev_fsync+0x1b/0x50 [ 601.355193] [] vfs_fsync_range+0x49/0xa0 [ 601.355432] [] blkdev_write_iter+0xca/0x100 [ 601.355679] [] __vfs_write+0xaa/0xe0 [ 601.355925] [] vfs_write+0xa9/0x1a0 [ 601.356164] [] kernel_write+0x38/0x50 The underlying device is a null_blk, with default parameters: root@pserver8:~# cat /sys/module/null_blk/parameters/queue_mode 2 # MQ root@pserver8:~# cat /sys/module/null_blk/parameters/submit_queues 1 # number of hw queues root@pserver8:~# cat /sys/module/null_blk/parameters/irqmode 1 # softirq Verification that nullb0 has something inflight: root@pserver8:~# cat /sys/block/nullb0/inflight 0 1 root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \; ... /sys/block/nullb0/mq/0/cpu2/rq_list CTX pending: ffff8838038e2400 ... GDB shows that request is alive, nothing is corrupted and all bitmaps are valid: root@pserver8:~# gdb /usr/src/linux-4.4.5-1-storage/vmlinux /proc/kcore (gdb) p (struct request *)0xffff8838038e2400 $1 = (struct request *) 0xffff8838038e2400 (gdb) p *$1 ... atomic_flags = 1, cpu = -1, __data_len = 98304, __sector = 192210432, ... bio = 0xffff8807ef64d000, biotail = 0xffff8807ef64df00, start_time = 4294963642, ... (gdb) p $1->mq_ctx $3 = (struct blk_mq_ctx *) 0xffffe8c7e0c9d300 (gdb) p $1->mq_ctx->cpu $4 = 2 (gdb) p $1->q->queue_hw_ctx[$1->q->mq_map[2]] $5 = (struct blk_mq_hw_ctx *) 0xffff8807f1967c00 (gdb) p $5->ctx_map $6 = { size = 8, bits_per_word = 8, map = 0xffff8807f1bb6200 } (gdb) p *$5->ctx_map.map@8 $18 = {{ word = 4, depth = 8 }, { word = 0, depth = 8 }, { word = 0, depth = 8 }, { word = 0, depth = 8 }, { word = 0, depth = 8 }, { word = 0, depth = 8 }, { word = 0, depth = 8 }, { word = 0, depth = 8 }} (gdb) p &$5->ctxs[0 + 2]->rq_list $23 = (struct list_head *) 0xffffe8c7e0c9d308 (gdb) p $5->ctxs[0 + 2]->rq_list $24 = { next = 0xffff8838038e2400, prev = 0xffff8838038e2400 <<<<< Our pended request } During debug it became clear that stalled request is always inserted in the rq_list from the following path: save_stack_trace_tsk + 34 blk_mq_insert_requests + 231 blk_mq_flush_plug_list + 281 blk_flush_plug_list + 199 wait_on_page_bit + 192 __filemap_fdatawait_range + 228 filemap_fdatawait_range + 20 filemap_write_and_wait_range + 63 blkdev_fsync + 27 vfs_fsync_range + 73 blkdev_write_iter + 202 __vfs_write + 170 vfs_write + 169 kernel_write + 56 So blk_flush_plug_list() was called with from_schedule == true. If from_schedule is true, that means that finally blk_mq_insert_requests() offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue, i.e. it calls kblockd_schedule_delayed_work_on(). That can mean, that we race with another CPU, which is about to execute __blk_mq_run_hw_queue() work. Ok, then I decided to trace deeper and to understand does the race exist, and if yes, is that because of a probable pending hctx->run_work. For further tracing I used 4 new events: 1. insert of a request in a rq_list 2. hctx marked as pending 3. did kblockd_schedule_delayed_work_on() succeed or not 4. actual execution of flush_busy_ctxs() and inserted these events as in hunks below: @@ -690,6 +690,8 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list) struct blk_mq_ctx *ctx; int i; + trace_hctx_flush_busy_ctxs(hctx); + for (i = 0; i < hctx->ctx_map.size; i++) { struct blk_align_bitmap *bm = &hctx->ctx_map.map[i]; @@ -872,8 +876,9 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) put_cpu(); } - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), - &hctx->run_work, 0); + queued = kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), + &hctx->run_work, 0); + trace_hctx_delayed_work_queued(hctx, queued); @@ -1047,8 +1052,10 @@ static void blk_mq_insert_requests(struct request_queue *q, list_del_init(&rq->queuelist); rq->mq_ctx = ctx; __blk_mq_insert_req_list(hctx, ctx, rq, false); + trace_hctx_insert_req_list(hctx, rq); } blk_mq_hctx_mark_pending(hctx, ctx); + trace_hctx_mark_pending(hctx); spin_unlock(&ctx->lock); I have to say, that I could not use kernel trace events framework (i.e. this one include/trace/events), because bug of course immediately disappears. I wrote my own static lockless ring buffer with capacity for 64 events. But I could not use any sorts of atomics - bug disappears. My ring buffer was implemented per CPU, but then I had to sort all the events out using some proper global accurate clock. But I could not use sched_clock() - bug disappears. I chose raw rdtsc(), and again, not rdtsc_ordered(), but unordered variant. rdtsc_ordered() again "fixes" the bug. So frankly I am not sure can I trust rdtsc(), which again can be reordered, but ok, no other option. So here is the sorted output of the tracer of another stalled request: rdtsc=849688091004, cpu= 6, INS_REQ hctx=ffff8800d46c0000, rq=ffff884000343600 rdtsc=849688091304, cpu= 6, MARK_PEND hctx=ffff8800d46c0000 rdtsc=849688099595, cpu= 6, KBLK_SCHED hctx=ffff8800d46c0000, queued=1 rdtsc=849688106102, cpu= 6, FLUSH_BUSY hctx=ffff8800d46c0000 rdtsc=849688109586, cpu= 2, INS_REQ hctx=ffff8800d46c0000, rq=ffff884000343cc0 rdtsc=849688109957, cpu= 2, MARK_PEND hctx=ffff8800d46c0000 rdtsc=849688110921, cpu= 2, KBLK_SCHED hctx=ffff8800d46c0000, queued=0 I see always the same picture for all runs I did (of course CPUs number and rq addresses can vary): CPU#6 CPU#2 reqeust ffff884000343600 inserted hctx marked as pended kblockd_schedule...() returns 1 *** WORK_STRUCT_PENDING_BIT is cleared *** flush_busy_ctxs() is executed reqeust ffff884000343cc0 inserted hctx marked as pended kblockd_schedule...() returns 0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ WTF? As a result ffff884000343cc0 request pended forever. According to the trace output I see that another CPU _always_ observes WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was cleared on another CPU. Checking the workqueue.c code I see that clearing the bit is nothing more, but atomic_long_set(), which is instruction. This function: static inline void set_work_data() In attempt to "fix" the mystery I replaced atomic_long_set() call with atomic_long_xchg(), which is instruction. The problem has gone. For me it looks that test_and_set_bit() ( instruction) does not require flush of all CPU caches, which can be dirty after executing of on another CPU. But really updates the memory and the following execution of observes that bit was cleared. As a conculusion I can say, that I am lucky enough and can reproduce this bug in several minutes on a specific load (I tried many other simple loads using fio, even using btrecord/btreplay, no success). And that easy reproduction on a specific load gives me some freedom to test and then to be sure, that problem has gone. Signed-off-by: Roman Pen Cc: Tejun Heo Cc: Jens Axboe Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- kernel/workqueue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 450c21f..60cd6db 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -621,7 +621,7 @@ static inline void set_work_data(struct work_struct *work, unsigned long data, unsigned long flags) { WARN_ON_ONCE(!work_pending(work)); - atomic_long_set(&work->data, data | flags | work_static(work)); + atomic_long_xchg(&work->data, data | flags | work_static(work)); } static void set_work_pwq(struct work_struct *work, struct pool_workqueue *pwq,