From patchwork Fri Sep 23 00:59:59 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Glauber Costa <glauber@scylladb.com>
X-Patchwork-Id: 9347265
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	F3782601C2 for <patchwork-linux-block@patchwork.kernel.org>;
	Fri, 23 Sep 2016 01:00:45 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D333A2ACD5
	for <patchwork-linux-block@patchwork.kernel.org>;
	Fri, 23 Sep 2016 01:00:45 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id C691D2AD26; Fri, 23 Sep 2016 01:00:45 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.3 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM,
	T_DKIM_INVALID autolearn=unavailable version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DE1AF2ACD5
	for <patchwork-linux-block@patchwork.kernel.org>;
	Fri, 23 Sep 2016 01:00:43 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1757638AbcIWBAY (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Thu, 22 Sep 2016 21:00:24 -0400
Received: from mail-it0-f54.google.com ([209.85.214.54]:35583 "EHLO
	mail-it0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1757621AbcIWBAX (ORCPT
	<rfc822;linux-block@vger.kernel.org>);
	Thu, 22 Sep 2016 21:00:23 -0400
Received: by mail-it0-f54.google.com with SMTP id r192so1647570ita.0
	for <linux-block@vger.kernel.org>;
	Thu, 22 Sep 2016 18:00:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=scylladb-com.20150623.gappssmtp.com; s=20150623;
	h=from:to:cc:subject:date:message-id;
	bh=PS5IslivI8HhlpcQ3scwSRFaiM/zXYFvQri6yBqef/Y=;
	b=lwOUdypSjkYewkA6pzs4Ctq2o2dh1LUpFaWHG3maD2rHdy98vFlhTDhIbMEHC2JZnM
	l0ytccNLPVCkT/6SDnCABhZ1iRHUXTA28/XhbsG4bTOuwNKQq6+5hpH6uRLEkYo/b+9+
	IDMmvSDjqWm4Q/cxqVhojo5ajgWeLk67cPEfzvig/OrigvZqlyz/qH0zasZW1e9y0+9J
	abEKQjxL1AAie+PQsTgW7u+vds2+5bVvvLqduNzIsB5/DEyol4FZeoEwNlUe09NU2nXM
	Lgg9w86OoweqoMgrZnyh8RIQA2EdQOYjdmdaU/bzF5IFGmubOfzsR6I9q75bXTHYibtv
	uLjw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:from:to:cc:subject:date:message-id;
	bh=PS5IslivI8HhlpcQ3scwSRFaiM/zXYFvQri6yBqef/Y=;
	b=cz7kgsuywi/BXjlJIveMLZQNFV8hMjkAgm2A8RxhjNAvDJan33/cpqc75FEsjYK8aF
	teQhwBx5qsf2woHYqm92twTj9oJIP4VsiPQwiY36uMV7MmRuCi63CMGWtHo3uyX2OBTL
	7wrJSv7ZQcHcoGPQ9eqKPQ9uqsQZWkJrINuB5dp4W8WnIXn16fSQcJTG4h7XfZw84IlM
	qwuLsuMBZsd1F1H0N4z4wFYyqqbgjqwuDto+5sO3/0hPugITW59lGeXTUfG+HWFxnOMD
	iTJ44ctG70LezFb0forKfAuEgC1pKYZDKF9Z1wNoQvP8GckFGrltD/xcXFn0cjIdyWGl
	LmVg==
X-Gm-Message-State: 
 AA6/9RkT7EICK0zyem2Ml+GzxN23faulg6VmyyNd0/n03mTqKNT+zfPr6w1ITfCWjkhC0LXd
X-Received: by 10.36.137.193 with SMTP id s184mr212492itd.51.1474592422321;
	Thu, 22 Sep 2016 18:00:22 -0700 (PDT)
Received: from trueserverstrongandfree.localdomain
	(CPEbc4dfb55ca31-CMbc4dfb55ca30.cpe.net.cable.rogers.com.
	[99.233.41.230]) by smtp.googlemail.com with ESMTPSA id
	z125sm575855itc.0.2016.09.22.18.00.21
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Thu, 22 Sep 2016 18:00:21 -0700 (PDT)
From: Glauber Costa <glauber@scylladb.com>
To: linux-block@vger.kernel.org
Cc: Glauber Costa <glauber@scylladb.com>, Jens Axboe <axboe@kernel.dk>,
	linux-kernel@vger.kernel.org
Subject: [PATCH] cfq: fix starvation of asynchronous writes
Date: Thu, 22 Sep 2016 20:59:59 -0400
Message-Id: <1474592399-2790-1-git-send-email-glauber@scylladb.com>
X-Mailer: git-send-email 2.5.5
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

While debugging timeouts happening in my application workload (ScyllaDB), I have
observed calls to open() taking a long time, ranging everywhere from 2 seconds -
the first ones that are enough to time out my application - to more than 30
seconds.

The problem seems to happen because XFS may block on pending metadata updates
under certain circumnstances, and that's confirmed with the following backtrace
taken by the offcputime tool (iovisor/bcc):

    ffffffffb90c57b1 finish_task_switch
    ffffffffb97dffb5 schedule
    ffffffffb97e310c schedule_timeout
    ffffffffb97e1f12 __down
    ffffffffb90ea821 down
    ffffffffc046a9dc xfs_buf_lock
    ffffffffc046abfb _xfs_buf_find
    ffffffffc046ae4a xfs_buf_get_map
    ffffffffc046babd xfs_buf_read_map
    ffffffffc0499931 xfs_trans_read_buf_map
    ffffffffc044a561 xfs_da_read_buf
    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
    ffffffffc0452e0f xfs_dir2_leaf_lookup
    ffffffffc044d9d3 xfs_dir_lookup
    ffffffffc047d1d9 xfs_lookup
    ffffffffc0479e53 xfs_vn_lookup
    ffffffffb925347a path_openat
    ffffffffb9254a71 do_filp_open
    ffffffffb9242a94 do_sys_open
    ffffffffb9242b9e sys_open
    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
    00007fb0698162ed [unknown]

Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
high "Dispatch wait" times, on the dozens of seconds range and consistent with
the open() times I have saw in that run.

Still from the blktrace output, we can after searching a bit, identify the
request that wasn't dispatched:

  8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
  8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
  8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
  8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
  <==== 'I' means Inserted (into the IO scheduler) ===================================>
  8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
  <==== Only 15s later the CFQ scheduler dispatches the request ======================>

As we can see above, in this particular example CFQ took 15 seconds to dispatch
this request. Going back to the full trace, we can see that the xfsaild queue
had plenty of opportunity to run, and it was selected as the active queue many
times. It would just always be preempted by something else (example):

  8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
  8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
  8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
  8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
  8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
  8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
  8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0

where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
IO dispatchers.

The requests preempting the xfsaild queue are synchronous requests. That's a
characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
While it can be argued that preempting ASYNC requests in favor of SYNC is part
of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
goal.

Moreover, unless I am misunderstanding something, that breaks the expectation
set by the "fifo_expire_async" tunable, which in my system is set to the
default.

Looking at the code, it seems to me that the issue is that after we make
an async queue active, there is no guarantee that it will execute any request.

When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
requests in flight. An incoming request from another queue can also preempt it
in such situation before we have the chance to execute anything (as seen in the
trace above).

This patch sets the must_dispatch flag if we notice that we have requests
that are already fifo_expired. This flag is always cleared after
cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
the queue for subsequent requests (unless they are themselves expired)

Care is taken during preempt to still allow rt requests to preempt us
regardless.

Testing my workload with this patch applied produces much better results.
From the application side I see no timeouts, and the open() latency histogram
generated by systemtap looks much better, with the worst outlier at 131ms:

Latency histogram of xfs_buf_lock acquisition (microseconds):
 value |-------------------------------------------------- count
     0 |                                                     11
     1 |@@@@                                                161
     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
     4 |@                                                    54
     8 |                                                     36
    16 |                                                      7
    32 |                                                      0
    64 |                                                      0
       ~
  1024 |                                                      0
  2048 |                                                      0
  4096 |                                                      1
  8192 |                                                      1
 16384 |                                                      2
 32768 |                                                      0
 65536 |                                                      0
131072 |                                                      1
262144 |                                                      0
524288 |                                                      0

Signed-off-by: Glauber Costa <glauber@scylladb.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: linux-block@vger.kernel.org
CC: linux-kernel@vger.kernel.org

Signed-off-by: Glauber Costa <glauber@scylladb.com>
---
 block/cfq-iosched.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index cc2f6db..5e24d88 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3042,7 +3042,6 @@ static struct request *cfq_check_fifo(struct cfq_queue *cfqq)
 	if (ktime_get_ns() < rq->fifo_time)
 		rq = NULL;
 
-	cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
 	return rq;
 }
 
@@ -3420,6 +3419,9 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	unsigned int max_dispatch;
 
+	if (cfq_cfqq_must_dispatch(cfqq))
+		return true;
+
 	/*
 	 * Drain async requests before we start sync IO
 	 */
@@ -3511,15 +3513,20 @@ static bool cfq_dispatch_request(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 
 	BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
 
+	rq = cfq_check_fifo(cfqq);
+	if (rq)
+		cfq_mark_cfqq_must_dispatch(cfqq);
+
 	if (!cfq_may_dispatch(cfqd, cfqq))
 		return false;
 
 	/*
 	 * follow expired path, else get first next available
 	 */
-	rq = cfq_check_fifo(cfqq);
 	if (!rq)
 		rq = cfqq->next_rq;
+	else
+		cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
 
 	/*
 	 * insert request into driver dispatch list
@@ -3989,7 +3996,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	 * if the new request is sync, but the currently running queue is
 	 * not, let the sync request have priority.
 	 */
-	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
+	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
 		return true;
 
 	/*