From patchwork Thu May 24 05:56:15 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Omar Sandoval <osandov@osandov.com>
X-Patchwork-Id: 10422733
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	3455E6019D for <patchwork-linux-block@patchwork.kernel.org>;
	Thu, 24 May 2018 05:56:21 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 203E1292E5
	for <patchwork-linux-block@patchwork.kernel.org>;
	Thu, 24 May 2018 05:56:21 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 1246E2931B; Thu, 24 May 2018 05:56:21 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID, MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2FD33292E5
	for <patchwork-linux-block@patchwork.kernel.org>;
	Thu, 24 May 2018 05:56:20 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755229AbeEXF4T (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Thu, 24 May 2018 01:56:19 -0400
Received: from mail-pl0-f66.google.com ([209.85.160.66]:42676 "EHLO
	mail-pl0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1755254AbeEXF4S (ORCPT
	<rfc822;linux-block@vger.kernel.org>);
	Thu, 24 May 2018 01:56:18 -0400
Received: by mail-pl0-f66.google.com with SMTP id u6-v6so360089pls.9
	for <linux-block@vger.kernel.org>;
	Wed, 23 May 2018 22:56:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=osandov-com.20150623.gappssmtp.com; s=20150623;
	h=date:from:to:cc:subject:message-id:references:mime-version
	:content-disposition:in-reply-to:user-agent;
	bh=FW16coPDI/7pcJ6ONBIggoUOOK24PoUuwn44Y75iD6k=;
	b=qchYfca5P9qxNmM8D41TUSMPSKcNY0GKfrXZNEZN2ktNs6295OHtLcL/blLQr6mlpa
	OrGu08G6fOPsR0g/1bZq99lTYRdlbg6ZrAzDceYIDfbixfcKZ7Bd1wB3egYzNxBFprnW
	m/fl5ZlKwE92ZyCRSpDg30tdTdODVOPx3Lp6flEELh3cQ6DcBxsofx5ITCNw2wJa3ZQX
	cj/SNAH3DTovzQqeiTwhK2P/z8KOGOddowub7apvGjd5fEsEQmUETZhnu34T7EVBwwCm
	l7vBC4iQ5nhUMN8VTWGbiI2jpQQkIhkPS8ex7ESkq5vbF64wvFLwCF131k+K0+3gM9K5
	+zHw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:date:from:to:cc:subject:message-id:references
	:mime-version:content-disposition:in-reply-to:user-agent;
	bh=FW16coPDI/7pcJ6ONBIggoUOOK24PoUuwn44Y75iD6k=;
	b=owvEzG0dBQJgdkg4vL8Cp8ktFHeENrCJUpBEYes0rXa5X8NKOqv26O0EZZWx35VJ04
	K+Y9BhTbKC/l5MJuDTWh/jLhaeURvPh9MFQHdJEgtrrNglkWPiODH5JQ0TjsuMU2Tni3
	aI9en9sVWMWFxm6zo2bCwKAyqTmn4UkdxNJFWG79euZ1UigWmWBM4Hca0q6hD8bDFIMP
	AQ8wZRrH5VYvV0y0Nty8pnRzYMIiJ/tqpO8aUvrIZUEbeFNeehJanW2JW8wQYkJi0kUu
	XMVVZmb7qP/yoUB1dKaVHXbJoYnxxEh3w44qbkBDjewKxyljmW6vAw6WwBrm6M+pO2R/
	Ayow==
X-Gm-Message-State: ALKqPwczDcbU21+OCwfJNa/k+PlXcKgq489OAr0JBuetMgCKo5+aW+qI
	ZLdtfwvyWU0TiuMrxY1fZQtcyA==
X-Google-Smtp-Source: 
 AB8JxZrtExuR4qGFdsBiUGiT2361oWa8s9TpBMkzml3n7AxlcyHZpBwPmY18r/Uzv+SlwaKtTtQRqw==
X-Received: by 2002:a17:902:3a5:: with SMTP id
	d34-v6mr6018535pld.103.1527141377198;
	Wed, 23 May 2018 22:56:17 -0700 (PDT)
Received: from vader ([2601:602:8800:a9a9:e6a7:a0ff:fe0b:c9a8])
	by smtp.gmail.com with ESMTPSA id
	k64-v6sm31600041pgk.17.2018.05.23.22.56.16
	(version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
	Wed, 23 May 2018 22:56:16 -0700 (PDT)
Date: Wed, 23 May 2018 22:56:15 -0700
From: Omar Sandoval <osandov@osandov.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org,
	stable@vger.kernel.org, Omar Sandoval <osandov@fb.com>
Subject: Re: [PATCH V3] blk-mq: avoid to starve tag allocation after
	allocation process migrates
Message-ID: <20180524055615.GF12533@vader>
References: <20180524030440.15744-1-ming.lei@redhat.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20180524030440.15744-1-ming.lei@redhat.com>
User-Agent: Mutt/1.10.0 (2018-05-17)
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Thu, May 24, 2018 at 11:04:40AM +0800, Ming Lei wrote:
> When the allocation process is scheduled back and the mapped hw queue is
> changed, fake one extra wake up on previous queue for compensating wake up
> miss, so other allocations on the previous queue won't be starved.
> 
> This patch fixes one request allocation hang issue, which can be
> triggered easily in case of very low nr_request.

Thanks, Ming, this looks better. One comment below.

> Cc: <stable@vger.kernel.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V3:
> 	- fix comments as suggested by Jens
> 	- remove the wrapper as suggested by Omar
> V2:
> 	- fix build failure
> 
> 
>  block/blk-mq-tag.c      | 12 ++++++++++++
>  include/linux/sbitmap.h |  7 +++++++
>  lib/sbitmap.c           | 22 ++++++++++++----------
>  3 files changed, 31 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 336dde07b230..a4e58fc28a06 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -134,6 +134,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  	ws = bt_wait_ptr(bt, data->hctx);
>  	drop_ctx = data->ctx == NULL;
>  	do {
> +		struct sbitmap_queue *bt_prev;
> +
>  		/*
>  		 * We're out of tags on this hardware queue, kick any
>  		 * pending IO submits before going to sleep waiting for
> @@ -159,6 +161,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  		if (data->ctx)
>  			blk_mq_put_ctx(data->ctx);
>  
> +		bt_prev = bt;
>  		io_schedule();
>  
>  		data->ctx = blk_mq_get_ctx(data->q);
> @@ -170,6 +173,15 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  			bt = &tags->bitmap_tags;
>  
>  		finish_wait(&ws->wait, &wait);
> +
> +		/*
> +		 * If destination hw queue is changed, fake wake up on
> +		 * previous queue for compensating the wake up miss, so
> +		 * other allocations on previous queue won't be starved.
> +		 */
> +		if (bt != bt_prev)
> +			sbitmap_queue_wake_up(bt_prev);
> +
>  		ws = bt_wait_ptr(bt, data->hctx);
>  	} while (1);
>  
> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 841585f6e5f2..bba9d80191b7 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -484,6 +484,13 @@ static inline struct sbq_wait_state *sbq_wait_ptr(struct sbitmap_queue *sbq,
>  void sbitmap_queue_wake_all(struct sbitmap_queue *sbq);
>  
>  /**
> + * sbitmap_queue_wake_up() - Wake up some of waiters in one waitqueue
> + * on a &struct sbitmap_queue.
> + * @sbq: Bitmap queue to wake up.
> + */
> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq);
> +
> +/**
>   * sbitmap_queue_show() - Dump &struct sbitmap_queue information to a &struct
>   * seq_file.
>   * @sbq: Bitmap queue to show.
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index e6a9c06ec70c..14e027a33ffa 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -335,8 +335,9 @@ void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth)
>  	if (sbq->wake_batch != wake_batch) {
>  		WRITE_ONCE(sbq->wake_batch, wake_batch);
>  		/*
> -		 * Pairs with the memory barrier in sbq_wake_up() to ensure that
> -		 * the batch size is updated before the wait counts.
> +		 * Pairs with the memory barrier in sbitmap_queue_wake_up()
> +		 * to ensure that the batch size is updated before the wait
> +		 * counts.
>  		 */
>  		smp_mb__before_atomic();
>  		for (i = 0; i < SBQ_WAIT_QUEUES; i++)
> @@ -425,7 +426,7 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
>  	return NULL;
>  }
>  
> -static void sbq_wake_up(struct sbitmap_queue *sbq)
> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq)
>  {
>  	struct sbq_wait_state *ws;
>  	unsigned int wake_batch;

Just after this is:

	/*
	 * Pairs with the memory barrier in set_current_state() to ensure the
	 * proper ordering of clear_bit()/waitqueue_active() in the waker and
	 * test_and_set_bit_lock()/prepare_to_wait()/finish_wait() in the
	 * waiter. See the comment on waitqueue_active(). This is __after_atomic
	 * because we just did clear_bit_unlock() in the caller.
	 */
	smp_mb__after_atomic();

But there's no atomic operation before this in blk_mq_get_tag(). I don't
think this memory barrier is necessary for the blk_mq_get_tag() case, so
let's move it to sbitmap_queue_clear():


That comment also mentioned clear_bit() but we do clear_bit_unlock()
now, so we can update that at the same time.

> @@ -454,23 +455,24 @@ static void sbq_wake_up(struct sbitmap_queue *sbq)
>  		 */
>  		smp_mb__before_atomic();
>  		/*
> -		 * If there are concurrent callers to sbq_wake_up(), the last
> -		 * one to decrement the wait count below zero will bump it back
> -		 * up. If there is a concurrent resize, the count reset will
> -		 * either cause the cmpxchg to fail or overwrite after the
> -		 * cmpxchg.
> +		 * If there are concurrent callers to sbitmap_queue_wake_up(),
> +		 * the last one to decrement the wait count below zero will
> +		 * bump it back up. If there is a concurrent resize, the count
> +		 * reset will either cause the cmpxchg to fail or overwrite
> +		 * after the cmpxchg.
>  		 */
>  		atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wait_cnt + wake_batch);
>  		sbq_index_atomic_inc(&sbq->wake_index);
>  		wake_up_nr(&ws->wait, wake_batch);
>  	}
>  }
> +EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
>  
>  void sbitmap_queue_clear(struct sbitmap_queue *sbq, unsigned int nr,
>  			 unsigned int cpu)
>  {
>  	sbitmap_clear_bit_unlock(&sbq->sb, nr);
> -	sbq_wake_up(sbq);
> +	sbitmap_queue_wake_up(sbq);
>  	if (likely(!sbq->round_robin && nr < sbq->sb.depth))
>  		*per_cpu_ptr(sbq->alloc_hint, cpu) = nr;
>  }
> @@ -482,7 +484,7 @@ void sbitmap_queue_wake_all(struct sbitmap_queue *sbq)
>  
>  	/*
>  	 * Pairs with the memory barrier in set_current_state() like in
> -	 * sbq_wake_up().
> +	 * sbitmap_queue_wake_up().
>  	 */
>  	smp_mb();
>  	wake_index = atomic_read(&sbq->wake_index);
> -- 
> 2.9.5
>

diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 14e027a33ffa..537328a98c06 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -432,15 +432,6 @@ void sbitmap_queue_wake_up(struct sbitmap_queue *sbq)
 	unsigned int wake_batch;
 	int wait_cnt;
 
-	/*
-	 * Pairs with the memory barrier in set_current_state() to ensure the
-	 * proper ordering of clear_bit()/waitqueue_active() in the waker and
-	 * test_and_set_bit_lock()/prepare_to_wait()/finish_wait() in the
-	 * waiter. See the comment on waitqueue_active(). This is __after_atomic
-	 * because we just did clear_bit_unlock() in the caller.
-	 */
-	smp_mb__after_atomic();
-
 	ws = sbq_wake_ptr(sbq);
 	if (!ws)
 		return;
@@ -472,6 +463,13 @@ void sbitmap_queue_clear(struct sbitmap_queue *sbq, unsigned int nr,
 			 unsigned int cpu)
 {
 	sbitmap_clear_bit_unlock(&sbq->sb, nr);
+	/*
+	 * Pairs with the memory barrier in set_current_state() to ensure the
+	 * proper ordering of clear_bit_unlock()/waitqueue_active() in the waker
+	 * and test_and_set_bit_lock()/prepare_to_wait()/finish_wait() in the
+	 * waiter. See the comment on waitqueue_active().
+	 */
+	smp_mb__after_atomic();
 	sbitmap_queue_wake_up(sbq);
 	if (likely(!sbq->round_robin && nr < sbq->sb.depth))
 		*per_cpu_ptr(sbq->alloc_hint, cpu) = nr;