[3/4] blk-mq: deal with shared queue mapping reliably

Message ID	20181216022517.26650-4-ming.lei@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, Ming Lei <ming.lei@redhat.com>, Jeff Moyer <jmoyer@redhat.com>, Mike Snitzer <snitzer@redhat.com>, Christoph Hellwig <hch@lst.de> Subject: [PATCH 3/4] blk-mq: deal with shared queue mapping reliably Date: Sun, 16 Dec 2018 10:25:16 +0800 Message-Id: <20181216022517.26650-4-ming.lei@redhat.com> In-Reply-To: <20181216022517.26650-1-ming.lei@redhat.com> References: <20181216022517.26650-1-ming.lei@redhat.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	blk-mq: queue mapping fix & improvement \| expand [0/4] blk-mq: queue mapping fix & improvement [1/4] nvme-pci: correct mapping for poll queue(s) [2/4] blk-mq: fix allocation for queue mapping table [3/4] blk-mq: deal with shared queue mapping reliably [4/4] blk-mq-debugfs: export hctx->type

Ming Lei Dec. 16, 2018, 2:25 a.m. UTC

This patch sets map->nr_queues as zero explictly if there is zero
queues for such queue type, then blk_mq_map_swqueue() can become
more robust to deal with shared mappings.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c          |  3 +++
 drivers/nvme/host/pci.c | 37 ++++++++++++++++++++-----------------
 2 files changed, 23 insertions(+), 17 deletions(-)

Christoph Hellwig Dec. 16, 2018, 4:16 p.m. UTC | #1

On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
> This patch sets map->nr_queues as zero explictly if there is zero
> queues for such queue type, then blk_mq_map_swqueue() can become
> more robust to deal with shared mappings.

This looks a lot more clumsy than what we had before, can you explain
what additional robustnes it buys us?

Mike Snitzer Dec. 16, 2018, 6:39 p.m. UTC | #2

On Sun, Dec 16 2018 at 11:16am -0500,
Christoph Hellwig <hch@lst.de> wrote:

> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
> > This patch sets map->nr_queues as zero explictly if there is zero
> > queues for such queue type, then blk_mq_map_swqueue() can become
> > more robust to deal with shared mappings.
> 
> This looks a lot more clumsy than what we had before, can you explain
> what additional robustnes it buys us?

It enables nvme IO to complete on my testbed with for-4.21/block
changes, this NUMA layout is what triggered Ming's work:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14
node 0 size: 128605 MB
node 0 free: 128092 MB
node 1 cpus: 1 3 5 7 9 11 13 15
node 1 size: 128997 MB
node 1 free: 128529 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Without the aggregate changes from this patchset (1-3 anyway) I get IO
hangs in blkdev_fsync().

With that in mind, Jens needs these fixes (or something comparable)
ASAP.

Tested-by: Mike Snitzer <snitzer@redhat.com>

Jens Axboe Dec. 16, 2018, 7:12 p.m. UTC | #3

On 12/16/18 11:39 AM, Mike Snitzer wrote:
> On Sun, Dec 16 2018 at 11:16am -0500,
> Christoph Hellwig <hch@lst.de> wrote:
> 
>> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
>>> This patch sets map->nr_queues as zero explictly if there is zero
>>> queues for such queue type, then blk_mq_map_swqueue() can become
>>> more robust to deal with shared mappings.
>>
>> This looks a lot more clumsy than what we had before, can you explain
>> what additional robustnes it buys us?
> 
> It enables nvme IO to complete on my testbed with for-4.21/block
> changes, this NUMA layout is what triggered Ming's work:
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6 8 10 12 14
> node 0 size: 128605 MB
> node 0 free: 128092 MB
> node 1 cpus: 1 3 5 7 9 11 13 15
> node 1 size: 128997 MB
> node 1 free: 128529 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> 
> Without the aggregate changes from this patchset (1-3 anyway) I get IO
> hangs in blkdev_fsync().

Still puzzled. I'm not against the change, but the commit message has
NOTHING in the way of justification. "Make X more robust" doesn't
mean anything. Your followup brings no extra info to the table in
terms of what the bug is here. What exact bug is it fixing? Why is
fsync currently hanging?

Ming Lei Dec. 17, 2018, 1:04 a.m. UTC | #4

On Sun, Dec 16, 2018 at 12:12:17PM -0700, Jens Axboe wrote:
> On 12/16/18 11:39 AM, Mike Snitzer wrote:
> > On Sun, Dec 16 2018 at 11:16am -0500,
> > Christoph Hellwig <hch@lst.de> wrote:
> > 
> >> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
> >>> This patch sets map->nr_queues as zero explictly if there is zero
> >>> queues for such queue type, then blk_mq_map_swqueue() can become
> >>> more robust to deal with shared mappings.
> >>
> >> This looks a lot more clumsy than what we had before, can you explain
> >> what additional robustnes it buys us?
> > 
> > It enables nvme IO to complete on my testbed with for-4.21/block
> > changes, this NUMA layout is what triggered Ming's work:
> > 
> > # numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 2 4 6 8 10 12 14
> > node 0 size: 128605 MB
> > node 0 free: 128092 MB
> > node 1 cpus: 1 3 5 7 9 11 13 15
> > node 1 size: 128997 MB
> > node 1 free: 128529 MB
> > node distances:
> > node   0   1
> >   0:  10  21
> >   1:  21  10
> > 
> > Without the aggregate changes from this patchset (1-3 anyway) I get IO
> > hangs in blkdev_fsync().
> 
> Still puzzled. I'm not against the change, but the commit message has
> NOTHING in the way of justification. "Make X more robust" doesn't
> mean anything. Your followup brings no extra info to the table in
> terms of what the bug is here. What exact bug is it fixing? Why is
> fsync currently hanging?

As I explained in previous mail, poll queue uses new mapping via
blk_mq_map_queues(), then blk_mq_map_swqueue() may over-write hctx->type
by this new mapping, finally the queue mapping is totally broken.


Thanks,
Ming

Ming Lei Dec. 17, 2018, 1:27 a.m. UTC | #5

On Sun, Dec 16, 2018 at 05:16:50PM +0100, Christoph Hellwig wrote:
> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
> > This patch sets map->nr_queues as zero explictly if there is zero
> > queues for such queue type, then blk_mq_map_swqueue() can become
> > more robust to deal with shared mappings.
> 
> This looks a lot more clumsy than what we had before, can you explain
> what additional robustnes it buys us?

If there isn't one mapping type, its .nr_queues is set as zero explicitly,
then we don't need to touch the related hctx/ctx which is retrieved via the
shared mapping table.

thanks,
Ming

Christoph Hellwig Dec. 17, 2018, 7:44 a.m. UTC | #6

I suspect we want something like this to make sure we never look
at queue maps that don't have nr_queues set, and then just don't
have to initialize them in nvme:

diff --git a/block/blk-mq.h b/block/blk-mq.h
index b63a0de8a07a..d1ed096723fb 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -105,14 +105,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 {
 	enum hctx_type type = HCTX_TYPE_DEFAULT;
 
-	if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
-	    ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
+	if ((flags & REQ_HIPRI) &&
+	    q->tag_set->nr_maps > HCTX_TYPE_POLL && 
+	    q->tag_set->map[HCTX_TYPE_POLL].nr_queues &&
+	    test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
 		type = HCTX_TYPE_POLL;
 
-	else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
-		 ((flags & REQ_OP_MASK) == REQ_OP_READ))
+	else if (((flags & REQ_OP_MASK) == REQ_OP_READ) &&
+	         q->tag_set->nr_maps > HCTX_TYPE_READ &&
+		 q->tag_set->map[HCTX_TYPE_READ].nr_queues)
 		type = HCTX_TYPE_READ;
-
+	
 	return blk_mq_map_queue_type(q, type, cpu);
 }
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fb9d8270f32c..698b350b38cf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -496,11 +496,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 		map->nr_queues = dev->io_queues[i];
 		if (!map->nr_queues) {
 			BUG_ON(i == HCTX_TYPE_DEFAULT);
-
-			/* shared set, resuse read set parameters */
-			map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
-			qoff = 0;
-			offset = queue_irq_offset(dev);
+			continue;
 		}
 
 		/*

Ming Lei Dec. 17, 2018, 8:38 a.m. UTC | #7

On Mon, Dec 17, 2018 at 08:44:42AM +0100, Christoph Hellwig wrote:
> I suspect we want something like this to make sure we never look
> at queue maps that don't have nr_queues set, and then just don't
> have to initialize them in nvme:
> 
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index b63a0de8a07a..d1ed096723fb 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -105,14 +105,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
>  {
>  	enum hctx_type type = HCTX_TYPE_DEFAULT;
>  
> -	if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
> -	    ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
> +	if ((flags & REQ_HIPRI) &&
> +	    q->tag_set->nr_maps > HCTX_TYPE_POLL && 
> +	    q->tag_set->map[HCTX_TYPE_POLL].nr_queues &&
> +	    test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
>  		type = HCTX_TYPE_POLL;
>  
> -	else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
> -		 ((flags & REQ_OP_MASK) == REQ_OP_READ))
> +	else if (((flags & REQ_OP_MASK) == REQ_OP_READ) &&
> +	         q->tag_set->nr_maps > HCTX_TYPE_READ &&
> +		 q->tag_set->map[HCTX_TYPE_READ].nr_queues)
>  		type = HCTX_TYPE_READ;
> -
> +	
>  	return blk_mq_map_queue_type(q, type, cpu);
>  }


>  
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index fb9d8270f32c..698b350b38cf 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -496,11 +496,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
>  		map->nr_queues = dev->io_queues[i];
>  		if (!map->nr_queues) {
>  			BUG_ON(i == HCTX_TYPE_DEFAULT);
> -
> -			/* shared set, resuse read set parameters */
> -			map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
> -			qoff = 0;
> -			offset = queue_irq_offset(dev);
> +			continue;
>  		}
>  
>  		/*

This way works just with a bit cost in blk_mq_map_queue(), given
we have cached hctx in rq->mq_hctx, I think this approach is good.


Thanks,
Ming

[3/4] blk-mq: deal with shared queue mapping reliably

Commit Message

Comments

Patch