diff mbox series

[3/4] blk-mq: deal with shared queue mapping reliably

Message ID 20181216022517.26650-4-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series blk-mq: queue mapping fix & improvement | expand

Commit Message

Ming Lei Dec. 16, 2018, 2:25 a.m. UTC
This patch sets map->nr_queues as zero explictly if there is zero
queues for such queue type, then blk_mq_map_swqueue() can become
more robust to deal with shared mappings.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c          |  3 +++
 drivers/nvme/host/pci.c | 37 ++++++++++++++++++++-----------------
 2 files changed, 23 insertions(+), 17 deletions(-)

Comments

Christoph Hellwig Dec. 16, 2018, 4:16 p.m. UTC | #1
On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
> This patch sets map->nr_queues as zero explictly if there is zero
> queues for such queue type, then blk_mq_map_swqueue() can become
> more robust to deal with shared mappings.

This looks a lot more clumsy than what we had before, can you explain
what additional robustnes it buys us?
Mike Snitzer Dec. 16, 2018, 6:39 p.m. UTC | #2
On Sun, Dec 16 2018 at 11:16am -0500,
Christoph Hellwig <hch@lst.de> wrote:

> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
> > This patch sets map->nr_queues as zero explictly if there is zero
> > queues for such queue type, then blk_mq_map_swqueue() can become
> > more robust to deal with shared mappings.
> 
> This looks a lot more clumsy than what we had before, can you explain
> what additional robustnes it buys us?

It enables nvme IO to complete on my testbed with for-4.21/block
changes, this NUMA layout is what triggered Ming's work:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14
node 0 size: 128605 MB
node 0 free: 128092 MB
node 1 cpus: 1 3 5 7 9 11 13 15
node 1 size: 128997 MB
node 1 free: 128529 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Without the aggregate changes from this patchset (1-3 anyway) I get IO
hangs in blkdev_fsync().

With that in mind, Jens needs these fixes (or something comparable)
ASAP.

Tested-by: Mike Snitzer <snitzer@redhat.com>
Jens Axboe Dec. 16, 2018, 7:12 p.m. UTC | #3
On 12/16/18 11:39 AM, Mike Snitzer wrote:
> On Sun, Dec 16 2018 at 11:16am -0500,
> Christoph Hellwig <hch@lst.de> wrote:
> 
>> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
>>> This patch sets map->nr_queues as zero explictly if there is zero
>>> queues for such queue type, then blk_mq_map_swqueue() can become
>>> more robust to deal with shared mappings.
>>
>> This looks a lot more clumsy than what we had before, can you explain
>> what additional robustnes it buys us?
> 
> It enables nvme IO to complete on my testbed with for-4.21/block
> changes, this NUMA layout is what triggered Ming's work:
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6 8 10 12 14
> node 0 size: 128605 MB
> node 0 free: 128092 MB
> node 1 cpus: 1 3 5 7 9 11 13 15
> node 1 size: 128997 MB
> node 1 free: 128529 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> 
> Without the aggregate changes from this patchset (1-3 anyway) I get IO
> hangs in blkdev_fsync().

Still puzzled. I'm not against the change, but the commit message has
NOTHING in the way of justification. "Make X more robust" doesn't
mean anything. Your followup brings no extra info to the table in
terms of what the bug is here. What exact bug is it fixing? Why is
fsync currently hanging?
Ming Lei Dec. 17, 2018, 1:04 a.m. UTC | #4
On Sun, Dec 16, 2018 at 12:12:17PM -0700, Jens Axboe wrote:
> On 12/16/18 11:39 AM, Mike Snitzer wrote:
> > On Sun, Dec 16 2018 at 11:16am -0500,
> > Christoph Hellwig <hch@lst.de> wrote:
> > 
> >> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
> >>> This patch sets map->nr_queues as zero explictly if there is zero
> >>> queues for such queue type, then blk_mq_map_swqueue() can become
> >>> more robust to deal with shared mappings.
> >>
> >> This looks a lot more clumsy than what we had before, can you explain
> >> what additional robustnes it buys us?
> > 
> > It enables nvme IO to complete on my testbed with for-4.21/block
> > changes, this NUMA layout is what triggered Ming's work:
> > 
> > # numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 2 4 6 8 10 12 14
> > node 0 size: 128605 MB
> > node 0 free: 128092 MB
> > node 1 cpus: 1 3 5 7 9 11 13 15
> > node 1 size: 128997 MB
> > node 1 free: 128529 MB
> > node distances:
> > node   0   1
> >   0:  10  21
> >   1:  21  10
> > 
> > Without the aggregate changes from this patchset (1-3 anyway) I get IO
> > hangs in blkdev_fsync().
> 
> Still puzzled. I'm not against the change, but the commit message has
> NOTHING in the way of justification. "Make X more robust" doesn't
> mean anything. Your followup brings no extra info to the table in
> terms of what the bug is here. What exact bug is it fixing? Why is
> fsync currently hanging?

As I explained in previous mail, poll queue uses new mapping via
blk_mq_map_queues(), then blk_mq_map_swqueue() may over-write hctx->type
by this new mapping, finally the queue mapping is totally broken.


Thanks,
Ming
Ming Lei Dec. 17, 2018, 1:27 a.m. UTC | #5
On Sun, Dec 16, 2018 at 05:16:50PM +0100, Christoph Hellwig wrote:
> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
> > This patch sets map->nr_queues as zero explictly if there is zero
> > queues for such queue type, then blk_mq_map_swqueue() can become
> > more robust to deal with shared mappings.
> 
> This looks a lot more clumsy than what we had before, can you explain
> what additional robustnes it buys us?

If there isn't one mapping type, its .nr_queues is set as zero explicitly,
then we don't need to touch the related hctx/ctx which is retrieved via the
shared mapping table.

thanks,
Ming
Christoph Hellwig Dec. 17, 2018, 7:44 a.m. UTC | #6
I suspect we want something like this to make sure we never look
at queue maps that don't have nr_queues set, and then just don't
have to initialize them in nvme:

diff --git a/block/blk-mq.h b/block/blk-mq.h
index b63a0de8a07a..d1ed096723fb 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -105,14 +105,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 {
 	enum hctx_type type = HCTX_TYPE_DEFAULT;
 
-	if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
-	    ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
+	if ((flags & REQ_HIPRI) &&
+	    q->tag_set->nr_maps > HCTX_TYPE_POLL && 
+	    q->tag_set->map[HCTX_TYPE_POLL].nr_queues &&
+	    test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
 		type = HCTX_TYPE_POLL;
 
-	else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
-		 ((flags & REQ_OP_MASK) == REQ_OP_READ))
+	else if (((flags & REQ_OP_MASK) == REQ_OP_READ) &&
+	         q->tag_set->nr_maps > HCTX_TYPE_READ &&
+		 q->tag_set->map[HCTX_TYPE_READ].nr_queues)
 		type = HCTX_TYPE_READ;
-
+	
 	return blk_mq_map_queue_type(q, type, cpu);
 }
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fb9d8270f32c..698b350b38cf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -496,11 +496,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 		map->nr_queues = dev->io_queues[i];
 		if (!map->nr_queues) {
 			BUG_ON(i == HCTX_TYPE_DEFAULT);
-
-			/* shared set, resuse read set parameters */
-			map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
-			qoff = 0;
-			offset = queue_irq_offset(dev);
+			continue;
 		}
 
 		/*
Ming Lei Dec. 17, 2018, 8:38 a.m. UTC | #7
On Mon, Dec 17, 2018 at 08:44:42AM +0100, Christoph Hellwig wrote:
> I suspect we want something like this to make sure we never look
> at queue maps that don't have nr_queues set, and then just don't
> have to initialize them in nvme:
> 
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index b63a0de8a07a..d1ed096723fb 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -105,14 +105,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
>  {
>  	enum hctx_type type = HCTX_TYPE_DEFAULT;
>  
> -	if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
> -	    ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
> +	if ((flags & REQ_HIPRI) &&
> +	    q->tag_set->nr_maps > HCTX_TYPE_POLL && 
> +	    q->tag_set->map[HCTX_TYPE_POLL].nr_queues &&
> +	    test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
>  		type = HCTX_TYPE_POLL;
>  
> -	else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
> -		 ((flags & REQ_OP_MASK) == REQ_OP_READ))
> +	else if (((flags & REQ_OP_MASK) == REQ_OP_READ) &&
> +	         q->tag_set->nr_maps > HCTX_TYPE_READ &&
> +		 q->tag_set->map[HCTX_TYPE_READ].nr_queues)
>  		type = HCTX_TYPE_READ;
> -
> +	
>  	return blk_mq_map_queue_type(q, type, cpu);
>  }


>  
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index fb9d8270f32c..698b350b38cf 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -496,11 +496,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
>  		map->nr_queues = dev->io_queues[i];
>  		if (!map->nr_queues) {
>  			BUG_ON(i == HCTX_TYPE_DEFAULT);
> -
> -			/* shared set, resuse read set parameters */
> -			map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
> -			qoff = 0;
> -			offset = queue_irq_offset(dev);
> +			continue;
>  		}
>  
>  		/*

This way works just with a bit cost in blk_mq_map_queue(), given
we have cached hctx in rq->mq_hctx, I think this approach is good.


Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a4a0895dae65..a737d912c46b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2435,6 +2435,9 @@  static void blk_mq_map_swqueue(struct request_queue *q)
 		for (j = 0; j < set->nr_maps; j++) {
 			hctx = blk_mq_map_queue_type(q, j, i);
 
+			if (!set->map[j].nr_queues)
+					continue;
+
 			/*
 			 * If the CPU is already set in the mask, then we've
 			 * mapped this one already. This can happen if
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 95bd68be2078..43074c54279e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -492,29 +492,32 @@  static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 	offset = queue_irq_offset(dev);
 	for (i = 0, qoff = 0; i < set->nr_maps; i++) {
 		struct blk_mq_queue_map *map = &set->map[i];
+		unsigned nr_queues;
 
-		map->nr_queues = dev->io_queues[i];
-		if (!map->nr_queues) {
+		nr_queues = map->nr_queues = dev->io_queues[i];
+		if (!nr_queues) {
 			BUG_ON(i == HCTX_TYPE_DEFAULT);
 
-			/* shared set, resuse read set parameters */
-			map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
+			/* shared set, resuse default set parameters and table */
+			nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
 			qoff = 0;
 			offset = queue_irq_offset(dev);
-		}
 
-		/*
-		 * The poll queue(s) doesn't have an IRQ (and hence IRQ
-		 * affinity), so use the regular blk-mq cpu mapping if
-		 * poll queue(s) don't share mapping with TYPE_DEFAULT.
-		 */
-		map->queue_offset = qoff;
-		if (i != HCTX_TYPE_POLL || !qoff)
-			blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
-		else
-			blk_mq_map_queues(map);
-		qoff += map->nr_queues;
-		offset += map->nr_queues;
+			memcpy(map->mq_map, set->map[HCTX_TYPE_DEFAULT].mq_map,
+				   nr_cpu_ids * sizeof(map->mq_map[0]));
+		} else {
+			/*
+			 * The poll queue(s) doesn't have an IRQ (and hence IRQ
+			 * affinity), so use the regular blk-mq cpu mapping.
+			 */
+			map->queue_offset = qoff;
+			if (i != HCTX_TYPE_POLL)
+				blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
+			else
+				blk_mq_map_queues(map);
+		}
+		qoff += nr_queues;
+		offset += nr_queues;
 	}
 
 	return 0;