Message ID | 20181216022517.26650-4-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-mq: queue mapping fix & improvement | expand |
On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote: > This patch sets map->nr_queues as zero explictly if there is zero > queues for such queue type, then blk_mq_map_swqueue() can become > more robust to deal with shared mappings. This looks a lot more clumsy than what we had before, can you explain what additional robustnes it buys us?
On Sun, Dec 16 2018 at 11:16am -0500, Christoph Hellwig <hch@lst.de> wrote: > On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote: > > This patch sets map->nr_queues as zero explictly if there is zero > > queues for such queue type, then blk_mq_map_swqueue() can become > > more robust to deal with shared mappings. > > This looks a lot more clumsy than what we had before, can you explain > what additional robustnes it buys us? It enables nvme IO to complete on my testbed with for-4.21/block changes, this NUMA layout is what triggered Ming's work: # numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 8 10 12 14 node 0 size: 128605 MB node 0 free: 128092 MB node 1 cpus: 1 3 5 7 9 11 13 15 node 1 size: 128997 MB node 1 free: 128529 MB node distances: node 0 1 0: 10 21 1: 21 10 Without the aggregate changes from this patchset (1-3 anyway) I get IO hangs in blkdev_fsync(). With that in mind, Jens needs these fixes (or something comparable) ASAP. Tested-by: Mike Snitzer <snitzer@redhat.com>
On 12/16/18 11:39 AM, Mike Snitzer wrote: > On Sun, Dec 16 2018 at 11:16am -0500, > Christoph Hellwig <hch@lst.de> wrote: > >> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote: >>> This patch sets map->nr_queues as zero explictly if there is zero >>> queues for such queue type, then blk_mq_map_swqueue() can become >>> more robust to deal with shared mappings. >> >> This looks a lot more clumsy than what we had before, can you explain >> what additional robustnes it buys us? > > It enables nvme IO to complete on my testbed with for-4.21/block > changes, this NUMA layout is what triggered Ming's work: > > # numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 2 4 6 8 10 12 14 > node 0 size: 128605 MB > node 0 free: 128092 MB > node 1 cpus: 1 3 5 7 9 11 13 15 > node 1 size: 128997 MB > node 1 free: 128529 MB > node distances: > node 0 1 > 0: 10 21 > 1: 21 10 > > Without the aggregate changes from this patchset (1-3 anyway) I get IO > hangs in blkdev_fsync(). Still puzzled. I'm not against the change, but the commit message has NOTHING in the way of justification. "Make X more robust" doesn't mean anything. Your followup brings no extra info to the table in terms of what the bug is here. What exact bug is it fixing? Why is fsync currently hanging?
On Sun, Dec 16, 2018 at 12:12:17PM -0700, Jens Axboe wrote: > On 12/16/18 11:39 AM, Mike Snitzer wrote: > > On Sun, Dec 16 2018 at 11:16am -0500, > > Christoph Hellwig <hch@lst.de> wrote: > > > >> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote: > >>> This patch sets map->nr_queues as zero explictly if there is zero > >>> queues for such queue type, then blk_mq_map_swqueue() can become > >>> more robust to deal with shared mappings. > >> > >> This looks a lot more clumsy than what we had before, can you explain > >> what additional robustnes it buys us? > > > > It enables nvme IO to complete on my testbed with for-4.21/block > > changes, this NUMA layout is what triggered Ming's work: > > > > # numactl --hardware > > available: 2 nodes (0-1) > > node 0 cpus: 0 2 4 6 8 10 12 14 > > node 0 size: 128605 MB > > node 0 free: 128092 MB > > node 1 cpus: 1 3 5 7 9 11 13 15 > > node 1 size: 128997 MB > > node 1 free: 128529 MB > > node distances: > > node 0 1 > > 0: 10 21 > > 1: 21 10 > > > > Without the aggregate changes from this patchset (1-3 anyway) I get IO > > hangs in blkdev_fsync(). > > Still puzzled. I'm not against the change, but the commit message has > NOTHING in the way of justification. "Make X more robust" doesn't > mean anything. Your followup brings no extra info to the table in > terms of what the bug is here. What exact bug is it fixing? Why is > fsync currently hanging? As I explained in previous mail, poll queue uses new mapping via blk_mq_map_queues(), then blk_mq_map_swqueue() may over-write hctx->type by this new mapping, finally the queue mapping is totally broken. Thanks, Ming
On Sun, Dec 16, 2018 at 05:16:50PM +0100, Christoph Hellwig wrote: > On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote: > > This patch sets map->nr_queues as zero explictly if there is zero > > queues for such queue type, then blk_mq_map_swqueue() can become > > more robust to deal with shared mappings. > > This looks a lot more clumsy than what we had before, can you explain > what additional robustnes it buys us? If there isn't one mapping type, its .nr_queues is set as zero explicitly, then we don't need to touch the related hctx/ctx which is retrieved via the shared mapping table. thanks, Ming
I suspect we want something like this to make sure we never look at queue maps that don't have nr_queues set, and then just don't have to initialize them in nvme: diff --git a/block/blk-mq.h b/block/blk-mq.h index b63a0de8a07a..d1ed096723fb 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -105,14 +105,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, { enum hctx_type type = HCTX_TYPE_DEFAULT; - if (q->tag_set->nr_maps > HCTX_TYPE_POLL && - ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags))) + if ((flags & REQ_HIPRI) && + q->tag_set->nr_maps > HCTX_TYPE_POLL && + q->tag_set->map[HCTX_TYPE_POLL].nr_queues && + test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) type = HCTX_TYPE_POLL; - else if (q->tag_set->nr_maps > HCTX_TYPE_READ && - ((flags & REQ_OP_MASK) == REQ_OP_READ)) + else if (((flags & REQ_OP_MASK) == REQ_OP_READ) && + q->tag_set->nr_maps > HCTX_TYPE_READ && + q->tag_set->map[HCTX_TYPE_READ].nr_queues) type = HCTX_TYPE_READ; - + return blk_mq_map_queue_type(q, type, cpu); } diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fb9d8270f32c..698b350b38cf 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -496,11 +496,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set) map->nr_queues = dev->io_queues[i]; if (!map->nr_queues) { BUG_ON(i == HCTX_TYPE_DEFAULT); - - /* shared set, resuse read set parameters */ - map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT]; - qoff = 0; - offset = queue_irq_offset(dev); + continue; } /*
On Mon, Dec 17, 2018 at 08:44:42AM +0100, Christoph Hellwig wrote: > I suspect we want something like this to make sure we never look > at queue maps that don't have nr_queues set, and then just don't > have to initialize them in nvme: > > diff --git a/block/blk-mq.h b/block/blk-mq.h > index b63a0de8a07a..d1ed096723fb 100644 > --- a/block/blk-mq.h > +++ b/block/blk-mq.h > @@ -105,14 +105,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, > { > enum hctx_type type = HCTX_TYPE_DEFAULT; > > - if (q->tag_set->nr_maps > HCTX_TYPE_POLL && > - ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags))) > + if ((flags & REQ_HIPRI) && > + q->tag_set->nr_maps > HCTX_TYPE_POLL && > + q->tag_set->map[HCTX_TYPE_POLL].nr_queues && > + test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) > type = HCTX_TYPE_POLL; > > - else if (q->tag_set->nr_maps > HCTX_TYPE_READ && > - ((flags & REQ_OP_MASK) == REQ_OP_READ)) > + else if (((flags & REQ_OP_MASK) == REQ_OP_READ) && > + q->tag_set->nr_maps > HCTX_TYPE_READ && > + q->tag_set->map[HCTX_TYPE_READ].nr_queues) > type = HCTX_TYPE_READ; > - > + > return blk_mq_map_queue_type(q, type, cpu); > } > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index fb9d8270f32c..698b350b38cf 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -496,11 +496,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set) > map->nr_queues = dev->io_queues[i]; > if (!map->nr_queues) { > BUG_ON(i == HCTX_TYPE_DEFAULT); > - > - /* shared set, resuse read set parameters */ > - map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT]; > - qoff = 0; > - offset = queue_irq_offset(dev); > + continue; > } > > /* This way works just with a bit cost in blk_mq_map_queue(), given we have cached hctx in rq->mq_hctx, I think this approach is good. Thanks, Ming
diff --git a/block/blk-mq.c b/block/blk-mq.c index a4a0895dae65..a737d912c46b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2435,6 +2435,9 @@ static void blk_mq_map_swqueue(struct request_queue *q) for (j = 0; j < set->nr_maps; j++) { hctx = blk_mq_map_queue_type(q, j, i); + if (!set->map[j].nr_queues) + continue; + /* * If the CPU is already set in the mask, then we've * mapped this one already. This can happen if diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 95bd68be2078..43074c54279e 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -492,29 +492,32 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set) offset = queue_irq_offset(dev); for (i = 0, qoff = 0; i < set->nr_maps; i++) { struct blk_mq_queue_map *map = &set->map[i]; + unsigned nr_queues; - map->nr_queues = dev->io_queues[i]; - if (!map->nr_queues) { + nr_queues = map->nr_queues = dev->io_queues[i]; + if (!nr_queues) { BUG_ON(i == HCTX_TYPE_DEFAULT); - /* shared set, resuse read set parameters */ - map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT]; + /* shared set, resuse default set parameters and table */ + nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT]; qoff = 0; offset = queue_irq_offset(dev); - } - /* - * The poll queue(s) doesn't have an IRQ (and hence IRQ - * affinity), so use the regular blk-mq cpu mapping if - * poll queue(s) don't share mapping with TYPE_DEFAULT. - */ - map->queue_offset = qoff; - if (i != HCTX_TYPE_POLL || !qoff) - blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset); - else - blk_mq_map_queues(map); - qoff += map->nr_queues; - offset += map->nr_queues; + memcpy(map->mq_map, set->map[HCTX_TYPE_DEFAULT].mq_map, + nr_cpu_ids * sizeof(map->mq_map[0])); + } else { + /* + * The poll queue(s) doesn't have an IRQ (and hence IRQ + * affinity), so use the regular blk-mq cpu mapping. + */ + map->queue_offset = qoff; + if (i != HCTX_TYPE_POLL) + blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset); + else + blk_mq_map_queues(map); + } + qoff += nr_queues; + offset += nr_queues; } return 0;
This patch sets map->nr_queues as zero explictly if there is zero queues for such queue type, then blk_mq_map_swqueue() can become more robust to deal with shared mappings. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-mq.c | 3 +++ drivers/nvme/host/pci.c | 37 ++++++++++++++++++++----------------- 2 files changed, 23 insertions(+), 17 deletions(-)