Message ID | 1508680074-2417-1-git-send-email-israelr@mellanox.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> Currently, blk_mq_tagset_iter() iterate over initial hctx tags only. > In case scheduler is used, it doesn't iterate the hctx scheduler tags > and the static request aren't been updated. > For example, while using NVMe over Fabrics RDMA host, this cause us not to > reinit the scheduler requests and thus not re-register all the memory regions > during the tagset re-initialization in the reconnect flow. I think this is a sign that we should cease from embedding memory regions on the pre-allocated requests. Its too much resources that we waste. In our case, tags are not really cheap given that they take a physical HW resource (rdma memory region). I think we should switch (again) to a pool design instead. I guess its time for a generic MR pool that will serve nvmf, xprt, srp, iser and friends.
On Sun, Oct 22, 2017 at 09:32:00PM +0300, Sagi Grimberg wrote: > >> Currently, blk_mq_tagset_iter() iterate over initial hctx tags only. >> In case scheduler is used, it doesn't iterate the hctx scheduler tags >> and the static request aren't been updated. >> For example, while using NVMe over Fabrics RDMA host, this cause us not to >> reinit the scheduler requests and thus not re-register all the memory regions >> during the tagset re-initialization in the reconnect flow. > > I think this is a sign that we should cease from embedding memory > regions on the pre-allocated requests. Its too much resources > that we waste. In our case, tags are not really cheap given > that they take a physical HW resource (rdma memory region). > > I think we should switch (again) to a pool design instead. I guess its > time for a generic MR pool that will serve nvmf, xprt, srp, iser and > friends. Liks drivers/infiniband/core/mr_pool.c? :)
On Sun, 2017-10-22 at 13:47 +0000, Israel Rukshin wrote: > @@ -441,6 +442,8 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q, > if (!hctx->sched_tags) > return -ENOMEM; > > + set->sched_tags[hctx_idx] = hctx->sched_tags; > + > ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests); > if (ret) > blk_mq_sched_free_tags(set, hctx, hctx_idx); Will this work as expected if a tag set is shared across multiple request queues? > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index c81b40e..c290de0 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -322,6 +322,22 @@ int blk_mq_tagset_iter(struct blk_mq_tag_set *set, void *data, > } > } > > + for (i = 0; i < set->nr_hw_queues; i++) { > + struct blk_mq_tags *sched_tags = set->sched_tags[i]; > + > + if (!sched_tags) > + continue; > + > + for (j = 0; j < sched_tags->nr_tags; j++) { > + if (!sched_tags->static_rqs[j]) > + continue; > + > + ret = fn(data, sched_tags->static_rqs[j]); > + if (ret) > + goto out; > + } > + } If both a scheduler tag and a driver tag have been assigned to a request, can this cause blk_mq_tagset_iter() to call fn() twice for the same request? Thanks, Bart.
>>> Currently, blk_mq_tagset_iter() iterate over initial hctx tags only. >>> In case scheduler is used, it doesn't iterate the hctx scheduler tags >>> and the static request aren't been updated. >>> For example, while using NVMe over Fabrics RDMA host, this cause us not to >>> reinit the scheduler requests and thus not re-register all the memory regions >>> during the tagset re-initialization in the reconnect flow. >> >> I think this is a sign that we should cease from embedding memory >> regions on the pre-allocated requests. Its too much resources >> that we waste. In our case, tags are not really cheap given >> that they take a physical HW resource (rdma memory region). >> >> I think we should switch (again) to a pool design instead. I guess its >> time for a generic MR pool that will serve nvmf, xprt, srp, iser and >> friends. > > Liks drivers/infiniband/core/mr_pool.c? :) Yea :) forgot we had that... Note that it does introduce a new spinlock to our hot-path, but given the current over-allocation scheme with schedulers, its probably better off.
On Mon, Oct 23, 2017 at 10:27:29AM +0300, Sagi Grimberg wrote: > Note that it does introduce a new spinlock to our hot-path, but given > the current over-allocation scheme with schedulers, its probably better > off. We could look into llists if it matters.
>> @@ -441,6 +442,8 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q, >> if (!hctx->sched_tags) >> return -ENOMEM; >> >> + set->sched_tags[hctx_idx] = hctx->sched_tags; >> + >> ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests); >> if (ret) >> blk_mq_sched_free_tags(set, hctx, hctx_idx); > > Will this work as expected if a tag set is shared across multiple request queues? Probably not. >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c >> index c81b40e..c290de0 100644 >> --- a/block/blk-mq-tag.c >> +++ b/block/blk-mq-tag.c >> @@ -322,6 +322,22 @@ int blk_mq_tagset_iter(struct blk_mq_tag_set *set, void *data, >> } >> } >> >> + for (i = 0; i < set->nr_hw_queues; i++) { >> + struct blk_mq_tags *sched_tags = set->sched_tags[i]; >> + >> + if (!sched_tags) >> + continue; >> + >> + for (j = 0; j < sched_tags->nr_tags; j++) { >> + if (!sched_tags->static_rqs[j]) >> + continue; >> + >> + ret = fn(data, sched_tags->static_rqs[j]); >> + if (ret) >> + goto out; >> + } >> + } > > If both a scheduler tag and a driver tag have been assigned to a request, can this cause > blk_mq_tagset_iter() to call fn() twice for the same request? It will, its not a big issue for reinit functionality, but it might if used for something else. I don't think its a good idea to go down this route.
>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c >>> index c81b40e..c290de0 100644 >>> --- a/block/blk-mq-tag.c >>> +++ b/block/blk-mq-tag.c >>> @@ -322,6 +322,22 @@ int blk_mq_tagset_iter(struct blk_mq_tag_set >>> *set, void *data, >>> } >>> } >>> + for (i = 0; i < set->nr_hw_queues; i++) { >>> + struct blk_mq_tags *sched_tags = set->sched_tags[i]; >>> + >>> + if (!sched_tags) >>> + continue; >>> + >>> + for (j = 0; j < sched_tags->nr_tags; j++) { >>> + if (!sched_tags->static_rqs[j]) >>> + continue; >>> + >>> + ret = fn(data, sched_tags->static_rqs[j]); >>> + if (ret) >>> + goto out; >>> + } >>> + } >> >> If both a scheduler tag and a driver tag have been assigned to a >> request, can this cause >> blk_mq_tagset_iter() to call fn() twice for the same request? > > It will, its not a big issue for reinit functionality, but it might if > used for something else. I don't think its a good idea to go down this > route. Don't you think sched_tags should be part of the tagset (as driver tags) ? if so, blk_mq_tagset_iter should call fn() on both scheduler and driver tags. Otherwise, let's call it blk_mq_tags_iter instead and pass struct blk_mq_tags *. The private case of fn() == reinit() for NVMEoF RDMA can be solved by mr pool but maybe we should look on the general case too. In either way, currently we can't use a scheduler for NVMEoF RDMA because of that issue.
On Sun, Oct 22, 2017 at 01:47:54PM +0000, Israel Rukshin wrote: > Currently, blk_mq_tagset_iter() iterate over initial hctx tags only. > In case scheduler is used, it doesn't iterate the hctx scheduler tags > and the static request aren't been updated. > For example, while using NVMe over Fabrics RDMA host, this cause us not to > reinit the scheduler requests and thus not re-register all the memory regions > during the tagset re-initialization in the reconnect flow. > > This may lead to a memory registration error: > "MEMREG for CQE 0xffff88044c14dce8 failed with status memory > management operation error (6)" > > Signed-off-by: Israel Rukshin <israelr@mellanox.com> > Reviewed-by: Max Gurtovoy <maxg@mellanox.com> > --- > > The commit is based on nvme branch for 4.15 which includes > Sagi's patches for reinit_tagset. > > --- > block/blk-mq-sched.c | 3 +++ > block/blk-mq-tag.c | 16 ++++++++++++++++ > block/blk-mq.c | 14 +++++++++++++- > include/linux/blk-mq.h | 1 + > 4 files changed, 33 insertions(+), 1 deletion(-) > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index 4ab6943..4db9797 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -426,6 +426,7 @@ static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set, > blk_mq_free_rqs(set, hctx->sched_tags, hctx_idx); > blk_mq_free_rq_map(hctx->sched_tags); > hctx->sched_tags = NULL; > + set->sched_tags[hctx_idx] = NULL; > } > } > > @@ -441,6 +442,8 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q, > if (!hctx->sched_tags) > return -ENOMEM; > > + set->sched_tags[hctx_idx] = hctx->sched_tags; > + > ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests); > if (ret) > blk_mq_sched_free_tags(set, hctx, hctx_idx); > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index c81b40e..c290de0 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -322,6 +322,22 @@ int blk_mq_tagset_iter(struct blk_mq_tag_set *set, void *data, > } > } > > + for (i = 0; i < set->nr_hw_queues; i++) { > + struct blk_mq_tags *sched_tags = set->sched_tags[i]; > + > + if (!sched_tags) > + continue; > + > + for (j = 0; j < sched_tags->nr_tags; j++) { > + if (!sched_tags->static_rqs[j]) > + continue; > + > + ret = fn(data, sched_tags->static_rqs[j]); > + if (ret) > + goto out; > + } > + } > + > out: > return ret; > } > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 7f01d69..d7675b7 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2576,10 +2576,16 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) > return -ENOMEM; > > ret = -ENOMEM; > + > + set->sched_tags = kzalloc_node(nr_cpu_ids * sizeof(struct blk_mq_tags *), > + GFP_KERNEL, set->numa_node); > + if (!set->sched_tags) > + goto out_free_tags; > + > set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids, > GFP_KERNEL, set->numa_node); > if (!set->mq_map) > - goto out_free_tags; > + goto out_free_sched_tags; > > ret = blk_mq_update_queue_map(set); > if (ret) > @@ -2597,6 +2603,9 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) > out_free_mq_map: > kfree(set->mq_map); > set->mq_map = NULL; > +out_free_sched_tags: > + kfree(set->sched_tags); > + set->sched_tags = NULL; > out_free_tags: > kfree(set->tags); > set->tags = NULL; > @@ -2614,6 +2623,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set) > kfree(set->mq_map); > set->mq_map = NULL; > > + kfree(set->sched_tags); > + set->sched_tags = NULL; > + > kfree(set->tags); > set->tags = NULL; > } > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index cfd64e5..9ec629f 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -78,6 +78,7 @@ struct blk_mq_tag_set { > void *driver_data; > > struct blk_mq_tags **tags; > + struct blk_mq_tags **sched_tags; It isn't a good idea to put 'sched_tags' here because scheduler tags shouldn't be shared, not like driver tags. If you want to do something on scheduler tags, you have to get the 'hctx' instance first.
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 4ab6943..4db9797 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -426,6 +426,7 @@ static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set, blk_mq_free_rqs(set, hctx->sched_tags, hctx_idx); blk_mq_free_rq_map(hctx->sched_tags); hctx->sched_tags = NULL; + set->sched_tags[hctx_idx] = NULL; } } @@ -441,6 +442,8 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q, if (!hctx->sched_tags) return -ENOMEM; + set->sched_tags[hctx_idx] = hctx->sched_tags; + ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests); if (ret) blk_mq_sched_free_tags(set, hctx, hctx_idx); diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index c81b40e..c290de0 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -322,6 +322,22 @@ int blk_mq_tagset_iter(struct blk_mq_tag_set *set, void *data, } } + for (i = 0; i < set->nr_hw_queues; i++) { + struct blk_mq_tags *sched_tags = set->sched_tags[i]; + + if (!sched_tags) + continue; + + for (j = 0; j < sched_tags->nr_tags; j++) { + if (!sched_tags->static_rqs[j]) + continue; + + ret = fn(data, sched_tags->static_rqs[j]); + if (ret) + goto out; + } + } + out: return ret; } diff --git a/block/blk-mq.c b/block/blk-mq.c index 7f01d69..d7675b7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2576,10 +2576,16 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) return -ENOMEM; ret = -ENOMEM; + + set->sched_tags = kzalloc_node(nr_cpu_ids * sizeof(struct blk_mq_tags *), + GFP_KERNEL, set->numa_node); + if (!set->sched_tags) + goto out_free_tags; + set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids, GFP_KERNEL, set->numa_node); if (!set->mq_map) - goto out_free_tags; + goto out_free_sched_tags; ret = blk_mq_update_queue_map(set); if (ret) @@ -2597,6 +2603,9 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) out_free_mq_map: kfree(set->mq_map); set->mq_map = NULL; +out_free_sched_tags: + kfree(set->sched_tags); + set->sched_tags = NULL; out_free_tags: kfree(set->tags); set->tags = NULL; @@ -2614,6 +2623,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set) kfree(set->mq_map); set->mq_map = NULL; + kfree(set->sched_tags); + set->sched_tags = NULL; + kfree(set->tags); set->tags = NULL; } diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index cfd64e5..9ec629f 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -78,6 +78,7 @@ struct blk_mq_tag_set { void *driver_data; struct blk_mq_tags **tags; + struct blk_mq_tags **sched_tags; struct mutex tag_list_lock; struct list_head tag_list;