From patchwork Wed Jul 18 11:38:25 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sagi Grimberg <sagi@grimberg.me>
X-Patchwork-Id: 10532213
Return-Path: <linux-rdma-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	11708600F4 for <patchwork-linux-rdma@patchwork.kernel.org>;
	Wed, 18 Jul 2018 11:38:33 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E7AD429020
	for <patchwork-linux-rdma@patchwork.kernel.org>;
	Wed, 18 Jul 2018 11:38:32 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id DA48429115; Wed, 18 Jul 2018 11:38:32 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4981B29020
	for <patchwork-linux-rdma@patchwork.kernel.org>;
	Wed, 18 Jul 2018 11:38:32 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1729185AbeGRMQA (ORCPT
	<rfc822;patchwork-linux-rdma@patchwork.kernel.org>);
	Wed, 18 Jul 2018 08:16:00 -0400
Received: from mail-wr1-f68.google.com ([209.85.221.68]:44615 "EHLO
	mail-wr1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1727515AbeGRMQA (ORCPT
	<rfc822; linux-rdma@vger.kernel.org>); Wed, 18 Jul 2018 08:16:00 -0400
Received: by mail-wr1-f68.google.com with SMTP id r16-v6so4329592wrt.11;
	Wed, 18 Jul 2018 04:38:29 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:subject:to:cc:references:from:message-id:date
	:user-agent:mime-version:in-reply-to:content-language
	:content-transfer-encoding;
	bh=cP2P2xGSYkXF0VDldKKdFlgSL236ZQ6MdbIwMvaNnN4=;
	b=Yc46ziKTzq8GJWXv8XCWruGWArXKJXkZPdc0/lrRoVZnhJQFPQQbhzA0k3joEtpDej
	sHuJGoxSamHGTcuSmEX3euGxelFTba6CLl1kgUsnJD38LxP6ZeWAKupXZv06lW/pMUm7
	S4eadcnP1PwshSzBSsmU+u4SK1Od0nvOuUeIxZd/Aa0EFGBhl4M+VqiiAtfstPRHPpP6
	CvWI+VTTKIIGz5hz94xxPR5k+7/S1XAbPXl7/FqYuwqrhBne9vMmfCW5fzsk06kA3kL6
	tLhuIyVcGKEUqZ/BSflh87ss1Zz9kaVWhn9k8wOnFiH75c2e7xoZhEsHaPeQ6mec/a2G
	VKew==
X-Gm-Message-State: AOUpUlGeAl8Pg8YOnyKauYuqSrUJpmjghtHer0/LMRmJhaf2b6MYT5l0
	HU6mD7cdKdwkRhkoCfv4hbTmT/cS
X-Google-Smtp-Source: 
 AAOMgpcSACejsWA7Gtv82M06kAnlK+hAm1fsQLfNZDRuDFR3WoFTYeNVns/ppOxQ2dhSLWnhkFOHnw==
X-Received: by 2002:adf:e90c:: with SMTP id
	f12-v6mr4229046wrm.126.1531913908526;
	Wed, 18 Jul 2018 04:38:28 -0700 (PDT)
Received: from [192.168.64.169] (bzq-219-42-90.isdn.bezeqint.net.
	[62.219.42.90]) by smtp.gmail.com with ESMTPSA id
	w9-v6sm9406650wrk.28.2018.07.18.04.38.26
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Wed, 18 Jul 2018 04:38:27 -0700 (PDT)
Subject: Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
To: Steve Wise <swise@opengridcomputing.com>,
	'Max Gurtovoy' <maxg@mellanox.com>, 'Leon Romanovsky' <leon@kernel.org>
Cc: 'Doug Ledford' <dledford@redhat.com>,
	'Jason Gunthorpe' <jgg@mellanox.com>,
	'RDMA mailing list' <linux-rdma@vger.kernel.org>,
	'Saeed Mahameed' <saeedm@mellanox.com>,
	'linux-netdev' <netdev@vger.kernel.org>
References: <20180716083012.15410-1-leon@kernel.org>
	<0cf29652-9034-6283-ef36-95de4588980f@grimberg.me>
	<20180716103046.GJ3152@mtr-leonro.mtl.com>
	<1cb63259-9fb6-59b0-3a34-0659973228ea@mellanox.com>
	<d8f375d7-8a1f-39f5-5fd6-0bdc781f1465@grimberg.me>
	<40d49fe1-c548-31ec-7daa-b19056215d69@mellanox.com>
	<243215dc-2b06-9c99-a0cb-8a45e0257077@opengridcomputing.com>
	<3f827784-3089-2375-9feb-b3c1701d7471@mellanox.com>
	<01cd01d41dce$992f4f30$cb8ded90$@opengridcomputing.com>
From: Sagi Grimberg <sagi@grimberg.me>
Message-ID: <0834cae6-33d6-3526-7d85-f5cae18c5487@grimberg.me>
Date: Wed, 18 Jul 2018 14:38:25 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
	Thunderbird/52.9.1
MIME-Version: 1.0
In-Reply-To: <01cd01d41dce$992f4f30$cb8ded90$@opengridcomputing.com>
Content-Language: en-US
Sender: linux-rdma-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-rdma.vger.kernel.org>
X-Mailing-List: linux-rdma@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

>> IMO we must fulfil the user wish to connect to N queues and not reduce
>> it because of affinity overlaps. So in order to push Leon's patch we
>> must also fix the blk_mq_rdma_map_queues to do a best effort mapping
>> according the affinity and map the rest in naive way (in that way we
>> will *always* map all the queues).
> 
> That is what I would expect also.   For example, in my node, where there are
> 16 cpus, and 2 numa nodes, I observe much better nvmf IOPS performance by
> setting up my 16 driver completion event queues such that each is bound to a
> node-local cpu.  So I end up with each nodel-local cpu having 2 queues bound
> to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(), this
> works fine.   I assumed adding ib_get_vector_affinity() would allow this to
> all "just work" by default, but I'm running into this connection failure
> issue.
> 
> I don't understand exactly what the blk_mq layer is trying to do, but I
> assume it has ingress event queues and processing that it trying to align
> with the drivers ingress cq event handling, so everybody stays on the same
> cpu (or at least node).   But something else is going on.  Is there
> documentation on how this works somewhere?

Does this (untested) patch help?
---

  void blk_mq_quiesce_queue_nowait(struct request_queue *q);
--

It really is still a best effort thing...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..dbe962cb537d 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,29 +30,34 @@ static int get_first_sibling(unsigned int cpu)
         return cpu;
  }

-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue(struct blk_mq_tag_set *set, unsigned int cpu)
  {
         unsigned int *map = set->mq_map;
         unsigned int nr_queues = set->nr_hw_queues;
-       unsigned int cpu, first_sibling;
+       unsigned int first_sibling;

-       for_each_possible_cpu(cpu) {
-               /*
-                * First do sequential mapping between CPUs and queues.
-                * In case we still have CPUs to map, and we have some 
number of
-                * threads per cores then map sibling threads to the 
same queue for
-                * performace optimizations.
-                */
-               if (cpu < nr_queues) {
+       /*
+        * First do sequential mapping between CPUs and queues.
+        * In case we still have CPUs to map, and we have some number of
+        * threads per cores then map sibling threads to the same queue for
+        * performace optimizations.
+        */
+       if (cpu < nr_queues) {
+               map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+       } else {
+               first_sibling = get_first_sibling(cpu);
+               if (first_sibling == cpu)
                         map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-               } else {
-                       first_sibling = get_first_sibling(cpu);
-                       if (first_sibling == cpu)
-                               map[cpu] = cpu_to_queue_index(nr_queues, 
cpu);
-                       else
-                               map[cpu] = map[first_sibling];
-               }
+               else
+                       map[cpu] = map[first_sibling];
         }
+}
+EXPORT_SYMBOL_GPL(blk_mq_map_queue);
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+       for_each_possible_cpu(cpu)
+                blk_mq_map_queue(set, cpu);

         return 0;
  }
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..5e91789bea5b 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -35,6 +35,10 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
         const struct cpumask *mask;
         unsigned int queue, cpu;

+       /* reset all to  */
+       for_each_possible_cpu(cpu)
+               set->mq_map[cpu] = UINT_MAX;
+
         for (queue = 0; queue < set->nr_hw_queues; queue++) {
                 mask = ib_get_vector_affinity(dev, first_vec + queue);
                 if (!mask)
@@ -44,6 +48,11 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
                         set->mq_map[cpu] = queue;
         }

+       for_each_possible_cpu(cpu) {
+               if (set->mq_map[cpu] == UINT_MAX)
+                       blk_mq_map_queue(set, cpu);
+       }
+
         return 0;

  fallback:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e3147eb74222..7a9848a82475 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -283,6 +283,7 @@ int blk_mq_freeze_queue_wait_timeout(struct 
request_queue *q,
                                      unsigned long timeout);

  int blk_mq_map_queues(struct blk_mq_tag_set *set);
+void blk_mq_map_queue(struct blk_mq_tag_set *set, unsigned int cpu);
  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int 
nr_hw_queues);